Video Applications For the Pentium III Processor
Intel claimed that Pentium III would be an excellent processor for multimedia applications, and projects in 3D and video have indeed shown the Pentium III to be a capable piece of hardware. In this article, Asi Elbaz, an Intel engineer, describes how the Pentium III processor and Streaming SIMD Extensions can improve the performance of integer-based applications, using examples from the MPEG encoder application.
November 5, 1999
Author: by Asi Elbaz
After the great success of Intel's MMX technology, the increasing demand for more complex algorithms based on floating-point calculations drove Intel to define yet another new technology. This time around, it defined a new set of instructions and data types for floating-point based algorithms, such as 3D and advanced signal & image processing algorithms, and extended MMX technology support for integer-based algorithms, all while maintaining compatibility with the existing software designed for the Intel architecture. It also included new memory operations that could accelerate any memory-based algorithm - especially multimedia applications, which typically use large blocks of memory.
Subsequent projects in 3D and video applications have demonstrated that the Pentium III processor is an excellent processor for multimedia applications. One of the most impressive such projects is the high resolution, real-time MPEG2 Encoder. This paper describes how the Pentium III processor and Streaming SIMD Extensions can improve the performance of integer-based applications, using examples from the MPEG encoder application.
Motion Estimation & Motion Compensation
For a better understanding, the following examples introduce two of the most basic operations in video compression techniques applications: Motion Estimation (ME), and Motion Compensation (MC).
ME is performed during encoding. It makes use of the fact that the next frame in a sequence is almost the same as the previous frame. The technique looks for the location of a given block in the previous frame by comparing the block to certain related blocks in the previous frame. The output of this operation for each block is a motion vector.
MC is the opposite operation. Given a certain motion vector and a difference block, MC builds a new block by taking the block, which can be located by the motion vector from the previous frame, and adding it to the difference block.
Streaming SIMD Extensions
The Streaming SIMD Extensions meet the demand for specific, advanced, and yet basic operations for video and communication.
The Streaming SIMD Extensions include the following instructions:
pavgb - SIMD averaging of two absolute byte-sized operands. A crucial operation in MC & ME algorithms
psadb - Absolute subtract and sum of two byte-sized operands. Crucial for block matching algorithms
pmin & pmax - SIMD minimum or maximum of two signed operands.
As the following examples show, these new instructions ease and speed up a lot of the basic kernels in video applications and other integer-based algorithms.
The following example shows the basic loop for MC using MMX technology:
Motion_Comp_Loop: Movq mm0,[eax+ecx] // read eight pixels from one block.Movq mm4,[eax+ecx+8] // next eight pixels.Movq mm1,[ebx+ecx] // read eight pixels from second block.Movq mm5,[ebx+ecx+8] // next eight pixels.Movq mm2,mm0Movq mm3,mm1Movq mm6,mm4 // No MMX registers left.// mm7 was initialized to be zero.Punpcklbw mm0,mm7 // convert the first four pixelsPunpcklbw mm1,mm7 // from byte format to short format.Punpcklbw mm4,mm7Punpckhbw mm2,mm7 // convert the second four pixelsPunpckhbw mm3,mm7 // from byte format to short format.Punpckhbw mm6,mm7 // Calculate the average values.Paddw mm0,mm1 // after add values are 9 bits.Paddw mm2,mm3 Movq mm1,mm5 // Now mm1 is free.Punpcklbw mm5,mm7Punpckhbw mm1,mm7 Paddw mm4,mm5Paddw mm6,mm1 Psrlw mm0,1 // divide by two.Psrlw mm2,1 // after division values are 8 bits.Psrlw mm4,1 // divide by two.Psrlw mm6,1 // after division values are 8 bits.Packuswb mm0,mm2 // convert back to byte format.Packuswb mm4,mm6 // convert back to byte format. Movq [edx+ecx],mm0 // store results.Movq [edx+ecx+8],mm4 // store results. // Increment pointer to the next line.Jmp back while not end of macro block |
Example 1. Motion Compensation Using MMX Technology |
Since the data range after adding two pixels is more than eight bits, you have to convert the values to short format and then calculate the average. Although we could do this with a shift (divide by 2) before the adding, this would reduce one bit of accuracy.
Motion Compensation Using Streaming SIMD Extensions
The next example shows the same implementation using Streaming SIMD Extensions. In this example, you can see that there is no need to do any conversions when using the 'pavgb' instruction.
Motion_Comp_Loop: Movq mm0,[eax+ecx] // read eight pixels from one block.Movq mm2,[eax+ecx+8] // next eight pixels. Movq mm1,[ebx+ecx] // read eight pixels from second block.Movq mm3,[ebx+ecx+8] // next eight pixels. Pavgb mm0,mm1 // calculate the average values.Pavgb mm1,mm3 // calculate the average values. Movq [edx+ecx],mm0 // store results.Movq [edx+ecx+8],mm1 // store results. // Increment pointer to the next line.Jmp back while not end of macro block |
Example 2. Motion Compensation Using Streaming SIMD Extensions |
Another basic operation in ME algorithms is "Block Matching", taking two blocks and calculating the energy of the difference block.
The following example shows the basic code for block matching using MMX technology:
Motion_Est_Loop: Movq mm1,[edx] // read 8 pixels of ref block.Movq mm3,[edx +8] // read next 8 pixels of ref block.Movq mm0,[ebx] // read 8 pixels of predicted block.Movq mm2,[ebx +8] // read next 8 pixels of predicted block.Movq mm4,mm0Psubusb mm0,mm1 // difference between pixels of two blocks.Psubusb mm1,mm4 // difference other way.Por mm0,mm1 // absolute difference of pixels in mm0.Movq mm4,mm2Psubusb mm2,mm3 // difference between pixels of two blocks.Psubusb mm3,mm4 // difference other way.Por mm2,mm3 // absolute difference of pixels in mm2.// Calculation of the sum of absolute differences.Movq mm1,mm0Punpcklbw mm0,mm6 // mm6 was initialized to be zero.Punpckhbw mm1,mm6 // converts 8 bytes in one mm to 8 shorts.Movq mm3,mm2Punpcklbw mm2,mm6Punpckhbw mm3,mm6 // converts 8 bytes in one mm to 8 shorts.Paddusw mm0,mm1 // summing them up.Paddusw mm0,mm2Paddusw mm0,mm3IF there_is_threshold// Need to calculate final sum.Pmaddwd mm0,MASK64 // mult every word by 1 and sum it up to dwords.Movq mm1,mm0Psrlq mm1,32. Paddd mm0,mm1 // final sum of differences.Paddd mm7,mm0 // 1 total sum of differences. Movd esi,mm0Cmp esi,Threshold_EnergyJge Fast_OutELSEPaddusw mm7,mm0 // 4 sum of differences.END// Increment pointer to the next line.Jmp back while not end ofmacro block IF not there_is_threshold// Final sum.Pmaddwd mm7,MASK64 // mult every word by 1 and sum it up to dwords.Movq mm1,mm7Psrlq mm1,32. Paddd mm7,mm1 // final sum of differences.END Fast_Out: |
Example 3. Block Matching Using MMX Technology |
Since MMX technology does not contain a horizontal operation such as a sum of four short elements in one MMX technology register, and since the sum of the absolute differences takes more than eight bits, the implementation of the block matching algorithm must converted to short format and perform 3 extra adds in each iteration. At the end of the loop, you need to sum all four difference values to produce one final result.
Moreover, when using a threshold energy to avoid unnecessary calculations (which is typically the case in ME algorithms) the overhead is large, since using that method (there_is_threshold=TRUE ) means you must calculate the final sum for each iteration, for comparison with the threshold energy, Using the 'psadbw' instruction enable a quick and efficient comparison at each iteration.
The following example shows the same implementation using the 'psadb' instruction, which is specifically designed to solve these problems.
Motion_Est_Loop: Movq mm1,[edx] // read 8 pixels of ref1 block.Movq mm3,[edx +8] // read next 8 pixels of ref block.Movq mm0,[ebx] // read 8 pixels of ref2 block.Movq mm2,[ebx +8] // read next 8 pixels of ref2 block.Psadbw mm1,mm0 // mm1 = sum of absolute difference of 8 pixels.Psadbw mm3,mm2 // mm3 = sum of absolute difference of 8 pixels. Paddd mm7,mm1Paddd mm7,mm3 // 1 total sum of differences.IF there_is_thresholdMovd esi,mm7Cmp esi,Threshold_EnergyJge Fast_OutEND// Increment pointer to the next line.Jmp back while not end ofmacro block Fast_Out: |
Example 4. Block Matching Using Streaming SIMD Extensions |
Table 1 shows possible performance boosts to be gained by using Streaming SIMD Extensions. The measurements assume Block size: 16x16 pixels and hot cache.
Streaming SIMD Extensions include more instructions that can improve performance of integer-based algorithms. For MMX technology developers, these extensions can be easily integrated into previous implementations.
Special Memory Instructions
Since multimedia applications have memory constraints and the calculation units are fast, cache utilization is important. For that reason the Pentium III processor utilizes two major advanced memory instructions, which help improve the cache locality.
prefetch. The first instruction is the 'prefetch' instruction, which moves a certain block in memory closer to the processor. The 'prefetch' instruction provides hints to the processor as to where to move the specified data. There are four different locality hints for different purposes.
One hint is the 'prefetchnta' hint, which uses non-temporal data (data that is not anticipated to be accessed again in the near future). This operation loads the data into L1 and does not pollute the L2 cache level.
When data is needed for more than one usage, or when there is a need to read, modify and then write back to the same place in memory, you can use 'prefetcht0', 'prefetcht1', or 'prefetcht2' hints, where each hint tells the processor which level of cache should be updated in response to the 'prefetch' instruction.
For example, While doing the last operation on a current frame, you can use 'prefetch' for moving the next frame closer to the processor. Yet since you are in the middle of using the current frame, you would like to move the next frame as close as you can, while not overwriting the memory needed for the current operations. Therefore, we could use 'prefetcht1' to get the next frame, as this fetches data only to L2. Consequently, the current data in the low cache levels is not polluted and the next frame is closer for future calculation.
movntq. The second instruction is the streaming store operation - 'movntq'.
This is actually a store operation just like the 'movq' instruction, which does not pollute the cache hierarchy. That is, if the store hits L1 then it is a regular 'movq', otherwise it goes directly to memory through the line fill-buffers. This is useful when you want to write data that will probably not be in L1 the next time it is accessed, so there is no need to bring this data to the cache. In contrast, using a regular store brings data to the cache and consequently pollutes the caches.
These two operations have been proven to be very efficient in multimedia applications, where memory is one of the major bottlenecks. Understanding the data flow in the application and using suitable memory layout of the data structure, while maintaining the basic principals of 'movntq' & 'prefetch' operations, removes the dependency of the application on the memory sub-system.
An Important Constraint
There is an important constraint that should be taken into account when trying to use 'prefetch' and 'movntq' operations. Misunderstanding this issue will result in losing performance in the application level.
The interface between memory (main memory or L2) and the cache hierarchy is via the line fill-buffers (32-bytes size). These line buffers are allocated whenever:
There is a miss and data must fetched to the cache.
A prefetch operation is used.
A streaming store operation or other uncacheable write (WC, USWC) is made.
If there is a new request (one of the 3 above) and all line fill-buffers are occupied, one of the following two options are possible.
If all buffers are occupied due to load from memory (prefetch or miss), there is a stall until one of the buffers is free. The next transaction will wait inside the load buffer or store buffer.
If one of the buffers holds data that should be written to memory, this data is flushed to memory and the buffer is free.
To avoid overloading this resource you should avoid the following:
Extensive use of 'movntq' & 'prefetch' - This will overload the load-buffer or store-buffer.
Using 'movntq' for non-sequential memory or writing to different cache lines - The flush of the fill-buffer happens whenever all fill-buffers are occupied and a new request is pending. The flush may occur when the fill-buffer is not full (32- byte). So by writing a cache one line at a time, you increase the odds that the fill-buffer, whenever flushed, is full. This will probably fully utilize the fill-buffers. For this reason, you may have to think about the data structure (Array Of Structure Vs. Structure Of Array).
Conclusion
The Pentium III processor with Streaming SIMD Extensions boosts the performance of floating-point as well as integer based applications, thanks in part to new advanced memory operations. This advanced technology enables the high resolution MPEG encoder, based on software only, to work in real-time with high quality results. Other multimedia and communication applications can easily improve their performance using this technology.
Asi Elbaz has a B.Sc. (1998) in Electrical Engineering from The Technion - Israel Institute of Technology. His areas of concentration are in image and signal processing. Formerly he co-developed the first real time MPEG-2 Encoder for the Pentium III processor. Asi currently works in the Networking group at Intel's Israel Design Center (IDC) in Haifa, Israel. He can be contacted at [email protected].
Read more about:
FeaturesYou May Also Like