After the great success of Intel's MMX technology, the increasing demand for more complex algorithms based on floating-point calculations drove Intel to define yet another new technology. This time around, it defined a new set of instructions and data types for floating-point based algorithms, such as 3D and advanced signal & image processing algorithms, and extended MMX technology support for integer-based algorithms, all while maintaining compatibility with the existing software designed for the Intel architecture. It also included new memory operations that could accelerate any memory-based algorithm - especially multimedia applications, which typically use large blocks of memory.
Subsequent projects in 3D and video applications have demonstrated that the Pentium III processor is an excellent processor for multimedia applications. One of the most impressive such projects is the high resolution, real-time MPEG2 Encoder. This paper describes how the Pentium III processor and Streaming SIMD Extensions can improve the performance of integer-based applications, using examples from the MPEG encoder application.
Motion Estimation & Motion Compensation
For a better understanding, the following examples introduce two of the most basic operations in video compression techniques applications: Motion Estimation (ME), and Motion Compensation (MC).
ME is performed during encoding. It makes use of the fact that the next frame in a sequence is almost the same as the previous frame. The technique looks for the location of a given block in the previous frame by comparing the block to certain related blocks in the previous frame. The output of this operation for each block is a motion vector.
MC is the opposite operation. Given a certain motion vector and a difference block, MC builds a new block by taking the block, which can be located by the motion vector from the previous frame, and adding it to the difference block.
Streaming SIMD Extensions
The Streaming SIMD Extensions meet the demand for specific, advanced, and yet basic operations for video and communication.
The Streaming SIMD Extensions include the following instructions:
pavgb - SIMD averaging of two absolute byte-sized operands. A crucial operation in MC & ME algorithmspsadb - Absolute subtract and sum of two byte-sized operands. Crucial for block matching algorithms
pmin & pmax - SIMD minimum or maximum of two signed operands.
As the following examples show, these new instructions ease and speed up a lot of the basic kernels in video applications and other integer-based algorithms.
The following example shows the basic loop for MC using MMX technology:
Motion_Comp_Loop:
Movq mm0,[eax+ecx]
// read eight pixels from one block.
// Calculate
the average values.
Movq mm1,mm5
// Now mm1 is free.
Paddw mm4,mm5
Psrlw mm0,1
// divide by two.
Movq [edx+ecx],mm0
// store results.
// Increment
pointer to the next line.
|
Example
1. Motion Compensation Using MMX Technology
|
Since the data range after adding two pixels is more than eight bits, you have to convert the values to short format and then calculate the average. Although we could do this with a shift (divide by 2) before the adding, this would reduce one bit of accuracy.
Motion Compensation Using Streaming SIMD Extensions
The
next example shows the same implementation using Streaming SIMD Extensions.
In this example, you can see that there is no need to do any conversions
when using the 'pavgb' instruction.
Motion_Comp_Loop:
Movq
mm0,[eax+ecx] // read eight pixels from one block.
Movq mm1,[ebx+ecx]
// read eight pixels from second block.
Pavgb mm0,mm1
// calculate the average values.
Movq [edx+ecx],mm0
// store results.
// Increment
pointer to the next line. |
Example 2. Motion Compensation Using Streaming SIMD Extensions |
Another basic operation in ME algorithms is "Block Matching", taking two blocks and calculating the energy of the difference block.
The following example shows the basic code for block matching using MMX technology:
Motion_Est_Loop:
Movq mm1,[edx]
// read 8 pixels of ref block.
Movd esi,mm0
IF not there_is_threshold
Fast_Out: |
Example
3. Block Matching Using MMX Technology
|
Since MMX technology does not contain a horizontal operation such as a sum of four short elements in one MMX technology register, and since the sum of the absolute differences takes more than eight bits, the implementation of the block matching algorithm must converted to short format and perform 3 extra adds in each iteration. At the end of the loop, you need to sum all four difference values to produce one final result.
Moreover, when using a threshold energy to avoid unnecessary calculations (which is typically the case in ME algorithms) the overhead is large, since using that method (there_is_threshold=TRUE ) means you must calculate the final sum for each iteration, for comparison with the threshold energy, Using the 'psadbw' instruction enable a quick and efficient comparison at each iteration.
The following example shows the same implementation using the 'psadb' instruction, which is specifically designed to solve these problems.
Motion_Est_Loop:
Movq
mm1,[edx] // read 8 pixels of ref1 block.
Paddd
mm7,mm1
Fast_Out: |
Example
4. Block Matching Using Streaming SIMD Extensions
|
Table 1 shows possible performance boosts to be gained by using Streaming SIMD Extensions. The measurements assume Block size: 16x16 pixels and hot cache.
|
|||||||||||||||
Table
1. MMX Technology vs. Streaming SIMD Extensions Implementation
for ME & MC
|
Streaming SIMD Extensions include more instructions that can improve
performance of integer-based algorithms. For MMX technology developers,
these extensions can be easily integrated into previous implementations.
Special Memory Instructions
Since multimedia applications have memory constraints and the calculation units are fast, cache utilization is important. For that reason the Pentium III processor utilizes two major advanced memory instructions, which help improve the cache locality.
prefetch. The first instruction is the 'prefetch' instruction, which moves a certain block in memory closer to the processor. The 'prefetch' instruction provides hints to the processor as to where to move the specified data. There are four different locality hints for different purposes.
One hint is the 'prefetchnta' hint, which uses non-temporal data (data that is not anticipated to be accessed again in the near future). This operation loads the data into L1 and does not pollute the L2 cache level.
When data is needed for more than one usage, or when there is a need to read, modify and then write back to the same place in memory, you can use 'prefetcht0', 'prefetcht1', or 'prefetcht2' hints, where each hint tells the processor which level of cache should be updated in response to the 'prefetch' instruction.
For example, While doing the last operation on a current frame, you can use 'prefetch' for moving the next frame closer to the processor. Yet since you are in the middle of using the current frame, you would like to move the next frame as close as you can, while not overwriting the memory needed for the current operations. Therefore, we could use 'prefetcht1' to get the next frame, as this fetches data only to L2. Consequently, the current data in the low cache levels is not polluted and the next frame is closer for future calculation.
movntq.
The second instruction is the streaming store operation - 'movntq'.
This is actually a store operation just like the 'movq' instruction,
which does not pollute the cache hierarchy. That is, if the store hits
L1 then it is a regular 'movq', otherwise it goes directly to memory
through the line fill-buffers. This is useful when you want to write
data that will probably not be in L1 the next time it is accessed, so
there is no need to bring this data to the cache. In contrast, using
a regular store brings data to the cache and consequently pollutes the
caches.
These two operations have been proven to be very efficient in multimedia applications, where memory is one of the major bottlenecks. Understanding the data flow in the application and using suitable memory layout of the data structure, while maintaining the basic principals of 'movntq' & 'prefetch' operations, removes the dependency of the application on the memory sub-system.
An Important Constraint
There is an important constraint that should be taken into account when trying to use 'prefetch' and 'movntq' operations. Misunderstanding this issue will result in losing performance in the application level.
The interface between memory (main memory or L2) and the cache hierarchy is via the line fill-buffers (32-bytes size). These line buffers are allocated whenever:
- There is a miss and data must fetched to the cache.
- A prefetch operation is used.
- A streaming store operation or other uncacheable write (WC, USWC) is made.
If there is a new request (one of the 3 above) and all line fill-buffers are occupied, one of the following two options are possible.
- If all buffers are occupied due to load from memory (prefetch or miss), there is a stall until one of the buffers is free. The next transaction will wait inside the load buffer or store buffer.
- If one of the buffers holds data that should be written to memory, this data is flushed to memory and the buffer is free.
To avoid overloading this resource you should avoid the following:
- Extensive use of 'movntq' & 'prefetch' - This will overload the load-buffer or store-buffer.
- Using 'movntq' for non-sequential memory or writing to different cache lines - The flush of the fill-buffer happens whenever all fill-buffers are occupied and a new request is pending. The flush may occur when the fill-buffer is not full (32- byte). So by writing a cache one line at a time, you increase the odds that the fill-buffer, whenever flushed, is full. This will probably fully utilize the fill-buffers. For this reason, you may have to think about the data structure (Array Of Structure Vs. Structure Of Array).
Conclusion
The Pentium III processor with Streaming SIMD Extensions boosts the performance of floating-point as well as integer based applications, thanks in part to new advanced memory operations. This advanced technology enables the high resolution MPEG encoder, based on software only, to work in real-time with high quality results. Other multimedia and communication applications can easily improve their performance using this technology.
Asi Elbaz has a B.Sc. (1998) in Electrical Engineering from The Technion - Israel Institute of Technology. His areas of concentration are in image and signal processing. Formerly he co-developed the first real time MPEG-2 Encoder for the Pentium III processor. Asi currently works in the Networking group at Intel's Israel Design Center (IDC) in Haifa, Israel. He can be contacted at [email protected].