informa
12 min read
article

Video Applications For the Pentium III Processor

Intel claimed that Pentium III would be an excellent processor for multimedia applications, and projects in 3D and video have indeed shown the Pentium III to be a capable piece of hardware. In this article, Asi Elbaz, an Intel engineer, describes how the Pentium III processor and Streaming SIMD Extensions can improve the performance of integer-based applications, using examples from the MPEG encoder application.

After the great success of Intel's MMX technology, the increasing demand for more complex algorithms based on floating-point calculations drove Intel to define yet another new technology. This time around, it defined a new set of instructions and data types for floating-point based algorithms, such as 3D and advanced signal & image processing algorithms, and extended MMX technology support for integer-based algorithms, all while maintaining compatibility with the existing software designed for the Intel architecture. It also included new memory operations that could accelerate any memory-based algorithm - especially multimedia applications, which typically use large blocks of memory.

Subsequent projects in 3D and video applications have demonstrated that the Pentium III processor is an excellent processor for multimedia applications. One of the most impressive such projects is the high resolution, real-time MPEG2 Encoder. This paper describes how the Pentium III processor and Streaming SIMD Extensions can improve the performance of integer-based applications, using examples from the MPEG encoder application.

Motion Estimation & Motion Compensation

For a better understanding, the following examples introduce two of the most basic operations in video compression techniques applications: Motion Estimation (ME), and Motion Compensation (MC).

ME is performed during encoding. It makes use of the fact that the next frame in a sequence is almost the same as the previous frame. The technique looks for the location of a given block in the previous frame by comparing the block to certain related blocks in the previous frame. The output of this operation for each block is a motion vector.

MC is the opposite operation. Given a certain motion vector and a difference block, MC builds a new block by taking the block, which can be located by the motion vector from the previous frame, and adding it to the difference block.

Streaming SIMD Extensions

The Streaming SIMD Extensions meet the demand for specific, advanced, and yet basic operations for video and communication.

The Streaming SIMD Extensions include the following instructions:


pavgb - SIMD averaging of two absolute byte-sized operands. A crucial operation in MC & ME algorithms

psadb - Absolute subtract and sum of two byte-sized operands. Crucial for block matching algorithms

pmin & pmax - SIMD minimum or maximum of two signed operands.

As the following examples show, these new instructions ease and speed up a lot of the basic kernels in video applications and other integer-based algorithms.

The following example shows the basic loop for MC using MMX technology:

Motion_Comp_Loop:

Movq mm0,[eax+ecx] // read eight pixels from one block.
Movq mm4,[eax+ecx+8] // next eight pixels.
Movq mm1,[ebx+ecx] // read eight pixels from second block.
Movq mm5,[ebx+ecx+8] // next eight pixels.
Movq mm2,mm0
Movq mm3,mm1
Movq mm6,mm4 // No MMX registers left.
// mm7 was initialized to be zero.
Punpcklbw mm0,mm7 // convert the first four pixels
Punpcklbw mm1,mm7 // from byte format to short format.
Punpcklbw mm4,mm7
Punpckhbw mm2,mm7 // convert the second four pixels
Punpckhbw mm3,mm7 // from byte format to short format.
Punpckhbw mm6,mm7

// Calculate the average values.
Paddw mm0,mm1 // after add values are 9 bits.
Paddw mm2,mm3

Movq mm1,mm5 // Now mm1 is free.
Punpcklbw mm5,mm7
Punpckhbw mm1,mm7

Paddw mm4,mm5
Paddw mm6,mm1

Psrlw mm0,1 // divide by two.
Psrlw mm2,1 // after division values are 8 bits.
Psrlw mm4,1 // divide by two.
Psrlw mm6,1 // after division values are 8 bits.
Packuswb mm0,mm2 // convert back to byte format.
Packuswb mm4,mm6 // convert back to byte format.

Movq [edx+ecx],mm0 // store results.
Movq [edx+ecx+8],mm4 // store results.

// Increment pointer to the next line.
Jmp back while not end of macro block

 

Example 1. Motion Compensation Using MMX Technology

Since the data range after adding two pixels is more than eight bits, you have to convert the values to short format and then calculate the average. Although we could do this with a shift (divide by 2) before the adding, this would reduce one bit of accuracy.


Motion Compensation Using Streaming SIMD Extensions

The next example shows the same implementation using Streaming SIMD Extensions. In this example, you can see that there is no need to do any conversions when using the 'pavgb' instruction.

Motion_Comp_Loop:

Movq mm0,[eax+ecx] // read eight pixels from one block.
Movq mm2,[eax+ecx+8] // next eight pixels.

Movq mm1,[ebx+ecx] // read eight pixels from second block.
Movq mm3,[ebx+ecx+8] // next eight pixels.

Pavgb mm0,mm1 // calculate the average values.
Pavgb mm1,mm3 // calculate the average values.

Movq [edx+ecx],mm0 // store results.
Movq [edx+ecx+8],mm1 // store results.

// Increment pointer to the next line.
Jmp back while not end of macro block

Example 2. Motion Compensation Using Streaming SIMD Extensions

Another basic operation in ME algorithms is "Block Matching", taking two blocks and calculating the energy of the difference block.

The following example shows the basic code for block matching using MMX technology:

Motion_Est_Loop:

Movq mm1,[edx] // read 8 pixels of ref block.
Movq mm3,[edx +8] // read next 8 pixels of ref block.
Movq mm0,[ebx] // read 8 pixels of predicted block.
Movq mm2,[ebx +8] // read next 8 pixels of predicted block.

Movq mm4,mm0
Psubusb mm0,mm1 // difference between pixels of two blocks.
Psubusb mm1,mm4 // difference other way.
Por mm0,mm1 // absolute difference of pixels in mm0.

Movq mm4,mm2
Psubusb mm2,mm3 // difference between pixels of two blocks.
Psubusb mm3,mm4 // difference other way.
Por mm2,mm3 // absolute difference of pixels in mm2.

// Calculation of the sum of absolute differences.
Movq mm1,mm0
Punpcklbw mm0,mm6 // mm6 was initialized to be zero.
Punpckhbw mm1,mm6 // converts 8 bytes in one mm to 8 shorts.

Movq mm3,mm2
Punpcklbw mm2,mm6
Punpckhbw mm3,mm6 // converts 8 bytes in one mm to 8 shorts.

Paddusw mm0,mm1 // summing them up.
Paddusw mm0,mm2
Paddusw mm0,mm3

IF there_is_threshold
// Need to calculate final sum.
Pmaddwd mm0,MASK64 // mult every word by 1 and sum it up to dwords.
Movq mm1,mm0
Psrlq mm1,32
. Paddd mm0,mm1 // final sum of differences.
Paddd mm7,mm0 // 1 total sum of differences.

Movd esi,mm0
Cmp esi,Threshold_Energy
Jge Fast_Out
ELSE
Paddusw mm7,mm0 // 4 sum of differences.
END
// Increment pointer to the next line.
Jmp back while not end ofmacro block

IF not there_is_threshold
// Final sum.
Pmaddwd mm7,MASK64 // mult every word by 1 and sum it up to dwords.
Movq mm1,mm7
Psrlq mm1,32
. Paddd mm7,mm1 // final sum of differences.
END

Fast_Out:

Example 3. Block Matching Using MMX Technology

Since MMX technology does not contain a horizontal operation such as a sum of four short elements in one MMX technology register, and since the sum of the absolute differences takes more than eight bits, the implementation of the block matching algorithm must converted to short format and perform 3 extra adds in each iteration. At the end of the loop, you need to sum all four difference values to produce one final result.

Moreover, when using a threshold energy to avoid unnecessary calculations (which is typically the case in ME algorithms) the overhead is large, since using that method (there_is_threshold=TRUE ) means you must calculate the final sum for each iteration, for comparison with the threshold energy, Using the 'psadbw' instruction enable a quick and efficient comparison at each iteration.

The following example shows the same implementation using the 'psadb' instruction, which is specifically designed to solve these problems.

Motion_Est_Loop:

Movq mm1,[edx] // read 8 pixels of ref1 block.
Movq mm3,[edx +8] // read next 8 pixels of ref block.
Movq mm0,[ebx] // read 8 pixels of ref2 block.
Movq mm2,[ebx +8] // read next 8 pixels of ref2 block.

Psadbw mm1,mm0 // mm1 = sum of absolute difference of 8 pixels.
Psadbw mm3,mm2 // mm3 = sum of absolute difference of 8 pixels.

Paddd mm7,mm1
Paddd mm7,mm3 // 1 total sum of differences.
IF there_is_threshold
Movd esi,mm7
Cmp esi,Threshold_Energy
Jge Fast_Out
END
// Increment pointer to the next line.
Jmp back while not end ofmacro block

Fast_Out:

Example 4. Block Matching Using Streaming SIMD Extensions

Table 1 shows possible performance boosts to be gained by using Streaming SIMD Extensions. The measurements assume Block size: 16x16 pixels and hot cache.

Operation
Total Cycles
Total Mops
Motion estimation with MMX technology
308
432
Motion estimation with Streaming SIMD Extensions
208 -> 48%
272 -> 58%
Motion Compensation with MMX technology
298
528
Motion Compensation with Streaming SIMD Extensions
134 -> 222%
192 -> 175%
Table 1. MMX Technology vs. Streaming SIMD Extensions Implementation for ME & MC


Streaming SIMD Extensions include more instructions that can improve performance of integer-based algorithms. For MMX technology developers, these extensions can be easily integrated into previous implementations.


Special Memory Instructions

Since multimedia applications have memory constraints and the calculation units are fast, cache utilization is important. For that reason the Pentium III processor utilizes two major advanced memory instructions, which help improve the cache locality.

prefetch. The first instruction is the 'prefetch' instruction, which moves a certain block in memory closer to the processor. The 'prefetch' instruction provides hints to the processor as to where to move the specified data. There are four different locality hints for different purposes.

One hint is the 'prefetchnta' hint, which uses non-temporal data (data that is not anticipated to be accessed again in the near future). This operation loads the data into L1 and does not pollute the L2 cache level.

When data is needed for more than one usage, or when there is a need to read, modify and then write back to the same place in memory, you can use 'prefetcht0', 'prefetcht1', or 'prefetcht2' hints, where each hint tells the processor which level of cache should be updated in response to the 'prefetch' instruction.

For example, While doing the last operation on a current frame, you can use 'prefetch' for moving the next frame closer to the processor. Yet since you are in the middle of using the current frame, you would like to move the next frame as close as you can, while not overwriting the memory needed for the current operations. Therefore, we could use 'prefetcht1' to get the next frame, as this fetches data only to L2. Consequently, the current data in the low cache levels is not polluted and the next frame is closer for future calculation.

movntq. The second instruction is the streaming store operation - 'movntq'.

This is actually a store operation just like the 'movq' instruction, which does not pollute the cache hierarchy. That is, if the store hits L1 then it is a regular 'movq', otherwise it goes directly to memory through the line fill-buffers. This is useful when you want to write data that will probably not be in L1 the next time it is accessed, so there is no need to bring this data to the cache. In contrast, using a regular store brings data to the cache and consequently pollutes the caches.

These two operations have been proven to be very efficient in multimedia applications, where memory is one of the major bottlenecks. Understanding the data flow in the application and using suitable memory layout of the data structure, while maintaining the basic principals of 'movntq' & 'prefetch' operations, removes the dependency of the application on the memory sub-system.

An Important Constraint

There is an important constraint that should be taken into account when trying to use 'prefetch' and 'movntq' operations. Misunderstanding this issue will result in losing performance in the application level.

The interface between memory (main memory or L2) and the cache hierarchy is via the line fill-buffers (32-bytes size). These line buffers are allocated whenever:

  • There is a miss and data must fetched to the cache.
  • A prefetch operation is used.
  • A streaming store operation or other uncacheable write (WC, USWC) is made.

If there is a new request (one of the 3 above) and all line fill-buffers are occupied, one of the following two options are possible.

  • If all buffers are occupied due to load from memory (prefetch or miss), there is a stall until one of the buffers is free. The next transaction will wait inside the load buffer or store buffer.
  • If one of the buffers holds data that should be written to memory, this data is flushed to memory and the buffer is free.

To avoid overloading this resource you should avoid the following:

  • Extensive use of 'movntq' & 'prefetch' - This will overload the load-buffer or store-buffer.

  • Using 'movntq' for non-sequential memory or writing to different cache lines - The flush of the fill-buffer happens whenever all fill-buffers are occupied and a new request is pending. The flush may occur when the fill-buffer is not full (32- byte). So by writing a cache one line at a time, you increase the odds that the fill-buffer, whenever flushed, is full. This will probably fully utilize the fill-buffers. For this reason, you may have to think about the data structure (Array Of Structure Vs. Structure Of Array).

Conclusion

The Pentium III processor with Streaming SIMD Extensions boosts the performance of floating-point as well as integer based applications, thanks in part to new advanced memory operations. This advanced technology enables the high resolution MPEG encoder, based on software only, to work in real-time with high quality results. Other multimedia and communication applications can easily improve their performance using this technology.

Asi Elbaz has a B.Sc. (1998) in Electrical Engineering from The Technion - Israel Institute of Technology. His areas of concentration are in image and signal processing. Formerly he co-developed the first real time MPEG-2 Encoder for the Pentium III processor. Asi currently works in the Networking group at Intel's Israel Design Center (IDC) in Haifa, Israel. He can be contacted at [email protected].

Latest Jobs

Treyarch

Playa Vista, California
6.20.22
Audio Engineer

Digital Extremes

London, Ontario, Canada
6.20.22
Communications Director

High Moon Studios

Carlsbad, California
6.20.22
Senior Producer

Build a Rocket Boy Games

Edinburgh, Scotland
6.20.22
Lead UI Programmer
More Jobs   

CONNECT WITH US

Register for a
Subscribe to
Follow us

Game Developer Account

Game Developer Newsletter

@gamedevdotcom

Register for a

Game Developer Account

Gain full access to resources (events, white paper, webinars, reports, etc)
Single sign-on to all Informa products

Register
Subscribe to

Game Developer Newsletter

Get daily Game Developer top stories every morning straight into your inbox

Subscribe
Follow us

@gamedevdotcom

Follow us @gamedevdotcom to stay up-to-date with the latest news & insider information about events & more