Intel's Streaming SIMD Extensions, or “SSE” is a 128-bit SIMD vector extension to the x86 ISA that is quite similar to AltiVec. Most of the good practices for AltiVec apply. These include enabling full compiler optimizations, function call inlining, proper alignment and organization of data, attention to pipeline latencies, dispatch limitations, etc. As always, the largest opportunities for performance improvement comes from high level optimization techniques, most importantly choosing the right algorithm. The same goes for PowerPC vs. x86 in general.
However, there are some key differences between the two. For a broad overview of general tips and techniques for writing universal binaries, please see:Universal Binary Programming Guidelines.
A good source of x86 specific tuning advice and architectural documentation is Intel's web site. In particular, please see the processor optimization reference manual and accompanying software developers manuals: Intel Pentium References
There are also a number of very interesting (though in many cases highly speculative) resources available on the web to help you better understand Pentium behavior.
This document is intended to be an addendum to the above sources with information specifically relevant to tuning for SSE and high performance programming in general. It is targeted specifically towards the segment of the developer population that is already knowledgeable about high performance programming using AltiVec, especially those people with a substantial investment in AltiVec who would like to leverage that investment moving forward onto the Intel architecture.
Before we begin, we would like to strongly urge developers who are starting the process of porting AltiVec code to SSE to look to see if this work has already been done for you in Accelerate.framework. There has been a large body of work added to Accelerate.framework in recent years that you may not have been able to take advantage of previously, for reasons that may no longer exist. We recommend taking a few minutes to take a look. Accelerate.framework does signal processing (vDSP.h), image processing (vImage.h), linear algebra (BLAS/LAPACK), vector math library (vMathLib), and large integer computation (vBasicOps.h, vBigNum.h). The framework will transparently select the best code for the appropriate CPU, be that G3, G4, G5 or Pentium. In many cases, you don't have to know anything about vector programming to use Accelerate.framework.
AltiVec and SSE
Hardware Overview
Instruction Overview
Last updated: 2005-09-08