Testing Inequalities
Vector compares are done on SSE in substantially the same way as for AltiVec. The same basic set of compare instructions (similar to vec_cmp*) are available. They return a vector containing like sized elements with -1 for a true result and 0 for a false result in the corresponding element. The floating point compares provide the full set that AltiVec provides (except vec_cmpb) and in addition provide ordered and unordered compares and the != test. In addition, all vector floating point compares come in both scalar and packed versions.
The integer compares test for equality and inequality. The inequality test are for signed integers only. There are no unsigned compare greater than instruction. There are no compare instructions for 64-bit types.
Conditional Execution
Branching based on the result of a compare is handled differently from AltiVec, however. The AltiVec compares set some bits in the condition register, upon which the processor can branch directly. SSE compares set no analogous bits. Instead, use MOVMSKPD, MOVMSKPS or PMOVMSKB instruction to copy the top bit out of each element, crunch them together into a 2- ,4- or 16-bit int for double, float and integer data respectively, and copy to an integer register. You may then test that bit field to decide whether or not to branch. This example implements the SSE version of AltiVec's vec_any_eq intrinsic for vFloat:
int _mm_any_eq( vFloat a, vFloat b ) |
{ |
//test a==b for each float in a & b |
vFloat mask = _mm_cmpeq_ps( a, b ); |
//copy top bit of each result to maskbits |
int maskBits = _mm_movemask_ps( mask ); |
return maskBits != 0; |
} |
If you are branching based on the result of a compare of one element only, then you can do the whole thing in one instruction using either UCOMISD/UCOMISS or COMISD/COMISS.
Select
Branching is expensive on Intel, just as it is on PowerPC. Most of the time that a test is done, the developer on either platform will elect not to do conditional execution, but instead evaluate both sides of the branch and select the correct result based on the value of the test. In AltiVec, this would look like this:
// if (a > 0 ) a += a; |
vUInt32 mask = vec_cmpgt( a, zero ); |
vFloat twoA = vec_add( a, a); |
a = vec_sel( a, twoA, mask ); |
In SSE, the same algorithm is used. However, SSE has no select instruction. One must use AND, ANDNOT,, OR instead:
vFloat _mm_sel_ps( vFloat a, vFloat b, vFloat mask ) |
{ |
b = _mm_and_ps( b, mask ); |
a = _mm_andnot_ps( mask, a ); |
return _mm_or_ps( a, b ); |
} |
Then, the SSE version of the above AltiVec code may be written:
// if (a > 0 ) a += a |
vFloat mask = _mm_cmpgt_ps( a, zero ); |
vFloat twoA = _mm_add_ps( a, a); |
a = _mm_sel_ps( a, twoA, mask ); |
We have found that in practice, it is sometimes possible to cleverly replace select with simpler Boolean operators like a single AND, OR or XOR, especially in vector floating point code. While not a performance win for AltiVec (it's a wash), for SSE this replaces three instructions with one, and can be a large win for code that uses select frequently. Very infrequently, sleepy AltiVec programmers may momentarily forget about vec_min and vec_max, and use compare / select instead. Those are a nice win too, when you can find them.
Algorithms and Conversions
Here is a conversion table for AltiVec to SSE translation for vector compares and select:
Last updated: 2005-09-08