I was wondering if it was possible to check the processor's flags register by the means of Intel's SSE intrinsic functions?
For example:
int idx = _mm_cmpistri(mmrange, mmstr, 0x14);
int zero = _mm_cmpistrz(mmrange, mmstr, 0x14);
In this example the compiler is able to optimize those two intrinsics to a single instruction (pcmpistri
) and checking the flags register by a jump instruction (jz
).
However in the following example the compiler doesn't manage to optimize the code properly:
__m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);
Here, the compiler generates a pcmpistrm
and a pcmpistri
instruction. However, in my opinion, the second instruction is redundant because pcmpistrm
sets the flags in the processor's flags register in the same way as pcmistri
.
So, to come back to my question, is there a way to either read the flags register directly or to instruct the compiler to only generate a pcmpistrm
instruction?
Answer
Looks like just an MSVC missed-optimization bug, not anything inherent.
gcc6.2 and icc17 successfully use both results from one PCMPISTRM in a test function I wrote that branches on the zero
result (on the Godbolt compiler explorer):
#include
__m128i foo(__m128i mmoldchar, __m128i mmstr)
{
__m128i mmmask = _mm_cmpistrm(mmoldchar, mmstr, 0x40);
int zero = _mm_cmpistrz(mmoldchar, mmstr, 0x40);
if(zero)
return mmmask;
else
return _mm_setzero_si128();
}
##gcc6.2 -O3 -march=nehalem
pcmpistrm xmm0, xmm1, 64
je .L5
pxor xmm0, xmm0
ret
.L5:
ret
OTOH, clang3.9 fails to CSE, and uses a PCMPISTRI.
foo:
movdqa xmm2, xmm0
pcmpistri xmm2, xmm1, 64
pxor xmm0, xmm0
jne .LBB0_2
pcmpistrm xmm2, xmm1, 64
.LBB0_2:
ret
Note that according to Agner Fog's instruction tables, PCMPISTRM has good throughput but high latency, so there's lots of room to do two in parallel if latency is the bottleneck. Jumping through hoops like using __readflags()
might actually be worse.
No comments:
Post a Comment