Saturday 3 December 2016

c - What is the Cost of an L1 Cache Miss?



Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit



I did some testing and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.



I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run “instantly”).





*** Testing 20000 nodes
Total Ordered Time: 888.822899
Total Random Time: 2155.846268


Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).



typedef struct stItem

{
long lData;
//char acPad[20];
} LIST_NODE;



#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{

QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2, llFreq;

QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
*pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;

}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
// Just use clock(), this test doesn't need higher resolution
*pt1 = clock();
}

void StopTimer( LONGLONG t1, double *pdMS )

{
LONGLONG t2 = clock();
*pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif



long longrand()
{

#if defined( WIN32 )
// Stupid cheesy way to make sure it is not just a 16-bit rand value
return ( rand() << 16 ) | rand();
#else
return rand();
#endif
}

// get random value in the given range
int randint( int m, int n )

{
int ret = longrand() % ( n - m + 1 );
return ret + m;
}

// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
long *plShuffle, // (O) return array of "randomly" ordered integers
long lNumItems // (I) length of array

)
{
long i;
long j;
long t;

for ( i = 0; i < lNumItems; i++ )
plShuffle[i] = i;

for ( i = 0; i < lNumItems; i++ )

{
j = randint( i, lNumItems - 1 );

t = plShuffle[i];
plShuffle[i] = plShuffle[j];
plShuffle[j] = t;
}
}




int main( int argc, char* argv[] )
{
long *plDataValues;
LIST_NODE *pstNodes;
long lNumItems = 20000;
long i, j;
LONGLONG t1; // for timing
double dms;


if ( argc > 1 && atoi(argv[1]) > 0 )
lNumItems = atoi( argv[1] );

printf( "\n\n*** Testing %u nodes\n", lNumItems );

srand( (unsigned int)time( 0 ));

// allocate the nodes as one single chunk of memory
pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
assert( pstNodes != NULL );


// Create an array that gives the access order for the nodes
plDataValues = (long*)malloc( lNumItems * sizeof( long ));
assert( plDataValues != NULL );

// Access the data in order
for ( i = 0; i < lNumItems; i++ )
plDataValues[i] = i;

StartTimer( &t1 );


// Loop through and access the memory a bunch of times
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}


StopTimer( t1, &dms );
printf( "Total Ordered Time: %f\n", dms );

// now access the array positions in a "random" order
ShuffleArray( plDataValues, lNumItems );

StartTimer( &t1 );

for ( j = 0; j < lNumItems; j++ )
{

for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}

StopTimer( t1, &dms );
printf( "Total Random Time: %f\n", dms );

}


Answer



While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.



Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...