I have 200 million records, some of which have variable sized fields (string, variable length array etc.). I need to perform some filters, aggregations etc. on them (analytics oriented queries).
I want to just lay them all out in memory (enough to fit in a big box) and then do linear scans on them. There are two approach I can take, and I want to hear your opinions on which approach is better, for maximizing speed:
- Using an array of structs with
char*
andint*
etc. to deal with variable length fields - Use a large byte array, scan the byte array like a binary stream, and then parse the records
Which approach would you recommend?
Update: Using C.
Answer
The unfortunate answer is that "it depends on the details which you haven't provided" which, while true, is not particularly useful. The general advice to approaching a problem like this is to start with the simplest/most obvious design and then profile and optimize it as needed. If it really matters you can start with a few very basic benchmark tests of a few designs using your exact data and use-cases to get a more accurate idea of what direction you should take.
Looking in general at a few specific designs and their general pros/cons:
One Big Buffer
char* pBuffer = malloc(200000000);
- Assumes your data can fit into memory all at once.
- Would work better for all text (or mostly text) data.
- Wouldn't be my first choice for large data as it just mirrors the data on the disk. Better just to use the hardware/software file cache/read ahead and read data directly from the drive, or map it if needed.
- For linear scans this is a good format but you lose a bit if it requires complex parsing (especially if you have to do multiple scans).
- Potential for the least overhead assuming you can pack the structures one after the other.
Static Structure
typedef struct {
char Data1[32];
int Data2[10];
} myStruct;
myStruct *pData = malloc(sizeof(myStruct)*200000000);
- Simplest design and likely the best potential for speed at a cost of memory (without actual profiling).
- If your variable length arrays have a wide range of sizes you are going to waste a lot of memory. Since you have 200 million records you may not have enough memory to use this method.
- For a linear scan this is likely the best memory structure due to memory cache/prefetching.
Dynamic Structure
typedef struct {
char* pData1;
int* pData2;
} myStruct2;
myStruct2 *pData = malloc(sizeof(myStruct2)*200000000);
- With 200 million records this is going require a lot of dynamic memory allocations which is very likely going to have a significant impact on speed.
- Has the potential to be more memory efficient if your dynamic arrays have a wide range of sizes (though see next point).
- Be aware of the overhead of the pointer sizes. On a 32-bit system this structure needs 8 bytes (ignoring padding) to store the pointers which is 1.6 GB alone for 200 million records! If your dynamic arrays are generally small (or empty) you may be spending more memory on the overhead than the actual data.
- For a linear scan of data this type of structure will probably perform poorly as you are accessing memory in a non-linear manner which cannot be predicted by the prefetcher.
Streaming
- If you only need to do one scan of the data then I would look at a streaming solution where you read a small bit of the data at a time from the file.
- Works well with very large data sets that wouldn't fit into memory.
- Main limitation here is the disk read speed and how complex your parsing is.
- Even if you have to do multiple passes with file caching this might be comparable in speed to the other methods.
Which of these is "best" really depends on your specific case...I can think of situations where each one would be the preferred method.
No comments:
Post a Comment