I looked into this because part of our pipeline is forced to be chunked. Most advice I've seen boils down to "more contiguity = better", but without numbers, or at least not generalizable ones.
My concrete tasks will already reach peak performance before 128 kB and I couldn't find pure processing workloads that benefit significantly beyond 1 MB chunk size. Code is linked in the post, it would be nice to see results on more systems.
twoodfin 1 days ago [-]
Your results match similar analyses of database systems I’ve seen.
64KB-128KB seems like the sweet spot.
throwaway81523 15 hours ago [-]
Doesn't it depend what you're doing? xz data compression or some video codecs? Retrograde chess analysis (endgame tablebases)? Number Field Sieve factorization in the linear algebra phase?
sweetjuly 24 hours ago [-]
I wonder how much of the cost is coming from the cache misses vs the more frequent indirections/ILP drop?
For example, I wonder what this test looks like if you don't randomize the chunks but instead just have the chunks in work order? If you still see the perf hit, that suggests the cost is not from the cache misses but rather the overhead of needing to switch chunks more often.
PhilipTrettner 14 hours ago [-]
that's a bit what the "repeated" scenario (roughly middle of the post) measures. It's not in work order but it is the same order every time, so caches work. And there you see that the working set size matters.
Note that the base setup has zero cache reuse because each run touches a completely different and cold part of memory. (that makes the result more of an upper bound on the needed chunk size)
gwking 1 days ago [-]
I’ve casually experimented with this in python a number of times for various hot loops, including those where I’m passing the chunk between c routines. On Apple M1 I’ve never seen a case where chunks larger than 16k mattered. That’s the page size, so totally unsurprising.
Nevertheless it’s been a helpful rule of thumb to not overthink optimizations.
_zoltan_ 1 days ago [-]
is this an attempt at nerd sniping? ;-)
on GPU databases sometimes we go up to the GB range per "item of work" (input permitting) as it's very efficient.
I need to add it to my TODO list to have a look at your github code...
PhilipTrettner 1 days ago [-]
It definitely worked on myself :)
Do have a look, I've tried to roughly keep it small and readable. It's ~250 LOC effectively.
Also, this is CPU only. I'm not super sure what a good GPU version of my benchmark would be, though ... Maybe measuring a "map" more than a "reduction" like I do on the CPU? We should probably take a look at common chunking patterns there.
aapoalas 1 days ago [-]
Would kernel huge pages possibly have an effect here also?
senderista 21 hours ago [-]
This kind of empirical analysis is helpful for things like sizing B-tree pages or unrolled linked list chunks.
smj-edison 1 days ago [-]
Side note, but this product looks really cool! I have a fundamental mistrust of all boolean operations, so to see a system that actually works with degenerate cases correctly is refreshing.
01HNNWZ0MV43FF 1 days ago [-]
This is good data, but I'm not sure what the actionable is for me as a Grug Programmer.
It means if I'm doing very light processing (sums) I should try to move that to structure-of-arrays to take advantage of cache? But if I'm doing something very expensive, I can leave it as array-of-structures, since the computation will dominate the memory access in Amdahl's Law analysis?
This data should tell me something about organizing my data and accessing it, right?
corysama 22 hours ago [-]
Even in code where performance is a serious concern, you don't need to feel guilty about using a data structure that is an array of pointers to 4 kbyte chunks or a tree of such chunks. 4K is linear enough that using a completely flat array probably won't be significantly faster.
Rendered at 20:15:00 GMT+0000 (Coordinated Universal Time) with Vercel.
My concrete tasks will already reach peak performance before 128 kB and I couldn't find pure processing workloads that benefit significantly beyond 1 MB chunk size. Code is linked in the post, it would be nice to see results on more systems.
64KB-128KB seems like the sweet spot.
For example, I wonder what this test looks like if you don't randomize the chunks but instead just have the chunks in work order? If you still see the perf hit, that suggests the cost is not from the cache misses but rather the overhead of needing to switch chunks more often.
Note that the base setup has zero cache reuse because each run touches a completely different and cold part of memory. (that makes the result more of an upper bound on the needed chunk size)
Nevertheless it’s been a helpful rule of thumb to not overthink optimizations.
on GPU databases sometimes we go up to the GB range per "item of work" (input permitting) as it's very efficient.
I need to add it to my TODO list to have a look at your github code...
Do have a look, I've tried to roughly keep it small and readable. It's ~250 LOC effectively.
Also, this is CPU only. I'm not super sure what a good GPU version of my benchmark would be, though ... Maybe measuring a "map" more than a "reduction" like I do on the CPU? We should probably take a look at common chunking patterns there.
It means if I'm doing very light processing (sums) I should try to move that to structure-of-arrays to take advantage of cache? But if I'm doing something very expensive, I can leave it as array-of-structures, since the computation will dominate the memory access in Amdahl's Law analysis?
This data should tell me something about organizing my data and accessing it, right?