Refactor stacking for reuse and overlappability

assigned to @hammerd

Worth noting: if we do go for overlapping train processing, then we could also decide that we don't need multi-threading over sources within a given train. The current implementation uses that (thread pool used to parallelize over source) to be fast enough per-train, but implementation would be simpler if we accept higher latency and only parallelize over trains in a short queue. Then the stacking could be naive np.stack and such.

As I'm offended by latency, I'll first try to get it running with the option to parallelize both over sources within a train and over trains potentially in a queue.

added 1 commit

4c1d521f - Simplify merge index setting, per-train preparatory steps

Compare with previous version

added 1 commit

aae765f3 - Fix obvious schema issue

Compare with previous version

added 4 commits

cdb91936 - Similarly start compartmentalizing frame selection functionality
d47a9039 - Handle configuration of frame selection
ca46ba88 - Always run thread pool, enable configuring number of workers
c3e1c48f - WIP: restructure / simplify stacking execution

Compare with previous version

added 1 commit

ad4606b8 - Finished overhaul of source stacking, added test

Compare with previous version

added 2 commits

2b5117a6 - Improve testing, fix some edge cases
20e9ce2f - Fix up key stacking, improve testing

Compare with previous version

added 1 commit

95bbab69 - Include thread pool in tests

Compare with previous version

marked the checklist item Move stacking functionality to "friend" class (in progress) as completed

marked the checklist item Get rid of global stacking buffers; idea: create, per train, a stacking context with fresh buffers as completed

changed the description

So, the overall design for a stacking friend (and a frame selection friend but that one is trivial) is now in place. I ended up moving away from the context manager (would have held the buffers prepared for a single train) to instead let the friend do set up and process sources immediately, optionally using a passed thread pool (similar to assembly in extra_geom). This means more submissions to the thread pool, but has some advantages:

can run the entire stacking process after frame selection, so the two don't clash (shouldn't happen on the same device in the future, but nice that it's orthogonal)
it allows alternative execution pattern described below

Other interesting change: stacking within handle_source used to be called based on every source present and so we'd want to keep track of who's missing and fill them in afterwards. The version in this MR will instead call auxiliary functions for each copy operation that was supposed to happen; if a given source is not there, filling missing values happens immediately. Also, would allow splitting work if multiple keys were stacked within a source. Key stacking not quite as sophisticated, though.