Overlapping processing of trains

We've discussed multiple times whether to multithread the processing, letting the input handler return early. So far, this has not been necessary as we could consistently stay below 100 ms for the full input handler (load data, load to GPU, process, load back to memory, write out). The most costly part of this, by far, is memory transfers - either on the network / framework level (outside our control) or to / from GPU. And the latter is something we may be able to do something about; with multiple GPU streams, we could probably overlap (or at least pipeline) a bunch of stuff.

Most obvious implementation in my mind:

thread pool for processing trains
each thread has its own input/output buffer (constants are shared, though)
each thread has its own GPU stream
some syncronization to ensure trains are sent out in order

Requested explicitly by @esobolev as data analysis integration will inevitably start pushing the time budget (!61). Should take RDMA developments into account, though. Would be wasteful to tailor a very channel-centric solution right now if we will soon will do RDMA with the DAQ.