Add exdf-du CLI to determine storage size per source or key
As I've been needing this quite a few times in recent weeks and the notebook prototype was fairly clunky to use, I finally shaped it into a CLI. It can determine the amount of storage taken by sources or their individual keys in a collection of EXDF files. In the spirit of du
, it is called exdf-du
.
It offers two methods to determine storage size:
-
Array memory size: The default method iterates over all sources and their keys, and determines the size their data would take in memory by
np.prod(kd.shape) * kd.dtype.itemsize
. This is fairly efficient as it only accesses the key names andINDEX
data, but will not account for chunking or compression. For AGIPDproc
data e.g., it will overestimate its size by a factor of 5 due the compression of gain and mask. There is a warning built-in whenever it encounters compressed datasets -
Actual storage size: In exact mode, it will iterate over every dataset of all sources and keys to determine their true storage size through
h5py.Dataset.id.get_storage_size()
. This should always be accurate, but its runtime seems to scale with the data size and thus can be quite time consuming for large runs. An AGIPD raw run e.g. can take several minutes, even if restricted to only instrument data.
UPDATE: The default method now uses a combination of both - uncompressed datasets use array memory size, while for compressed ones the actual storage size in the first file extrapolated to the entire data.
@kluyvert Could you have a brief look please?