I'm working on a system which has to deal with time series data. I've been happy using Cassandra for time series and Spark looks promising as a computational platform.
I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as kairosdb does it. This allows an 8 byte chunk start timestamp with 4 byte offsets for the individual measurements. And it keeps the data below 2x10^9 even at 1000 Hz.
This schema works quite okay when dealing with one time series at a time. Because the data is partitioned by time series id and chunk of time (e.g. the three weeks mentioned above), it requires a little client side logic to retrieve the partitions and glue them together, but this is quite okay.
However, when working with many / all of the time series in a table at once, e.g. in Spark, the story changes dramatically. Say I'd want to compute something simple as a moving average, I have to deal with data all over the place. I can't currently think of anything but performing aggregateByKey causing a shuffle every time.
Anyone have experience with combining time series chunking and computation on all / many time series at once? Any advice?