Cassandra time series + Spark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Cassandra time series + Spark

Rumph, Frens Jan
Hi,

I'm working on a system which has to deal with time series data. I've been happy using Cassandra for time series and Spark looks promising as a computational platform.

I consider chunking time series in Cassandra necessary, e.g. by 3 weeks as kairosdb does it. This allows an 8 byte chunk start timestamp with 4 byte offsets for the individual measurements. And it keeps the data below 2x10^9 even at 1000 Hz.

This schema works quite okay when dealing with one time series at a time. Because the data is partitioned by time series id and chunk of time (e.g. the three weeks mentioned above), it requires a little client side logic to retrieve the partitions and glue them together, but this is quite okay.

However, when working with many / all of the time series in a table at once, e.g. in Spark, the story changes dramatically. Say I'd want to compute something simple as a moving average, I have to deal with data all over the place. I can't currently think of anything but performing aggregateByKey causing a shuffle every time.

Anyone have experience with combining time series chunking and computation on all / many time series at once? Any advice?

Cheers,
Frens Jan