Timeseries analysis using Cassandra and partition by date period

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.


Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Jack Krupansky-2
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.



Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.




Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Jack Krupansky-2
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.





Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.






Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Jack Krupansky-2
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.







Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. 
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);
which fits tactic queries: 
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)  
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions: 
select * from 
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <[hidden email]>:
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.








Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Kevin Burton
> Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.


On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <[hidden email]> wrote:
Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. 
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);
which fits tactic queries: 
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)  
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions: 
select * from 
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <[hidden email]>:
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.











--

Founder/CEO Spinn3r.com
Location: San Francisco, CA
… or check out my Google+ profile

Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton <[hidden email]>:
> Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.


On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <[hidden email]> wrote:
Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. 
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);
which fits tactic queries: 
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)  
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions: 
select * from 
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <[hidden email]>:
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.











--

Founder/CEO Spinn3r.com
Location: San Francisco, CA
… or check out my Google+ profile


Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Srinivasa T N

Comparison to OpenTSDB HBase

For one we do not use id’s for strings. The string data (metric names and tags) are written to row keys and the appropriate indexes. Because Cassandra has much wider rows there are far fewer keys written to the database. The space saved by using id’s is minor and by not using id’s we avoid having to use any kind of locks across the cluster.

As mentioned the Cassandra has wider rows. The default row size in OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks.

http://kairosdb.github.io/kairosdocs/CassandraSchema.html

On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <[hidden email]> wrote:
Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton <[hidden email]>:
> Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.


On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <[hidden email]> wrote:
Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. 
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);
which fits tactic queries: 
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)  
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions: 
select * from 
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <[hidden email]>:
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.











--

Founder/CEO Spinn3r.com
Location: San Francisco, CA
… or check out my Google+ profile



Reply | Threaded
Open this post in threaded view
|

Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak
Thank you, we'll see that instrument,

2015-04-06 12:30 GMT+02:00 Srinivasa T N <[hidden email]>:

Comparison to OpenTSDB HBase

For one we do not use id’s for strings. The string data (metric names and tags) are written to row keys and the appropriate indexes. Because Cassandra has much wider rows there are far fewer keys written to the database. The space saved by using id’s is minor and by not using id’s we avoid having to use any kind of locks across the cluster.

As mentioned the Cassandra has wider rows. The default row size in OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks.

http://kairosdb.github.io/kairosdocs/CassandraSchema.html

On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <[hidden email]> wrote:
Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton <[hidden email]>:
> Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.


On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <[hidden email]> wrote:
Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. 
As as a conclusion:
I have a table events

CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);
which fits tactic queries: 
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)  
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions: 
select * from 
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <[hidden email]>:
It sounds like your time bucket should be a month, but it depends on the amount of data per user per day and your main query range. Within the partition you can then query for a range of days.

Yes, all of the rows within a partition are stored on one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <[hidden email]> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes 
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or 20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same (ymd, user_id) would be on one physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <[hidden email]>:
Unfortunately, a non-equal relation on a partition key is not supported. You would need to bucket by some larger unit, like a month, and then use the date/time as a clustering column for the row key. Then you could query within the partition. The partition key determines which node can satisfy the query. Designing your partition key judiciously is the key (haha!) to performant Cassandra applications.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <[hidden email]> wrote:
Hi, we plan to have 10^8 users and each user could generate 10 events per day.
So we have: 
10^8 records per day
10^8*30 records per month. 
Our timewindow analysis could be from 1 to 6 months.

Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd, user_id), event_ts ) 
WITH CLUSTERING ORDER BY (event_ts DESC);

where ymd=20150102 (the Second of January)?

What happens to writes:
SSTable with past days (ymd < current_day) stay untouched and don't take part in Compaction process since there are o changes to them?

What happens to read:
I issue query: 
select some_attributes 
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <[hidden email]>:
It depends on the actual number of events per user, but simply bucketing the partition key can give you the same effect - clustering rows by time range. A composite partition key could be comprised of the user name and the date.

It also depends on the data rate - is it many events per day or just a few events per week, or over what time period. You need to be careful - you don't want your Cassandra partitions to be too big (millions of rows) or too small (just a few or even one row per partition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <[hidden email]> wrote:
Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra.
I have a entity named "Event".
"Event" has attributes:
user_id - a guy who triggered event
event_ts - when even happened
event_type - type of event
some_other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
 
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
  
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It would grow continuously during application lifetime.
I want to ask question:
Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.

Right now it would lead to full table scan.

There is a trick in HBase. HBase has table abstraction and HBase has Column Family abstraction. 
Column family should be declared in advance. 
Column family - physically is a pack of HFiles ("SSTables in C*").
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01' 
and store all 2015 January data to Column familiy named '2015_01'.

When I want to get January data, I would directly access column family named '2015_01' and I won't massage all data in table, just this piece.

What is approach in C* in this case?
I have an idea create several tables: event_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current understanding how it works.











--

Founder/CEO Spinn3r.com
Location: San Francisco, CA
… or check out my Google+ profile