Fwd: Data Modelling Help

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Data Modelling Help

Sandeep Gupta
Hi,

I am a newbie with Cassandra and thus need data modelling help as I haven't found a resource that tackles the same problem.

The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways:

1. All emails ever received
2. Mails that have been read by a user
3. Mails that are still unread by a user

My current model is as under:

CREATE TABLE TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    read boolean,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE INDEX ON TIMELINE (userID, read);

The queries I need to support are:

SELECT * FROM TIMELINE where userID = 12;
SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = true;
SELECT * FROM TIMELINE where userID = 12 and read = false;
SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc;


Queries are:

1. Should I keep  read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem.

2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index

3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge.


Thanks in advance.

Best Regards!





Keep Walking,
~ Sandeep



Reply | Threaded
Open this post in threaded view
|

RE: Data Modelling Help

Donald Smith

Secondary indicies are inefficient and are deprecated, as far as I know.

 

Unless you store many thousands of emails for a long time (which I recommend against), just use a single table with the partition key being the userid and the timestamp being the clustering (column) key, as in your schema.   You might want to use a TTL to expire old emails.

 

If you need to store a huge number of emails, consider splitting into tables by year, for example.

 

If you had two tables (one for read emails and one for unread emails) you’d have to move rows between them when an email got marked (un)read.  But it would support efficiently finding (un)read emails.

 

Don

 

From: Sandeep Gupta [mailto:[hidden email]]
Sent: Monday, April 27, 2015 11:46 AM
To: [hidden email]
Subject: Fwd: Data Modelling Help

 

Hi,

 

I am a newbie with Cassandra and thus need data modelling help as I haven't found a resource that tackles the same problem.

 

The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways:

 

1. All emails ever received

2. Mails that have been read by a user

3. Mails that are still unread by a user

 

My current model is as under:

 

CREATE TABLE TIMELINE (

    userID varchar,

    emailID varchar,

    timestamp bigint,

    read boolean,

    PRIMARY KEY (userID, timestamp)

) WITH CLUSTERING ORDER BY (timestamp desc);


CREATE INDEX ON TIMELINE (userID, read);

 

The queries I need to support are:

 

SELECT * FROM TIMELINE where userID = 12;

SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;

SELECT * FROM TIMELINE where userID = 12 and read = true;

SELECT * FROM TIMELINE where userID = 12 and read = false;

SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc;

SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc;

 

 

Queries are:

 

1. Should I keep  read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem.

 

2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index

 

3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge.

 

 

Thanks in advance.

 

Best Regards!

 

 

 

 

 

Keep Walking,
~ Sandeep

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Data Modelling Help

Robert Coli-3
On Wed, Apr 29, 2015 at 9:01 AM, Donald Smith <[hidden email]> wrote:

Secondary indicies are inefficient and are deprecated, as far as I know.


They are not deprecated, the correct summary is that they should only be used in very particular circumstances. If you're not sure if you're in one of those cases, you probably aren't.

=Rob
 
Reply | Threaded
Open this post in threaded view
|

Re: Data Modelling Help

Paulo Motta
This article explains when it's OK to use secondary indexes in Cassandra: http://www.wentnet.com/blog/?p=77

PS: the article is from 2013, so it can be outdated by now.. but at least it should give you some preliminary background on the topic.

2015-04-29 16:44 GMT-03:00 Robert Coli <[hidden email]>:
On Wed, Apr 29, 2015 at 9:01 AM, Donald Smith <[hidden email]> wrote:

Secondary indicies are inefficient and are deprecated, as far as I know.


They are not deprecated, the correct summary is that they should only be used in very particular circumstances. If you're not sure if you're in one of those cases, you probably aren't.

=Rob