Quantcast

Data model suggestions

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Data model suggestions

Ali Akhtar
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Manoj Khangaonkar
Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.



--
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Ali Akhtar
That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.



--

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Manoj Khangaonkar
Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards




On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:
That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.



--




--
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Ali Akhtar
Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards




On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:
That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.



--




--

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Narendra Sharma

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:
Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards




On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:
That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:
Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

Thanks.



--




--

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Data model suggestions

Peer, Oded

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Shahab Yunus
Interesting approach Oded.

Is this something similar that has been described here:

Regards,
Shahab

On Sun, Apr 26, 2015 at 4:29 AM, Peer, Oded <[hidden email]> wrote:

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Ali Akhtar
In reply to this post by Peer, Oded
Thanks Peer. I like the approach you're suggesting.

Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything).

Thanks.


On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <[hidden email]> wrote:

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Data model suggestions

Peer, Oded

I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace.

Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot

 

 

From: Ali Akhtar [mailto:[hidden email]]
Sent: Sunday, April 26, 2015 10:31 PM
To: [hidden email]
Subject: Re: Data model suggestions

 

Thanks Peer. I like the approach you're suggesting.

 

Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything).

 

Thanks.

 

 

On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <[hidden email]> wrote:

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Ali Akhtar
Wouldn't truncating the table create tombstones? 

On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded <[hidden email]> wrote:

I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace.

Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot

 

 

From: Ali Akhtar [mailto:[hidden email]]
Sent: Sunday, April 26, 2015 10:31 PM


To: [hidden email]
Subject: Re: Data model suggestions

 

Thanks Peer. I like the approach you're suggesting.

 

Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything).

 

Thanks.

 

 

On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <[hidden email]> wrote:

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data model suggestions

Laing, Michael
No - it immediately removes the sstables on all nodes.

On Mon, Apr 27, 2015 at 7:53 AM, Ali Akhtar <[hidden email]> wrote:
Wouldn't truncating the table create tombstones? 

On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded <[hidden email]> wrote:

I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace.

Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot

 

 

From: Ali Akhtar [mailto:[hidden email]]
Sent: Sunday, April 26, 2015 10:31 PM


To: [hidden email]
Subject: Re: Data model suggestions

 

Thanks Peer. I like the approach you're suggesting.

 

Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything).

 

Thanks.

 

 

On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <[hidden email]> wrote:

I would maintain two tables.

An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead).

An “active” table which holds all the records in the last external API invocation.

To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table.

I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent.

Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table.

No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots.

 

 

From: Narendra Sharma [mailto:[hidden email]]
Sent: Friday, April 24, 2015 6:53 AM
To: [hidden email]
Subject: Re: Data model suggestions

 

I think one table say record should be good. The primary key is record id. This will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try secondary index.

On Apr 23, 2015 1:32 PM, "Ali Akhtar" <[hidden email]> wrote:

Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be > 500k, the ids may just be cached in memory.

 

Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time.

 

The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues?

 

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active.

You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key.

To comment more, We would need to see your proposed cassandra tables and queries that you might need to run.

regards

 

 

 

On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <[hidden email]> wrote:

That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record.

 

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <[hidden email]> wrote:

Hi,

How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ?

regards

 

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <[hidden email]> wrote:

Hey all,

 

We are working on moving a mysql based application to Cassandra.

 

The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive'

 

The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more.

 

Is it advisable to keep the same data model in Cassandra? I'm concerned about tombstone issues when records are deleted from active. 

 

Thanks.



--

 



--

 

 



Loading...