Inserting null values

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Inserting null values

Matthew Johnson

Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Inserting null values

Peer, Oded

Inserting a null value creates a tombstone. Tombstones can have major performance implications.

You can see the tombstones using sstable2json.

If you have a small number of records with null values this seems OK, otherwise I recommend using the QueryBuilder (for Java clients) and waiting for https://issues.apache.org/jira/browse/CASSANDRA-7304

 

 

From: Matthew Johnson [mailto:[hidden email]]
Sent: Wednesday, April 29, 2015 11:37 AM
To: [hidden email]
Subject: Inserting null values

 

Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Ali Akhtar
In reply to this post by Matthew Johnson

Have you considered adding a 'toSafe' method which checks if the item is null, and if so, returns a default value? E.g String too = safe(bar, ""); .

On Apr 29, 2015 3:14 PM, "Matthew Johnson" <[hidden email]> wrote:

Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

DuyHai Doan
<auto promotion mode on>

The problem of NULL insert is already solved long time ago with Insert Strategy in Achilles: https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy

</auto promotion off>

However, it's nice to see there will be a flag on the protocol side to handle this problem

On Wed, Apr 29, 2015 at 2:27 PM, Ali Akhtar <[hidden email]> wrote:

Have you considered adding a 'toSafe' method which checks if the item is null, and if so, returns a default value? E.g String too = safe(bar, ""); .

On Apr 29, 2015 3:14 PM, "Matthew Johnson" <[hidden email]> wrote:

Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Robert Wille-2
In reply to this post by Matthew Johnson
I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

It’s annoying and not ideal, but what can you do?

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:

Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!
Matt

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Inserting null values

Matthew Johnson

Thank you all for the advice!

 

I have decided to use the Insert query builder (com.datastax.driver.core.querybuilder.Insert) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)).

 

Thanks!

Matt

 

From: Robert Wille [mailto:[hidden email]]
Sent: 29 April 2015 15:16
To: [hidden email]
Subject: Re: Inserting null values

 

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

 

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

 

It’s annoying and not ideal, but what can you do?

 

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:



Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Eric Stevens
Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)?  I.E. you're deleting clusters off of a partition.  A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested).  

Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period.

Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations.  This doesn't strike me as one of them.  If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal.  But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is already positioned over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert.

In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <[hidden email]> wrote:

Thank you all for the advice!

 

I have decided to use the Insert query builder (com.datastax.driver.core.querybuilder.Insert) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)).

 

Thanks!

Matt

 

From: Robert Wille [mailto:[hidden email]]
Sent: 29 April 2015 15:16
To: [hidden email]
Subject: Re: Inserting null values

 

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

 

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

 

It’s annoying and not ideal, but what can you do?

 

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:



Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Jonathan Haddad
Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. 

Perhaps an edge case, but worth considering.

On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <[hidden email]> wrote:
Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)?  I.E. you're deleting clusters off of a partition.  A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested).  

Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period.

Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations.  This doesn't strike me as one of them.  If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal.  But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is already positioned over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert.

In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <[hidden email]> wrote:

Thank you all for the advice!

 

I have decided to use the Insert query builder (com.datastax.driver.core.querybuilder.Insert) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)).

 

Thanks!

Matt

 

From: Robert Wille [mailto:[hidden email]]
Sent: 29 April 2015 15:16
To: [hidden email]
Subject: Re: Inserting null values

 

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

 

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

 

It’s annoying and not ideal, but what can you do?

 

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:



Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Eric Stevens
But we're talking about a single tombstone on each of a finite (small) set of values, right?  We're not talking about INSERTs which are 99% nulls (at least I don't think that's what Matthew was suggesting).  Unless you're engaging in the antipattern of repeated overwrite, I'm still struggling to see why this is worse than an equivalent number of non-tombstoned writes.  In fact from the description I don't think we're talking about these tombstones even occluding any value at all.

imagine a multi tb sstable w/ 99% tombstones

Let's play with this hypothetical, which doesn't seem like a probable consequence of the original question.  You'd have to have taken enough writes inside gc grace period to have even produced a multi-TB sstable to come anywhere near this, and even then this either exceeds or comes really close to the recommended maximum total data size per node (let alone in a single sstable).  If you did have such an sstable, it doesn't seem very likely to compact again inside gc grace period short of manually triggered major compaction.  

But let's assume you do that, you run cassandra stress inserting nothing but tombstones, and kick off major compaction periodically.  If it compacted inside gc grace period, is this worse for compaction than the same number of non-tombstoned values (i.e. a multi-TB sstable is costly to compact no matter what the contents)?  If it compacted outside gc grace period, then 99% of the work is just dropping tombstones, it seems like it would run really fast (for being an absurdly large sstable), as there would be just 1% of the contents to actually copy over to the new sstable.

I'm still not clear on what I'm missing.  Is a tombstone more expensive to compact than a non-tombstone?

On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad <[hidden email]> wrote:
Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. 

Perhaps an edge case, but worth considering.

On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <[hidden email]> wrote:
Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)?  I.E. you're deleting clusters off of a partition.  A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested).  

Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period.

Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations.  This doesn't strike me as one of them.  If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal.  But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is already positioned over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert.

In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <[hidden email]> wrote:

Thank you all for the advice!

 

I have decided to use the Insert query builder (com.datastax.driver.core.querybuilder.Insert) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)).

 

Thanks!

Matt

 

From: Robert Wille [mailto:[hidden email]]
Sent: 29 April 2015 15:16
To: [hidden email]
Subject: Re: Inserting null values

 

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

 

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

 

It’s annoying and not ideal, but what can you do?

 

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:



Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Philip Thompson
In a way, yes. A tombstone will only be removed after gc_grace iff the compaction is sure that it contains all rows which that tombstone might shadow. When two non-tombstone conflicting rows are compacted, it's always just LWW.

On Wed, Apr 29, 2015 at 2:42 PM, Eric Stevens <[hidden email]> wrote:
But we're talking about a single tombstone on each of a finite (small) set of values, right?  We're not talking about INSERTs which are 99% nulls (at least I don't think that's what Matthew was suggesting).  Unless you're engaging in the antipattern of repeated overwrite, I'm still struggling to see why this is worse than an equivalent number of non-tombstoned writes.  In fact from the description I don't think we're talking about these tombstones even occluding any value at all.

imagine a multi tb sstable w/ 99% tombstones

Let's play with this hypothetical, which doesn't seem like a probable consequence of the original question.  You'd have to have taken enough writes inside gc grace period to have even produced a multi-TB sstable to come anywhere near this, and even then this either exceeds or comes really close to the recommended maximum total data size per node (let alone in a single sstable).  If you did have such an sstable, it doesn't seem very likely to compact again inside gc grace period short of manually triggered major compaction.  

But let's assume you do that, you run cassandra stress inserting nothing but tombstones, and kick off major compaction periodically.  If it compacted inside gc grace period, is this worse for compaction than the same number of non-tombstoned values (i.e. a multi-TB sstable is costly to compact no matter what the contents)?  If it compacted outside gc grace period, then 99% of the work is just dropping tombstones, it seems like it would run really fast (for being an absurdly large sstable), as there would be just 1% of the contents to actually copy over to the new sstable.

I'm still not clear on what I'm missing.  Is a tombstone more expensive to compact than a non-tombstone?

On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad <[hidden email]> wrote:
Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. 

Perhaps an edge case, but worth considering.

On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <[hidden email]> wrote:
Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)?  I.E. you're deleting clusters off of a partition.  A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested).  

Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period.

Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations.  This doesn't strike me as one of them.  If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal.  But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is already positioned over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert.

In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <[hidden email]> wrote:

Thank you all for the advice!

 

I have decided to use the Insert query builder (com.datastax.driver.core.querybuilder.Insert) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)).

 

Thanks!

Matt

 

From: Robert Wille [mailto:[hidden email]]
Sent: 29 April 2015 15:16
To: [hidden email]
Subject: Re: Inserting null values

 

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

 

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

 

It’s annoying and not ideal, but what can you do?

 

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <[hidden email]> wrote:



Hi all,

 

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

 

I can see a few Jiras around CQL 3 supporting inserting nulls:

 

 

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

 

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

 

Thanks!

Matt

 




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Robert Coli-3
In reply to this post by Eric Stevens
On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <[hidden email]> wrote:
In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

Given that tombstones are often smaller than data columns, sorta hard to understand conceptually?

=Rob

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Inserting null values

Eric Stevens
I agree that inserting null is not as good as not inserting that column at all when you have confidence that you are not shadowing any underlying data. But pragmatically speaking it really doesn't sound like a small number of incidental nulls/tombstones (< 20% of columns, otherwise CASSANDRA-3442 takes over) is going to have any performance impact either in your query patterns or in compaction in any practical sense.

If INSERT of null values is problematic for small portions of your data, then it stands to reason that an INSERT option containing an instruction to prevent tombstone creation would be an important performance optimization (and would also address the fact that non-null collections also generate tombstones on INSERT as well).  INSERT INTO ... USING no_tombstones;


There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

tombstone_warn_threshold and tombstone_failure_threshold only apply to clustering scans right?  I.E. tombstones don't count against those thresholds if they are not part of the clustering key column being considered for the non-EQ relation?  The documentation certainly implies so:

tombstone_warn_threshold
(Default: 1000) The maximum number of tombstones a query can scan before warning.
tombstone_failure_threshold
(Default: 100000) The maximum number of tombstones a query can scan before aborting.

On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli <[hidden email]> wrote:
On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <[hidden email]> wrote:
In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

Given that tombstones are often smaller than data columns, sorta hard to understand conceptually?

=Rob


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Inserting null values

Peer, Oded

I’ve added an option to prevent tombstone creation when using PreparedStatements to trunk, see CASSANDRA-7304.

 

The problem is having tombstones in regular columns.

When you perform a read request (range query or by PK):

- Cassandra iterates over all the cells (all, not only the cells specified in the query) in the relevant rows while counting tombstone cells (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java#L199)

- creates a ColumnFamily object instance with the rows

- filters the selected columns from the internal CF (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java#L653)

- returns the result

 

If you have many unnecessary tombstones you read many unnecessary cells.

 

 

 

From: Eric Stevens [mailto:[hidden email]]
Sent: Wednesday, May 06, 2015 4:37 PM
To: [hidden email]
Subject: Re: Inserting null values

 

I agree that inserting null is not as good as not inserting that column at all when you have confidence that you are not shadowing any underlying data. But pragmatically speaking it really doesn't sound like a small number of incidental nulls/tombstones (< 20% of columns, otherwise CASSANDRA-3442 takes over) is going to have any performance impact either in your query patterns or in compaction in any practical sense.

 

If INSERT of null values is problematic for small portions of your data, then it stands to reason that an INSERT option containing an instruction to prevent tombstone creation would be an important performance optimization (and would also address the fact that non-null collections also generate tombstones on INSERT as well).  INSERT INTO ... USING no_tombstones;

 

 

There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

 

tombstone_warn_threshold and tombstone_failure_threshold only apply to clustering scans right?  I.E. tombstones don't count against those thresholds if they are not part of the clustering key column being considered for the non-EQ relation?  The documentation certainly implies so:

 

tombstone_warn_threshold

(Default: 1000) The maximum number of tombstones a query can scan before warning.

tombstone_failure_threshold

(Default: 100000) The maximum number of tombstones a query can scan before aborting.

 

On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli <[hidden email]> wrote:

On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <[hidden email]> wrote:

In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

 

There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

 

Given that tombstones are often smaller than data columns, sorta hard to understand conceptually?

 

=Rob

 

 

Loading...