Node Inconsistency

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Node Inconsistency

Vram Kouramajian
We are running a five node cluster in production with a replication
factor of three. The query results from the 5 nodes are returning
different results (2 of the 5 nodes return extra columns for the same
row key).

We are not sure the root of the problem (any config issues). Any suggestions?

Thanks,
Vram
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Tyler Hobbs
What version of Cassandra?  What consistency level are you writing/reading at?

Do you continue to get inconsistent results when you read the same data over and over (i.e. read repair is not fixing something)?

Do all of your nodes show the same thing when you run nodetool ring against them?

- Tyler

On Mon, Jan 10, 2011 at 9:18 AM, Vram Kouramajian <[hidden email]> wrote:
We are running a five node cluster in production with a replication
factor of three. The query results from the 5 nodes are returning
different results (2 of the 5 nodes return extra columns for the same
row key).

We are not sure the root of the problem (any config issues). Any suggestions?

Thanks,
Vram

Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Scott McCarty
Hi, Tyler,

I'm working with Vram on this project and can respond to your questions.

We do indeed continue to get inconsistent data after many read operations.  These columns and rowkeys are months old and they have had many reads done on them over that time.  Using cassandra-cli we see that on 3 of the 5 servers we get 54 columns back for a specific rowkey and on the other 2 we get back 68 columns.  We've also done this same query on 4 or 5 other rowkeys and different rowkeys get different (inconsistent) results with only one of the rowkeys getting back the same results from all 5 servers.  Some of the rowkeys get 4 consistent and 1 different, and some get the 3:2 ratio above.  From looking at the data I'm guessing that the results from the 3 nodes are correct and the results from the 2 nodes are old (the diff between the result sets is that the 54 is a subset of the 68).

Running "nodetool ring" shows all 5 of them have the same view of the ring.  The configuration for the cluster has not changed in quite a while, and we're running 0.6.8 right but were running 0.6.5 for quite a while.

Thanks,
  Scott

On Mon, Jan 10, 2011 at 8:27 AM, Tyler Hobbs <[hidden email]> wrote:
What version of Cassandra?  What consistency level are you writing/reading at?

Do you continue to get inconsistent results when you read the same data over and over (i.e. read repair is not fixing something)?

Do all of your nodes show the same thing when you run nodetool ring against them?

- Tyler


On Mon, Jan 10, 2011 at 9:18 AM, Vram Kouramajian <[hidden email]> wrote:
We are running a five node cluster in production with a replication
factor of three. The query results from the 5 nodes are returning
different results (2 of the 5 nodes return extra columns for the same
row key).

We are not sure the root of the problem (any config issues). Any suggestions?

Thanks,
Vram


Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Peter Schuller
> above.  From looking at the data I'm guessing that the results from the 3
> nodes are correct and the results from the 2 nodes are old (the diff between
> the result sets is that the 54 is a subset of the 68).

If I interpret the thread correctly, those 2 that you say you believe
are old are the ones that are returning extra results. So, returning
results that should no longer be there. That suggests to me that you
are seeing old values that have since been deleted (unless you're
seeing some kind of arbitrary random data popping up).

Is it possible that you're not running 'nodetool repair' frequently
enough with respect to GCGraceSeconds? The negative results of that
would be that deletions are essentially forgotten if they don't reach
all nodes in time. Have you so far only seen incorrect results in the
form of additional data that should be deleted, or have you also seen
old versions of columns? (That refuse to self-heal, that is.)

Now, GCGraceSeconds/repair issues would not explain, to me at least,
why read-repair is not fixing the discrepancy. I don't think I saw you
explicitly confirm it, so I'll ask: Are you indeed running with read
repair turned on?

If it is turned off, and if all your discrepancies are in the form of
forgotten deletes, then GCGraceSeconds/repair seems like a likely
candidate cause.

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Jonathan Ellis-3
On Mon, Jan 10, 2011 at 3:31 PM, Peter Schuller
<[hidden email]> wrote:
> Now, GCGraceSeconds/repair issues would not explain, to me at least,
> why read-repair is not fixing the discrepancy.

Short version: Once GCGraceSeconds expires, the tombstone is no longer
relevant as will not be included in RR (otherwise, you could have
nodes that haven't compacted yet, RR tombstones to other replicas that
had removed it).  Long version: see my last comment on
https://issues.apache.org/jira/browse/CASSANDRA-1316.

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

aaron morton
In reply to this post by Peter Schuller
Just to add, the cli works at CL One. What do you see what you use a higher CL through an API?

A

On 11/01/2011, at 10:31 AM, Peter Schuller <[hidden email]> wrote:

>> above.  From looking at the data I'm guessing that the results from the 3
>> nodes are correct and the results from the 2 nodes are old (the diff between
>> the result sets is that the 54 is a subset of the 68).
>
> If I interpret the thread correctly, those 2 that you say you believe
> are old are the ones that are returning extra results. So, returning
> results that should no longer be there. That suggests to me that you
> are seeing old values that have since been deleted (unless you're
> seeing some kind of arbitrary random data popping up).
>
> Is it possible that you're not running 'nodetool repair' frequently
> enough with respect to GCGraceSeconds? The negative results of that
> would be that deletions are essentially forgotten if they don't reach
> all nodes in time. Have you so far only seen incorrect results in the
> form of additional data that should be deleted, or have you also seen
> old versions of columns? (That refuse to self-heal, that is.)
>
> Now, GCGraceSeconds/repair issues would not explain, to me at least,
> why read-repair is not fixing the discrepancy. I don't think I saw you
> explicitly confirm it, so I'll ask: Are you indeed running with read
> repair turned on?
>
> If it is turned off, and if all your discrepancies are in the form of
> forgotten deletes, then GCGraceSeconds/repair seems like a likely
> candidate cause.
>
> --
> / Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Peter Schuller
In reply to this post by Jonathan Ellis-3
> Short version: Once GCGraceSeconds expires, the tombstone is no longer
> relevant as will not be included in RR (otherwise, you could have
> nodes that haven't compacted yet, RR tombstones to other replicas that
> had removed it).  Long version: see my last comment on
> https://issues.apache.org/jira/browse/CASSANDRA-1316.

Subtle, indeed.

I've attempted to document this here:

http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Vram Kouramajian
Thank you all for your assistance. I has been very helpful.

I have few more questions:

1. If we change the write/delete consistency level to ALL, do we
eliminate the data inconsistency among nodes (since the delete
operations will apply to ALL replicas)?

2. My understanding is that "Read Repair" doesn't handle tombstones.
How about "Node Tool Repair" (do we still see inconsistent data among
nodes after running "Node Tool Repair")?

Thanks,
Vram


On Mon, Jan 10, 2011 at 2:22 PM, Peter Schuller
<[hidden email]> wrote:

>> Short version: Once GCGraceSeconds expires, the tombstone is no longer
>> relevant as will not be included in RR (otherwise, you could have
>> nodes that haven't compacted yet, RR tombstones to other replicas that
>> had removed it).  Long version: see my last comment on
>> https://issues.apache.org/jira/browse/CASSANDRA-1316.
>
> Subtle, indeed.
>
> I've attempted to document this here:
>
> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
> http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
>
> --
> / Peter Schuller
>
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Peter Schuller
> I have few more questions:
>
> 1. If we change the write/delete consistency level to ALL, do we
> eliminate the data inconsistency among nodes (since the delete
> operations will apply to ALL replicas)?
>
> 2. My understanding is that "Read Repair" doesn't handle tombstones.
> How about "Node Tool Repair" (do we still see inconsistent data among
> nodes after running "Node Tool Repair")?

Read repair and nodetool repair handle it during normal circumstances.
The root cause here is that not running nodetool repair within
GCGraceSeconds breaks the underlying design, leading to the type of
inconsistency you got that is not healed by RR or repair.

The most important thing is to, from now on, make sure nodetool repair
is run often enough - either by running it more often or by increasing
GCGraceSeconds - so that deletes are never forgotten to begin with.

In terms of what to do now that you're in this position, my summary of
my understanding based on the JIRA ticket and DistributedDeletes is
here:
   http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds

Running at CL.ALL won't help in this case since the expiry of the
tombstones means the inconsistency won't be reconciled (see the JIRA
ticket). If you're not in this position (i.e., nodetool repair having
been run often enough). Using CL.ALL could technically "help" in the
normal case, but that's not the best way to heal consistency. Instead,
let Cassandra use normal read-repair. Using CL.ALL means, for one
thing, that you cannot survive node failures since CL.ALL queries will
start failing.

Basically, the tombstone issue is a non-problem as long as you run
nodetool repair often enough with respect to GCGraceSeconds. The
situation right now is a bit special because the contraints of the
cluster were violated (i.e., expired tombstones prior to nodetool
repair having been run).

I hope that clarifies.

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Vram Kouramajian
Thanks Peter for the reply. We are currently "fixing" our inconsistent
data (since we have master data saved) .

We will follow your suggestion and we will run Node Repair tool more
often in the future. However, what happens to data inserted/deleted
after Node Repair tool runs (i.e., between Node Repair and Major
Compaction).

Vram


On Tue, Jan 11, 2011 at 12:33 AM, Peter Schuller
<[hidden email]> wrote:

>> I have few more questions:
>>
>> 1. If we change the write/delete consistency level to ALL, do we
>> eliminate the data inconsistency among nodes (since the delete
>> operations will apply to ALL replicas)?
>>
>> 2. My understanding is that "Read Repair" doesn't handle tombstones.
>> How about "Node Tool Repair" (do we still see inconsistent data among
>> nodes after running "Node Tool Repair")?
>
> Read repair and nodetool repair handle it during normal circumstances.
> The root cause here is that not running nodetool repair within
> GCGraceSeconds breaks the underlying design, leading to the type of
> inconsistency you got that is not healed by RR or repair.
>
> The most important thing is to, from now on, make sure nodetool repair
> is run often enough - either by running it more often or by increasing
> GCGraceSeconds - so that deletes are never forgotten to begin with.
>
> In terms of what to do now that you're in this position, my summary of
> my understanding based on the JIRA ticket and DistributedDeletes is
> here:
>   http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
>
> Running at CL.ALL won't help in this case since the expiry of the
> tombstones means the inconsistency won't be reconciled (see the JIRA
> ticket). If you're not in this position (i.e., nodetool repair having
> been run often enough). Using CL.ALL could technically "help" in the
> normal case, but that's not the best way to heal consistency. Instead,
> let Cassandra use normal read-repair. Using CL.ALL means, for one
> thing, that you cannot survive node failures since CL.ALL queries will
> start failing.
>
> Basically, the tombstone issue is a non-problem as long as you run
> nodetool repair often enough with respect to GCGraceSeconds. The
> situation right now is a bit special because the contraints of the
> cluster were violated (i.e., expired tombstones prior to nodetool
> repair having been run).
>
> I hope that clarifies.
>
> --
> / Peter Schuller
>
Reply | Threaded
Open this post in threaded view
|

Re: Node Inconsistency

Peter Schuller
> We will follow your suggestion and we will run Node Repair tool more
> often in the future. However, what happens to data inserted/deleted
> after Node Repair tool runs (i.e., between Node Repair and Major
> Compaction).

It is handled as you would expect; deletions are propagated across the
cluster etc just like e.g. an overwrite would.

The thing that makes tombstones special is that deletions are
essentially a special case. While normal insertions, over-writes or
not, are fine because given some number of columns there is never an
issue deciding the latest one - the *lack* of a column is problematic
in a distributed system, and the active removal are represented by
these tombstones. If you were willing to store tombstones forever,
they would not be an issue. But typically that would not make sense,
since data that is removed will keep having a performance impact on
the cluster (and take up some disk space). Usually, when you remove
data you want it actually *removed*, so that there is no trace of it
at all. But as soon as you remove the tombstone, you lose track of the
fact that data was removed. So unless you *know* there is no data
somewhere in the cluster for a column, that is older than the
tombstone that indicates it removal, it's not safe to remove.

So, the grace period and the necessity to run nodetool repair is there
for that reason. The periodic nodetool repair is the method by which
you can "know" that there *is* in fact no data somewhere in the
cluster for a column, that is older than the tombstone that indicates
it removal. Hence, the expiry of the tombstones is safe.

--
/ Peter Schuller