Quantcast

Cluster status instability

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Cluster status instability

Marcin Pietraszek
Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster status instability

Michal Michalski
Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and they never come back?
There might be different reasons for flapping nodes, but to list what I have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC.

Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the "stop the world" full GC take (GC logs should be on by default from what I can see [1])


Michał


Kind regards,
Michał Michalski,

On 2 April 2015 at 11:05, Marcin Pietraszek <[hidden email]> wrote:
Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp

Jan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster status instability

Jan
Marcin  ; 

are all your nodes within the same Region   ?   
If not in the same region,   what is the Snitch type that you are using   ? 

Jan/



On Thursday, April 2, 2015 3:28 AM, Michal Michalski <[hidden email]> wrote:


Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and they never come back?
There might be different reasons for flapping nodes, but to list what I have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC.

Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the "stop the world" full GC take (GC logs should be on by default from what I can see [1])


Michał


Kind regards,
Michał Michalski,

On 2 April 2015 at 11:05, Marcin Pietraszek <[hidden email]> wrote:
Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster status instability

daemeon reiydelle
Do you happen to be using a tool like Nagios or Ganglia that are able to report utilization (CPU, Load, disk io, network)? There are plugins for both that will also notify you of (depending on whether you enabled the intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan <[hidden email]> wrote:
Marcin  ; 

are all your nodes within the same Region   ?   
If not in the same region,   what is the Snitch type that you are using   ? 

Jan/



On Thursday, April 2, 2015 3:28 AM, Michal Michalski <[hidden email]> wrote:


Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and they never come back?
There might be different reasons for flapping nodes, but to list what I have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC.

Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the "stop the world" full GC take (GC logs should be on by default from what I can see [1])


Michał


Kind regards,
Michał Michalski,

On 2 April 2015 at 11:05, Marcin Pietraszek <[hidden email]> wrote:
Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster status instability

Erik Forsberg
To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems to stay that way for a very long time (hours). I'm not even sure it will recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we have a lot of bulkloading going on. But why does it then stay for several hours after the load goes down? 
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12 about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all nodes are up. Load on cluster right now is minimal, there's no GC going on. Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN  2001:4c28:1:413:0:1:2:5   1.07 TB    256     1.8%   114ff46e-57d0-40dd-87fb-3e4259e96c16  rack2
DN  2001:4c28:1:413:0:1:2:6   1.06 TB    256     1.8%   b161a6f3-b940-4bba-9aa3-cfb0fc1fe759  rack2
DN  2001:4c28:1:413:0:1:2:13  896.82 GB  256     1.6%   4a488366-0db9-4887-b538-4c5048a6d756  rack2
DN  2001:4c28:1:413:0:1:3:7   1.04 TB    256     1.8%   95cf2cdb-d364-4b30-9b91-df4c37f3d670  rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
  generation:1427712750
  heartbeat:2310212
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack2
  LOAD:1.172524771195E12
  INTERNAL_IP:2001:4c28:1:413:0:1:2:5
  HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100493381707736523347375230104768602825
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
  generation:1427714889
  heartbeat:2305710
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack3
  LOAD:1.047542503234E12
  INTERNAL_IP:2001:4c28:1:413:0:1:3:12
  HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100163259989151698942931348962560111256
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made 04-05 mark it as up, with no success. 

Please let me know what other debug information I can provide.

Regards,
\EF

On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle <[hidden email]> wrote:
Do you happen to be using a tool like Nagios or Ganglia that are able to report utilization (CPU, Load, disk io, network)? There are plugins for both that will also notify you of (depending on whether you enabled the intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan <[hidden email]> wrote:
Marcin  ; 

are all your nodes within the same Region   ?   
If not in the same region,   what is the Snitch type that you are using   ? 

Jan/



On Thursday, April 2, 2015 3:28 AM, Michal Michalski <[hidden email]> wrote:


Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and they never come back?
There might be different reasons for flapping nodes, but to list what I have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC.

Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the "stop the world" full GC take (GC logs should be on by default from what I can see [1])


Michał


Kind regards,
Michał Michalski,

On 2 April 2015 at 11:05, Marcin Pietraszek <[hidden email]> wrote:
Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp





Loading...