Adding New Node Issue

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding New Node Issue

Thomas Miller

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html.

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed.

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards.

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended.

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller


 

Reply | Threaded
Open this post in threaded view
|

Re: Adding New Node Issue

Jeff Ferland
Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

Hello,
 
Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. 
 
When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 
 
The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 
 
Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.
 
We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 
 
We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.
 
We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.
 
So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller


Reply | Threaded
Open this post in threaded view
|

RE: Adding New Node Issue

Thomas Miller

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. 

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller

 

Reply | Threaded
Open this post in threaded view
|

Re: Adding New Node Issue

Ali Akhtar
What version are you running?

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <[hidden email]> wrote:

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. 

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller

 


Reply | Threaded
Open this post in threaded view
|

RE: Adding New Node Issue

Thomas Miller

Ali,

 

Our Cassandra version is 2.0.7.

 

Thanks,

Thomas Miller

 

From: Ali Akhtar [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 4:22 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

What version are you running?

 

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <[hidden email]> wrote:

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Adding New Node Issue

Andrei Ivanov
In reply to this post by Ali Akhtar
Thomas, just in case you missed it there is a bug with throughput setting prior to 2.0.13, here is the link:

So, it may happen you are setting it to 1600 megabytes

Andrei

On Thu, Apr 23, 2015 at 11:22 PM, Ali Akhtar <[hidden email]> wrote:
What version are you running?

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <[hidden email]> wrote:

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. 

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller

 



Reply | Threaded
Open this post in threaded view
|

RE: Adding New Node Issue

Thomas Miller

Andrei,

 

I did not see that bug report. Thanks for the heads up on that.

 

I am thinking that that is still not the issue though since if this were the case then I should be seeing higher than 200Mbps on that interface. I am able to see that the two streaming nodes never get over 200Mbps via my Zabbix monitoring software. If this bug was affecting us I should see those interface getting hammered, right?

 

Thanks,

Thomas Miller

 

From: Andrei Ivanov [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 4:40 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Thomas, just in case you missed it there is a bug with throughput setting prior to 2.0.13, here is the link:

 

So, it may happen you are setting it to 1600 megabytes

 

Andrei

 

On Thu, Apr 23, 2015 at 11:22 PM, Ali Akhtar <[hidden email]> wrote:

What version are you running?

 

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <[hidden email]> wrote:

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Adding New Node Issue

Andrei Ivanov
Thomas, 

From our experience, C* is almost degrading quite a bit when we bootstrap new nodes - no idea why, was never able to get any help or hints. And we never reach anywhere close to 200Mbps. Though we also see higher CPU usage.Actually, there is another way of adding nodes, I guess. Like start the new node w/o auto bootstrap and initiate a rebuild. But this approach is not completely flawless.

Andrei.

On Thu, Apr 23, 2015 at 11:50 PM, Thomas Miller <[hidden email]> wrote:

Andrei,

 

I did not see that bug report. Thanks for the heads up on that.

 

I am thinking that that is still not the issue though since if this were the case then I should be seeing higher than 200Mbps on that interface. I am able to see that the two streaming nodes never get over 200Mbps via my Zabbix monitoring software. If this bug was affecting us I should see those interface getting hammered, right?

 

Thanks,

Thomas Miller

 

From: Andrei Ivanov [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 4:40 PM


To: [hidden email]
Subject: Re: Adding New Node Issue

 

Thomas, just in case you missed it there is a bug with throughput setting prior to 2.0.13, here is the link:

 

So, it may happen you are setting it to 1600 megabytes

 

Andrei

 

On Thu, Apr 23, 2015 at 11:22 PM, Ali Akhtar <[hidden email]> wrote:

What version are you running?

 

On Fri, Apr 24, 2015 at 12:51 AM, Thomas Miller <[hidden email]> wrote:

Jeff,

 

Thanks for the response. I had come across that as a possible solution previously but there are discrepancies that would lead me to think that that is not the issue.

 

It appears our stream throughput is currently set to 200Mbps but unless the Cassandra service shares that same throughput limitation to serve its data also, it does not seem like 200Mbps bandwidth usage would overwhelm the nodes. The 200Mbps bandwidth usage is only on two of the four nodes when adding the new node. It seems like the other two nodes should be able to handle requests still. When my backups run at night they hit around 300Mbps bandwidth usage and we have no timeouts at all.

 

Then there is the question of why, when we stopped the Cassandra service on the joining node, the timeouts did not stop? Opscenter did not show that node anymore and “nodetool status” verified that. We were thinking that maybe gossip caused the existing nodes to think that there was still a node joining but since the new node was shutdown it was not actually joining, but that is not confirmed.

 

 

Thanks,

Thomas Miller

 

From: Jeff Ferland [mailto:[hidden email]]
Sent: Thursday, April 23, 2015 2:46 PM
To: [hidden email]
Subject: Re: Adding New Node Issue

 

Sounds to me like your stream throughput value is too high. `notetool getstreamthroughput` and `notetool setstreamthroughput` will update this value live. Limit it to something lower so that the system isn’t overloaded by streaming. The bottleneck that slows things down is mostly to be disk or network.

 

On Apr 23, 2015, at 11:18 AM, Thomas Miller <[hidden email]> wrote:

 

Hello,

 

Yesterday we ran into a serious issue while joining a new node to our existing 4 node Cassandra cluster (version 2.0.7). The average node data size is 152GB’s with a replication factor of 3. The node was prepped just like the following document describes - http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html

 

When I started the new node, Opscenter showed the node as “Active – Joining” but we immediately began getting timeouts on our websites because lookups were taking too long. On the 4 existing nodes the network interface showed about 200Mbps being used, the CPU never went over 20% and the memory usage barely changed. 

 

The question I have is, does adding a new node cause some sort of throttling that would affect our webservers from being able to function as normal? The only thing that we can think of that might have had some affect was that a repair was just finishing on one of the nodes when the new node was added. The repair ended up finishing while the new node was in the joining state but the timeouts did not go away afterwards. 

 

Our impatience got the better of us so we ended up stopping the Cassandra service on the new node because it appeared, at the time, to have stalled out in the joining state and nothing more was being streamed to it. But even stopping it did not allow the cluster to resume its normal operation and we were still getting timeouts. We tried rebooting our web servers and then our 4 existing Cassandra servers but none of it worked.

 

We never saw any errors/exceptions in the Cassandra and system logs at all. It completely mystified us why there would be no errors/exceptions unless this was working as intended. 

 

We ended up getting it working by adding the new node again and just letting it go until it finally finished joining, and everything magically started working again. We noticed towards the end it was barely streaming anything (Opscenter was not showing any running streams towards the end) by checking the size of the data directory and we saw it growing and shrinking ever so slightly.

 

We have to add one more new node and then decommission two of the existing nodes so we can perform some hardware maintenance on the server those two existing nodes are on, but we are hesitant to try this again without scheduling a maintenance window for this node add and decommissioning process.

 

So to reiterate what I am asking, does adding a node cause the cluster to be unusable/timeout? Also, can we expect the decommissioning of the other two nodes to cause the same type of downtimes since they have to stream their content out to the other nodes in the cluster?

 

Thanks,

Thomas Miller