Issue with removing a node and adding it back

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with removing a node and adding it back

Shiwen Cheng
Hi all,

I encountered an issue by removing and adding back a node.
Here is how this issue came out:
(1) We have four nodes cluster running, but there was a hard disk failure on one of the node.
Since we need to replace the hard disk, I chose to use removenode to remove the failed node.
(2) few days later, after the new hard disk is installed. I re-install the Cassandra on this node.
I checked the .yaml file and it is the same as the other three nodes (only difference is the listen_address), the newly added node is not in the seeds list.
I used the same Cassandra version as other nodes which is 2.0.5
(3) With bootstrap set to true (by default), the new node "seems" can join the cluster.
But:
 (a) OpsCenter shows this node is with "unknow datacenter".
 (b) the status of this node in OpsCenter is shown as "joining"
 (c) one of the node starts to streaming the data to the new node.  However, after few hours there is no futher streaming, but the data size is not even close to other nodes which is definitely not finished.
 (d) The node is still shown as "joining" with "unknown datacenter"  in OpsCenter.  More than 12 hours in this status.
 (e) nodetool status on other three machines doesn't show this newly added node.

There are no exceptions in the log from the newly added node.
I tried many times to re-install cassandra, opscenter and datastax-agent but no luck to solve it.
So I got stuck here.

Can anybody help? I really appreciate!

Thanks,
Shiwen
Reply | Threaded
Open this post in threaded view
|

Re: Issue with removing a node and adding it back

Robert Coli-3
On Thu, Mar 26, 2015 at 11:31 AM, Shiwen Cheng <[hidden email]> wrote:
I encountered an issue by removing and adding back a node.

You are encountering a failed/hung bootstrap, which probably has nothing to do with the node having been previously removenoded.

Stop the node, wipe all the data on the node, including it's system directory and re-bootstrap.

=Rob
 
Reply | Threaded
Open this post in threaded view
|

Re: Issue with removing a node and adding it back

Shiwen Cheng
Thanks Robert!
Yes I tried what you said: clean the data and re-bootstrap. But still it failed, once at the point of 600GB transferred and once at 1.1TB :(

But I could see following exceptions from time to time:
=====================
java.io.IOException: net.jpountz.lz4.LZ4Exception: Error decoding offset 15 of input buffer
        at org.apache.cassandra.io.compress.LZ4Compressor.uncompress(LZ4Compressor.java:89)
        at org.apache.cassandra.streaming.compress.CompressedInputStream.decompress(CompressedInputStream.java:108)
        at org.apache.cassandra.streaming.compress.CompressedInputStream.read(CompressedInputStream.java:86)
        at java.io.InputStream.read(InputStream.java:170)
        at java.io.InputStream.skip(InputStream.java:222)
        at org.apache.cassandra.streaming.StreamReader.drain(StreamReader.java:117)
        at org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:89)
        at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:47)
        at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:37)
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:283)
        at java.lang.Thread.run(Thread.java:744)
=======================

And 
=======================
CassandraDaemon.java (line 479) Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
        at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86)
        at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:975)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:736)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:583)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:482)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:345)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
        at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
        at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
        at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
        at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:211)
        at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
        at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:329)
        at org.apache.cassandra.streaming.StreamSession.convict(StreamSession.java:592)
        at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:236)
        at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:623)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:64)
        at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:170)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:75)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
 INFO [StorageServiceShutdownHook] 2015-03-26 10:29:48,471 Gossiper.java (line 1251) Announcing shutdown
==========================

Is there anything else I could try?
Thanks!

Shiwen

On Thu, Mar 26, 2015 at 4:18 PM, Robert Coli <[hidden email]> wrote:
On Thu, Mar 26, 2015 at 11:31 AM, Shiwen Cheng <[hidden email]> wrote:
I encountered an issue by removing and adding back a node.

You are encountering a failed/hung bootstrap, which probably has nothing to do with the node having been previously removenoded.

Stop the node, wipe all the data on the node, including it's system directory and re-bootstrap.

=Rob
 

Reply | Threaded
Open this post in threaded view
|

Re: Issue with removing a node and adding it back

Robert Coli-3
On Fri, Mar 27, 2015 at 4:27 PM, Shiwen Cheng <[hidden email]> wrote:
Thanks Robert!
Yes I tried what you said: clean the data and re-bootstrap. But still it failed, once at the point of 600GB transferred and once at 1.1TB :(


1) figure out what is making your streams die (usually either flaky network (AWS) or stop-the-world GC) and fix that
OR
2) try tuning streaming_socket_timeout_in_ms

=Rob