get_key_range (CASSANDRA-169)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

get_key_range (CASSANDRA-169)

Simon Smith-3
I'm seeing an issue similar to:

http://issues.apache.org/jira/browse/CASSANDRA-169

Here is when I see it.  I'm running Cassandra on 5 nodes using the
OrderPreservingPartitioner, and have populated Cassandra with 78
records, and I can use get_key_range via Thrift just fine.  Then, if I
manually kill one of the nodes (if I kill off node #5), the node (node
#1) which I've been using to call get_key_range will timeout and the
error:

 Thrift: Internal error processing get_key_range

And the Cassandra output shows the same trace as in 169:

ERROR - Encountered IOException on connection:
java.nio.channels.SocketChannel[closed]
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
WARN - Closing down connection java.nio.channels.SocketChannel[closed]
ERROR - Internal error processing get_key_range
java.lang.RuntimeException: java.util.concurrent.TimeoutException:
Operation timed out.
        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:675)
Caused by: java.util.concurrent.TimeoutException: Operation timed out.
        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
        ... 7 more



If it was giving an error just one time, I could just rely on catching
the error and trying again.  But a get_key_range call to that node I
was already making get_key_range queries against (node #1) never works
again (it is still up and it responds fine to multiget Thrift calls),
sometimes not even after I restart the down node (node #5).  I end up
having to restart node #1 in addition to node #5.  The behavior for
the other 3 nodes varies - some of them  are also unable to respond to
get_key_range calls, but some of them do respond to get_key_range
calls.

My question is, what path should I go down in terms of reproducing
this problem?  I'm using Aug 27 trunk code - should I update my
Cassandra install prior to gathering more information for this issue,
and if so, which version (0.4 or trunk).  If there is anyone who is
familiar with this issue, could you let me know what I might be doing
wrong, or what the next info-gathering step should be for me?

Thank you,

Simon Smith
Arcode Corporation
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Jonathan Ellis-3
getting temporary errors when a node goes down, until the other nodes'
failure detectors realize it's down, is normal.  (this should only
take a dozen seconds, or so.)

but after that it should route requests to other nodes, and it should
also realize when you restart #5 that it is alive again.  those are
two separate issues.

can you verify that "bin/nodeprobe cluster" shows that node 1
eventually does/does not see #5 dead, and alive again?

-Jonathan

On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[hidden email]> wrote:

> I'm seeing an issue similar to:
>
> http://issues.apache.org/jira/browse/CASSANDRA-169
>
> Here is when I see it.  I'm running Cassandra on 5 nodes using the
> OrderPreservingPartitioner, and have populated Cassandra with 78
> records, and I can use get_key_range via Thrift just fine.  Then, if I
> manually kill one of the nodes (if I kill off node #5), the node (node
> #1) which I've been using to call get_key_range will timeout and the
> error:
>
>  Thrift: Internal error processing get_key_range
>
> And the Cassandra output shows the same trace as in 169:
>
> ERROR - Encountered IOException on connection:
> java.nio.channels.SocketChannel[closed]
> java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
> ERROR - Internal error processing get_key_range
> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation timed out.
>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:675)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>        ... 7 more
>
>
>
> If it was giving an error just one time, I could just rely on catching
> the error and trying again.  But a get_key_range call to that node I
> was already making get_key_range queries against (node #1) never works
> again (it is still up and it responds fine to multiget Thrift calls),
> sometimes not even after I restart the down node (node #5).  I end up
> having to restart node #1 in addition to node #5.  The behavior for
> the other 3 nodes varies - some of them  are also unable to respond to
> get_key_range calls, but some of them do respond to get_key_range
> calls.
>
> My question is, what path should I go down in terms of reproducing
> this problem?  I'm using Aug 27 trunk code - should I update my
> Cassandra install prior to gathering more information for this issue,
> and if so, which version (0.4 or trunk).  If there is anyone who is
> familiar with this issue, could you let me know what I might be doing
> wrong, or what the next info-gathering step should be for me?
>
> Thank you,
>
> Simon Smith
> Arcode Corporation
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Simon Smith-3
The error starts as soon as the downed node #5 goes down and lasts
until I restart the downed node #5.

bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
and when it is up again)

Since I set the replication set to 3, I'm confused as to why (after
the first few seconds or so) there is an error just because one host
is down temporarily.

The way I have the test setup is that I have a script running on each
of the nodes that is running the get_key_range over and over to
"localhost".  Depending on which node I take down, the behavior
varies: if I take done one host, it is the only one giving errors (the
other 4 nodes still work).  For the other 4 situations, either 2 or 3
nodes continue to work (i.e. the downed node and either one or two
other nodes are the ones giving errors).  Note: the nodes that keep
working, never fail at all, not even for a few seconds.

I am running this on 4GB "cloud server" boxes in Rackspace, I can set
up just about any test needed to help debug this and capture output or
logs, and can give a Cassandra developer access if it would help.  Of
course I can include whatever config files or log files would be
helpful, I just don't want to spam the list unless it is relevant.

Thanks again,

Simon


On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[hidden email]> wrote:

> getting temporary errors when a node goes down, until the other nodes'
> failure detectors realize it's down, is normal.  (this should only
> take a dozen seconds, or so.)
>
> but after that it should route requests to other nodes, and it should
> also realize when you restart #5 that it is alive again.  those are
> two separate issues.
>
> can you verify that "bin/nodeprobe cluster" shows that node 1
> eventually does/does not see #5 dead, and alive again?
>
> -Jonathan
>
> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[hidden email]> wrote:
>> I'm seeing an issue similar to:
>>
>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>
>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>> OrderPreservingPartitioner, and have populated Cassandra with 78
>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>> manually kill one of the nodes (if I kill off node #5), the node (node
>> #1) which I've been using to call get_key_range will timeout and the
>> error:
>>
>>  Thrift: Internal error processing get_key_range
>>
>> And the Cassandra output shows the same trace as in 169:
>>
>> ERROR - Encountered IOException on connection:
>> java.nio.channels.SocketChannel[closed]
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>> ERROR - Internal error processing get_key_range
>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>> Operation timed out.
>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>        at java.lang.Thread.run(Thread.java:675)
>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>        ... 7 more
>>
>>
>>
>> If it was giving an error just one time, I could just rely on catching
>> the error and trying again.  But a get_key_range call to that node I
>> was already making get_key_range queries against (node #1) never works
>> again (it is still up and it responds fine to multiget Thrift calls),
>> sometimes not even after I restart the down node (node #5).  I end up
>> having to restart node #1 in addition to node #5.  The behavior for
>> the other 3 nodes varies - some of them  are also unable to respond to
>> get_key_range calls, but some of them do respond to get_key_range
>> calls.
>>
>> My question is, what path should I go down in terms of reproducing
>> this problem?  I'm using Aug 27 trunk code - should I update my
>> Cassandra install prior to gathering more information for this issue,
>> and if so, which version (0.4 or trunk).  If there is anyone who is
>> familiar with this issue, could you let me know what I might be doing
>> wrong, or what the next info-gathering step should be for me?
>>
>> Thank you,
>>
>> Simon Smith
>> Arcode Corporation
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Jonathan Ellis-3
Okay, so when #5 comes back up, #1 eventually stops erroring out and
you don't have to restart #1?  That is good, that would have been a
bigger problem. :)

If you are comfortable using a Java debugger (by default Cassandra
listens for one on 8888) you can look at what is going on inside
StorageProxy.getKeyRange on node #1 at the call to

        EndPoint endPoint =
StorageService.instance().findSuitableEndPoint(command.startWith);

findSuitableEndpoint is supposed to pick a live node, not a dead one. :)

If not I can write a patch to log extra information for this bug so we
can track it down.

-Jonathan

On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<[hidden email]> wrote:

> The error starts as soon as the downed node #5 goes down and lasts
> until I restart the downed node #5.
>
> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
> and when it is up again)
>
> Since I set the replication set to 3, I'm confused as to why (after
> the first few seconds or so) there is an error just because one host
> is down temporarily.
>
> The way I have the test setup is that I have a script running on each
> of the nodes that is running the get_key_range over and over to
> "localhost".  Depending on which node I take down, the behavior
> varies: if I take done one host, it is the only one giving errors (the
> other 4 nodes still work).  For the other 4 situations, either 2 or 3
> nodes continue to work (i.e. the downed node and either one or two
> other nodes are the ones giving errors).  Note: the nodes that keep
> working, never fail at all, not even for a few seconds.
>
> I am running this on 4GB "cloud server" boxes in Rackspace, I can set
> up just about any test needed to help debug this and capture output or
> logs, and can give a Cassandra developer access if it would help.  Of
> course I can include whatever config files or log files would be
> helpful, I just don't want to spam the list unless it is relevant.
>
> Thanks again,
>
> Simon
>
>
> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[hidden email]> wrote:
>> getting temporary errors when a node goes down, until the other nodes'
>> failure detectors realize it's down, is normal.  (this should only
>> take a dozen seconds, or so.)
>>
>> but after that it should route requests to other nodes, and it should
>> also realize when you restart #5 that it is alive again.  those are
>> two separate issues.
>>
>> can you verify that "bin/nodeprobe cluster" shows that node 1
>> eventually does/does not see #5 dead, and alive again?
>>
>> -Jonathan
>>
>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[hidden email]> wrote:
>>> I'm seeing an issue similar to:
>>>
>>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>>
>>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>>> OrderPreservingPartitioner, and have populated Cassandra with 78
>>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>>> manually kill one of the nodes (if I kill off node #5), the node (node
>>> #1) which I've been using to call get_key_range will timeout and the
>>> error:
>>>
>>>  Thrift: Internal error processing get_key_range
>>>
>>> And the Cassandra output shows the same trace as in 169:
>>>
>>> ERROR - Encountered IOException on connection:
>>> java.nio.channels.SocketChannel[closed]
>>> java.net.ConnectException: Connection refused
>>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>>> ERROR - Internal error processing get_key_range
>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>>> Operation timed out.
>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>        at java.lang.Thread.run(Thread.java:675)
>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>>        ... 7 more
>>>
>>>
>>>
>>> If it was giving an error just one time, I could just rely on catching
>>> the error and trying again.  But a get_key_range call to that node I
>>> was already making get_key_range queries against (node #1) never works
>>> again (it is still up and it responds fine to multiget Thrift calls),
>>> sometimes not even after I restart the down node (node #5).  I end up
>>> having to restart node #1 in addition to node #5.  The behavior for
>>> the other 3 nodes varies - some of them  are also unable to respond to
>>> get_key_range calls, but some of them do respond to get_key_range
>>> calls.
>>>
>>> My question is, what path should I go down in terms of reproducing
>>> this problem?  I'm using Aug 27 trunk code - should I update my
>>> Cassandra install prior to gathering more information for this issue,
>>> and if so, which version (0.4 or trunk).  If there is anyone who is
>>> familiar with this issue, could you let me know what I might be doing
>>> wrong, or what the next info-gathering step should be for me?
>>>
>>> Thank you,
>>>
>>> Simon Smith
>>> Arcode Corporation
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Simon Smith-3
I think it might take me quite a bit of effort for me figure out how
to use a java debugger - it will be a lot quicker if you can give me a
patch, then I can certainly re-build using ant against either latest
trunk or latest 0.4 and re-run my test.

Thanks,

Simon

On Wed, Sep 9, 2009 at 6:52 PM, Jonathan Ellis <[hidden email]> wrote:

> Okay, so when #5 comes back up, #1 eventually stops erroring out and
> you don't have to restart #1?  That is good, that would have been a
> bigger problem. :)
>
> If you are comfortable using a Java debugger (by default Cassandra
> listens for one on 8888) you can look at what is going on inside
> StorageProxy.getKeyRange on node #1 at the call to
>
>        EndPoint endPoint =
> StorageService.instance().findSuitableEndPoint(command.startWith);
>
> findSuitableEndpoint is supposed to pick a live node, not a dead one. :)
>
> If not I can write a patch to log extra information for this bug so we
> can track it down.
>
> -Jonathan
>
> On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<[hidden email]> wrote:
>> The error starts as soon as the downed node #5 goes down and lasts
>> until I restart the downed node #5.
>>
>> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
>> and when it is up again)
>>
>> Since I set the replication set to 3, I'm confused as to why (after
>> the first few seconds or so) there is an error just because one host
>> is down temporarily.
>>
>> The way I have the test setup is that I have a script running on each
>> of the nodes that is running the get_key_range over and over to
>> "localhost".  Depending on which node I take down, the behavior
>> varies: if I take done one host, it is the only one giving errors (the
>> other 4 nodes still work).  For the other 4 situations, either 2 or 3
>> nodes continue to work (i.e. the downed node and either one or two
>> other nodes are the ones giving errors).  Note: the nodes that keep
>> working, never fail at all, not even for a few seconds.
>>
>> I am running this on 4GB "cloud server" boxes in Rackspace, I can set
>> up just about any test needed to help debug this and capture output or
>> logs, and can give a Cassandra developer access if it would help.  Of
>> course I can include whatever config files or log files would be
>> helpful, I just don't want to spam the list unless it is relevant.
>>
>> Thanks again,
>>
>> Simon
>>
>>
>> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[hidden email]> wrote:
>>> getting temporary errors when a node goes down, until the other nodes'
>>> failure detectors realize it's down, is normal.  (this should only
>>> take a dozen seconds, or so.)
>>>
>>> but after that it should route requests to other nodes, and it should
>>> also realize when you restart #5 that it is alive again.  those are
>>> two separate issues.
>>>
>>> can you verify that "bin/nodeprobe cluster" shows that node 1
>>> eventually does/does not see #5 dead, and alive again?
>>>
>>> -Jonathan
>>>
>>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[hidden email]> wrote:
>>>> I'm seeing an issue similar to:
>>>>
>>>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>>>
>>>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>>>> OrderPreservingPartitioner, and have populated Cassandra with 78
>>>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>>>> manually kill one of the nodes (if I kill off node #5), the node (node
>>>> #1) which I've been using to call get_key_range will timeout and the
>>>> error:
>>>>
>>>>  Thrift: Internal error processing get_key_range
>>>>
>>>> And the Cassandra output shows the same trace as in 169:
>>>>
>>>> ERROR - Encountered IOException on connection:
>>>> java.nio.channels.SocketChannel[closed]
>>>> java.net.ConnectException: Connection refused
>>>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>>>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>>>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>>>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>>>> ERROR - Internal error processing get_key_range
>>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>>>> Operation timed out.
>>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>>>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>>>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>>>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>>>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>>        at java.lang.Thread.run(Thread.java:675)
>>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>>>        ... 7 more
>>>>
>>>>
>>>>
>>>> If it was giving an error just one time, I could just rely on catching
>>>> the error and trying again.  But a get_key_range call to that node I
>>>> was already making get_key_range queries against (node #1) never works
>>>> again (it is still up and it responds fine to multiget Thrift calls),
>>>> sometimes not even after I restart the down node (node #5).  I end up
>>>> having to restart node #1 in addition to node #5.  The behavior for
>>>> the other 3 nodes varies - some of them  are also unable to respond to
>>>> get_key_range calls, but some of them do respond to get_key_range
>>>> calls.
>>>>
>>>> My question is, what path should I go down in terms of reproducing
>>>> this problem?  I'm using Aug 27 trunk code - should I update my
>>>> Cassandra install prior to gathering more information for this issue,
>>>> and if so, which version (0.4 or trunk).  If there is anyone who is
>>>> familiar with this issue, could you let me know what I might be doing
>>>> wrong, or what the next info-gathering step should be for me?
>>>>
>>>> Thank you,
>>>>
>>>> Simon Smith
>>>> Arcode Corporation
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Jonathan Ellis-3
I think I see the problem.

Can you check if your range query is spanning multiple nodes in the
cluster?  You can tell by setting the log level to DEBUG, and looking
for after it logs get_key_range, it will say "reading
RangeCommand(...) from ... @machine" more than once.

The bug is that when picking the node to start the range query it
consults the failure detector to avoid dead nodes, but if the query
spans nodes it does not do that on subsequent nodes.

But if you are only generating one RangeCommand per get_key_range then
we have two bugs. :)

-Jonathan

On Wed, Sep 9, 2009 at 6:03 PM, Simon Smith<[hidden email]> wrote:

> I think it might take me quite a bit of effort for me figure out how
> to use a java debugger - it will be a lot quicker if you can give me a
> patch, then I can certainly re-build using ant against either latest
> trunk or latest 0.4 and re-run my test.
>
> Thanks,
>
> Simon
>
> On Wed, Sep 9, 2009 at 6:52 PM, Jonathan Ellis <[hidden email]> wrote:
>> Okay, so when #5 comes back up, #1 eventually stops erroring out and
>> you don't have to restart #1?  That is good, that would have been a
>> bigger problem. :)
>>
>> If you are comfortable using a Java debugger (by default Cassandra
>> listens for one on 8888) you can look at what is going on inside
>> StorageProxy.getKeyRange on node #1 at the call to
>>
>>        EndPoint endPoint =
>> StorageService.instance().findSuitableEndPoint(command.startWith);
>>
>> findSuitableEndpoint is supposed to pick a live node, not a dead one. :)
>>
>> If not I can write a patch to log extra information for this bug so we
>> can track it down.
>>
>> -Jonathan
>>
>> On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<[hidden email]> wrote:
>>> The error starts as soon as the downed node #5 goes down and lasts
>>> until I restart the downed node #5.
>>>
>>> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
>>> and when it is up again)
>>>
>>> Since I set the replication set to 3, I'm confused as to why (after
>>> the first few seconds or so) there is an error just because one host
>>> is down temporarily.
>>>
>>> The way I have the test setup is that I have a script running on each
>>> of the nodes that is running the get_key_range over and over to
>>> "localhost".  Depending on which node I take down, the behavior
>>> varies: if I take done one host, it is the only one giving errors (the
>>> other 4 nodes still work).  For the other 4 situations, either 2 or 3
>>> nodes continue to work (i.e. the downed node and either one or two
>>> other nodes are the ones giving errors).  Note: the nodes that keep
>>> working, never fail at all, not even for a few seconds.
>>>
>>> I am running this on 4GB "cloud server" boxes in Rackspace, I can set
>>> up just about any test needed to help debug this and capture output or
>>> logs, and can give a Cassandra developer access if it would help.  Of
>>> course I can include whatever config files or log files would be
>>> helpful, I just don't want to spam the list unless it is relevant.
>>>
>>> Thanks again,
>>>
>>> Simon
>>>
>>>
>>> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[hidden email]> wrote:
>>>> getting temporary errors when a node goes down, until the other nodes'
>>>> failure detectors realize it's down, is normal.  (this should only
>>>> take a dozen seconds, or so.)
>>>>
>>>> but after that it should route requests to other nodes, and it should
>>>> also realize when you restart #5 that it is alive again.  those are
>>>> two separate issues.
>>>>
>>>> can you verify that "bin/nodeprobe cluster" shows that node 1
>>>> eventually does/does not see #5 dead, and alive again?
>>>>
>>>> -Jonathan
>>>>
>>>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[hidden email]> wrote:
>>>>> I'm seeing an issue similar to:
>>>>>
>>>>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>>>>
>>>>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>>>>> OrderPreservingPartitioner, and have populated Cassandra with 78
>>>>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>>>>> manually kill one of the nodes (if I kill off node #5), the node (node
>>>>> #1) which I've been using to call get_key_range will timeout and the
>>>>> error:
>>>>>
>>>>>  Thrift: Internal error processing get_key_range
>>>>>
>>>>> And the Cassandra output shows the same trace as in 169:
>>>>>
>>>>> ERROR - Encountered IOException on connection:
>>>>> java.nio.channels.SocketChannel[closed]
>>>>> java.net.ConnectException: Connection refused
>>>>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>>        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>>>>        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>>>>        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>>>>        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>>>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>>>>> ERROR - Internal error processing get_key_range
>>>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>>>>> Operation timed out.
>>>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>>>>        at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>>>>        at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>>>>        at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>>>>        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>>>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>>>        at java.lang.Thread.run(Thread.java:675)
>>>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>>>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>>>>        at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>>>>        ... 7 more
>>>>>
>>>>>
>>>>>
>>>>> If it was giving an error just one time, I could just rely on catching
>>>>> the error and trying again.  But a get_key_range call to that node I
>>>>> was already making get_key_range queries against (node #1) never works
>>>>> again (it is still up and it responds fine to multiget Thrift calls),
>>>>> sometimes not even after I restart the down node (node #5).  I end up
>>>>> having to restart node #1 in addition to node #5.  The behavior for
>>>>> the other 3 nodes varies - some of them  are also unable to respond to
>>>>> get_key_range calls, but some of them do respond to get_key_range
>>>>> calls.
>>>>>
>>>>> My question is, what path should I go down in terms of reproducing
>>>>> this problem?  I'm using Aug 27 trunk code - should I update my
>>>>> Cassandra install prior to gathering more information for this issue,
>>>>> and if so, which version (0.4 or trunk).  If there is anyone who is
>>>>> familiar with this issue, could you let me know what I might be doing
>>>>> wrong, or what the next info-gathering step should be for me?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Simon Smith
>>>>> Arcode Corporation
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Simon Smith-3
I sent get_key_range to node #1 (174.143.182.178), and here are the
resulting log lines from 174.143.182.178's log (Do you want the other
nodes' log lines? Let me know if so.)

DEBUG - get_key_range
DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
startWith='', stopAt='', maxResults=100) from 648@174.143.182.178:7000
DEBUG - collecting :false:32@1252535119
 [ ... chop the repeated & identical collecting messages ... ]
DEBUG - collecting :false:32@1252535119
DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
completed=false) to 648@174.143.182.178:7000
DEBUG - Processing response on an async result from 648@174.143.182.178:7000
DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
startWith='', stopAt='', maxResults=58) from 649@174.143.182.182:7000
DEBUG - Processing response on an async result from 649@174.143.182.182:7000
DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
startWith='', stopAt='', maxResults=58) from 650@174.143.182.179:7000
DEBUG - Processing response on an async result from 650@174.143.182.179:7000
DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
startWith='', stopAt='', maxResults=22) from 651@174.143.182.185:7000
DEBUG - Processing response on an async result from 651@174.143.182.185:7000
DEBUG - Disseminating load info ...


Thanks,

Simon

On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[hidden email]> wrote:

> I think I see the problem.
>
> Can you check if your range query is spanning multiple nodes in the
> cluster?  You can tell by setting the log level to DEBUG, and looking
> for after it logs get_key_range, it will say "reading
> RangeCommand(...) from ... @machine" more than once.
>
> The bug is that when picking the node to start the range query it
> consults the failure detector to avoid dead nodes, but if the query
> spans nodes it does not do that on subsequent nodes.
>
> But if you are only generating one RangeCommand per get_key_range then
> we have two bugs. :)
>
> -Jonathan
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Jonathan Ellis-3
That confirms what I suspected, thanks.

Can you file a ticket on Jira and I'll work on a fix for you to test?

thanks,

-Jonathan

On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<[hidden email]> wrote:

> I sent get_key_range to node #1 (174.143.182.178), and here are the
> resulting log lines from 174.143.182.178's log (Do you want the other
> nodes' log lines? Let me know if so.)
>
> DEBUG - get_key_range
> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
> startWith='', stopAt='', maxResults=100) from 648@174.143.182.178:7000
> DEBUG - collecting :false:32@1252535119
>  [ ... chop the repeated & identical collecting messages ... ]
> DEBUG - collecting :false:32@1252535119
> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
> java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
> completed=false) to 648@174.143.182.178:7000
> DEBUG - Processing response on an async result from 648@174.143.182.178:7000
> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
> startWith='', stopAt='', maxResults=58) from 649@174.143.182.182:7000
> DEBUG - Processing response on an async result from 649@174.143.182.182:7000
> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
> startWith='', stopAt='', maxResults=58) from 650@174.143.182.179:7000
> DEBUG - Processing response on an async result from 650@174.143.182.179:7000
> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
> startWith='', stopAt='', maxResults=22) from 651@174.143.182.185:7000
> DEBUG - Processing response on an async result from 651@174.143.182.185:7000
> DEBUG - Disseminating load info ...
>
>
> Thanks,
>
> Simon
>
> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[hidden email]> wrote:
>> I think I see the problem.
>>
>> Can you check if your range query is spanning multiple nodes in the
>> cluster?  You can tell by setting the log level to DEBUG, and looking
>> for after it logs get_key_range, it will say "reading
>> RangeCommand(...) from ... @machine" more than once.
>>
>> The bug is that when picking the node to start the range query it
>> consults the failure detector to avoid dead nodes, but if the query
>> spans nodes it does not do that on subsequent nodes.
>>
>> But if you are only generating one RangeCommand per get_key_range then
>> we have two bugs. :)
>>
>> -Jonathan
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Simon Smith-3
https://issues.apache.org/jira/browse/CASSANDRA-440

Thanks again, of course I'm happy to give any additional information
and will gladly do any testing of the fix.

Simon


On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis <[hidden email]> wrote:

> That confirms what I suspected, thanks.
>
> Can you file a ticket on Jira and I'll work on a fix for you to test?
>
> thanks,
>
> -Jonathan
>
> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<[hidden email]> wrote:
>> I sent get_key_range to node #1 (174.143.182.178), and here are the
>> resulting log lines from 174.143.182.178's log (Do you want the other
>> nodes' log lines? Let me know if so.)
>>
>> DEBUG - get_key_range
>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>> startWith='', stopAt='', maxResults=100) from 648@174.143.182.178:7000
>> DEBUG - collecting :false:32@1252535119
>>  [ ... chop the repeated & identical collecting messages ... ]
>> DEBUG - collecting :false:32@1252535119
>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
>> completed=false) to 648@174.143.182.178:7000
>> DEBUG - Processing response on an async result from 648@174.143.182.178:7000
>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>> startWith='', stopAt='', maxResults=58) from 649@174.143.182.182:7000
>> DEBUG - Processing response on an async result from 649@174.143.182.182:7000
>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>> startWith='', stopAt='', maxResults=58) from 650@174.143.182.179:7000
>> DEBUG - Processing response on an async result from 650@174.143.182.179:7000
>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>> startWith='', stopAt='', maxResults=22) from 651@174.143.182.185:7000
>> DEBUG - Processing response on an async result from 651@174.143.182.185:7000
>> DEBUG - Disseminating load info ...
>>
>>
>> Thanks,
>>
>> Simon
>>
>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[hidden email]> wrote:
>>> I think I see the problem.
>>>
>>> Can you check if your range query is spanning multiple nodes in the
>>> cluster?  You can tell by setting the log level to DEBUG, and looking
>>> for after it logs get_key_range, it will say "reading
>>> RangeCommand(...) from ... @machine" more than once.
>>>
>>> The bug is that when picking the node to start the range query it
>>> consults the failure detector to avoid dead nodes, but if the query
>>> spans nodes it does not do that on subsequent nodes.
>>>
>>> But if you are only generating one RangeCommand per get_key_range then
>>> we have two bugs. :)
>>>
>>> -Jonathan
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Simon Smith-3
Jonathan:

I tried out the patch you attached to JIRA-440, I applied it to 0.4,
and it works for me.  Now, as soon as I take the node down, there may
be one or two seconds of the thrift-internal error (timeout) but as
soon as the host doing the querying can see the node is down, the
error stops, and valid output is given by the get_key_range query
again.  And there isn't any disruption when the node comes back up.

Thanks!  (I put this same note in the bug report).

Simon Smith




On Fri, Sep 11, 2009 at 9:38 AM, Simon Smith <[hidden email]> wrote:

> https://issues.apache.org/jira/browse/CASSANDRA-440
>
> Thanks again, of course I'm happy to give any additional information
> and will gladly do any testing of the fix.
>
> Simon
>
>
> On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis <[hidden email]> wrote:
>> That confirms what I suspected, thanks.
>>
>> Can you file a ticket on Jira and I'll work on a fix for you to test?
>>
>> thanks,
>>
>> -Jonathan
>>
>> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<[hidden email]> wrote:
>>> I sent get_key_range to node #1 (174.143.182.178), and here are the
>>> resulting log lines from 174.143.182.178's log (Do you want the other
>>> nodes' log lines? Let me know if so.)
>>>
>>> DEBUG - get_key_range
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=100) from 648@174.143.182.178:7000
>>> DEBUG - collecting :false:32@1252535119
>>>  [ ... chop the repeated & identical collecting messages ... ]
>>> DEBUG - collecting :false:32@1252535119
>>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
>>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
>>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
>>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
>>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
>>> completed=false) to 648@174.143.182.178:7000
>>> DEBUG - Processing response on an async result from 648@174.143.182.178:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=58) from 649@174.143.182.182:7000
>>> DEBUG - Processing response on an async result from 649@174.143.182.182:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=58) from 650@174.143.182.179:7000
>>> DEBUG - Processing response on an async result from 650@174.143.182.179:7000
>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>> startWith='', stopAt='', maxResults=22) from 651@174.143.182.185:7000
>>> DEBUG - Processing response on an async result from 651@174.143.182.185:7000
>>> DEBUG - Disseminating load info ...
>>>
>>>
>>> Thanks,
>>>
>>> Simon
>>>
>>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[hidden email]> wrote:
>>>> I think I see the problem.
>>>>
>>>> Can you check if your range query is spanning multiple nodes in the
>>>> cluster?  You can tell by setting the log level to DEBUG, and looking
>>>> for after it logs get_key_range, it will say "reading
>>>> RangeCommand(...) from ... @machine" more than once.
>>>>
>>>> The bug is that when picking the node to start the range query it
>>>> consults the failure detector to avoid dead nodes, but if the query
>>>> spans nodes it does not do that on subsequent nodes.
>>>>
>>>> But if you are only generating one RangeCommand per get_key_range then
>>>> we have two bugs. :)
>>>>
>>>> -Jonathan
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: get_key_range (CASSANDRA-169)

Jonathan Ellis-3
Great, thanks for testing it.  I'll commit soon.

-Jonathan

On Mon, Sep 14, 2009 at 5:37 PM, Simon Smith <[hidden email]> wrote:

> Jonathan:
>
> I tried out the patch you attached to JIRA-440, I applied it to 0.4,
> and it works for me.  Now, as soon as I take the node down, there may
> be one or two seconds of the thrift-internal error (timeout) but as
> soon as the host doing the querying can see the node is down, the
> error stops, and valid output is given by the get_key_range query
> again.  And there isn't any disruption when the node comes back up.
>
> Thanks!  (I put this same note in the bug report).
>
> Simon Smith
>
>
>
>
> On Fri, Sep 11, 2009 at 9:38 AM, Simon Smith <[hidden email]> wrote:
>> https://issues.apache.org/jira/browse/CASSANDRA-440
>>
>> Thanks again, of course I'm happy to give any additional information
>> and will gladly do any testing of the fix.
>>
>> Simon
>>
>>
>> On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis <[hidden email]> wrote:
>>> That confirms what I suspected, thanks.
>>>
>>> Can you file a ticket on Jira and I'll work on a fix for you to test?
>>>
>>> thanks,
>>>
>>> -Jonathan
>>>
>>> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith<[hidden email]> wrote:
>>>> I sent get_key_range to node #1 (174.143.182.178), and here are the
>>>> resulting log lines from 174.143.182.178's log (Do you want the other
>>>> nodes' log lines? Let me know if so.)
>>>>
>>>> DEBUG - get_key_range
>>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>>> startWith='', stopAt='', maxResults=100) from 648@174.143.182.178:7000
>>>> DEBUG - collecting :false:32@1252535119
>>>>  [ ... chop the repeated & identical collecting messages ... ]
>>>> DEBUG - collecting :false:32@1252535119
>>>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4,
>>>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1,
>>>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5,
>>>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4,
>>>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5],
>>>> completed=false) to 648@174.143.182.178:7000
>>>> DEBUG - Processing response on an async result from 648@174.143.182.178:7000
>>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>>> startWith='', stopAt='', maxResults=58) from 649@174.143.182.182:7000
>>>> DEBUG - Processing response on an async result from 649@174.143.182.182:7000
>>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>>> startWith='', stopAt='', maxResults=58) from 650@174.143.182.179:7000
>>>> DEBUG - Processing response on an async result from 650@174.143.182.179:7000
>>>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash,
>>>> startWith='', stopAt='', maxResults=22) from 651@174.143.182.185:7000
>>>> DEBUG - Processing response on an async result from 651@174.143.182.185:7000
>>>> DEBUG - Disseminating load info ...
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Simon
>>>>
>>>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis <[hidden email]> wrote:
>>>>> I think I see the problem.
>>>>>
>>>>> Can you check if your range query is spanning multiple nodes in the
>>>>> cluster?  You can tell by setting the log level to DEBUG, and looking
>>>>> for after it logs get_key_range, it will say "reading
>>>>> RangeCommand(...) from ... @machine" more than once.
>>>>>
>>>>> The bug is that when picking the node to start the range query it
>>>>> consults the failure detector to avoid dead nodes, but if the query
>>>>> spans nodes it does not do that on subsequent nodes.
>>>>>
>>>>> But if you are only generating one RangeCommand per get_key_range then
>>>>> we have two bugs. :)
>>>>>
>>>>> -Jonathan
>>>>>
>>>>
>>>
>>
>