Help understanding aftermath of death by GC

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Help understanding aftermath of death by GC

Robert Wille-2
I moved my site over to Cassandra a few months ago, and everything has been just peachy until a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered death by GC. By death by GC, I mean this:

[rwille@cas031 cassandra]$ grep GC system.log | head -5
 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 8329887744

I’m pretty sure I know what triggered it. When I first started developing to Cassandra, I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the clause and the whole system went belly up.

I get that my bad queries caused Cassandra to require more heap than was available, but here’s what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout error, but the cluster never recovered. I would have expected that when I was no longer issuing the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause heap to be permanently uncollectable?

This makes me pretty worried. I can fix my code, but I don’t really have control over spikes. If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come back when the pressure is gone, that seems pretty bad.

Any insights would be greatly appreciated

Robert

Reply | Threaded
Open this post in threaded view
|

Re: Help understanding aftermath of death by GC

Jason Wee
Hey Robert, you might want to start by looking into the statistics of
cassandra, either exposed via nodetool or if you have monitoring
system monitor the important metrics. I have read this article moment
ago and I hope it help you
http://aryanet.com/blog/cassandra-garbage-collector-tuning to begin to
understand where and how to determine the root cause.

jason

On Tue, Mar 31, 2015 at 8:22 PM, Robert Wille <[hidden email]> wrote:

> I moved my site over to Cassandra a few months ago, and everything has been just peachy until a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered death by GC. By death by GC, I mean this:
>
> [rwille@cas031 cassandra]$ grep GC system.log | head -5
>  INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 8329887744
>
> I’m pretty sure I know what triggered it. When I first started developing to Cassandra, I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the clause and the whole system went belly up.
>
> I get that my bad queries caused Cassandra to require more heap than was available, but here’s what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout error, but the cluster never recovered. I would have expected that when I was no longer issuing the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause heap to be permanently uncollectable?
>
> This makes me pretty worried. I can fix my code, but I don’t really have control over spikes. If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come back when the pressure is gone, that seems pretty bad.
>
> Any insights would be greatly appreciated
>
> Robert
>
Reply | Threaded
Open this post in threaded view
|

Re: Help understanding aftermath of death by GC

Jens Rantil
In reply to this post by Robert Wille-2
Hi Robert,

On Tue, Mar 31, 2015 at 2:22 PM, Robert Wille <[hidden email]> wrote:
Can anybody help me understand why Cassandra wouldn’t recover?

One issue when you are running a JVM and start running out of memory is that the JVM can start throwing `OutOfMemoryError` in any thread - not necessarily in the thread which is taking all the memory. I've seen this happen multiple times. If this happened to you, a critical Cassandra thread could have died and brought the whole Cassandra DB with itself.

Just an idea - cheers,
Jens

--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32

Reply | Threaded
Open this post in threaded view
|

Re: Help understanding aftermath of death by GC

Robert Coli-3

On Tue, Mar 31, 2015 at 9:12 AM, Jens Rantil <[hidden email]> wrote:
One issue when you are running a JVM and start running out of memory is that the JVM can start throwing `OutOfMemoryError` in any thread - not necessarily in the thread which is taking all the memory. I've seen this happen multiple times. If this happened to you, a critical Cassandra thread could have died and brought the whole Cassandra DB with itself.

Jens is correct that the JVM has few options as to what to do when it runs out of heap :


Expands a bit on Cassandra specific behavior here.

But basically, once you've OOMed (any generation of) the heap, you almost certainly want to stop or re-start the JVM, even if it hasn't crashed itself. 

=Rob