I had a nasty streak of OOMs earlier today (several on one node, and a single OOM on one other node). I've downloaded a few of the hprof files for local analysis. In each case, there is a single ReadStage thread with a huge (> 7.5GB) org.apache.cassandra.db.ArrayBackedSortedColumns instance. I'm trying to understand exactly what this means.
1) Does a ReadStage thread only process one query at a time? If so, then a reasonable conclusion (I think) would be I had a single query that produced a ton of results. If not (if ReadStage threads can work on multiple queries concurrently) then this volume of data might have been produced by a combination of queries.
2) My driver (gocql) does not appear to enable paging by default. Am I correct in assuming that this should "solve the problem" (more precisely: avoid OOMs due to me fetching a ton of rows, assuming that is the problem and not that I am fetching a small number of very large rows)?
3) Is there any way for me (either from the system.log or from the hprof dumps) to tell what query was currently executing when the process OOMed? If I dig down in the object hierarchy, I see: Thread -> MessageDeliveryTask -> message -> payload, which has the right ksName and cfName. But the "key" property is a byte array - is there an easy way for me to map this onto my column key (which has multiple CQL columns in a composite key).
4) Alternatively, is it possible for me to see how many rows had been read for that query so far? That way I can at least validate that the problem was "too many rows" and not "rows are too big".