Recently, we discovered that millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in gc logs.
We have done JVM tuning and Cassandra config changes to solve this:
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
1. Why increasing memtable_flush_writers caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?
2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.
However there’s many things to consider
With heavy writes, it is NOT recommend to use
In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping
On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Hi, I cannot really answer your question as some rock solid truth.
When we had problems, we did mainly two things
I don’t know precisely how to answer, but :
Touching JVM heap parameters can be hazardous, increasing heap may seem like a nice thing, but it can increase GC time in the worst case scenario.
Also increasing the
So I would rather work on the real cause. rather than GC. One thing brought my attention
Could it be that the model is based on wide rows ? That could be a problem, for several reasons not limited to compactions. If that is so I’d advise to revise the datamodel
On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <[hidden email]> wrote:
In reply to this post by Anuj
Another reason for memtable to be kept in memory if there's wide rows. Maybe someone can chime in and confirm or not, but I believe wide rows (in the thrift sense) need to synced entirely across nodes. So from the number you gave a node can send ~100 Mb over the network for a single row. With compaction and other stuff, it may be an issue, as these object can stay long enough in the heap to survive a collection.
Think about the row cache too, as with wide rows, Cassandra will hold a bit longer the tables to serialize the data in the off-heap row cache (in 2.0.x, not sure about other versions). See this page : http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_configuring_caches_c.html
On Wed, Apr 22, 2015 at 2:47 PM, Anuj Wadehra <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|