Quantcast

Handle Write Heavy Loads in Cassandra 2.0.3

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?
2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
Small correction: we are making writes in 5 cf an reading frm one at high speeds. 


Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?
2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 


Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Brice Dutheil

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 


Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
Thanks Brice!!

We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU are not overloaded even in peak load..I dont think IO is an issue as iostat shows await<17 all times..util attrbute in iostat usually increases from 0 to 100..and comes back immediately..m not an expert on analyzing IO but things look ok..We are using STCS..and not using Logged batches..We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row.   We have further reduced in_memory_compaction_limit_in_mb to 125.Though still getting logs saying "compacting large row".

We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.

I would appreciate if you could answer the queries posted in initial mail.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Tue, 21 Apr, 2015 at 10:22 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <<a rel="nofollow" shape="rect" ymailto="mailto:anujw_2003@yahoo.co.in" target="_blank" href="javascript:return">anujw_2003@...> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <<a rel="nofollow" shape="rect" ymailto="mailto:anujw_2003@yahoo.co.in" target="_blank" href="javascript:return">anujw_2003@...>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <<a rel="nofollow" shape="rect" ymailto="mailto:anujw_2003@yahoo.co.in" target="_blank" href="javascript:return">anujw_2003@...>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Brice Dutheil

Hi, I cannot really answer your question as some rock solid truth.

When we had problems, we did mainly two things

  • Analyzed the GC logs (with censum from jClarity, this tool IS really awesome, it’s good investment even better if the production is running other java applications)
  • Heap dumped cassandra when there was a GC, this helped in narrowing down the actual issue

I don’t know precisely how to answer, but :

  • concurrent_compactors could be lowered to 10, it seems from another thread here that it can be harmful, see https://issues.apache.org/jira/browse/CASSANDRA-6142
  • memtable_flush_writers we set it to 2
  • compaction_throughput_mb_per_sec could probably be increased, on SSDs that should help
  • trickle_fsync don’t forget this one too if you’re on SSDs

Touching JVM heap parameters can be hazardous, increasing heap may seem like a nice thing, but it can increase GC time in the worst case scenario.

Also increasing the MaxTenuringThreshold is probably wrong too, as you probably know it means objects will be copied from Eden to Survivor 0/1 and to the other Survivor on the next collection until that threshold is reached, then it will be copied in Old generation. That means that’s being applied to Memtables, so it may mean several copies to be done on each GCs, and memtables are not small objects that could take a little while for an available system. Another fact to take account for is that upon each collection the active survivor S0/S1 has to be big enough for the memtable to fit there, and there’s other objects too.

So I would rather work on the real cause. rather than GC. One thing brought my attention

Though still getting logs saying “compacting large row”.

Could it be that the model is based on wide rows ? That could be a problem, for several reasons not limited to compactions. If that is so I’d advise to revise the datamodel


-- Brice

On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <[hidden email]> wrote:
Thanks Brice!!

We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU are not overloaded even in peak load..I dont think IO is an issue as iostat shows await<17 all times..util attrbute in iostat usually increases from 0 to 100..and comes back immediately..m not an expert on analyzing IO but things look ok..We are using STCS..and not using Logged batches..We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row.   We have further reduced in_memory_compaction_limit_in_mb to 125.Though still getting logs saying "compacting large row".

We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.

I would appreciate if you could answer the queries posted in initial mail.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Tue, 21 Apr, 2015 at 10:22 pm

Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
In reply to this post by Anuj
Thanks Brice for all the comments..

We analyzed gc logs and heap dump before tuning JVM n gc. With new JVM config I specified we were able to remove promotion failures seen with default config. With Heap dump I got an idea that memetables and compaction are biggest culprits.

CAASSANDRA-6142 talks about multithreaded_compaction but we are using concurrent_compactors. I think they are different. On nodes with many cores it is usually recommend to run core/2 concurrent compactors. I dont think 10 or 12 would  make big difference.

For now, we have kept compaction throughput to 24 as we already have scenarios which create heap pressure due to heavy read write load. Yes we can think of increasing it on SSD.

We have already enabled trickle fsync.

Justification behind increasing MaxTenuringThreshold ,young gen size and creating large survivor space is to gc most memtables in Yong gen itself. For making sure that memtables are smaller and not kept too long in heap ,we have reduced total_memtable_space_in_mb to 1g from heap size/4 which is default. We flush a memtable to disk approx every 15 sec and our minor collection runs evry 3-7 secs.So its highly probable that most memtables will be collected in young gen. Idea is that most short lived and middle life time objects should not reach old gen otherwise CMC old gen collections would be very frequent,more expensive as they may not collect memtables and fragmentation would be higher.

I think wide rows less than 100mb should nt be prob. Cassandra infact provides very good wide rows format suitable for time series and other scenarios. The problem is that when my in_memory_compaction_in_mb limit is 125 mb why Cassandra is printing "compacting large rows" when row is less than 100mb.



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Wed, 22 Apr, 2015 at 3:52 am
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Hi, I cannot really answer your question as some rock solid truth.

When we had problems, we did mainly two things

  • Analyzed the GC logs (with censum from jClarity, this tool IS really awesome, it’s good investment even better if the production is running other java applications)
  • Heap dumped cassandra when there was a GC, this helped in narrowing down the actual issue

I don’t know precisely how to answer, but :

  • concurrent_compactors could be lowered to 10, it seems from another thread here that it can be harmful, see https://issues.apache.org/jira/browse/CASSANDRA-6142
  • memtable_flush_writers we set it to 2
  • compaction_throughput_mb_per_sec could probably be increased, on SSDs that should help
  • trickle_fsync don’t forget this one too if you’re on SSDs

Touching JVM heap parameters can be hazardous, increasing heap may seem like a nice thing, but it can increase GC time in the worst case scenario.

Also increasing the MaxTenuringThreshold is probably wrong too, as you probably know it means objects will be copied from Eden to Survivor 0/1 and to the other Survivor on the next collection until that threshold is reached, then it will be copied in Old generation. That means that’s being applied to Memtables, so it may mean several copies to be done on each GCs, and memtables are not small objects that could take a little while for an available system. Another fact to take account for is that upon each collection the active survivor S0/S1 has to be big enough for the memtable to fit there, and there’s other objects too.

So I would rather work on the real cause. rather than GC. One thing brought my attention

Though still getting logs saying “compacting large row”.

Could it be that the model is based on wide rows ? That could be a problem, for several reasons not limited to compactions. If that is so I’d advise to revise the datamodel


-- Brice

On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <<a rel="nofollow" shape="rect" ymailto="mailto:anujw_2003@yahoo.co.in" target="_blank" href="javascript:return">anujw_2003@...> wrote:
Thanks Brice!!

We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU are not overloaded even in peak load..I dont think IO is an issue as iostat shows await<17 all times..util attrbute in iostat usually increases from 0 to 100..and comes back immediately..m not an expert on analyzing IO but things look ok..We are using STCS..and not using Logged batches..We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row.   We have further reduced in_memory_compaction_limit_in_mb to 125.Though still getting logs saying "compacting large row".

We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.

I would appreciate if you could answer the queries posted in initial mail.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <<a rel="nofollow" shape="rect" ymailto="mailto:brice.dutheil@gmail.com" target="_blank" href="javascript:return">brice.dutheil@...>
Date:Tue, 21 Apr, 2015 at 10:22 pm

Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Anuj
Any other suggestions on the JVM Tuning and Cassandra config we did to solve the promotion failures during gc?

I would appreciate if someone can try to answer our queries mentioned in initial mail?

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Wed, 22 Apr, 2015 at 6:12 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Thanks Brice for all the comments..

We analyzed gc logs and heap dump before tuning JVM n gc. With new JVM config I specified we were able to remove promotion failures seen with default config. With Heap dump I got an idea that memetables and compaction are biggest culprits.

CAASSANDRA-6142 talks about multithreaded_compaction but we are using concurrent_compactors. I think they are different. On nodes with many cores it is usually recommend to run core/2 concurrent compactors. I dont think 10 or 12 would  make big difference.

For now, we have kept compaction throughput to 24 as we already have scenarios which create heap pressure due to heavy read write load. Yes we can think of increasing it on SSD.

We have already enabled trickle fsync.

Justification behind increasing MaxTenuringThreshold ,young gen size and creating large survivor space is to gc most memtables in Yong gen itself. For making sure that memtables are smaller and not kept too long in heap ,we have reduced total_memtable_space_in_mb to 1g from heap size/4 which is default. We flush a memtable to disk approx every 15 sec and our minor collection runs evry 3-7 secs.So its highly probable that most memtables will be collected in young gen. Idea is that most short lived and middle life time objects should not reach old gen otherwise CMC old gen collections would be very frequent,more expensive as they may not collect memtables and fragmentation would be higher.

I think wide rows less than 100mb should nt be prob. Cassandra infact provides very good wide rows format suitable for time series and other scenarios. The problem is that when my in_memory_compaction_in_mb limit is 125 mb why Cassandra is printing "compacting large rows" when row is less than 100mb.



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Wed, 22 Apr, 2015 at 3:52 am
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Hi, I cannot really answer your question as some rock solid truth.

When we had problems, we did mainly two things

  • Analyzed the GC logs (with censum from jClarity, this tool IS really awesome, it’s good investment even better if the production is running other java applications)
  • Heap dumped cassandra when there was a GC, this helped in narrowing down the actual issue

I don’t know precisely how to answer, but :

  • concurrent_compactors could be lowered to 10, it seems from another thread here that it can be harmful, see https://issues.apache.org/jira/browse/CASSANDRA-6142
  • memtable_flush_writers we set it to 2
  • compaction_throughput_mb_per_sec could probably be increased, on SSDs that should help
  • trickle_fsync don’t forget this one too if you’re on SSDs

Touching JVM heap parameters can be hazardous, increasing heap may seem like a nice thing, but it can increase GC time in the worst case scenario.

Also increasing the MaxTenuringThreshold is probably wrong too, as you probably know it means objects will be copied from Eden to Survivor 0/1 and to the other Survivor on the next collection until that threshold is reached, then it will be copied in Old generation. That means that’s being applied to Memtables, so it may mean several copies to be done on each GCs, and memtables are not small objects that could take a little while for an available system. Another fact to take account for is that upon each collection the active survivor S0/S1 has to be big enough for the memtable to fit there, and there’s other objects too.

So I would rather work on the real cause. rather than GC. One thing brought my attention

Though still getting logs saying “compacting large row”.

Could it be that the model is based on wide rows ? That could be a problem, for several reasons not limited to compactions. If that is so I’d advise to revise the datamodel


-- Brice

On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <<a rel="nofollow" shape="rect" ymailto="mailto:anujw_2003@yahoo.co.in" target="_blank" href="javascript:return">anujw_2003@...> wrote:
Thanks Brice!!

We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU are not overloaded even in peak load..I dont think IO is an issue as iostat shows await<17 all times..util attrbute in iostat usually increases from 0 to 100..and comes back immediately..m not an expert on analyzing IO but things look ok..We are using STCS..and not using Logged batches..We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row.   We have further reduced in_memory_compaction_limit_in_mb to 125.Though still getting logs saying "compacting large row".

We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.

I would appreciate if you could answer the queries posted in initial mail.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <<a rel="nofollow" shape="rect" ymailto="mailto:brice.dutheil@gmail.com" target="_blank" href="javascript:return">brice.dutheil@...>
Date:Tue, 21 Apr, 2015 at 10:22 pm

Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Handle Write Heavy Loads in Cassandra 2.0.3

Brice Dutheil
Another reason for memtable to be kept in memory if there's wide rows. Maybe someone can chime in and confirm or not, but I believe wide rows (in the thrift sense) need to synced entirely across nodes. So from the number you gave a node can send ~100 Mb over the network for a single row. With compaction and other stuff, it may be an issue, as these object can stay long enough in the heap to survive a collection.

Think about the row cache too, as with wide rows, Cassandra will hold a bit longer the tables to serialize the data in the off-heap row cache (in 2.0.x, not sure about other versions). See this page : http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_configuring_caches_c.html




-- Brice

On Wed, Apr 22, 2015 at 2:47 PM, Anuj Wadehra <[hidden email]> wrote:
Any other suggestions on the JVM Tuning and Cassandra config we did to solve the promotion failures during gc?

I would appreciate if someone can try to answer our queries mentioned in initial mail?

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Wed, 22 Apr, 2015 at 6:12 pm

Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Thanks Brice for all the comments..

We analyzed gc logs and heap dump before tuning JVM n gc. With new JVM config I specified we were able to remove promotion failures seen with default config. With Heap dump I got an idea that memetables and compaction are biggest culprits.

CAASSANDRA-6142 talks about multithreaded_compaction but we are using concurrent_compactors. I think they are different. On nodes with many cores it is usually recommend to run core/2 concurrent compactors. I dont think 10 or 12 would  make big difference.

For now, we have kept compaction throughput to 24 as we already have scenarios which create heap pressure due to heavy read write load. Yes we can think of increasing it on SSD.

We have already enabled trickle fsync.

Justification behind increasing MaxTenuringThreshold ,young gen size and creating large survivor space is to gc most memtables in Yong gen itself. For making sure that memtables are smaller and not kept too long in heap ,we have reduced total_memtable_space_in_mb to 1g from heap size/4 which is default. We flush a memtable to disk approx every 15 sec and our minor collection runs evry 3-7 secs.So its highly probable that most memtables will be collected in young gen. Idea is that most short lived and middle life time objects should not reach old gen otherwise CMC old gen collections would be very frequent,more expensive as they may not collect memtables and fragmentation would be higher.

I think wide rows less than 100mb should nt be prob. Cassandra infact provides very good wide rows format suitable for time series and other scenarios. The problem is that when my in_memory_compaction_in_mb limit is 125 mb why Cassandra is printing "compacting large rows" when row is less than 100mb.



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Wed, 22 Apr, 2015 at 3:52 am
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Hi, I cannot really answer your question as some rock solid truth.

When we had problems, we did mainly two things

  • Analyzed the GC logs (with censum from jClarity, this tool IS really awesome, it’s good investment even better if the production is running other java applications)
  • Heap dumped cassandra when there was a GC, this helped in narrowing down the actual issue

I don’t know precisely how to answer, but :

  • concurrent_compactors could be lowered to 10, it seems from another thread here that it can be harmful, see https://issues.apache.org/jira/browse/CASSANDRA-6142
  • memtable_flush_writers we set it to 2
  • compaction_throughput_mb_per_sec could probably be increased, on SSDs that should help
  • trickle_fsync don’t forget this one too if you’re on SSDs

Touching JVM heap parameters can be hazardous, increasing heap may seem like a nice thing, but it can increase GC time in the worst case scenario.

Also increasing the MaxTenuringThreshold is probably wrong too, as you probably know it means objects will be copied from Eden to Survivor 0/1 and to the other Survivor on the next collection until that threshold is reached, then it will be copied in Old generation. That means that’s being applied to Memtables, so it may mean several copies to be done on each GCs, and memtables are not small objects that could take a little while for an available system. Another fact to take account for is that upon each collection the active survivor S0/S1 has to be big enough for the memtable to fit there, and there’s other objects too.

So I would rather work on the real cause. rather than GC. One thing brought my attention

Though still getting logs saying “compacting large row”.

Could it be that the model is based on wide rows ? That could be a problem, for several reasons not limited to compactions. If that is so I’d advise to revise the datamodel


-- Brice

On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <[hidden email]> wrote:
Thanks Brice!!

We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU are not overloaded even in peak load..I dont think IO is an issue as iostat shows await<17 all times..util attrbute in iostat usually increases from 0 to 100..and comes back immediately..m not an expert on analyzing IO but things look ok..We are using STCS..and not using Logged batches..We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data of around 100mb per row.   We have further reduced in_memory_compaction_limit_in_mb to 125.Though still getting logs saying "compacting large row".

We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.

I would appreciate if you could answer the queries posted in initial mail.

Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Brice Dutheil" <[hidden email]>
Date:Tue, 21 Apr, 2015 at 10:22 pm

Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

This is an intricate matter, I cannot say for sure what are good parameters from the wrong ones, too many things changed at once.

However there’s many things to consider

  • What is your OS ?
  • Do your nodes have SSDs or mechanical drives ? How many cores do you have ?
  • Is it the CPUs or IOs that are overloaded ?
  • What is the write request/s per node and cluster wide ?
  • What is the compaction strategy of the tables you are writing into ?
  • Are you using LOGGED BATCH statement.

With heavy writes, it is NOT recommend to use LOGGED BATCH statements.

In our 2.0.14 cluster we have experimented node unavailability due to long Full GC pauses. We discovered bogus legacy data, a single outlier was so wrong that it updated hundred thousand time the same CQL rows with duplicate data. Given the tables we were writing to were configured to use LCS, this resulted in keeping Memtables in memory long enough to promote them in the old generation (the MaxTenuringThreshold default is 1).
Handling this data proved to be the thing to fix, with default GC settings the cluster (10 nodes) handle 39 write requests/s.

Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be allocated off-heap.


-- Brice

On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <[hidden email]> wrote:
Any suggestions or comments on this one?? 

Thanks
Anuj Wadhera

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 11:51 pm
Subject:Re: Handle Write Heavy Loads in Cassandra 2.0.3

Small correction: we are making writes in 5 cf an reading frm one at high speeds. 



Thanks
Anuj Wadehra

Sent from Yahoo Mail on Android


From:"Anuj Wadehra" <[hidden email]>
Date:Mon, 20 Apr, 2015 at 7:53 pm
Subject:Handle Write Heavy Loads in Cassandra 2.0.3

Hi,
 
Recently, we discovered that  millions of mutations were getting dropped on our cluster. Eventually, we solved this problem by increasing the value of memtable_flush_writers from 1 to 3. We usually write 3 CFs simultaneously an one of them has 4 Secondary Indexes.
 
New changes also include:
concurrent_compactors: 12 (earlier it was default)
compaction_throughput_mb_per_sec: 32(earlier it was default)
in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
memtable_flush_writers: 3 (earlier 1)
 
After, making above changes, our write heavy workload scenarios started giving "promotion failed" exceptions in  gc logs.
 
We have done JVM tuning and Cassandra config changes to solve this:
 
MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
HEAP_NEWSIZE="3G"
 
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at SurvivorRatio=4, our survivor space was getting 100% utilized under heavy write load and we thought that minor collections were directly promoting objects to Tenured generation)
 
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were moving from Eden to Tenured on each minor collection..may be related to medium life objects related to Memtables and compactions as suggested by heapdump)
 
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default value
JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid concurrent failures we reduced value)
 
Cassandra config:
compaction_throughput_mb_per_sec: 24
memtable_total_space_in_mb: 1000 (to make memtable flush frequent.default is 1/4 heap which creates more long lived objects)
 
Questions:
1. Why increasing memtable_flush_writers and in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does more memtable_flush_writers mean more memtables in memory?

2. Still, objects are getting promoted at high speed to Tenured space. CMS is running on Old gen every 4-5 minutes  under heavy write load. Around 750+ minor collections of upto 300ms happened in 45 mins. Do you see any problems with new JVM tuning and Cassandra config? Is the justification given against those changes sounds logical? Any suggestions?
3. What is the best practice for reducing heap fragmentation/promotion failure when allocation and promotion rates are high?
 
Thanks
Anuj
 
 





Loading...