Out of Memory Error While Opening SSTables on Startup

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Out of Memory Error While Opening SSTables on Startup

Paul Nickerson
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.

I am running a 6 node cluster using Apache Cassandra 2.1.2 with DataStax OpsCenter 5.0.2 from the AWS EC2 AMI "DataStax Auto-Clustering AMI 2.5.1-hvm" (DataStax Community AMI). I'm using m3.xlarge instances, which have 15 GiB of memory.

By default, the formula in /etc/cassandra/cassandra-env.sh sets Xms and Xmx to 3.6 GiB. I tried overriding with 8 GiB and 2 GiB; both result in the same problem.

This shows up in system.log on startup. There is nothing after that last line. I'm not sure why there are over 200 thousand SSTable things to open.

    INFO  [main] 2015-02-10 18:31:44,766 ColumnFamilyStore.java:268 - Initializing OpsCenter.settings
    INFO  [SSTableBatchOpen:1] 2015-02-10 18:31:44,767 SSTableReader.java:392 - Opening /raid0/cassandra/data/OpsCenter/settings-4455ec427ca411e4bd3f1927a2a71193/OpsCenter-settings-ka-1755 (290 bytes)
    ...
    INFO  [SSTableBatchOpen:4] 2015-02-10 18:31:44,775 SSTableReader.java:392 - Opening /raid0/cassandra/data/OpsCenter/settings-4455ec427ca411e4bd3f1927a2a71193/OpsCenter-settings-ka-1753 (288 bytes)
    INFO  [main] 2015-02-10 18:31:44,797 AutoSavingCache.java:146 - reading saved cache /raid0/cassandra/saved_caches/OpsCenter-settings-4455ec427ca411e4bd3f1927a2a71193-KeyCache-b.db
    INFO  [main] 2015-02-10 18:31:56,504 ColumnFamilyStore.java:268 - Initializing OpsCenter.rollups60
    INFO  [SSTableBatchOpen:2] 2015-02-10 18:32:08,353 SSTableReader.java:392 - Opening /raid0/cassandra/data/OpsCenter/rollups60-445613507ca411e4bd3f1927a2a71193/OpsCenter-rollups60-ka-359458 (195 bytes)
    ... (201,260 more lines like this)
    INFO  [SSTableBatchOpen:1] 2015-02-10 18:32:47,804 SSTableReader.java:392 - Opening /raid0/cassandra/data/OpsCenter/rollups60-445613507ca411e4bd3f1927a2a71193/OpsCenter-rollups60-ka-332976 (291 bytes)

When I run Cassandra right on the command line (as opposed to starting the service), I get error information in the output.

    Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007fd2ec000, 1241088, 0) failed; error='Cannot allocate memory' (errno=12)
    #
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (malloc) failed to allocate 1241088 bytes for committing reserved memory.
    # An error report file with more information is saved as:
    # /raid0/cassandra/hs_err_pid22970.log

That log file is big, but part of it near the top reads

    #  Out of Memory Error (os_linux.cpp:2726), pid=22970, tid=140587792205568
    #
    # JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 1.7.0_51-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)

Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

BTW, I didn't know what an SSTable was. I found the definition here: http://www.datastax.com/documentation/cassandra/2.1/share/glossary/gloss_sstable.html

Thank you,
 ~ Paul Nickerson
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Robert Coli-3
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Paul Nickerson
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 

Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Chris Lohfink-3
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 


Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Paul Nickerson
I was having trouble with snapshots failing while trying to repair that table (http://www.mail-archive.com/user@.../msg40686.html). I have a repair running on it now, and it seems to be going successfully this time. I am going to wait for that to finish, then try a manual nodetool compact. If that goes successfully, then would it be safe to chalk the lack of compaction on this table in the past up to 2.1.2 problems?


 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink <[hidden email]> wrote:
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 



Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Chris Lohfink-3
yeah... probably just 2.1.2 things and not compactions.  Still probably want to do something about the 1.6 million files though.  It may be worth just mv/rm'ing to 60 sec rollup data though unless really attached to it.

Chris

On Tue, Feb 10, 2015 at 4:04 PM, Paul Nickerson <[hidden email]> wrote:
I was having trouble with snapshots failing while trying to repair that table (http://www.mail-archive.com/user@.../msg40686.html). I have a repair running on it now, and it seems to be going successfully this time. I am going to wait for that to finish, then try a manual nodetool compact. If that goes successfully, then would it be safe to chalk the lack of compaction on this table in the past up to 2.1.2 problems?


 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink <[hidden email]> wrote:
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 




Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Eric Stevens
This kind of recovery is definitely not my strong point, so feedback on this approach would certainly be welcome.

As I understand it, if you really want to keep that data, you ought to be able to mv it out of the way to get your node online, then move those files in a several thousand at a time, nodetool refresh OpsCenter rollups60 && nodetool compact OpsCenter rollups60; rinse and repeat.  This should let you incrementally restore the data in that keyspace without putting so many sstables in there that it ooms your cluster again.

On Tue, Feb 10, 2015 at 3:38 PM, Chris Lohfink <[hidden email]> wrote:
yeah... probably just 2.1.2 things and not compactions.  Still probably want to do something about the 1.6 million files though.  It may be worth just mv/rm'ing to 60 sec rollup data though unless really attached to it.

Chris

On Tue, Feb 10, 2015 at 4:04 PM, Paul Nickerson <[hidden email]> wrote:
I was having trouble with snapshots failing while trying to repair that table (http://www.mail-archive.com/user@.../msg40686.html). I have a repair running on it now, and it seems to be going successfully this time. I am going to wait for that to finish, then try a manual nodetool compact. If that goes successfully, then would it be safe to chalk the lack of compaction on this table in the past up to 2.1.2 problems?


 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink <[hidden email]> wrote:
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 





Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Flavien Charlon
I already experienced the same problem (hundreds of thousands of SSTables) with Cassandra 2.1.2. It seems to appear when running an incremental repair while there is a medium to high insert load on the cluster. The repair goes in a bad state and starts creating way more SSTables than it should (even when there should be nothing to repair).

On 10 February 2015 at 15:46, Eric Stevens <[hidden email]> wrote:
This kind of recovery is definitely not my strong point, so feedback on this approach would certainly be welcome.

As I understand it, if you really want to keep that data, you ought to be able to mv it out of the way to get your node online, then move those files in a several thousand at a time, nodetool refresh OpsCenter rollups60 && nodetool compact OpsCenter rollups60; rinse and repeat.  This should let you incrementally restore the data in that keyspace without putting so many sstables in there that it ooms your cluster again.

On Tue, Feb 10, 2015 at 3:38 PM, Chris Lohfink <[hidden email]> wrote:
yeah... probably just 2.1.2 things and not compactions.  Still probably want to do something about the 1.6 million files though.  It may be worth just mv/rm'ing to 60 sec rollup data though unless really attached to it.

Chris

On Tue, Feb 10, 2015 at 4:04 PM, Paul Nickerson <[hidden email]> wrote:
I was having trouble with snapshots failing while trying to repair that table (http://www.mail-archive.com/user@.../msg40686.html). I have a repair running on it now, and it seems to be going successfully this time. I am going to wait for that to finish, then try a manual nodetool compact. If that goes successfully, then would it be safe to chalk the lack of compaction on this table in the past up to 2.1.2 problems?


 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink <[hidden email]> wrote:
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob
 






Jan
Reply | Threaded
Open this post in threaded view
|

Re: Out of Memory Error While Opening SSTables on Startup

Jan
Paul Nickerson; 

curious, 
did you get a solution to your problem ? 

Regards,
Jan/ 




On Tuesday, February 10, 2015 5:48 PM, Flavien Charlon <[hidden email]> wrote:


I already experienced the same problem (hundreds of thousands of SSTables) with Cassandra 2.1.2. It seems to appear when running an incremental repair while there is a medium to high insert load on the cluster. The repair goes in a bad state and starts creating way more SSTables than it should (even when there should be nothing to repair).

On 10 February 2015 at 15:46, Eric Stevens <[hidden email]> wrote:
This kind of recovery is definitely not my strong point, so feedback on this approach would certainly be welcome.

As I understand it, if you really want to keep that data, you ought to be able to mv it out of the way to get your node online, then move those files in a several thousand at a time, nodetool refresh OpsCenter rollups60 && nodetool compact OpsCenter rollups60; rinse and repeat.  This should let you incrementally restore the data in that keyspace without putting so many sstables in there that it ooms your cluster again.

On Tue, Feb 10, 2015 at 3:38 PM, Chris Lohfink <[hidden email]> wrote:
yeah... probably just 2.1.2 things and not compactions.  Still probably want to do something about the 1.6 million files though.  It may be worth just mv/rm'ing to 60 sec rollup data though unless really attached to it.

Chris

On Tue, Feb 10, 2015 at 4:04 PM, Paul Nickerson <[hidden email]> wrote:
I was having trouble with snapshots failing while trying to repair that table (http://www.mail-archive.com/user@.../msg40686.html). I have a repair running on it now, and it seems to be going successfully this time. I am going to wait for that to finish, then try a manual nodetool compact. If that goes successfully, then would it be safe to chalk the lack of compaction on this table in the past up to 2.1.2 problems?


 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink <[hidden email]> wrote:
Your cluster is probably having issues with compactions (with STCS you should never have this many).  I would probably punt with OpsCenter/rollups60. Turn the node off and move all of the sstables off to a different directory for backup (or just rm if you really don't care about 1 minute metrics), than turn the server back on. 

Once you get your cluster running again go back and investigate why compactions stopped, my guess is you hit an exception in past that killed your CompactionExecutor and things just built up slowly until you got to this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson <[hidden email]> wrote:
Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There are 1,617,289 files under OpsCenter/rollups60.

Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I was able to start up Cassandra OK with the default heap size formula.

Now my cluster is running multiple versions of Cassandra. I think I will downgrade the rest to 2.1.1.

 ~ Paul Nickerson

On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli <[hidden email]> wrote:
On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson <[hidden email]> wrote:
I am getting an out of memory error why I try to start Cassandra on one of my nodes. Cassandra will run for a minute, and then exit without outputting any error in the log file. It is happening while SSTableReader is opening a couple hundred thousand things.
... 
Does anyone know how I might get Cassandra on this node running again? I'm not very familiar with correctly tuning Java memory parameters, and I'm not sure if that's the right solution in this case anyway.

Try running 2.1.1, and/or increasing heap size beyond 8gb.

Are there actually that many SSTables on disk?

=Rob