I run a 10-node Cassandra cluster in production. 99% writes; 1% reads, 0% deletes. The nodes have 32 GB RAM; C* runs with 8 GB heap. Each node has a SDD for commitlog and 2x4 TB spinning disks for data (sstables). The schema uses key caching only. C* version is 2.1.2.
It can be predicted that the cluster will run out of free disk space in not too long. So its storage capacity needs to be increased. The client prefers increasing disk size over adding more nodes. So a plan is to take the 2x4 TB spinning disks in each node and replace by 3x6 TB spinning disks.
Thanks in advance.
The preferred pattern for scaling data with Cassandra is to add nodes. Growing the disk on each node is an anti-pattern. The key strength of Cassandra is that it is a DISTRIBUTED database, so always keep your eye on distributing your data.
But if you do need to grow disk, be sure to grow RAM and CPU power as well. More disk without more RAM AND CPU is just asking for trouble. But even that has its limits relative to the preferred pattern of adding nodes.
-- Jack Krupansky
On Wed, Apr 8, 2015 at 4:36 AM, Thomas Borg Salling <[hidden email]> wrote:
Agreed with Jack. Cassandra is a database meant to scale horizontally by adding nodes, and what you're describing is vertical scale.
Aside from the vertical scale issue, unless you're running a very specific workload (time series data w/ Date Tiered Compaction) and you REALLY know what you're doing, I wouldn't go above 3-5TB per node right now. You'll start to see GC issues and your cluster performance will suffer.
Add nodes and sleep comfortably at night.
On Wed, Apr 8, 2015 at 4:27 AM Jack Krupansky <[hidden email]> wrote:
In reply to this post by Thomas Borg Salling
First off, I agree that the preferred path is adding nodes, but it is possible.
> Can C* handle up to 18 TB data size per node with this amount of RAM?
Depends on how deep in the weeds you want to get tuning and testing. See below.
> Is it feasible to increase the disk size by mounting a new (larger) disk, copy all SS tables to it, and then mount it on the same mount point as the original (smaller) disk (to replace it)?
Yes (with C* off of course).
As for tuning, you will need to look at, experiment with, and get a good understanding of:
- index_interval (turn this up now anyway if have not already ~ start at 512 and go up from there)
- bloom filter space usage via bloom_filter_fp_chance
- compression metadata storage via chunk_length_kb
- repair time and how compaction_throughput_in_mb_per_sec and stream_throughput_outbound_megabits_per_sec will effect such
The first three will have a direct negative impact on read performance.
You will definitely want to use JBOD so you don't have to repair everything if you loose a single disk, but you will still be degraded for *a very long time* when you loose a disk.
This is hard and takes experimentation and research (I can't emphasize this part enough), but i've seen it work. That said, the engineering time spent is probably more than buying and deploying additional hardware in the first place. YMMV.
I can certainly sympathize if you have IT staff/management who will willingly spring for some disk drives, but not for full machines, even if they are relatively commodity boxes. Seems penny-wise and pound-foolish to me, but management has their own priorities, plus there is the pre-existing Oracle mindset of dense/fat nodes as a preference.
-- Jack Krupansky
On Wed, Apr 8, 2015 at 2:00 PM, Nate McCall <[hidden email]> wrote:
Yikes, 18tb/node is a very bad idea.
I dont like to go over 2-3 personally and you have to be careful with JBOD. See one of Ellis's latest posts on this and suggested use of LVM. It is a reversal on previous position re JBOD.
+1 612 859 6129
On Apr 8, 2015, at 3:11 PM, Jack Krupansky <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|