Wide rows or tons of rows?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Wide rows or tons of rows?

Héctor Izquierdo Seliva
Hi everyone.

I'm sure this question or similar has come up before, but I can't find a
clear answer. I have to store a unknown number of items in cassandra,
which can vary from a few hundreds to a few millions per customer.

I read that in cassandra wide rows are better than a lot of rows, but
then I face two problems. First, column distribution. The only way I can
think of distributing items among a given set of rows is hashing the
item id to a row id, and the using the item id as the column name. In
this way, I can distribute data among a few rows evenly, but If there
are only a few items it's equivalent to a row per item plus more
overhead, and if there are millions of items then the rows are to big,
and I have to turn off row cache. Does anybody knows a way around this?

The second issue is that in my benchmarks, once the data is mmapped, one
item per row performs faster than wide rows by a significant margin. Is
this how it is supposed to be?

I can give additional data if needed. English is not my first language
so I apologize beforehand is some of this doesn't make sense.

Thanks for your time

Reply | Threaded
Open this post in threaded view
|

Re: Wide rows or tons of rows?

Edward Capriolo
2010/10/11 Héctor Izquierdo Seliva <[hidden email]>:

> Hi everyone.
>
> I'm sure this question or similar has come up before, but I can't find a
> clear answer. I have to store a unknown number of items in cassandra,
> which can vary from a few hundreds to a few millions per customer.
>
> I read that in cassandra wide rows are better than a lot of rows, but
> then I face two problems. First, column distribution. The only way I can
> think of distributing items among a given set of rows is hashing the
> item id to a row id, and the using the item id as the column name. In
> this way, I can distribute data among a few rows evenly, but If there
> are only a few items it's equivalent to a row per item plus more
> overhead, and if there are millions of items then the rows are to big,
> and I have to turn off row cache. Does anybody knows a way around this?
>
> The second issue is that in my benchmarks, once the data is mmapped, one
> item per row performs faster than wide rows by a significant margin. Is
> this how it is supposed to be?
>
> I can give additional data if needed. English is not my first language
> so I apologize beforehand is some of this doesn't make sense.
>
> Thanks for your time
>
>
If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward
Reply | Threaded
Open this post in threaded view
|

Re: Wide rows or tons of rows?

Héctor Izquierdo Seliva
El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió:

Inlined:

> 2010/10/11 Héctor Izquierdo Seliva <[hidden email]>:
> > Hi everyone.
> >
> > I'm sure this question or similar has come up before, but I can't find a
> > clear answer. I have to store a unknown number of items in cassandra,
> > which can vary from a few hundreds to a few millions per customer.
> >
> > I read that in cassandra wide rows are better than a lot of rows, but
> > then I face two problems. First, column distribution. The only way I can
> > think of distributing items among a given set of rows is hashing the
> > item id to a row id, and the using the item id as the column name. In
> > this way, I can distribute data among a few rows evenly, but If there
> > are only a few items it's equivalent to a row per item plus more
> > overhead, and if there are millions of items then the rows are to big,
> > and I have to turn off row cache. Does anybody knows a way around this?
> >
> > The second issue is that in my benchmarks, once the data is mmapped, one
> > item per row performs faster than wide rows by a significant margin. Is
> > this how it is supposed to be?
> >
> > I can give additional data if needed. English is not my first language
> > so I apologize beforehand is some of this doesn't make sense.
> >
> > Thanks for your time
> >
> >
> If you have wide rows RowCache is a problem. IMHO RowCache is only
> viable in situations where you have a fixed amount of data and thus
> will get a high hit rate. I was running a large row cache for some
> time and I found it unpredictable. It causes memory pressure on the
> JVM from moving things in and out of memory, and if the hit rate is
> low taking a key and all its columns in and out repeatedly ends up
> being counter productive for disk utilization. Suggest KeyCache in
> most situations, (there is a ticket opened for a fractional row cache)

I saw the same behavior. It's a pity there is not a column cache. That
would be awesome.

> Another factor to consider is if you have many rows and many columns
> you end up with large (er) indexes. In our case we have start up times
> slightly longer then we would like because the process of sampling
> indexes during start up is intensive. If I could do it all over again
> I might serialize more into single columns rather then exploding data
> across multiple rows and columns. If you always need to look up the
> entire row do not break it down by columns.

So it might be better to store a json serialized version then? I was
using SuperColumns to store item info, but a simple string might give me
the option to do some compression.

> memory mapping. There are different dynamics depending on data size
> relative to memory size. You may have something like ~ 40GB of data
> and 10GB index, 32GB RAM a node, this system is not going to respond
> the same way with say 200GB data 25 GB Indexes. Also it is very
> workload dependent.

We have a 6 node cluster with 16 GB RAM  each, although the whole
dataset is expected to be around 100GB per machine. Which indexes are
more expensive, row or column indexes?

> Hope this helps,
> Edward

It does!


Reply | Threaded
Open this post in threaded view
|

Re: Wide rows or tons of rows?

Jeremy Davis-3
In reply to this post by Edward Capriolo

Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..
Of course it depends on my access patterns right...

Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc...

If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of <some limitation xyz>

-JD



On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo <[hidden email]> wrote:
2010/10/11 Héctor Izquierdo Seliva <[hidden email]>:
> Hi everyone.
>
> I'm sure this question or similar has come up before, but I can't find a
> clear answer. I have to store a unknown number of items in cassandra,
> which can vary from a few hundreds to a few millions per customer.
>
> I read that in cassandra wide rows are better than a lot of rows, but
> then I face two problems. First, column distribution. The only way I can
> think of distributing items among a given set of rows is hashing the
> item id to a row id, and the using the item id as the column name. In
> this way, I can distribute data among a few rows evenly, but If there
> are only a few items it's equivalent to a row per item plus more
> overhead, and if there are millions of items then the rows are to big,
> and I have to turn off row cache. Does anybody knows a way around this?
>
> The second issue is that in my benchmarks, once the data is mmapped, one
> item per row performs faster than wide rows by a significant margin. Is
> this how it is supposed to be?
>
> I can give additional data if needed. English is not my first language
> so I apologize beforehand is some of this doesn't make sense.
>
> Thanks for your time
>
>
If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward

Reply | Threaded
Open this post in threaded view
|

Re: Wide rows or tons of rows?

aaron morton
No idea about a partial row cache, but I would start with fat rows in your use case. If you find that performance is really a problem then you could add a second "recent / oldest" CF that you maintain with the most recent entries and use the row cache there. OR add more nodes. 
 

Aaron


On 12 Oct, 2010,at 10:08 AM, Jeremy Davis <[hidden email]> wrote:


Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..
Of course it depends on my access patterns right...

Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc...

If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of <some limitation xyz>

-JD



On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo <[hidden email]> wrote:
2010/10/11 Héctor Izquierdo Seliva <[hidden email]>:

> Hi everyone.
>
> I'm sure this question or similar has come up before, but I can't find a
> clear answer. I have to store a unknown number of items in cassandra,
> which can vary from a few hundreds to a few millions per customer.
>
> I read that in cassandra wide rows are better than a lot of rows, but
> then I face two problems. First, column distribution. The only way I can
> think of distributing items among a given set of rows is hashing the
> item id to a row id, and the using the item id as the column name. In
> this way, I can distribute data among a few rows evenly, but If there
> are only a few items it's equivalent to a row per item plus more
> overhead, and if there are millions of items then the rows are to big,
> and I have to turn off row cache. Does anybody knows a way around this?
>
> The second issue is that in my benchmarks, once the data is mmapped, one
> item per row performs faster than wide rows by a significant margin. Is
> this how it is supposed to be?
>
> I can give additional data if needed. English is not my first language
> so I apologize beforehand is some of this doesn't make sense.
>
> Thanks for your time
>
>
If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward