SuperColumn vs range of Columns

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

SuperColumn vs range of Columns

Matt Corgan
Hi,
I've been watching some of the Cassandra presentation videos and
looking through slides and the website, but I'm still missing the
motivation behind SuperColumns.

1) What is the difference between a super-column like:

homeAddress: {
  street: “1234 x street”,
  city: “san francisco”,
  zip: “94107″,
}

and the BigTable or HBase style of concatenating nested keys together
into something like:

homeAddress/street:”1234 x street”,
homeAddress/city: “san francisco”,
homeAddress/zip: “94017″

Wouldn’t they be sorted the same way on disk and be similarly
efficient for range queries?  Is it that you avoid storing the string
“homeAddress” redundantly?  Maybe that really adds up if you’re doing
inbox search and storing billions of doc ids where the column name is
several times the size of the doc id.  Seems like BigTable/HBase could
get a similar benefit by using prefix compression and omitting the
timestamps.


2) Can SuperColumns only add one level of nesting beyond normal
columns? That seems limiting considerng BigTable and HBase can append
an arbitrary number of nested keys together.


3) Can you update the columns in the row of a supercolumn without
overwriting the whole row? For example, if a facebook user sends his
10,000th message with the word Steelers in it, does that mean all
10,000 columns need to be overwritten (something like 100KB), or can a
single column be sqeezed into the front of a supercolumn?  Similarly,
can you read a fraction of a SuperColumn without pulling the whole
thing to the client?

As far as i can tell, the only benefit of a SuperColumn over a bunch
of Columns stored together is the savings you get by not storing the
column name and timestamp over and over?  What am I missing?

Thanks!  (maybe this could be added to an FAQ section on the project wiki)

Matt
Reply | Threaded
Open this post in threaded view
|

Re: SuperColumn vs range of Columns

Jonathan Ellis-3
On Thu, Sep 10, 2009 at 7:57 PM, Matt Corgan <[hidden email]> wrote:

> 1) What is the difference between a super-column like:
>
> homeAddress: {
>  street: “1234 x street”,
>  city: “san francisco”,
>  zip: “94107″,
> }
>
> and the BigTable or HBase style of concatenating nested keys together
> into something like:
>
> homeAddress/street:”1234 x street”,
> homeAddress/city: “san francisco”,
> homeAddress/zip: “94017″
>
> Wouldn’t they be sorted the same way on disk and be similarly
> efficient for range queries?  Is it that you avoid storing the string
> “homeAddress” redundantly?

[Note that in Cassandra we refer to column "names" to avoid confusion
w/ row "keys."]

This is primarily useful when your column set is not fixed.  Cassandra
can currently handle up to a million or so columns without problems,
and with a little work could handle billions.  So treating a row as an
associative array with dynamic column names that are determined at
runtime is a totally legitimate thing to do.  So if you are storing
"objects" like address data, a supercolumn maps more closely to what
you would think of in an OO language as

Map<String, Address> addresses

rather than having to treat each field separately:
Map<String, String> streets
Map<String, String> cities
Map<String, String> zip

Besides being a more natural fit for the data, your row-level index of
column names is much more effective when related data is grouped like
this, than when you repeat the name N times for N fields.

> 2) Can SuperColumns only add one level of nesting beyond normal
> columns? That seems limiting considerng BigTable and HBase can append
> an arbitrary number of nested keys together.

Yes, only one level of nesting.

Remember, column names are just a byte[].  You can still smush column
names together if you want to.  You don't need my permission. :)

(Although needing more than one level of nesting is often a sign you
should rethink your row model.)

> 3) Can you update the columns in the row of a supercolumn without
> overwriting the whole row?

Yes.

> Similarly,
> can you read a fraction of a SuperColumn without pulling the whole
> thing to the client?

Yes.

-Jonathan