Visual representation of Cassandra data model

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Visual representation of Cassandra data model

Mark McBride
While working on an updated data model wiki page I'm trying to put
together a graphical representation of the data model.  I threw this
together based on Curt's goal of modeling delicious.  The basic gist
is descriptive data for tags, users, and bookmarks goes in the
Description column family.  The relationships between bookmarks, tags
and users goes in the map supercolumn.  I'm not sure this is how you
would do it in production (I'm guessing at the very least you'd want
separate supercolumns for bookmarks, tags and users), but it seems to
be simple enough for a new user to digest, and covers all the bases of
the data model (aside from ordering I guess).  So two questions

1) did I get it right (I'm new to this as well)?
2) is this a useful representation?

  ---Mark

cassandra-data-format.png (159K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Ryan King
A few quick comments:

* its not clear what column family the super column you're using is in.
* it might be useful to include the timestamps in the columns (since
they're user-supplied)
* given that the colon-delimited api has been removed, it might be
easier to explain the data model without such strings
* why would you mix different kinds of data in the same column family,
rather than having separate column families for each? (users,
bookmarks, tags)

-ryan

On Wed, Aug 12, 2009 at 4:57 PM, Mark McBride<[hidden email]> wrote:

> While working on an updated data model wiki page I'm trying to put
> together a graphical representation of the data model.  I threw this
> together based on Curt's goal of modeling delicious.  The basic gist
> is descriptive data for tags, users, and bookmarks goes in the
> Description column family.  The relationships between bookmarks, tags
> and users goes in the map supercolumn.  I'm not sure this is how you
> would do it in production (I'm guessing at the very least you'd want
> separate supercolumns for bookmarks, tags and users), but it seems to
> be simple enough for a new user to digest, and covers all the bases of
> the data model (aside from ordering I guess).  So two questions
>
> 1) did I get it right (I'm new to this as well)?
> 2) is this a useful representation?
>
>  ---Mark
>
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Mark McBride
Is this clearer?  I had the key names set up as <type>:<id> just to
keep it simple and put everything in one keyspace.  Ditto the super
column, although I guess that could be spread out into three things,
or you could spread it out into three keyspaces.  Not sure what best
practices there are.

What I'd like to do (and I'll get started on this tonight) is start
with a problem statement, and then go about building up a
storage-conf.xml file with this structure, showing API examples along
the way.  So while this is a final picture, there would be simpler
ones up front.

   ---Mark

On Wed, Aug 12, 2009 at 5:35 PM, Ryan King<[hidden email]> wrote:

> A few quick comments:
>
> * its not clear what column family the super column you're using is in.
> * it might be useful to include the timestamps in the columns (since
> they're user-supplied)
> * given that the colon-delimited api has been removed, it might be
> easier to explain the data model without such strings
> * why would you mix different kinds of data in the same column family,
> rather than having separate column families for each? (users,
> bookmarks, tags)
>
> -ryan
>
> On Wed, Aug 12, 2009 at 4:57 PM, Mark McBride<[hidden email]> wrote:
>> While working on an updated data model wiki page I'm trying to put
>> together a graphical representation of the data model.  I threw this
>> together based on Curt's goal of modeling delicious.  The basic gist
>> is descriptive data for tags, users, and bookmarks goes in the
>> Description column family.  The relationships between bookmarks, tags
>> and users goes in the map supercolumn.  I'm not sure this is how you
>> would do it in production (I'm guessing at the very least you'd want
>> separate supercolumns for bookmarks, tags and users), but it seems to
>> be simple enough for a new user to digest, and covers all the bases of
>> the data model (aside from ordering I guess).  So two questions
>>
>> 1) did I get it right (I'm new to this as well)?
>> 2) is this a useful representation?
>>
>>  ---Mark
>>
>

cassandra-data-format.png (169K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Jonathan Ellis-3
Thanks for taking a stab at this, Mark.

I'm not a fan of teaching this by showing CF-spanning rows.  (The
bigtable paper does this IIRC but it's wrong. :)

You can have data in different CFs with the same key, yes, but all
that means is they will be stored on the same nodes.  Each CF is
stored separately on disk and queried separately and the common case
is that they _won't_ have keys in common, rather than the reverse.

-Jonathan

On Wed, Aug 12, 2009 at 10:24 PM, Mark McBride<[hidden email]> wrote:

> Is this clearer?  I had the key names set up as <type>:<id> just to
> keep it simple and put everything in one keyspace.  Ditto the super
> column, although I guess that could be spread out into three things,
> or you could spread it out into three keyspaces.  Not sure what best
> practices there are.
>
> What I'd like to do (and I'll get started on this tonight) is start
> with a problem statement, and then go about building up a
> storage-conf.xml file with this structure, showing API examples along
> the way.  So while this is a final picture, there would be simpler
> ones up front.
>
>   ---Mark
>
> On Wed, Aug 12, 2009 at 5:35 PM, Ryan King<[hidden email]> wrote:
>> A few quick comments:
>>
>> * its not clear what column family the super column you're using is in.
>> * it might be useful to include the timestamps in the columns (since
>> they're user-supplied)
>> * given that the colon-delimited api has been removed, it might be
>> easier to explain the data model without such strings
>> * why would you mix different kinds of data in the same column family,
>> rather than having separate column families for each? (users,
>> bookmarks, tags)
>>
>> -ryan
>>
>> On Wed, Aug 12, 2009 at 4:57 PM, Mark McBride<[hidden email]> wrote:
>>> While working on an updated data model wiki page I'm trying to put
>>> together a graphical representation of the data model.  I threw this
>>> together based on Curt's goal of modeling delicious.  The basic gist
>>> is descriptive data for tags, users, and bookmarks goes in the
>>> Description column family.  The relationships between bookmarks, tags
>>> and users goes in the map supercolumn.  I'm not sure this is how you
>>> would do it in production (I'm guessing at the very least you'd want
>>> separate supercolumns for bookmarks, tags and users), but it seems to
>>> be simple enough for a new user to digest, and covers all the bases of
>>> the data model (aside from ordering I guess).  So two questions
>>>
>>> 1) did I get it right (I'm new to this as well)?
>>> 2) is this a useful representation?
>>>
>>>  ---Mark
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Arin Sarkissian
FWIW: I find that the only sane way to visually represent a data model
is to use a JSON-ish notation.
Picture type visualizations confuse me even more.

I don't mean to be a downer but me and a lot of my peers found all the
picture type visual aides even more confusing

-arin
aka: phatduckk

On Wed, Aug 12, 2009 at 8:35 PM, Jonathan Ellis<[hidden email]> wrote:

> Thanks for taking a stab at this, Mark.
>
> I'm not a fan of teaching this by showing CF-spanning rows.  (The
> bigtable paper does this IIRC but it's wrong. :)
>
> You can have data in different CFs with the same key, yes, but all
> that means is they will be stored on the same nodes.  Each CF is
> stored separately on disk and queried separately and the common case
> is that they _won't_ have keys in common, rather than the reverse.
>
> -Jonathan
>
> On Wed, Aug 12, 2009 at 10:24 PM, Mark McBride<[hidden email]> wrote:
>> Is this clearer?  I had the key names set up as <type>:<id> just to
>> keep it simple and put everything in one keyspace.  Ditto the super
>> column, although I guess that could be spread out into three things,
>> or you could spread it out into three keyspaces.  Not sure what best
>> practices there are.
>>
>> What I'd like to do (and I'll get started on this tonight) is start
>> with a problem statement, and then go about building up a
>> storage-conf.xml file with this structure, showing API examples along
>> the way.  So while this is a final picture, there would be simpler
>> ones up front.
>>
>>   ---Mark
>>
>> On Wed, Aug 12, 2009 at 5:35 PM, Ryan King<[hidden email]> wrote:
>>> A few quick comments:
>>>
>>> * its not clear what column family the super column you're using is in.
>>> * it might be useful to include the timestamps in the columns (since
>>> they're user-supplied)
>>> * given that the colon-delimited api has been removed, it might be
>>> easier to explain the data model without such strings
>>> * why would you mix different kinds of data in the same column family,
>>> rather than having separate column families for each? (users,
>>> bookmarks, tags)
>>>
>>> -ryan
>>>
>>> On Wed, Aug 12, 2009 at 4:57 PM, Mark McBride<[hidden email]> wrote:
>>>> While working on an updated data model wiki page I'm trying to put
>>>> together a graphical representation of the data model.  I threw this
>>>> together based on Curt's goal of modeling delicious.  The basic gist
>>>> is descriptive data for tags, users, and bookmarks goes in the
>>>> Description column family.  The relationships between bookmarks, tags
>>>> and users goes in the map supercolumn.  I'm not sure this is how you
>>>> would do it in production (I'm guessing at the very least you'd want
>>>> separate supercolumns for bookmarks, tags and users), but it seems to
>>>> be simple enough for a new user to digest, and covers all the bases of
>>>> the data model (aside from ordering I guess).  So two questions
>>>>
>>>> 1) did I get it right (I'm new to this as well)?
>>>> 2) is this a useful representation?
>>>>
>>>>  ---Mark
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Michael Koziarski
On Thu, Aug 13, 2009 at 5:12 PM, Arin Sarkissian<[hidden email]> wrote:
> FWIW: I find that the only sane way to visually represent a data model
> is to use a JSON-ish notation.
> Picture type visualizations confuse me even more.
>
> I don't mean to be a downer but me and a lot of my peers found all the
> picture type visual aides even more confusing

I agree, it's generally easier and pretty much everyone understands
jsonish notation (though I find ruby's => notation for hashes is
easier to follow ;))

Having said that, evan's pictures were really useful:

http://blog.evanweaver.com/files/cassandra/twitter_small.jpg
http://blog.evanweaver.com/files/cassandra/twitter.jpg

> -arin
> aka: phatduckk
>
> On Wed, Aug 12, 2009 at 8:35 PM, Jonathan Ellis<[hidden email]> wrote:
>> Thanks for taking a stab at this, Mark.
>>
>> I'm not a fan of teaching this by showing CF-spanning rows.  (The
>> bigtable paper does this IIRC but it's wrong. :)
>>
>> You can have data in different CFs with the same key, yes, but all
>> that means is they will be stored on the same nodes.  Each CF is
>> stored separately on disk and queried separately and the common case
>> is that they _won't_ have keys in common, rather than the reverse.
>>
>> -Jonathan
>>
>> On Wed, Aug 12, 2009 at 10:24 PM, Mark McBride<[hidden email]> wrote:
>>> Is this clearer?  I had the key names set up as <type>:<id> just to
>>> keep it simple and put everything in one keyspace.  Ditto the super
>>> column, although I guess that could be spread out into three things,
>>> or you could spread it out into three keyspaces.  Not sure what best
>>> practices there are.
>>>
>>> What I'd like to do (and I'll get started on this tonight) is start
>>> with a problem statement, and then go about building up a
>>> storage-conf.xml file with this structure, showing API examples along
>>> the way.  So while this is a final picture, there would be simpler
>>> ones up front.
>>>
>>>   ---Mark
>>>
>>> On Wed, Aug 12, 2009 at 5:35 PM, Ryan King<[hidden email]> wrote:
>>>> A few quick comments:
>>>>
>>>> * its not clear what column family the super column you're using is in.
>>>> * it might be useful to include the timestamps in the columns (since
>>>> they're user-supplied)
>>>> * given that the colon-delimited api has been removed, it might be
>>>> easier to explain the data model without such strings
>>>> * why would you mix different kinds of data in the same column family,
>>>> rather than having separate column families for each? (users,
>>>> bookmarks, tags)
>>>>
>>>> -ryan
>>>>
>>>> On Wed, Aug 12, 2009 at 4:57 PM, Mark McBride<[hidden email]> wrote:
>>>>> While working on an updated data model wiki page I'm trying to put
>>>>> together a graphical representation of the data model.  I threw this
>>>>> together based on Curt's goal of modeling delicious.  The basic gist
>>>>> is descriptive data for tags, users, and bookmarks goes in the
>>>>> Description column family.  The relationships between bookmarks, tags
>>>>> and users goes in the map supercolumn.  I'm not sure this is how you
>>>>> would do it in production (I'm guessing at the very least you'd want
>>>>> separate supercolumns for bookmarks, tags and users), but it seems to
>>>>> be simple enough for a new user to digest, and covers all the bases of
>>>>> the data model (aside from ordering I guess).  So two questions
>>>>>
>>>>> 1) did I get it right (I'm new to this as well)?
>>>>> 2) is this a useful representation?
>>>>>
>>>>>  ---Mark
>>>>>
>>>>
>>>
>>
>



--
Cheers

Koz
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Curt Micol
In reply to this post by Arin Sarkissian
On Thu, Aug 13, 2009 at 1:12 AM, Arin Sarkissian<[hidden email]> wrote:
> FWIW: I find that the only sane way to visually represent a data model
> is to use a JSON-ish notation.
> Picture type visualizations confuse me even more.
>
> I don't mean to be a downer but me and a lot of my peers found all the
> picture type visual aides even more confusing

I can see that, I've found both to be helpful (Evan's drawings were
very helpful in visualizing his post).

I've put together an attempt here, using Mark's layout but with
Jonathan's information also.  I haven't had time to fill in all the
data, but I think this is in the right direction:

http://www.asenchi.com/~cbm/misc-cassandra/delicious-schema.html


--
# Curt Micol
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Curt Micol
On Thu, Aug 13, 2009 at 1:34 AM, Curt Micol<[hidden email]> wrote:

> http://www.asenchi.com/~cbm/misc-cassandra/delicious-schema.html

Ok, I took an opportunity to update this a bit while waiting for some
processes to finish at work.

Refresh, and go up a dir to see the matching 'storage-conf.xml'.  I've
also got all of this here:
http://github.com/asenchi/misc-cassandra/tree/master

This diagram and the matching conf feel better to me. I think I am
beginning to "get it". :)

Criticisms, opinions, hate mail?

--
# Curt Micol
Reply | Threaded
Open this post in threaded view
|

Re: Visual representation of Cassandra data model

Colin Mollenhour
In reply to this post by Mark McBride
I'm really glad that you all are working on this, Cassandra's data model
to me was still is a big learning curve to completely digest due to the
various unknown implications (to Cassandra newbies especially) that the
data model has on performance and usability. This also seems to changing
somewhat with the Thrift API changes so it would be really nice to have
a "designing a Cassandra schema for your application" guide.

In your model I don't think it is best to have a general "map" SC with
all of the relations in it since there will be unnecessary
deserialization and network transfer of the map data that you won't
always make use of. I think you should denormalize and use separate CFs
for the various mappings. Cassandra handles lots of keys better than
large SCs from what I understand.  Here is my first stab at the data
model you are working on:

Schema Legend:
<CF or SC name> (SC|CF keyed on <key description>)
<example key>: {<column name>: <value>, ...}
or
<example key>: [<CF name>: {<column name>: <value>, ...}, <CF name>:
{...}, ...]

Delicious Keyspace Schema:
user (CF keyed on nick)
"mccv": {name: "Mark McBride", email: "[hidden email]"}

bookmark (SC keyed on url with CFs for related users and related tags)
"http://thesartorialist.blogspot.com": [details: {title: "The
Sartorialist", other_meta_data: <value>}, users: {"mccv": null}, tags:
{"blog": null, "news": null}]
(storing users here may be overkill, but it is reasonable that when
retrieving a bookmark you will usually want the tags too)

bookmark_tag_users (CF keyed on bookmark|tag containing list of related
users)
"http://thesartorialist.blogspot.com|blog": {"mvcc": null, ...}
"http://thesartorialist.blogspot.com|news": {"mvcc": null, ...}

user_bookmark_tags (CF keyed on user|bookmark to lookup a user's tags
for a bookmark or all of a user's bookmarks and their tags (using
key_range))
"mccv|http://thesartorialist.blogspot.com": {"blog": null, "news": null,
...}

tag_bookmarks (CF keyed on tag name to lookup all bookmarks for a given tag)
"blog": {"http://thesartorialist.blogspot.com": "The Sartorialist", ...}
"news": {"http://thesartorialist.blogspot.com": "The Sartorialist", ...}

user_tag_bookmarks (CF keyed on tag|user to lookup all bookmarks for a
given tag and user or just a given user (using key_range))
"mccv|blog": {"http://thesartorialist.blogspot.com":"The Sartorialist", ...}
"mccv|news": {"http://thesartorialist.blogspot.com":"The Sartorialist", ...}

I think a good approach to designing a Cassandra schema from scratch is
to make a list of the queries that you *know* you will need to be fast
and then look at your model attempts and see how well it fits while
trying to minimize overhead. Example:
-All bookmarks for a user
-All of a user's bookmarks for a tag
-All bookmarks for a tag
-All tags for a bookmark
-etc..

I would start with a highly denormalized schema that consists of only
simple CFs. My take on SCs is that if you know that every time you
retrieve data from one CF for a key you will also retrieve data for
another CF with the same key, then you should probably combine them in a
SC, otherwise they probably need to be in a separate simple CF (due to
the entire SC having to be deserialized in memory just to retrieve a
slice). However it seems like you can end up with lots of special
purpose CFs used as maps and I'm not sure at what point you would want
to simply go with a different database system with a richer querying
capability.. I don't know much about Delicious, but it seems that using
natural keys is perfectly acceptable in this case.

I'm sure this isn't the best schema but it is an alternative approach.
I'd really love to see how the experts would model this in a production
system.

Thanks,
Colin

Mark McBride wrote:

> While working on an updated data model wiki page I'm trying to put
> together a graphical representation of the data model.  I threw this
> together based on Curt's goal of modeling delicious.  The basic gist
> is descriptive data for tags, users, and bookmarks goes in the
> Description column family.  The relationships between bookmarks, tags
> and users goes in the map supercolumn.  I'm not sure this is how you
> would do it in production (I'm guessing at the very least you'd want
> separate supercolumns for bookmarks, tags and users), but it seems to
> be simple enough for a new user to digest, and covers all the bases of
> the data model (aside from ordering I guess).  So two questions
>
> 1) did I get it right (I'm new to this as well)?
> 2) is this a useful representation?
>
>   ---Mark
>  
>
> ------------------------------------------------------------------------
>