Re: [Fwd: Re: Greetings!]

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: [Fwd: Re: Greetings!]

Jonathan Ellis-3
On Fri, Jul 31, 2009 at 5:42 PM, Colin Mollenhour<[hidden email]> wrote:

> This reply keeps getting blocked as spam so I am just sending to you
> directly..
> Jonathan, thank you very much for the excellent response. If I may, a few
> more questions (inline):
> > One caveat is that the subcolumns of supercolumns are not indexed.
> > When you query those, Cassandra reads the entire Supercolumn into
> > memory.  So they are best suited for small bunches of attributes, not
> > up to 60k events.
> Given that subcolumns of SCs are not indexed it seems that the only time it
> makes sense to use them is when some or most of the subcolumns will be
> needed within the same request, otherwise you could just have a separate
> simple CF for each sub-group of data. Is there any other reason to use a SC?

Most generally, it's useful when you want a dynamic "container", since
supercolumns can come into existence as needed but CFs are more

> For example on Evan Weavers blog post he gives this diagram:
> with subcolumns
> user_timeline and home_timeline of the UserRelationships SC.  But, because
> they will never be requested simultaneously, these would be better off if
> they were each their own simple CF, right?

That's what it looks like to me.

> > If the event names cannot clash with user names then you might just
> > put all of the data / event / permissions data in the same row without
> > extra namespacing.  Otherwise, you will have to put each of those
> > types of data in a single row.  Which is better depends on your query
> > needs.  (My initial impression is the 2nd is a better fit for you
> > here.)
> I'm not sure I follow you here but the reason I had them as SC:CF is that
> pending_events is something I need to be able to add/remove from easily and
> permissions will always be retrieved as a full list. In many cases I think
> these will need to be fetched to serve the same request. What is the
> drawback of this approach that I am failing to see?

My impression was that pending_events is likely to be large, in which
case per the above it is a bad fit for a SC.  Otherwise it is fine.

> > There's a related problem with your type index: Cassandra still
> > materializes entire rows in memory at compaction time (see
> > CASSANDRA-16).  So for now you might want to split those across rows
> > as $type|$journalid, in a simple columnfamily with each row only about
> > that one journal.  Then you can do range queries to get the journals
> > needed, then slice for the events as needed.
> Cool. Will it ever be possible to retrieve the actual columns from a range
> query rather than just the keys within the range?

Yes.  The only question is when someone will need it enough to code it. :)

> > One other suggestion would be that it generally simplifies things to
> > use natural keys, rather than surrogate (_id keys).  And if you do use
> > surrogate keys, use UUIDs rather than numeric counters.
> I am having trouble finding anything on how to use UUIDs. Even a search on
> the wiki for UUID has no results and all of the examples set the id
> explicitly.. How do I do this using the Thrift interface?

Column names are byte[] now, and a UUID is just 16 bytes laid out the
right way.  How you generate the UUID in the first place and serialize
it to byte[] is going to be client language dependent.  (For Python,
the tests in test/system/ have an example.)

> > No.  If anything, you may not be denormalizing enough.  Having CFs
> > like the event details off by itself when that's not directly needing
> > to be queried looks fishy.
> The take-away seems to be, "Design your schema as if you are using a
> key/value hash and then group CFs together under a SC only if they are
> frequently retrieved in-full by the same app request.". Is there a point at
> which this wouldn't be true because your data was so denormalized that you
> had too many indexes, or does that just mean that Cassandra is not a good
> fit for the application?

In general, Cassandra is a poor fit where you need to do lots of
ad-hoc queries.  But I don't think that's what you have here.