Hi all, I am new to the Cassandra scene. I have watched presentations,
read papers and articles, run the server with some basic usage, digested
the thrift interface and disseminated as much info as possible with
frying my brain with all of this stuff. I am working on a web app that
will have some social networking aspects as well as some other features
that involve lots of "event" records and while I have a good enough
understanding to do some damage, I don't feel comfortable writing an app
just yet.. I actually started the app with a PHP framework and a MySQL
schema that isn't too complex and have started distilling it into a
Cassandra schema as best I can but this is where I am getting stuck. I'm
not sure if I'm trying to fit a square peg into a round hole or if I am
just not lining it up right so perhaps you can help me?
I've been going off of the "twitter" examples (Evan Weaver, Eric
Florenzano) as my point of reference but have a few questions about
I have for the most part, "users", "journals", and "events".
Events have one of several types (variable) and are either a start-end
range or a single point in time and have various metadata.
Journals have multiple events plus various metadata.
In the lifetime of a journal I am estimating it will accrue 20k-60k events.
Users have multiple journals and can share access to journals with other
Users will own <10 journals but some users might share access to more
than that at once.
I'd like it to scale to as many users as we can get to sign up,
potentially very very many, hence my interest in Cassandra :)
I need to be able to fetch all or latest events with the following
-A specific journal
-All of a user's journals
-A specific event type
-A specific event type for a specific journal
-A specific event type for all of a user's journals
After much deliberation in trying to figure out how to do the above
without having to loop through many many queries here is the schema I
If I am correct in my thinking, all of the above cases can be retrieved
in one or two steps with the maximum number of queries being determined
by the number of journals in question.
Am I wrong to try to reduce the number of indexes and round-trips to the
database by modeling this way?
Some more general questions:
My model assumes the use of get_slice_by_names with a potentially large
number of keys, is that ok?
Cassandra lacks transactions and increment methods, is there a way to
generate unique user ids with just Cassandra as the authority that I am
Is it silly to use short column names for the sake of performance or
storage efficiency? E.g. uid instead of user_id. I like verbose names...
On Tue, Jul 28, 2009 at 4:26 AM, Colin Mollenhour<[hidden email]> wrote:
> I need to be able to fetch all or latest events with the following
> -A specific journal
> -All of a user's journals
> -A specific event type
> -A specific event type for a specific journal
> -A specific event type for all of a user's journals
> After much deliberation in trying to figure out how to do the above
> without having to loop through many many queries here is the schema I
> arrived at:
> If I am correct in my thinking, all of the above cases can be retrieved
> in one or two steps with the maximum number of queries being determined
> by the number of journals in question.
I think you have the right idea. And thanks for taking the trouble to
draw a diagram, that was very useful. :)
One caveat is that the subcolumns of supercolumns are not indexed.
When you query those, Cassandra reads the entire Supercolumn into
memory. So they are best suited for small bunches of attributes, not
up to 60k events.
If the event names cannot clash with user names then you might just
put all of the data / event / permissions data in the same row without
extra namespacing. Otherwise, you will have to put each of those
types of data in a single row. Which is better depends on your query
needs. (My initial impression is the 2nd is a better fit for you
There's a related problem with your type index: Cassandra still
materializes entire rows in memory at compaction time (see
CASSANDRA-16). So for now you might want to split those across rows
as $type|$journalid, in a simple columnfamily with each row only about
that one journal. Then you can do range queries to get the journals
needed, then slice for the events as needed.
One other suggestion would be that it generally simplifies things to
use natural keys, rather than surrogate (_id keys). And if you do use
surrogate keys, use UUIDs rather than numeric counters.
> Am I wrong to try to reduce the number of indexes and round-trips to the
> database by modeling this way?
No. If anything, you may not be denormalizing enough. Having CFs
like the event details off by itself when that's not directly needing
to be queried looks fishy.
> Some more general questions:
> My model assumes the use of get_slice_by_names with a potentially large
> number of keys, is that ok?
For the numbers you are talking about (< 100,000) it should be. Just
be aware that serialization of the request won't be negligible at
those numbers. Using get_slice with start and finish ranges will be
more efficient in that respect.
> Cassandra lacks transactions and increment methods, is there a way to
> generate unique user ids with just Cassandra as the authority that I am
Yeah, UUIDs as above.
> Is it silly to use short column names for the sake of performance or
> storage efficiency? E.g. uid instead of user_id. I like verbose names...
IMO, that is unlikely to make the difference between a workable
solution and an unworkable one.
|Free forum by Nabble||Edit this page|