Concurrent updates

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Concurrent updates

Ivan Chang
I have the following scenario that would like a best solution for.
 
Here's the scenario:
 
Table1.Standard1['cassandra']['frequency']
 
it is used for keeping track of how many times the word "cassandra" appeared.
 
Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
all articles throughout the Hadoop cluster that matches the pattern ^cassandra$
and updates Table1.Standard1['cassandra']['frequency'].  Hence
Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
 
One of the issues I am facing is that Table1.Standard1['cassandra']['frequency']
stores the count as a String (I am using Java), so in order to update the  frequency
properly, the thread that's running the Map/Reduce will have to retrieve
Table1.Standard1['cassandra']['frequency'] in its native String format and hold
that in temp (java Sttring), convert into int, then add the new counts in, and finally
"SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() + ''"
 
During the entire process, how do we guranatee concurrency.  The Cql SET does
not allow something like
 
SET Table1.Standard1['cassandra']['frequency']. = Table1.Standard1['cassandra']['frequency']. + newCounts
 
since there's only one String type.
 
What would be the best solution in this situtaion?
 
Thanks,
Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent updates

Michael Greene
Even if CQL SET allowed for the operation you're describing, it's at
odds with the availability and consistency constrains of Cassandra.
Another process, somewhere else, could be reading and writing that
frequency value at the same time.  Reducing the operation to one
statement does not make it transactional or idempotent.

Unless you are looking for estimates in that cell and the delay
between processing updates to that cell is large enough to provide
reasonable estimates, you will want to look at a queueing solution or
a transaction solution outside of Cassandra.  There are a few issues
open in JIRA that would allow you to up the consistency on this
particular read/write call to ensure that you are getting better
estimates, but this is a scenario that Cassandra does not handle well.

If you can think of a way to model your operation to be idempotent,
then that would be preferable.  Otherwise an external queue (such as
AMQP) or transaction system (such as Zookeeper) is all I can think of
at the moment.

Michael

On Fri, Jul 17, 2009 at 9:14 AM, Ivan Chang<[hidden email]> wrote:

> I have the following scenario that would like a best solution for.
>
> Here's the scenario:
>
> Table1.Standard1['cassandra']['frequency']
>
> it is used for keeping track of how many times the word "cassandra"
> appeared.
>
> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
> all articles throughout the Hadoop cluster that matches the pattern
> ^cassandra$
> and updates Table1.Standard1['cassandra']['frequency'].  Hence
> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>
> One of the issues I am facing is that
> Table1.Standard1['cassandra']['frequency']
> stores the count as a String (I am using Java), so in order to update the
> frequency
> properly, the thread that's running the Map/Reduce will have to retrieve
> Table1.Standard1['cassandra']['frequency'] in its native String format and
> hold
> that in temp (java Sttring), convert into int, then add the new counts in,
> and finally
> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
> ''"
>
> During the entire process, how do we guranatee concurrency.  The Cql SET
> does
> not allow something like
>
> SET Table1.Standard1['cassandra']['frequency']. =
> Table1.Standard1['cassandra']['frequency']. + newCounts
>
> since there's only one String type.
>
> What would be the best solution in this situtaion?
>
> Thanks,
> Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent updates

Jun Rao
In reply to this post by Ivan Chang

This is a case where a test-and-set feature would be useful. See the following JIRA. We just don't have it nailed down yet.
https://issues.apache.org/jira/browse/CASSANDRA-48

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA 95120-6099

[hidden email]

Inactive hide details for Ivan Chang <ivan.chang@medigy.com>Ivan Chang <[hidden email]>



To

[hidden email]

cc


Subject

Concurrent updates

I have the following scenario that would like a best solution for.
 
Here's the scenario:
 
Table1.Standard1['cassandra']['frequency']
 
it is used for keeping track of how many times the word "cassandra" appeared.
 
Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
all articles throughout the Hadoop cluster that matches the pattern ^cassandra$
and updates Table1.Standard1['cassandra']['frequency'].  Hence
Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
 
One of the issues I am facing is that Table1.Standard1['cassandra']['frequency']
stores the count as a String (I am using Java), so in order to update the  frequency
properly, the thread that's running the Map/Reduce will have to retrieve
Table1.Standard1['cassandra']['frequency'] in its native String format and hold
that in temp (java Sttring), convert into int, then add the new counts in, and finally
"SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() + ''"
 
During the entire process, how do we guranatee concurrency.  The Cql SET does
not allow something like
 
SET Table1.Standard1['cassandra']['frequency']. = Table1.Standard1['cassandra']['frequency']. + newCounts
 
since there's only one String type.
 
What would be the best solution in this situtaion?
 
Thanks,
Ivan

pic27086.gif (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent updates

Jonathan Ellis-3
This is the kind of inconsistency that vector clocks can handle but
the more simplistic timestamp-based resolution cannot.

Of test-and-set vs vector clocks, vector clocks fits cassandra much better.

-Jonathan

On Fri, Jul 17, 2009 at 9:59 AM, Jun Rao<[hidden email]> wrote:

> This is a case where a test-and-set feature would be useful. See the
> following JIRA. We just don't have it nailed down yet.
> https://issues.apache.org/jira/browse/CASSANDRA-48
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> [hidden email]
>
> Ivan Chang <[hidden email]>
>
>
> Ivan Chang <[hidden email]>
>
> 07/17/2009 07:14 AM
>
> Please respond to
> [hidden email]
>
> To
> [hidden email]
> cc
>
> Subject
> Concurrent updates
> I have the following scenario that would like a best solution for.
>
> Here's the scenario:
>
> Table1.Standard1['cassandra']['frequency']
>
> it is used for keeping track of how many times the word "cassandra"
> appeared.
>
> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
> all articles throughout the Hadoop cluster that matches the pattern
> ^cassandra$
> and updates Table1.Standard1['cassandra']['frequency'].  Hence
> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>
> One of the issues I am facing is that
> Table1.Standard1['cassandra']['frequency']
> stores the count as a String (I am using Java), so in order to update the
> frequency
> properly, the thread that's running the Map/Reduce will have to retrieve
> Table1.Standard1['cassandra']['frequency'] in its native String format and
> hold
> that in temp (java Sttring), convert into int, then add the new counts in,
> and finally
> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
> ''"
>
> During the entire process, how do we guranatee concurrency.  The Cql SET
> does
> not allow something like
>
> SET Table1.Standard1['cassandra']['frequency']. =
> Table1.Standard1['cassandra']['frequency']. + newCounts
>
> since there's only one String type.
>
> What would be the best solution in this situtaion?
>
> Thanks,
> Ivan
>
Reply | Threaded
Open this post in threaded view
|

Re: Concurrent updates

Sandeep Tata
You could (for now) store counters in
Table1.Standard1['cassandra']['frequency-mapperid'].
At the end, you do a get_slice and add them up.
This is really bad for fault-tolerance -- you'll get wrong counts if
mappers were restarted because of failures. But then, you'd have the
same problem if you (transactionally) incremented a single counter
too.
This way, modulo failures your answer is still correct.



On Fri, Jul 17, 2009 at 8:41 AM, Jonathan Ellis<[hidden email]> wrote:

> This is the kind of inconsistency that vector clocks can handle but
> the more simplistic timestamp-based resolution cannot.
>
> Of test-and-set vs vector clocks, vector clocks fits cassandra much better.
>
> -Jonathan
>
> On Fri, Jul 17, 2009 at 9:59 AM, Jun Rao<[hidden email]> wrote:
>> This is a case where a test-and-set feature would be useful. See the
>> following JIRA. We just don't have it nailed down yet.
>> https://issues.apache.org/jira/browse/CASSANDRA-48
>>
>> Jun
>> IBM Almaden Research Center
>> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>>
>> [hidden email]
>>
>> Ivan Chang <[hidden email]>
>>
>>
>> Ivan Chang <[hidden email]>
>>
>> 07/17/2009 07:14 AM
>>
>> Please respond to
>> [hidden email]
>>
>> To
>> [hidden email]
>> cc
>>
>> Subject
>> Concurrent updates
>> I have the following scenario that would like a best solution for.
>>
>> Here's the scenario:
>>
>> Table1.Standard1['cassandra']['frequency']
>>
>> it is used for keeping track of how many times the word "cassandra"
>> appeared.
>>
>> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
>> all articles throughout the Hadoop cluster that matches the pattern
>> ^cassandra$
>> and updates Table1.Standard1['cassandra']['frequency'].  Hence
>> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>>
>> One of the issues I am facing is that
>> Table1.Standard1['cassandra']['frequency']
>> stores the count as a String (I am using Java), so in order to update the
>> frequency
>> properly, the thread that's running the Map/Reduce will have to retrieve
>> Table1.Standard1['cassandra']['frequency'] in its native String format and
>> hold
>> that in temp (java Sttring), convert into int, then add the new counts in,
>> and finally
>> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
>> ''"
>>
>> During the entire process, how do we guranatee concurrency.  The Cql SET
>> does
>> not allow something like
>>
>> SET Table1.Standard1['cassandra']['frequency']. =
>> Table1.Standard1['cassandra']['frequency']. + newCounts
>>
>> since there's only one String type.
>>
>> What would be the best solution in this situtaion?
>>
>> Thanks,
>> Ivan
>>
>