Quantcast

Cassandra and Pig - how to get column values?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Cassandra and Pig - how to get column values?

Eric Lee
Hey guys,

I'm having a problem with pig and cassandra and was hoping someone could point me in the right direction. I've setup Pig and Cassandra and I'm able to run through the example shown in the README.txt - I can view a list of top column names. That's all good stuff.

What I would like to do next is just dump out the column values. Suppose I have a very simple Column Family called User. To that column family, I've added 2 rows of data, each row just has 1 column 'userName'. I'm using a GUID as my key. 

When I load and dump my rows, I get some data like:

(6c7fef29-16dd-44ca-bde1-f53995b2e818,{(userName,someUserName1)})
(8be0b934-45aa-444f-90e2-ce7137a73b68,{(userName,someUserName2})
(c51fc8ce-2a53-46bb-b872-0f644b972f62,{(userName,someUserName3)})

As I understand it, at this point, the GUID is $0 and $1 is the bag that contains my columns.

So, like in the README, I run:

cols = FOREACH rows GENERATE flatten($1);

As I understand it, when I flatten a bag, I get a set of tuples. When I dump cols, I get the following:

(userName,someUserName1)
(userName,someUserName2)
(userName,someUserName3)

If I continue with the README, I would run colnames = FOREACH cols GENERATE $0 to give me the column names.

I'm a little confused why I only get column names - when I do a describe on cols, I get the following:

cols: {bytearray}

It seems like $0 should be the entire line (userName,someUserName1), not just the column name.

Anyways, what I really what is the column value, not the name. Is there a way to do that? I listed all of the failed attempts I made below.
  • colnames = FOREACH cols GENERATE $1 and was told $1 was out of bounds. 
  • casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0; but all I got back were empty tuples
  • values = FOREACH cols GENERATE $0.$1; but I got an error telling me data byte array can't be casted to tuple
So I'm stuck - any help would be greatly appreciated.

Thanks!

Eric.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cassandra and Pig - how to get column values?

Eric Lee
I have this working now with the following:

rows = LOAD 'cassandra://TwitterExample/User' using CassandraStorage();
cols = FOREACH rows GENERATE FLATTEN((bag{tuple(chararray,chararray)})$1);
users = FOREACH cols GENERATE $1;

Not sure if that operation with cols is correct or not, but it appears to be working. Any thoughts would be appreciated.

Eric.

On Fri, Oct 15, 2010 at 8:02 PM, Eric Lee <[hidden email]> wrote:
Hey guys,

I'm having a problem with pig and cassandra and was hoping someone could point me in the right direction. I've setup Pig and Cassandra and I'm able to run through the example shown in the README.txt - I can view a list of top column names. That's all good stuff.

What I would like to do next is just dump out the column values. Suppose I have a very simple Column Family called User. To that column family, I've added 2 rows of data, each row just has 1 column 'userName'. I'm using a GUID as my key. 

When I load and dump my rows, I get some data like:

(6c7fef29-16dd-44ca-bde1-f53995b2e818,{(userName,someUserName1)})
(8be0b934-45aa-444f-90e2-ce7137a73b68,{(userName,someUserName2})
(c51fc8ce-2a53-46bb-b872-0f644b972f62,{(userName,someUserName3)})

As I understand it, at this point, the GUID is $0 and $1 is the bag that contains my columns.

So, like in the README, I run:

cols = FOREACH rows GENERATE flatten($1);

As I understand it, when I flatten a bag, I get a set of tuples. When I dump cols, I get the following:

(userName,someUserName1)
(userName,someUserName2)
(userName,someUserName3)

If I continue with the README, I would run colnames = FOREACH cols GENERATE $0 to give me the column names.

I'm a little confused why I only get column names - when I do a describe on cols, I get the following:

cols: {bytearray}

It seems like $0 should be the entire line (userName,someUserName1), not just the column name.

Anyways, what I really what is the column value, not the name. Is there a way to do that? I listed all of the failed attempts I made below.
  • colnames = FOREACH cols GENERATE $1 and was told $1 was out of bounds. 
  • casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0; but all I got back were empty tuples
  • values = FOREACH cols GENERATE $0.$1; but I got an error telling me data byte array can't be casted to tuple
So I'm stuck - any help would be greatly appreciated.

Thanks!

Eric.






--
WonderAffect

http://www.wonderaffect.com
http://www.wonderaffect.com/blog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cassandra and Pig - how to get column values?

Brandon Williams
On Sat, Oct 16, 2010 at 3:55 PM, Eric Lee <[hidden email]> wrote:
I have this working now with the following:

rows = LOAD 'cassandra://TwitterExample/User' using CassandraStorage();
cols = FOREACH rows GENERATE FLATTEN((bag{tuple(chararray,chararray)})$1);
users = FOREACH cols GENERATE $1;

Not sure if that operation with cols is correct or not, but it appears to be working. Any thoughts would be appreciated.

You can do what you want in a single pass by dereferencing the bag:

rows = LOAD 'cassandra://TwitterExample/User' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
values = FOREACH rows GENERATE columns.value;

-Brandon
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cassandra and Pig - how to get column values?

Eric Lee
Ah nice, even better, thanks!

On Sat, Oct 16, 2010 at 2:31 PM, Brandon Williams <[hidden email]> wrote:
On Sat, Oct 16, 2010 at 3:55 PM, Eric Lee <[hidden email]> wrote:
I have this working now with the following:

rows = LOAD 'cassandra://TwitterExample/User' using CassandraStorage();
cols = FOREACH rows GENERATE FLATTEN((bag{tuple(chararray,chararray)})$1);
users = FOREACH cols GENERATE $1;

Not sure if that operation with cols is correct or not, but it appears to be working. Any thoughts would be appreciated.

You can do what you want in a single pass by dereferencing the bag:

rows = LOAD 'cassandra://TwitterExample/User' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
values = FOREACH rows GENERATE columns.value;

-Brandon



--
WonderAffect

http://www.wonderaffect.com
http://www.wonderaffect.com/blog
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cassandra and Pig - how to get column values?

tamilmani
This post has NOT been accepted by the mailing list yet.
In reply to this post by Brandon Williams
If i use

GENERATE columns.value
The result data type becomes col_val: {list: {(value: chararray)}}

({(234561),(bcdefa),(xxx.com)})

So If I try to access tuple wise I'm getting

GENERATE list.$1

Index 1 out of range in schema:value:chararray


What would be the reason ??

Regards,
Tamil
Loading...