Quantcast

Reading hundreds of thousands of rows at once?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Reading hundreds of thousands of rows at once?

John Anderson
Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing.  Currently testing this on a single cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance when querying a large set of data?  Will this timing speed up linearly as I add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster.

Thanks,
John
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading hundreds of thousands of rows at once?

Anishek Agarwal
I think these will help speed up 

- removing compression 
- you have lot of independent columns mentioned. If you are always going to query all of them together one other thing that will help is have a full json(or some custom obj representation) of the value data and change the model to just have survey_id, hour_created,respondent_id, json_value

On Wed, Apr 22, 2015 at 1:09 PM, John Anderson <[hidden email]> wrote:
Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing.  Currently testing this on a single cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance when querying a large set of data?  Will this timing speed up linearly as I add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster.

Thanks,
John

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading hundreds of thousands of rows at once?

Anishek Agarwal
also might want to go through a thread here in with subject "High latencies for simple queries"

On Wed, Apr 22, 2015 at 1:55 PM, Anishek Agarwal <[hidden email]> wrote:
I think these will help speed up 

- removing compression 
- you have lot of independent columns mentioned. If you are always going to query all of them together one other thing that will help is have a full json(or some custom obj representation) of the value data and change the model to just have survey_id, hour_created,respondent_id, json_value

On Wed, Apr 22, 2015 at 1:09 PM, John Anderson <[hidden email]> wrote:
Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing.  Currently testing this on a single cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance when querying a large set of data?  Will this timing speed up linearly as I add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster.

Thanks,
John


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reading hundreds of thousands of rows at once?

Robert Wille-2
In reply to this post by John Anderson
Add more nodes to your cluster

On Apr 22, 2015, at 1:39 AM, John Anderson <[hidden email]> wrote:

Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing.  Currently testing this on a single cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance when querying a large set of data?  Will this timing speed up linearly as I add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster.

Thanks,
John

Loading...