I have requirement to fetch million row as result of my query which is giving timeout errors.
I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement.
My table definition is:
CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points list<double>, PRIMARY KEY ((image_caseid),Area,uuid));
Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon image_caseid I have made it partition key.
I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors:
Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
Also when I try the same query on console even while using limit of 2000 rows:
cqlsh:images> select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and Area>20 limit 2000;
Thanks and Regards,
Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it.
On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <[hidden email]> wrote:
I have tried with fetch size of 10000 still its not giving any results.
My expectations were that Cassandra can handle a million rows easily.
Is there any mistake in the way I am defining the keys or querying them.
On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <[hidden email]> wrote:
Have you tried a smaller fetch size, such as 5k - 2k ?
On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <[hidden email]> wrote:
yes it works for 1000 but not more than that.
How can I fetch all rows using this efficiently?
On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <[hidden email]> wrote:
Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while?
On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <[hidden email]> wrote:
We have UI interface which needs this data for rendering.
So efficiency of pulling this data matters a lot. It should be fetched within a minute.
Is there a way to achieve such efficiency
On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <[hidden email]> wrote:
I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results.
I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me.
On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <[hidden email]> wrote:
Sorry, meant to say "that way when you have to render, you can just display the latest cache."
On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <[hidden email]> wrote:
The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows.
So I don't think I can cache such large results. Since same results will rarely fetched again.
Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr.
But is there any way I can achieve that without using any other technology.
On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <[hidden email]> wrote:
How often does the data change?
I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution.
I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics.
If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y)
On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <[hidden email]> wrote:
Data won't change much but queries will be different.
I am not working on the rendering tool myself so I don't know much details about it.
Also as suggested by you I tried to fetch data in size of 500 or 1000 with java driver auto pagination.
It fails when the number of records are high (around 100000) with following error:
On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <[hidden email]> wrote:
If even 500-1000 isn't working, then your cassandra node might not be up.
1) Try running nodetool status from shell on your cassandra server, make sure the nodes are up.
2) Are you calling this on the same server where cassandra is running? Its trying to connect to localhost . If you're running it on a different server, try passing in the direct ip of your cassandra server.
On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <[hidden email]> wrote:
Currently there is only single node which I am calling directly with around 150000 rows. Full data will be in around billions per node.
The code is working only for size 100/200. Also the consecutive fetching is taking around 5-10 secs.
I have a parallel script which is inserting the data while I am reading it. When I stopped the script it worked for 500/1000 but not more than that.
On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <[hidden email]> wrote:
What's your memory / CPU usage at? And how much ram + cpu do you have on this server?
On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <[hidden email]> wrote:
Currently Cassandra java process is taking 1% of cpu (total 8% is being used) and 14.3% memory (out of total 4G memory).
As you can see there is not much load from other processes.
Should I try changing default parameters of memory in Cassandra settings.
On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <[hidden email]> wrote:
Yeah, it may be that the process is being limited by swap. This page:
Lines 42 - 48 list a few settings that you could try out for increasing / reducing the memory limits (assuming you're on linux).
Also, are you using an SSD? If so make sure the IO scheduler is noop or deadline .
On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta <[hidden email]> wrote:
4g also seems small for the kind of load you are trying to handle (billions of rows) etc.
I would also try adding more nodes to the cluster.
On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar <[hidden email]> wrote:
ya I have cluster total 10 nodes but I am just testing with one node currently.
Total data for all nodes will exceed 5 billion rows. But I may have memory on other nodes.
On Wed, Mar 18, 2015 at 6:06 AM, Ali Akhtar <[hidden email]> wrote:
In reply to this post by Mehak Mehta
Cassandra can certainly handle millions and even billions of rows, but... it is a very clear anti-pattern to design a single query to return more than a relatively small number of rows except through paging. How small? Low hundreds is probably a reasonable limit. It is also an anti-pattern to filter or analyze a large number of rows in a single query - that's why there are so many crazy restrictions and the requirement to use ALLOW FILTERING - to reinforce that Cassandra is designed for short and performant queries, not large-scale retrieval of a large number of rows. As a general rule, the user of ALLOW FILTERING is an anti-pattern and a yellow flag that you are doing something wrong.
As a minor point, check your partition key - you should try to "bucket" rows that will tend to be accessed together so that they have locality so that they can be fetched together.
Rather than using a raw x and y coordinate range, consider indexing by a "chunk" number and then you can query by chunk number for direct access to the partition and row key, without the need for inequality filtering.
-- Jack Krupansky
On Wed, Mar 18, 2015 at 3:22 AM, Mehak Mehta <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|