Read performance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Read performance

Alprema
Hi,

I am writing an application that will periodically read big amounts of data from Cassandra and I am experiencing odd performances.

My column family is a classic time series one, with series ID and Day as partition key and a timestamp as clustering key, the value being a double.

The query I run gets all the values for a given time series for a given day (so about 86400 points):

SELECT "UtcDate", "Value"
FROM "Metric_OneSec"
WHERE "MetricId" = 12215ece-6544-4fcf-a15d-4f9e9ce1567e
AND "Day" = '2015-05-05 00:00:00+0000'
LIMIT 86400;

This takes about 450ms to run and when I trace the query I see that it takes about 110ms to read the data from disk and 224ms to send the data from the responsible node to the coordinator (full trace in attachment).

I did a quick estimation of the requested data (correct me if I'm wrong):
86400 * (column name + column value + timestamp + ttl)
= 86400 * (8 + 8 + 8 + 8?)
= 2.6Mb 

Let's say about 3Mb with misc. overhead, so these timings seem pretty slow to me for a modern SSD and a 1Gb/s NIC.

Do those timings seem normal? Am I missing something?

Thank you,

Kévin



trace_slow_read.txt (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read performance

Bryan Holladay
Try breaking it up into smaller chunks using multiple threads and token ranges. 86400 is pretty large. I found ~1000 results per query is good. This will spread the burden across all servers a little more evenly.

On Thu, May 7, 2015 at 4:27 AM, Alprema <[hidden email]> wrote:
Hi,

I am writing an application that will periodically read big amounts of data from Cassandra and I am experiencing odd performances.

My column family is a classic time series one, with series ID and Day as partition key and a timestamp as clustering key, the value being a double.

The query I run gets all the values for a given time series for a given day (so about 86400 points):

SELECT "UtcDate", "Value"
FROM "Metric_OneSec"
WHERE "MetricId" = 12215ece-6544-4fcf-a15d-4f9e9ce1567e
AND "Day" = '2015-05-05 00:00:00+0000'
LIMIT 86400;

This takes about 450ms to run and when I trace the query I see that it takes about 110ms to read the data from disk and 224ms to send the data from the responsible node to the coordinator (full trace in attachment).

I did a quick estimation of the requested data (correct me if I'm wrong):
86400 * (column name + column value + timestamp + ttl)
= 86400 * (8 + 8 + 8 + 8?)
= 2.6Mb 

Let's say about 3Mb with misc. overhead, so these timings seem pretty slow to me for a modern SSD and a 1Gb/s NIC.

Do those timings seem normal? Am I missing something?

Thank you,

Kévin



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read performance

Alprema
I was planning on using a more "server-friendly" strategy anyway (by parallelizing my workload on multiple metrics) but my concern here is more about the raw numbers.

According to the trace and my estimation of the data size, the read from disk was done at about 30MByte/s and the transfer between the responsible node and the coordinator was done at 120Mbits/s which doesn't seem right given that the cluster was not busy and the network is Gbit capable.

I know that there is some overhead, but these numbers seem odd to me, do they seem normal to you ?

On Fri, May 8, 2015 at 2:34 PM, Bryan Holladay <[hidden email]> wrote:
Try breaking it up into smaller chunks using multiple threads and token ranges. 86400 is pretty large. I found ~1000 results per query is good. This will spread the burden across all servers a little more evenly.

On Thu, May 7, 2015 at 4:27 AM, Alprema <[hidden email]> wrote:
Hi,

I am writing an application that will periodically read big amounts of data from Cassandra and I am experiencing odd performances.

My column family is a classic time series one, with series ID and Day as partition key and a timestamp as clustering key, the value being a double.

The query I run gets all the values for a given time series for a given day (so about 86400 points):

SELECT "UtcDate", "Value"
FROM "Metric_OneSec"
WHERE "MetricId" = 12215ece-6544-4fcf-a15d-4f9e9ce1567e
AND "Day" = '2015-05-05 00:00:00+0000'
LIMIT 86400;

This takes about 450ms to run and when I trace the query I see that it takes about 110ms to read the data from disk and 224ms to send the data from the responsible node to the coordinator (full trace in attachment).

I did a quick estimation of the requested data (correct me if I'm wrong):
86400 * (column name + column value + timestamp + ttl)
= 86400 * (8 + 8 + 8 + 8?)
= 2.6Mb 

Let's say about 3Mb with misc. overhead, so these timings seem pretty slow to me for a modern SSD and a 1Gb/s NIC.

Do those timings seem normal? Am I missing something?

Thank you,

Kévin




Loading...