Really high read latency

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Really high read latency

Dave Galbraith
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Duncan Sands
Hi Dave,

On 23/03/15 05:56, Dave Galbraith wrote:

> Hi! So I've got a table like this:
>
> CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value
> double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND
> bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND
> dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128
> AND read_repair_chance=1 AND replicate_on_write='true' AND
> populate_io_cache_on_flush='false' AND default_time_to_live=0 AND
> speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND
> compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'}
> AND compression={'sstable_compression':'LZ4Compressor'};

does it work better with
   PRIMARY KEY((row_time, attrs), offset)
?

Ciao, Duncan.

>
> and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of
> heap space. So it's timeseries data that I'm doing so I increment "row_time"
> each day, "attrs" is additional identifying information about each series, and
> "offset" is the number of milliseconds into the day for each data point. So for
> the past 5 days, I've been inserting 3k points/second distributed across 100k
> distinct "attrs"es. And now when I try to run queries on this data that look like
>
> "SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"
>
> it takes an absurdly long time and sometimes just times out. I did "nodetool
> cftsats default" and here's what I get:
>
> Keyspace: default
>      Read Count: 59
>      Read Latency: 397.12523728813557 ms.
>      Write Count: 155128
>      Write Latency: 0.3675690719921613 ms.
>      Pending Flushes: 0
>          Table: metrics
>          SSTable count: 26
>          Space used (live): 35146349027
>          Space used (total): 35146349027
>          Space used by snapshots (total): 0
>          SSTable Compression Ratio: 0.10386468749216264
>          Memtable cell count: 141800
>          Memtable data size: 31071290
>          Memtable switch count: 41
>          Local read count: 59
>          Local read latency: 397.126 ms
>          Local write count: 155128
>          Local write latency: 0.368 ms
>          Pending flushes: 0
>          Bloom filter false positives: 0
>          Bloom filter false ratio: 0.00000
>          Bloom filter space used: 2856
>          Compacted partition minimum bytes: 104
>          Compacted partition maximum bytes: 36904729268
>          Compacted partition mean bytes: 986530969
>          Average live cells per slice (last five minutes): 501.66101694915255
>          Maximum live cells per slice (last five minutes): 502.0
>          Average tombstones per slice (last five minutes): 0.0
>          Maximum tombstones per slice (last five minutes): 0.0
>
> Ouch! 400ms of read latency, orders of magnitude higher than it has any right to
> be. How could this have happened? Is there something fundamentally broken about
> my data model? Thanks!
>

Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Jens Rantil
In reply to this post by Dave Galbraith
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32

Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Dave Galbraith
Duncan: I'm thinking it might be something like that. I'm also seeing just a ton of garbage collection on the box, could it be pulling rows for all 100k attrs for a given row_time into memory since only row_time is the partition key?

Jens: I'm not using EBS (although I used to until I read up on how useless it is). I'm not sure what constitutes proper paging but my client has a pretty small amount of available memory so I'm doing pages of size 5k using the C++ Datastax driver.

Thanks for the replies!

-Dave

On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil <[hidden email]> wrote:
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32


Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Eric Stevens
Enable tracing in cqlsh and see how many sstables are being lifted to satisfy the query (are you repeatedly writing to the same partition [row_time]) over time?).

Also watch for whether you're hitting a lot of tombstones (are you deleting lots of values in the same partition over time?).

On Mon, Mar 23, 2015 at 4:01 AM, Dave Galbraith <[hidden email]> wrote:
Duncan: I'm thinking it might be something like that. I'm also seeing just a ton of garbage collection on the box, could it be pulling rows for all 100k attrs for a given row_time into memory since only row_time is the partition key?

Jens: I'm not using EBS (although I used to until I read up on how useless it is). I'm not sure what constitutes proper paging but my client has a pretty small amount of available memory so I'm doing pages of size 5k using the C++ Datastax driver.

Thanks for the replies!

-Dave

On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil <[hidden email]> wrote:
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32



Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Dave Galbraith
I haven't deleted anything. Here's output from a traced cqlsh query (I tried to make the spaces line up, hope it's legible):

Execute CQL3 query                                                                                                                                                           | 2015-03-23 21:04:37.422000 | 172.31.32.211 |              0
Parsing select * from default.metrics where row_time = 16511 and attrs = '[redacted]' limit 100; [SharedPool-Worker-2] | 2015-03-23 21:04:37.423000 | 172.31.32.211 |             93
Preparing statement [SharedPool-Worker-2]                                                                                                                      | 2015-03-23 21:04:37.423000 | 172.31.32.211 |            696
Executing single-partition query on metrics [SharedPool-Worker-1]                                                                                   | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2807
Acquiring sstable references [SharedPool-Worker-1]                                                                                                         | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2993
Merging memtable tombstones [SharedPool-Worker-1]                                                                                                     | 2015-03-23 21:04:37.426000 | 172.31.32.211 |           3049
Partition index with 484338 entries found for sstable 15966 [SharedPool-Worker-1]                                                         | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202304
Seeking to partition indexed section in data file [SharedPool-Worker-1]                                                                            | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202354
Bloom filter allows skipping sstable 5613 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202445
Bloom filter allows skipping sstable 5582 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202478
Bloom filter allows skipping sstable 5611 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202508
Bloom filter allows skipping sstable 5610 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202539
Bloom filter allows skipping sstable 5549 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202678
Bloom filter allows skipping sstable 5544 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202720
Bloom filter allows skipping sstable 5237 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202752
Bloom filter allows skipping sstable 2516 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202782
Bloom filter allows skipping sstable 2632 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202812
Bloom filter allows skipping sstable 3015 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202852
Skipped 0/11 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1]                               | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202882
Merging data from memtables and 1 sstables [SharedPool-Worker-1]                                                                             | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202902
Read 101 live and 0 tombstoned cells [SharedPool-Worker-1]                                                                                          | 2015-03-23 21:04:38.626000 | 172.31.32.211 |         203752
Request complete                                                                                                                                                              | 2015-03-23 21:04:38.628253 | 172.31.32.211 |         206253

On Mon, Mar 23, 2015 at 11:53 AM, Eric Stevens <[hidden email]> wrote:
Enable tracing in cqlsh and see how many sstables are being lifted to satisfy the query (are you repeatedly writing to the same partition [row_time]) over time?).

Also watch for whether you're hitting a lot of tombstones (are you deleting lots of values in the same partition over time?).

On Mon, Mar 23, 2015 at 4:01 AM, Dave Galbraith <[hidden email]> wrote:
Duncan: I'm thinking it might be something like that. I'm also seeing just a ton of garbage collection on the box, could it be pulling rows for all 100k attrs for a given row_time into memory since only row_time is the partition key?

Jens: I'm not using EBS (although I used to until I read up on how useless it is). I'm not sure what constitutes proper paging but my client has a pretty small amount of available memory so I'm doing pages of size 5k using the C++ Datastax driver.

Thanks for the replies!

-Dave

On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil <[hidden email]> wrote:
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32




Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Chris Lohfink-3

>>  Compacted partition maximum bytes: 36904729268

thats huge... 36gb rows are gonna cause a lot of problems, even when you specify a precise cell under this it still is going to have an enormous column index to deserialize on every read for the partition.  As mentioned above, you should include your attribute name in the partition key ((row_time, attrs))
 to spread this out... Id call that critical

Chris

On Mon, Mar 23, 2015 at 4:13 PM, Dave Galbraith <[hidden email]> wrote:
I haven't deleted anything. Here's output from a traced cqlsh query (I tried to make the spaces line up, hope it's legible):

Execute CQL3 query                                                                                                                                                           | 2015-03-23 21:04:37.422000 | 172.31.32.211 |              0
Parsing select * from default.metrics where row_time = 16511 and attrs = '[redacted]' limit 100; [SharedPool-Worker-2] | 2015-03-23 21:04:37.423000 | 172.31.32.211 |             93
Preparing statement [SharedPool-Worker-2]                                                                                                                      | 2015-03-23 21:04:37.423000 | 172.31.32.211 |            696
Executing single-partition query on metrics [SharedPool-Worker-1]                                                                                   | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2807
Acquiring sstable references [SharedPool-Worker-1]                                                                                                         | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2993
Merging memtable tombstones [SharedPool-Worker-1]                                                                                                     | 2015-03-23 21:04:37.426000 | 172.31.32.211 |           3049
Partition index with 484338 entries found for sstable 15966 [SharedPool-Worker-1]                                                         | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202304
Seeking to partition indexed section in data file [SharedPool-Worker-1]                                                                            | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202354
Bloom filter allows skipping sstable 5613 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202445
Bloom filter allows skipping sstable 5582 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202478
Bloom filter allows skipping sstable 5611 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202508
Bloom filter allows skipping sstable 5610 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202539
Bloom filter allows skipping sstable 5549 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202678
Bloom filter allows skipping sstable 5544 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202720
Bloom filter allows skipping sstable 5237 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202752
Bloom filter allows skipping sstable 2516 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202782
Bloom filter allows skipping sstable 2632 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202812
Bloom filter allows skipping sstable 3015 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202852
Skipped 0/11 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1]                               | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202882
Merging data from memtables and 1 sstables [SharedPool-Worker-1]                                                                             | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202902
Read 101 live and 0 tombstoned cells [SharedPool-Worker-1]                                                                                          | 2015-03-23 21:04:38.626000 | 172.31.32.211 |         203752
Request complete                                                                                                                                                              | 2015-03-23 21:04:38.628253 | 172.31.32.211 |         206253

On Mon, Mar 23, 2015 at 11:53 AM, Eric Stevens <[hidden email]> wrote:
Enable tracing in cqlsh and see how many sstables are being lifted to satisfy the query (are you repeatedly writing to the same partition [row_time]) over time?).

Also watch for whether you're hitting a lot of tombstones (are you deleting lots of values in the same partition over time?).

On Mon, Mar 23, 2015 at 4:01 AM, Dave Galbraith <[hidden email]> wrote:
Duncan: I'm thinking it might be something like that. I'm also seeing just a ton of garbage collection on the box, could it be pulling rows for all 100k attrs for a given row_time into memory since only row_time is the partition key?

Jens: I'm not using EBS (although I used to until I read up on how useless it is). I'm not sure what constitutes proper paging but my client has a pretty small amount of available memory so I'm doing pages of size 5k using the C++ Datastax driver.

Thanks for the replies!

-Dave

On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil <[hidden email]> wrote:
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32





Reply | Threaded
Open this post in threaded view
|

Re: Really high read latency

Ben Bromhead
nodetool cfhistograms is also very helpful in diagnosing these kinds of data modelling issues. 

On 23 March 2015 at 14:43, Chris Lohfink <[hidden email]> wrote:

>>  Compacted partition maximum bytes: 36904729268

thats huge... 36gb rows are gonna cause a lot of problems, even when you specify a precise cell under this it still is going to have an enormous column index to deserialize on every read for the partition.  As mentioned above, you should include your attribute name in the partition key ((row_time, attrs))
 to spread this out... Id call that critical

Chris

On Mon, Mar 23, 2015 at 4:13 PM, Dave Galbraith <[hidden email]> wrote:
I haven't deleted anything. Here's output from a traced cqlsh query (I tried to make the spaces line up, hope it's legible):

Execute CQL3 query                                                                                                                                                           | 2015-03-23 21:04:37.422000 | 172.31.32.211 |              0
Parsing select * from default.metrics where row_time = 16511 and attrs = '[redacted]' limit 100; [SharedPool-Worker-2] | 2015-03-23 21:04:37.423000 | 172.31.32.211 |             93
Preparing statement [SharedPool-Worker-2]                                                                                                                      | 2015-03-23 21:04:37.423000 | 172.31.32.211 |            696
Executing single-partition query on metrics [SharedPool-Worker-1]                                                                                   | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2807
Acquiring sstable references [SharedPool-Worker-1]                                                                                                         | 2015-03-23 21:04:37.425000 | 172.31.32.211 |           2993
Merging memtable tombstones [SharedPool-Worker-1]                                                                                                     | 2015-03-23 21:04:37.426000 | 172.31.32.211 |           3049
Partition index with 484338 entries found for sstable 15966 [SharedPool-Worker-1]                                                         | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202304
Seeking to partition indexed section in data file [SharedPool-Worker-1]                                                                            | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202354
Bloom filter allows skipping sstable 5613 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202445
Bloom filter allows skipping sstable 5582 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202478
Bloom filter allows skipping sstable 5611 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202508
Bloom filter allows skipping sstable 5610 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625000 | 172.31.32.211 |         202539
Bloom filter allows skipping sstable 5549 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202678
Bloom filter allows skipping sstable 5544 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202720
Bloom filter allows skipping sstable 5237 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202752
Bloom filter allows skipping sstable 2516 [SharedPool-Worker-1]                                                                                      | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202782
Bloom filter allows skipping sstable 2632 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202812
Bloom filter allows skipping sstable 3015 [SharedPool-Worker-1]                                                                                     | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202852
Skipped 0/11 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1]                               | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202882
Merging data from memtables and 1 sstables [SharedPool-Worker-1]                                                                             | 2015-03-23 21:04:38.625001 | 172.31.32.211 |         202902
Read 101 live and 0 tombstoned cells [SharedPool-Worker-1]                                                                                          | 2015-03-23 21:04:38.626000 | 172.31.32.211 |         203752
Request complete                                                                                                                                                              | 2015-03-23 21:04:38.628253 | 172.31.32.211 |         206253

On Mon, Mar 23, 2015 at 11:53 AM, Eric Stevens <[hidden email]> wrote:
Enable tracing in cqlsh and see how many sstables are being lifted to satisfy the query (are you repeatedly writing to the same partition [row_time]) over time?).

Also watch for whether you're hitting a lot of tombstones (are you deleting lots of values in the same partition over time?).

On Mon, Mar 23, 2015 at 4:01 AM, Dave Galbraith <[hidden email]> wrote:
Duncan: I'm thinking it might be something like that. I'm also seeing just a ton of garbage collection on the box, could it be pulling rows for all 100k attrs for a given row_time into memory since only row_time is the partition key?

Jens: I'm not using EBS (although I used to until I read up on how useless it is). I'm not sure what constitutes proper paging but my client has a pretty small amount of available memory so I'm doing pages of size 5k using the C++ Datastax driver.

Thanks for the replies!

-Dave

On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil <[hidden email]> wrote:
Also, two control questions:
  • Are you using EBS for data storage? It might introduce additional latencies.
  • Are you doing proper paging when querying the keyspace?
Cheers,
Jens

On Mon, Mar 23, 2015 at 5:56 AM, Dave Galbraith <[hidden email]> wrote:
Hi! So I've got a table like this:

CREATE TABLE "default".metrics (row_time int,attrs varchar,offset int,value double, PRIMARY KEY(row_time, attrs, offset)) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class':'DateTieredCompactionStrategy','timestamp_resolution':'MILLISECONDS'} AND compression={'sstable_compression':'LZ4Compressor'};

and I'm running Cassandra on an EC2 m3.2xlarge out in the cloud, with 4 GB of heap space. So it's timeseries data that I'm doing so I increment "row_time" each day, "attrs" is additional identifying information about each series, and "offset" is the number of milliseconds into the day for each data point. So for the past 5 days, I've been inserting 3k points/second distributed across 100k distinct "attrs"es. And now when I try to run queries on this data that look like

"SELECT * FROM "default".metrics WHERE row_time = 5 AND attrs = 'potatoes_and_jam'"

it takes an absurdly long time and sometimes just times out. I did "nodetool cftsats default" and here's what I get:

Keyspace: default
    Read Count: 59
    Read Latency: 397.12523728813557 ms.
    Write Count: 155128
    Write Latency: 0.3675690719921613 ms.
    Pending Flushes: 0
        Table: metrics
        SSTable count: 26
        Space used (live): 35146349027
        Space used (total): 35146349027
        Space used by snapshots (total): 0
        SSTable Compression Ratio: 0.10386468749216264
        Memtable cell count: 141800
        Memtable data size: 31071290
        Memtable switch count: 41
        Local read count: 59
        Local read latency: 397.126 ms
        Local write count: 155128
        Local write latency: 0.368 ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2856
        Compacted partition minimum bytes: 104
        Compacted partition maximum bytes: 36904729268
        Compacted partition mean bytes: 986530969
        Average live cells per slice (last five minutes): 501.66101694915255
        Maximum live cells per slice (last five minutes): 502.0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0.0

Ouch! 400ms of read latency, orders of magnitude higher than it has any right to be. How could this have happened? Is there something fundamentally broken about my data model? Thanks!




--
Jens Rantil
Backend engineer
Tink AB

Phone: +46 708 84 18 32








--

Ben Bromhead

Instaclustr | www.instaclustr.com | @instaclustr | (650) 284 9692