Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

dlu66061

Cassandra gurus, I am really puzzled by my observations, and hope to get some help explaining the results. Thanks in advance.

I think it has always been advocated in Cassandra community that de-normalization leads to better performance. I wanted to see how much performance improvement it can offer, but the results were totally opposite. The performance degraded dramatically for simultaneously requests for the same set of data.

Environment:

I have a Cassandra cluster consisting of 3 AWS m3.large instances, with Cassandra 2.0.6 installed and pretty much default settings. My program is written in Java using Java Driver 2.0.8.

Normalized case:

I have two tables created with the following 2 CQL statements

CREATE TABLE event (event_id UUID, time_token timeuuid, …­ 30 other attributes, …­ PRIMARY KEY (event_id))

CREATE TABLE event_index (index_key text, time_token timeuuid, event_id UUID,   PRIMARY KEY (index_key, time_token))

In my program, given the proper index_key and a token range (tokenLowerBound to tokenUpperBound), I first query the event_index table

Query 1:

SELECT * FROM event_index WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

to get a list of event_ids and then run the following CQL to get the event details.

Query 2:

SELECT * FROM event WHERE event_id IN (a list of event_ids from the above query)

I repeat the above process, with updated token range from the previous run. This actually performs pretty well.

In this normalized process, I have to run 2 queries to get data: the first one should be very quick since it is getting a slice of an internally wide row. The second query may take long because it needs to hit up to 2000 rows of event table.

De-normalized case:

What if we can attach event detail to the index and run just 1 query? Like Query 1, would it be much faster since it is also getting a slice of an internally wide row?

I created a third table that merged the above two tables together. Notice the first three attributes and the PRIMARY KEY definition are exactly the same as the "event_index" table.

CREATE TABLE event_index_with_detail (index_key text, time_token timeuuid, event_id UUID, … 30 other attributes, …­ PRIMARY KEY (index_key, time_token))

Then I can just run the following query to achieve my goal, with the same index and token range as in query 1:

Query 3:

SELECT * FROM event_index_with_detail WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

Performance observations

Using Java Driver 2.0.8, I wrote a program that runs Query 1 + Query 2 in the normalized case, or Query 3 in the denormalized case. All queries is set with LOCAL_QUORUM consistency level.

Then I created 1 or more instances of the program to simultaneously retrieve the SAME set of 1 million events stored in Cassandra. Each test runs for 5 minutes, and the results are shown below.

 

1 instance

5 instances

10 instances

Normalized

89

315

417

Denormalized

100

43

3

Note that the unit of measure is number of operations. So in the normalized case, the programs runs 89 times and retrieves 178K events for a single instance, 315 times and 630K events to 5 instances (each instance gets about 126K events), and 417 times and 834K events to 10 instances simultaneously (each instance gets about 83.4K events).

Well for the de-normalized case, the performance is little better for a single instance case, in which the program runs 100 times and retrieves 200K events. However, it turns sharply south for multiple simultaneous instances. All 5 instances completed successfully only 43 operations together, and all 10 instances completed successfully only 3 operations together. For the latter case, the log showed that 3 instances each retrieved 2000 events successfully, and 7 other instances retrieved 0.

In the de-normalized case, the program reported a lot of exceptions like below:

com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

com.datastax.driver.core.exceptions.NoHostAvailableException, All host(s) tried for query failed (no host was tried)

I repeated the two cases back and forth several times, and the results remained the same.

I also observed CPU usage on the 3 Cassandra servers, and they were all much higher for the de-normalized case.

 

1 instance

5 instances

10 instances

Normalized

7% usr, 2% sys

30% usr, 8% sys

40% usr, 10% sys

Denormalized

44% usr, 0.3% sys

65% usr, 1% sys

70% usr, 2% sys

Questions

This is really not what I expected, and I am puzzled and have not figured out a good explanation.

  • Why are there so many exceptions in the de-normalized case? I would think Cassandra should be able to handle simultaneous accesses to the same data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same.
  • Is (internally) wide row only good for small amount of data under each column name?
  • Or is it an issue with Java Driver?
  • Or did I do something wrong?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

dlu66061
This post has NOT been accepted by the mailing list yet.

I carried out additional tests and tried to make it easy to figure out an answer.

Clearly, the consistency checking algorithm is at the center of this issue. Since the final data set (2000 event records) retrieved from normalized and de-normalized cases are the same, I have to think that Cassandra has to check every attribute of every row. In this sense, the time spent on the consistency checking should be roughly the same for both cases.

If the total time spent on the consistency checking is the same, then where is the difference? I can only speculate that it is the size of an internal row that makes a difference. In the normalized case, we will retrieve 2001 internal rows (1 from event_index, and 2000 from event). The size of each row is small such that the consistency checking for each row is very quick. If the algorithm releases CPU between rows, all the instances will have to chance to proceed. If the algorithm caches the checking results and since all the instances of my tests retrieve the same set of data, it should make things even better.

For the de-normalized case, however, there is only one wide row with all the data. If the algorithm has to finish checking the whole row before releasing CPU, it could be a long time before the checking is finished such that other threads serving other client instances would not have a chance before they time out. I would also have to think that some kind of (row) locking is involved. Otherwise, all threads should be able to process the same row at the same time without waiting.

LIMIT setting in CQL query:

To reduce the size of the single wide row in the de-normalized case, I tried to change the CQL LIMIT value to 250, 200, 150, and 100. Although the number of occurances decreases, there were still timeout exceptions when there were 10 client instances running simultaneously. Only until the LIMIT is set to 50 that exception disappeared

To have a fair performance comparison, I re-ran the tests with LIMIT set at 50, here is the result

 

1 instance

5 instances

10 instances

Normalized

2754

NA

8247

Denormalized

3934

NA

5062

Again the unit of measure is number of operations. So in the normalized case, the programs runs 2754 times and retrieves 137K events for a single instance, and 8247 times and 412K events to 10 instances simultaneously (each instance gets about 41.2K events).

Well for the de-normalized case, the performance is little better for a single instance case, in which the program runs 3934 times and retrieves 197K events. Still, it turns sharply south for 10 simultaneous instances, even when there is no timout exceptions. All 10 instances retrieved a total of 5062 times and 253K events (each instance gets about 25.3K events)

Change consistency level to LOCAL_ONE

What if I forgo the LOCAL_QUORUM, and use LOCAL_ONE instead? Can I use a large LIMIT setting for CQL SELECT statement?

I changed the consistency level in the program, and tried to change the CQL LIMIT value to 250, 200, 150, and 100. Well, the timeout exceptions still occur when there were 10 client instances running simultaneously. The exception now changed to

com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)

At this consistency level, and with 0 replica responded, I can only speculate that some kind of row locking is going on that other Cassandra threads cannot get the data. But what needs to be checked if we only need one copy?

Only until the LIMIT is set to 50 that this exception disappeared. I re-ran the tests with LIMIT set at 50, here is the result

 

1 instance

5 instances

10 instances

Normalized

4944

NA

17215

Denormalized

5139

NA

8136

Similar to before, with only 1 instance, de-normalized case is about 4% better. However for 10 instances, the de-normalized is only half of the normalized case. Such a big difference.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

Erick Ramirez
In reply to this post by dlu66061
Hello, there.

In relation to the Java driver, I would recommend updating to the latest version as there were a lot of issues reported in versions earlier that 2.0.9 were the driver is incorrectly marking nodes as down/not available.

In fact, there is a new version of the driver being released in the next 24-48 hours that reverts JAVA-425 to resolve this issue.


On Wed, Apr 29, 2015 at 4:56 AM, dlu66061 <[hidden email]> wrote:

Cassandra gurus, I am really puzzled by my observations, and hope to get some help explaining the results. Thanks in advance.

I think it has always been advocated in Cassandra community that de-normalization leads to better performance. I wanted to see how much performance improvement it can offer, but the results were totally opposite. The performance degraded dramatically for simultaneously requests for the same set of data.

Environment:

I have a Cassandra cluster consisting of 3 AWS m3.large instances, with Cassandra 2.0.6 installed and pretty much default settings. My program is written in Java using Java Driver 2.0.8.

Normalized case:

I have two tables created with the following 2 CQL statements

CREATE TABLE event (event_id UUID, time_token timeuuid, …­ 30 other attributes, …­ PRIMARY KEY (event_id))

CREATE TABLE event_index (index_key text, time_token timeuuid, event_id UUID,   PRIMARY KEY (index_key, time_token))

In my program, given the proper index_key and a token range (tokenLowerBound to tokenUpperBound), I first query the event_index table

Query 1:

SELECT * FROM event_index WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

to get a list of event_ids and then run the following CQL to get the event details.

Query 2:

SELECT * FROM event WHERE event_id IN (a list of event_ids from the above query)

I repeat the above process, with updated token range from the previous run. This actually performs pretty well.

In this normalized process, I have to run 2 queries to get data: the first one should be very quick since it is getting a slice of an internally wide row. The second query may take long because it needs to hit up to 2000 rows of event table.

De-normalized case:

What if we can attach event detail to the index and run just 1 query? Like Query 1, would it be much faster since it is also getting a slice of an internally wide row?

I created a third table that merged the above two tables together. Notice the first three attributes and the PRIMARY KEY definition are exactly the same as the "event_index" table.

CREATE TABLE event_index_with_detail (index_key text, time_token timeuuid, event_id UUID, … 30 other attributes, …­ PRIMARY KEY (index_key, time_token))

Then I can just run the following query to achieve my goal, with the same index and token range as in query 1:

Query 3:

SELECT * FROM event_index_with_detail WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

Performance observations

Using Java Driver 2.0.8, I wrote a program that runs Query 1 + Query 2 in the normalized case, or Query 3 in the denormalized case. All queries is set with LOCAL_QUORUM consistency level.

Then I created 1 or more instances of the program to simultaneously retrieve the SAME set of 1 million events stored in Cassandra. Each test runs for 5 minutes, and the results are shown below.

 

1 instance

5 instances

10 instances

Normalized

89

315

417

Denormalized

100

43

3

Note that the unit of measure is number of operations. So in the normalized case, the programs runs 89 times and retrieves 178K events for a single instance, 315 times and 630K events to 5 instances (each instance gets about 126K events), and 417 times and 834K events to 10 instances simultaneously (each instance gets about 83.4K events).

Well for the de-normalized case, the performance is little better for a single instance case, in which the program runs 100 times and retrieves 200K events. However, it turns sharply south for multiple simultaneous instances. All 5 instances completed successfully only 43 operations together, and all 10 instances completed successfully only 3 operations together. For the latter case, the log showed that 3 instances each retrieved 2000 events successfully, and 7 other instances retrieved 0.

In the de-normalized case, the program reported a lot of exceptions like below:

com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

com.datastax.driver.core.exceptions.NoHostAvailableException, All host(s) tried for query failed (no host was tried)

I repeated the two cases back and forth several times, and the results remained the same.

I also observed CPU usage on the 3 Cassandra servers, and they were all much higher for the de-normalized case.

 

1 instance

5 instances

10 instances

Normalized

7% usr, 2% sys

30% usr, 8% sys

40% usr, 10% sys

Denormalized

44% usr, 0.3% sys

65% usr, 1% sys

70% usr, 2% sys

Questions

This is really not what I expected, and I am puzzled and have not figured out a good explanation.

  • Why are there so many exceptions in the de-normalized case? I would think Cassandra should be able to handle simultaneous accesses to the same data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same.
  • Is (internally) wide row only good for small amount of data under each column name?
  • Or is it an issue with Java Driver?
  • Or did I do something wrong?


View this message in context: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled
Sent from the cassandra-user@... mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

Steve Robenalt
A few observations from what you've said so far:

1) IN clauses in CQL can have performance impact by including sets of keys that are spread across the cluster.

2) We previously used m3.large instances in our cluster and would see occasional read timeouts even at CL.ONE. We upgraded to i2.xlarge with local SSD drive and no longer experience those problems.

3) You didn't state how you had your storage configured, but if you're using EBS for your Cassandra partitions, it can seriously impact performance due to network lag. If you're using local storage on your instances (which is recommended), you should be using separate drives for your data and commitlog partitions because the access patterns are very different. SSD is the preferred local storage option (over spinning disks) in any case.

4) If you have all of the above covered, you might also want to compare CL.ONE read results with CL.QUORUM read results. The former will likely perform much better.

Any of the issues above can cause excessive contention between nodes and seriously degrade performance as traffic increases.

These are just some of the first things that jump out. Others on the list are a lot more experienced than I am with Cassandra performance and may have additional advice. There are also quite a few good papers and videos on planet cassandra and the youtube channel regarding performance, storage, data models and the interactions between them.

Hope that helps,
Steve

  

On Sun, May 3, 2015 at 4:05 PM, Erick Ramirez <[hidden email]> wrote:
Hello, there.

In relation to the Java driver, I would recommend updating to the latest version as there were a lot of issues reported in versions earlier that 2.0.9 were the driver is incorrectly marking nodes as down/not available.

In fact, there is a new version of the driver being released in the next 24-48 hours that reverts JAVA-425 to resolve this issue.


On Wed, Apr 29, 2015 at 4:56 AM, dlu66061 <[hidden email]> wrote:

Cassandra gurus, I am really puzzled by my observations, and hope to get some help explaining the results. Thanks in advance.

I think it has always been advocated in Cassandra community that de-normalization leads to better performance. I wanted to see how much performance improvement it can offer, but the results were totally opposite. The performance degraded dramatically for simultaneously requests for the same set of data.

Environment:

I have a Cassandra cluster consisting of 3 AWS m3.large instances, with Cassandra 2.0.6 installed and pretty much default settings. My program is written in Java using Java Driver 2.0.8.

Normalized case:

I have two tables created with the following 2 CQL statements

CREATE TABLE event (event_id UUID, time_token timeuuid, …­ 30 other attributes, …­ PRIMARY KEY (event_id))

CREATE TABLE event_index (index_key text, time_token timeuuid, event_id UUID,   PRIMARY KEY (index_key, time_token))

In my program, given the proper index_key and a token range (tokenLowerBound to tokenUpperBound), I first query the event_index table

Query 1:

SELECT * FROM event_index WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

to get a list of event_ids and then run the following CQL to get the event details.

Query 2:

SELECT * FROM event WHERE event_id IN (a list of event_ids from the above query)

I repeat the above process, with updated token range from the previous run. This actually performs pretty well.

In this normalized process, I have to run 2 queries to get data: the first one should be very quick since it is getting a slice of an internally wide row. The second query may take long because it needs to hit up to 2000 rows of event table.

De-normalized case:

What if we can attach event detail to the index and run just 1 query? Like Query 1, would it be much faster since it is also getting a slice of an internally wide row?

I created a third table that merged the above two tables together. Notice the first three attributes and the PRIMARY KEY definition are exactly the same as the "event_index" table.

CREATE TABLE event_index_with_detail (index_key text, time_token timeuuid, event_id UUID, … 30 other attributes, …­ PRIMARY KEY (index_key, time_token))

Then I can just run the following query to achieve my goal, with the same index and token range as in query 1:

Query 3:

SELECT * FROM event_index_with_detail WHERE index_key in (…­) AND time_token > tokenLowerBound AND time_token <= tokenUpperBound ORDER BY time_token ASC LIMIT 2000

Performance observations

Using Java Driver 2.0.8, I wrote a program that runs Query 1 + Query 2 in the normalized case, or Query 3 in the denormalized case. All queries is set with LOCAL_QUORUM consistency level.

Then I created 1 or more instances of the program to simultaneously retrieve the SAME set of 1 million events stored in Cassandra. Each test runs for 5 minutes, and the results are shown below.

 

1 instance

5 instances

10 instances

Normalized

89

315

417

Denormalized

100

43

3

Note that the unit of measure is number of operations. So in the normalized case, the programs runs 89 times and retrieves 178K events for a single instance, 315 times and 630K events to 5 instances (each instance gets about 126K events), and 417 times and 834K events to 10 instances simultaneously (each instance gets about 83.4K events).

Well for the de-normalized case, the performance is little better for a single instance case, in which the program runs 100 times and retrieves 200K events. However, it turns sharply south for multiple simultaneous instances. All 5 instances completed successfully only 43 operations together, and all 10 instances completed successfully only 3 operations together. For the latter case, the log showed that 3 instances each retrieved 2000 events successfully, and 7 other instances retrieved 0.

In the de-normalized case, the program reported a lot of exceptions like below:

com.datastax.driver.core.exceptions.ReadTimeoutException, Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

com.datastax.driver.core.exceptions.NoHostAvailableException, All host(s) tried for query failed (no host was tried)

I repeated the two cases back and forth several times, and the results remained the same.

I also observed CPU usage on the 3 Cassandra servers, and they were all much higher for the de-normalized case.

 

1 instance

5 instances

10 instances

Normalized

7% usr, 2% sys

30% usr, 8% sys

40% usr, 10% sys

Denormalized

44% usr, 0.3% sys

65% usr, 1% sys

70% usr, 2% sys

Questions

This is really not what I expected, and I am puzzled and have not figured out a good explanation.

  • Why are there so many exceptions in the de-normalized case? I would think Cassandra should be able to handle simultaneous accesses to the same data. Why are there NO exceptions for the normalized case? I meant that the environments for the two cases are basically the same.
  • Is (internally) wide row only good for small amount of data under each column name?
  • Or is it an issue with Java Driver?
  • Or did I do something wrong?


View this message in context: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled
Sent from the cassandra-user@... mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Denormalization leads to terrible, rather than better, Cassandra performance -- I am really puzzled

dlu66061
Thank you both very much, Erick and Steve.

Steve,

1. You are right on the target. I realized it later last week and did additional tests. I am in the process summarizing it and I am going to put it here as a comparison reference for others.

2. It is good to know.

3. Yes, I am using EBS. I understand its implication for the overall performance, but I think it is not the main cause for the observation.

4. I also tested CL.ONE vs. CL.QUORUM. I am summarizing that part too.
Loading...