Using a separate commit log drive was 4x slower

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Using a separate commit log drive was 4x slower

Jeremy Davis-3

I have a weird one to share with the list, Using a separate commit log drive dropped my performance a lot more than I would expect...

I'm doing perf tests on 3 identical machines but with 3 different drive sets. (SAS 15K,10K, and SATA 7.5K)
Each system has a single system disk (Same as the data set) and the data set ( a 5 drive RAID-0 )

I'm using Cassandra 0.6.4 with Java 1.6_20.
This is all RF=1, CL=1

I inserted an initial data set of 100K keys (each with 1000 columns of random data (1000 bytes). Compacted and restarted Cassandra.

Then I did a write baseline where I have 500 threads inserting a random 1000 bytes on a random key/column combination (always 1 column per request).
If my commit log is on my RAID'd Data drive I get about 19K Columns/Inserts per second.

If I then add some random reads ( 30 threads doing a random read Key/Column read - always 1 column per read) I get ~ 8K Reads/Writes per second

Host
Write Baseline Columns Per Second.
Write Columns Per Second.
Read Columns Per Second.
SAS15K
18800
8700
8100
SAS10K
15800
7600
7300
SATA
13200
7300
8000

Now, if I do the same thing but with the commit log on the system disk, I get:

Host
Write Baseline Columns Per Second.
Write Columns Per Second.
Read Columns Per Second.
SAS15K
12600
2200
1600
SAS10K
11400
3000
1900
SATA
9900
3100
1800


I would think that the Write level would stay at about the baseline, and I have no idea why the read level would be so low.

Any thoughts?


Some iostat (while separate commit log):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.33    0.00    4.72    2.48    0.00   54.47

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00    73.77     0.57    5.20   4.93  54.50
cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00    86.30     0.10    6.06   2.73   4.50
dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00     8.00     6.25    6.13   0.53  54.50
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          37.42    0.00    2.37    3.54    0.00   56.68

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   854.50    0.00  124.50     0.00  7816.00    62.78     0.61    4.94   4.82  60.00
cciss/c0d1        0.00     0.00   32.00    0.00  4032.00     0.00   126.00     0.21    6.72   3.12  10.00
dm-0              0.00     0.00    0.00  979.50     0.00  7836.00     8.00     5.57    5.69   0.61  60.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00


some top (while separate commit log):

top - 15:56:38 up 6 days, 21:26,  9 users,  load average: 17.09, 7.92, 6.87
Tasks: 358 total,   1 running, 357 sleeping,   0 stopped,   0 zombie
Cpu(s): 35.4%us,  1.6%sy,  0.0%ni, 59.4%id,  3.2%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:  24729068k total, 19789732k used,  4939336k free,   132056k buffers
Swap:  5849080k total,    54976k used,  5794104k free, 14839884k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                              
19411 root      20   0  142g 8.2g 5.0g S  599 34.9 423:25.42 java    


----

<Storage>
  <ClusterName>Test Cluster</ClusterName>
  <AutoBootstrap>false</AutoBootstrap>
  <HintedHandoffEnabled>true</HintedHandoffEnabled>

<Keyspaces>
    <Keyspace Name="Keyspace1">
      <ColumnFamily Name="PerfTest" CompareWith="LongType"/>
      <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>1</ReplicationFactor>
      <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>

  </Keyspaces>

  <Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>
  <Partitioner>org.apache.cassandra.dht.OrderPreservingPartitioner</Partitioner>
  <InitialToken></InitialToken>
  <CommitLogDirectory>/data/commitlog</CommitLogDirectory>
  <DataFileDirectories>
      <DataFileDirectory>/data/data</DataFileDirectory>
  </DataFileDirectories>
  <Seeds>
      <Seed>127.0.0.1</Seed>
  </Seeds>

<RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
  <CommitLogRotationThresholdInMB>1024</CommitLogRotationThresholdInMB>
  <ListenAddress>10.2.60.20</ListenAddress>
  <StoragePort>7000</StoragePort>
  <ThriftAddress>10.2.60.20</ThriftAddress>
  <ThriftPort>9160</ThriftPort> 
  <ThriftFramedTransport>false</ThriftFramedTransport>
  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>

<FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>512</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>1.2</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>16</ConcurrentReads>
  <ConcurrentWrites>512</ConcurrentWrites>

  <CommitLogSync>batch</CommitLogSync>
  <CommitLogSyncBatchWindowInMS>5</CommitLogSyncBatchWindowInMS>
  <GCGraceSeconds>864000</GCGraceSeconds>
</Storage>


Reply | Threaded
Open this post in threaded view
|

Re: Using a separate commit log drive was 4x slower

Peter Schuller
I have no explanation for the slower reads, but I have an hypothesis
on the writes.

Your iostat shows:

> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00    73.77     0.57    5.20   4.93  54.50
> cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00    86.30     0.10    6.06   2.73   4.50
> dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00     8.00     6.25    6.13   0.53  54.50
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

So that's keeping the disk busy around 50-60% of the time. This seems
roughly consistent with your commit batch window being set to 5 ms and
the system drive NOT being supported by a battery-backed cache (such
that an fsync() actually does have to wait for the data to be on the
platter).

Is your non-system drive backed by the BBU? (I'm not sure if the
controllers would support having some volumes backed by BBU protected
cache and not others.)

If it is the case that the other volume is BBU backed, then maybe the
slowdown on writes is due to this. In any case, whatever the situation
was before, the above stats do seem roughly consistent with
write-through fsync() and batch window of 5ms, given sufficient
concurrency to achieve the throughput you're seeing. On the other
hand, "roughly consistent" is not very precise, and the original
performance on the RAID:ed device is probably also roughly consistent
with this ;)

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Using a separate commit log drive was 4x slower

Jeremy Davis-3
Yeah, it has a BBU, and it is charged and on..
Very odd behavior, I'm stumped.

-JD

On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller <[hidden email]> wrote:
I have no explanation for the slower reads, but I have an hypothesis
on the writes.

Your iostat shows:

> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00    73.77     0.57    5.20   4.93  54.50
> cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00    86.30     0.10    6.06   2.73   4.50
> dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00     8.00     6.25    6.13   0.53  54.50
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

So that's keeping the disk busy around 50-60% of the time. This seems
roughly consistent with your commit batch window being set to 5 ms and
the system drive NOT being supported by a battery-backed cache (such
that an fsync() actually does have to wait for the data to be on the
platter).

Is your non-system drive backed by the BBU? (I'm not sure if the
controllers would support having some volumes backed by BBU protected
cache and not others.)

If it is the case that the other volume is BBU backed, then maybe the
slowdown on writes is due to this. In any case, whatever the situation
was before, the above stats do seem roughly consistent with
write-through fsync() and batch window of 5ms, given sufficient
concurrency to achieve the throughput you're seeing. On the other
hand, "roughly consistent" is not very precise, and the original
performance on the RAID:ed device is probably also roughly consistent
with this ;)

--
/ Peter Schuller

Reply | Threaded
Open this post in threaded view
|

Re: Using a separate commit log drive was 4x slower

Jonathan Ellis-3
Other activity, e.g. syslog?

Journaling at the FS level? you could try making a small partition
formatted as ext2.

On Tue, Aug 10, 2010 at 5:10 PM, Jeremy Davis
<[hidden email]> wrote:

> Yeah, it has a BBU, and it is charged and on..
> Very odd behavior, I'm stumped.
>
> -JD
>
> On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller
> <[hidden email]> wrote:
>>
>> I have no explanation for the slower reads, but I have an hypothesis
>> on the writes.
>>
>> Your iostat shows:
>>
>> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> > avgrq-sz avgqu-sz   await  svctm  %util
>> > cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
>> > 73.77     0.57    5.20   4.93  54.50
>> > cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00
>> > 86.30     0.10    6.06   2.73   4.50
>> > dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00
>> > 8.00     6.25    6.13   0.53  54.50
>> > dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>> > 0.00     0.00    0.00   0.00   0.00
>>
>> So that's keeping the disk busy around 50-60% of the time. This seems
>> roughly consistent with your commit batch window being set to 5 ms and
>> the system drive NOT being supported by a battery-backed cache (such
>> that an fsync() actually does have to wait for the data to be on the
>> platter).
>>
>> Is your non-system drive backed by the BBU? (I'm not sure if the
>> controllers would support having some volumes backed by BBU protected
>> cache and not others.)
>>
>> If it is the case that the other volume is BBU backed, then maybe the
>> slowdown on writes is due to this. In any case, whatever the situation
>> was before, the above stats do seem roughly consistent with
>> write-through fsync() and batch window of 5ms, given sufficient
>> concurrency to achieve the throughput you're seeing. On the other
>> hand, "roughly consistent" is not very precise, and the original
>> performance on the RAID:ed device is probably also roughly consistent
>> with this ;)
>>
>> --
>> / Peter Schuller
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com
Reply | Threaded
Open this post in threaded view
|

Re: Using a separate commit log drive was 4x slower

Peter Schuller
In reply to this post by Jeremy Davis-3
> Yeah, it has a BBU, and it is charged and on..
> Very odd behavior, I'm stumped.

I advise double-checking raid volume settings and ensuring that policy
is truly such that the write cache is used. This may also be a
function of kernel driver settings depending on what RAID
controller/kernel version you have (for example make sure that an
fsync() doesn't result in asking the RAID controller to flush caches
regardless of BBU state). In any case, these stats:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
73.77     0.57    5.20   4.93  54.50

... are highly inconsistent with write-back caching under the
assumption that the writes are indeed the sequential writes to the
commit log. Write counts in the ~ 100/second ballpark, with no reads,
average request size of 74, an average transaction time of 4.93 and a
utilization of 54% *really really* sounds like the I/O is going all
the way down to the platter. Either that or the RAID firmware/driver
is not doing its job properly.

I've attached a small script (note: UGLY hack since I just whipped it
up, not a proper tool, but it does the job for this) that you can run
to test it:

   ./writelat.py /path/to/file_to_write_to # warning, file will be
truncated/destroyed if it exists

If you run this on an idle system and you're truly doing write-back
caching, you should see numbers BELOW 1 (i.e., sub-millisecond times)
(but you can ignore the first sample probably).

It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs,
etc. I predict you'll see numbers in the 3-7 millisecond range.

Sample output for me with a plain old SATA drive (and ZFS on FreeBSD) follows.

33.8141918182
8.32605361938
8.44812393188
8.44788551331
8.40210914612
8.4490776062
8.38303565979
8.57901573181
8.20922851562
8.21614265442
10.0581645966
8.37683677673
8.50605964661
8.376121521
9.86790657043
8.43715667725
8.45789909363
8.40520858765
8.4171295166
8.46195220947
8.41498374939
8.46099853516
8.44287872314
8.43000411987
8.455991745

--
/ Peter Schuller

writelat.py (598 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using a separate commit log drive was 4x slower

Jeremy Davis-3
Thanks for all the feedback, I'll be back in 2 weeks and pick up then.
-JD

On Tue, Aug 10, 2010 at 3:45 PM, Peter Schuller <[hidden email]> wrote:
> Yeah, it has a BBU, and it is charged and on..
> Very odd behavior, I'm stumped.

I advise double-checking raid volume settings and ensuring that policy
is truly such that the write cache is used. This may also be a
function of kernel driver settings depending on what RAID
controller/kernel version you have (for example make sure that an
fsync() doesn't result in asking the RAID controller to flush caches
regardless of BBU state). In any case, these stats:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
73.77     0.57    5.20   4.93  54.50

... are highly inconsistent with write-back caching under the
assumption that the writes are indeed the sequential writes to the
commit log. Write counts in the ~ 100/second ballpark, with no reads,
average request size of 74, an average transaction time of 4.93 and a
utilization of 54% *really really* sounds like the I/O is going all
the way down to the platter. Either that or the RAID firmware/driver
is not doing its job properly.

I've attached a small script (note: UGLY hack since I just whipped it
up, not a proper tool, but it does the job for this) that you can run
to test it:

  ./writelat.py /path/to/file_to_write_to # warning, file will be
truncated/destroyed if it exists

If you run this on an idle system and you're truly doing write-back
caching, you should see numbers BELOW 1 (i.e., sub-millisecond times)
(but you can ignore the first sample probably).

It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs,
etc. I predict you'll see numbers in the 3-7 millisecond range.

Sample output for me with a plain old SATA drive (and ZFS on FreeBSD) follows.

33.8141918182
8.32605361938
8.44812393188
8.44788551331
8.40210914612
8.4490776062
8.38303565979
8.57901573181
8.20922851562
8.21614265442
10.0581645966
8.37683677673
8.50605964661
8.376121521
9.86790657043
8.43715667725
8.45789909363
8.40520858765
8.4171295166
8.46195220947
8.41498374939
8.46099853516
8.44287872314
8.43000411987
8.455991745

--
/ Peter Schuller