Cassandra to store 1 billion small 64KB Blobs

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
Hi

We plan to use cassandra as a data storage on at least 2 nodes with RF=2  for about 1 billion small files.
We do have about 48TB discspace behind for each node.

now my question is - is this possible with cassandra - reliable - means (every blob is stored on 2 jbods)..

we may grow up to nearly 40TB or more on cassandra "storage" data ...

anyone out did something similar?

for retrieval of the blobs we are going to index them with an hashvalue (means hashes are used to store the blob) ...
so we can search fast for the entry in the database and combine the blobs to a normal file again ...

thanks for answer

michael
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Jonathan Shook
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc. I would look at the benchmarking
info available on the lists as a good starting point.

On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:

> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Peter Schuller
In reply to this post by Michael Widmann
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?

Other than what Jonathan Shook mentioned, I'd expect one potential
problem to be the number of sstables. At 40 TB, the larger compactions
are going to take quite some time. How many memtables will be flushed
to disk during the time it takes to perform a ~ 40 TB compaction? That
may or may not be an issue depending on how fast writes will happen,
how large your memtables are (the bigger the better) and what your
reads will look like.

(This relates to another thread where I posted about concurrent
compaction, but right now Cassandra only does a single compaction at a
time.)

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
In reply to this post by Jonathan Shook
Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc. I would look at the benchmarking
info available on the lists as a good starting point.

On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
In reply to this post by Peter Schuller
Hi Peter

We try to figure that out how much data is coming in to cassandra once in full operation mode

Reads are more depending on the hash values (the file name) for the binary blobs - not the binary data itself
We will try to store hash values "grouped" (based on their first byte (a-z,A-Z,0-9) 
writes will sometimes be very fast (depends on the workload and the clients writing to the system)

Question: is concurrent compaction planned for the future?

Mike

2010/7/23 Peter Schuller <[hidden email]>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?

Other than what Jonathan Shook mentioned, I'd expect one potential
problem to be the number of sstables. At 40 TB, the larger compactions
are going to take quite some time. How many memtables will be flushed
to disk during the time it takes to perform a ~ 40 TB compaction? That
may or may not be an issue depending on how fast writes will happen,
how large your memtables are (the bigger the better) and what your
reads will look like.

(This relates to another thread where I posted about concurrent
compaction, but right now Cassandra only does a single compaction at a
time.)

--
/ Peter Schuller



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

aaron morton
In reply to this post by Michael Widmann
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron

On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <[hidden email]>
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron


On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

aaron morton
Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Not sure on your follow up question, so I'll just wildly blather on about things :)

My assumption of your data is you have 64K chunks that are identified by a hash, which can somehow be grouped together into larger files (so there is a "file name" of sorts).

One possible storage design (assuming the Random Partitioner) is....

A Chunks CF, each row in this CF uses the hash of the chunk as it's key and has is a single column with the chunk data. You could use more columns to store meta here.

A ChunkIndex CF, each row uses the file name (from above) as the key and has one column for each chunk in the file. The column name *could* be an offset for the chunk and the column value could be the hash for the chunk. Or you could use the chunk hash as the col name and the offset as the col value if needed.

To rebuild the file read the entire row from the ChunkIndex, then make a series of multi gets to read all the chunks. Or you could lazy populate the ones you needed.

This is all assuming that the 1000's comment below means you could want to combine the chunks  60+ MB chunks. It would be easier to keep all the chunks together in one row, if you are going to have large (unbounded) file size this may not be appropriate.

You could also think about using the order preserving partitioner, and using a compound key for each row such as "file_name_hash.offset" . Then by using the get_range_slices to scan the range of chunks for a file you would not need to maintain a secondary index. Some drawbacks to that approach, read the article above.

Hope the helps
Aaron


On 26 Jul, 2010,at 04:01 PM, Michael Widmann <[hidden email]> wrote:

Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <[hidden email]>
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron



On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
Hi

Wow that was lot of information...

Think about users storing files online (means with their customer name) - each customer maintains his own "hashtable" of files. Each File can consist of some or several thousand entries (depends on the size of the whole file).

for example:

File Test.doc  consists of 3 * 64K Blobs  - each blob does have the hash value ABC - so we will only store one blob, one hash value and the entry that this blob is needed 3 times, so we try to avoid duplicate file data (and there's a lot of duplicates)

Means our modell is:

Filename:
Test.doc
Hashes: 3
Hash 1: ABC
Hash 2: ABC
Hash 3: ABC
BLOB:ABC
 Used: 3
 Binary: Data 1*

Each customer does have:

Customer:  Customer:ID
  
Filename: Test.doc 
 MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
 Version: 0
 Hash 1 / Hash 2 / Hash 3

Filename: Another.doc
 
  MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
  Version: 0
  Hash 1 / Hash 2 / Hash 3
  Version: 1
  Hash 1 / Hash 2 / Hash 4
  Version: 2
  Hash 3 / Hash 2 / Hash 2

Hope this clear some things :-) 

Mike
 

2010/7/26 Aaron Morton <[hidden email]>
Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Not sure on your follow up question, so I'll just wildly blather on about things :)

My assumption of your data is you have 64K chunks that are identified by a hash, which can somehow be grouped together into larger files (so there is a "file name" of sorts).

One possible storage design (assuming the Random Partitioner) is....

A Chunks CF, each row in this CF uses the hash of the chunk as it's key and has is a single column with the chunk data. You could use more columns to store meta here.

A ChunkIndex CF, each row uses the file name (from above) as the key and has one column for each chunk in the file. The column name *could* be an offset for the chunk and the column value could be the hash for the chunk. Or you could use the chunk hash as the col name and the offset as the col value if needed.

To rebuild the file read the entire row from the ChunkIndex, then make a series of multi gets to read all the chunks. Or you could lazy populate the ones you needed.

This is all assuming that the 1000's comment below means you could want to combine the chunks  60+ MB chunks. It would be easier to keep all the chunks together in one row, if you are going to have large (unbounded) file size this may not be appropriate.

You could also think about using the order preserving partitioner, and using a compound key for each row such as "file_name_hash.offset" . Then by using the get_range_slices to scan the range of chunks for a file you would not need to maintain a secondary index. Some drawbacks to that approach, read the article above.

Hope the helps
Aaron



On 26 Jul, 2010,at 04:01 PM, Michael Widmann <[hidden email]> wrote:

Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <[hidden email]>
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron



On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

aaron morton
I see, got carried away thinking about it so here are some thoughts.... 

Your access patterns will determine the best storage design, so it's probably not the best solution. I would welcome thoughts from others. 

=> Standard CF: Chunks

* key is chunk hash
* col named 'data' col value is chunk data 
* another col with name "hash" and value of the hash. 
* could also store access data against the individual chunks here 
* probably cache lots of keys, and a only a few rows 

- to test if the hashes you have already exist do a multi get slice . But you need to specify a col to return the value for, and not the big data one. So use the hash column. 

=> Standard CF: CustomerFiles

* key is customer id 
* col per file name, col value perhaps latest version number or last accessed
* cache keys and rows

- to get all the files for a customer slice the customers row
- if you want access data when you list all files for a customer, use a super CF with a super col for each file name. Store the meta data for the file in this CF *and* in the Files CF

=> Super CF: Files 

* key is client_id.file_name
* super column called "meta" with path / accessed etc. including versions
* super column called "current" with columns named "0001" and col values as the chunk hash
* super column called "version.X" with the same format as above for current
* cache keys and rows

- assumes meta is shared across versions 
- to rebuild a file get_slice for all cols in the "current" super col and then do multi gets for the chunks  
- row grows as versions of file grows, but only storing links to the chunks so probably OK. Consider how many versions X how many chunks, then may way to make the number of rows grow instead of row size. Perhaps have a FileVersions CF where the key includes the version number, then maintain information about the current version in both Files and FileVersions CFs. Files CF would only ever have the current file, update access meta data in both CF's.

=> Standard CF: ChunkUsage 

* key is chunk hash
* col name is a versioned file key cust_id.file_name.version col value is the count of usage in that version
* no caching 

- cannot increase a counter until cassandra 0.7, so cannot keep a count of chunk usage.
- this is the reverse index to see where the chunk is used. 
- does not consider what is current, just all time usage. 
- not sure how much re-use their is, but row size grows with reuse. Should be ok for couple of million cols.
 

Oh and if your going to use Hadoop / PIG to analyse the data in this beastie you need to think about that in the design. You'll probably want single CF to serve the queries with.

Hope that helps
Aaron


On 26 Jul 2010, at 17:29, Michael Widmann wrote:

Hi

Wow that was lot of information...

Think about users storing files online (means with their customer name) - each customer maintains his own "hashtable" of files. Each File can consist of some or several thousand entries (depends on the size of the whole file).

for example:

File Test.doc  consists of 3 * 64K Blobs  - each blob does have the hash value ABC - so we will only store one blob, one hash value and the entry that this blob is needed 3 times, so we try to avoid duplicate file data (and there's a lot of duplicates)

Means our modell is:

Filename:
Test.doc
Hashes: 3
Hash 1: ABC
Hash 2: ABC
Hash 3: ABC
BLOB:ABC
 Used: 3
 Binary: Data 1*

Each customer does have:

Customer:  Customer:ID
  
Filename: Test.doc 
 MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
 Version: 0
 Hash 1 / Hash 2 / Hash 3

Filename: Another.doc
 
  MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
  Version: 0
  Hash 1 / Hash 2 / Hash 3
  Version: 1
  Hash 1 / Hash 2 / Hash 4
  Version: 2
  Hash 3 / Hash 2 / Hash 2

Hope this clear some things :-) 

Mike
 

2010/7/26 Aaron Morton <[hidden email]>
Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Not sure on your follow up question, so I'll just wildly blather on about things :)

My assumption of your data is you have 64K chunks that are identified by a hash, which can somehow be grouped together into larger files (so there is a "file name" of sorts).

One possible storage design (assuming the Random Partitioner) is....

A Chunks CF, each row in this CF uses the hash of the chunk as it's key and has is a single column with the chunk data. You could use more columns to store meta here.

A ChunkIndex CF, each row uses the file name (from above) as the key and has one column for each chunk in the file. The column name *could* be an offset for the chunk and the column value could be the hash for the chunk. Or you could use the chunk hash as the col name and the offset as the col value if needed.

To rebuild the file read the entire row from the ChunkIndex, then make a series of multi gets to read all the chunks. Or you could lazy populate the ones you needed.

This is all assuming that the 1000's comment below means you could want to combine the chunks  60+ MB chunks. It would be easier to keep all the chunks together in one row, if you are going to have large (unbounded) file size this may not be appropriate.

You could also think about using the order preserving partitioner, and using a compound key for each row such as "file_name_hash.offset" . Then by using the get_range_slices to scan the range of chunks for a file you would not need to maintain a secondary index. Some drawbacks to that approach, read the article above.

Hope the helps
Aaron



On 26 Jul, 2010,at 04:01 PM, Michael Widmann <[hidden email]> wrote:

Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <[hidden email]>
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron



On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies

Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Michael Widmann
Okay . That really made a knot into my brain - It twist's a little bit now
I've to draw that  on the whiteboard to understand it better ... but I've seen some very interesting cornerstones in your answer
for our project.

really thanks a lot
mike

2010/7/26 aaron morton <[hidden email]>
I see, got carried away thinking about it so here are some thoughts.... 

Your access patterns will determine the best storage design, so it's probably not the best solution. I would welcome thoughts from others. 

=> Standard CF: Chunks

* key is chunk hash
* col named 'data' col value is chunk data 
* another col with name "hash" and value of the hash. 
* could also store access data against the individual chunks here 
* probably cache lots of keys, and a only a few rows 

- to test if the hashes you have already exist do a multi get slice . But you need to specify a col to return the value for, and not the big data one. So use the hash column. 

=> Standard CF: CustomerFiles

* key is customer id 
* col per file name, col value perhaps latest version number or last accessed
* cache keys and rows

- to get all the files for a customer slice the customers row
- if you want access data when you list all files for a customer, use a super CF with a super col for each file name. Store the meta data for the file in this CF *and* in the Files CF

=> Super CF: Files 

* key is client_id.file_name
* super column called "meta" with path / accessed etc. including versions
* super column called "current" with columns named "0001" and col values as the chunk hash
* super column called "version.X" with the same format as above for current
* cache keys and rows

- assumes meta is shared across versions 
- to rebuild a file get_slice for all cols in the "current" super col and then do multi gets for the chunks  
- row grows as versions of file grows, but only storing links to the chunks so probably OK. Consider how many versions X how many chunks, then may way to make the number of rows grow instead of row size. Perhaps have a FileVersions CF where the key includes the version number, then maintain information about the current version in both Files and FileVersions CFs. Files CF would only ever have the current file, update access meta data in both CF's.

=> Standard CF: ChunkUsage 

* key is chunk hash
* col name is a versioned file key cust_id.file_name.version col value is the count of usage in that version
* no caching 

- cannot increase a counter until cassandra 0.7, so cannot keep a count of chunk usage.
- this is the reverse index to see where the chunk is used. 
- does not consider what is current, just all time usage. 
- not sure how much re-use their is, but row size grows with reuse. Should be ok for couple of million cols.
 

Oh and if your going to use Hadoop / PIG to analyse the data in this beastie you need to think about that in the design. You'll probably want single CF to serve the queries with.

Hope that helps
Aaron


On 26 Jul 2010, at 17:29, Michael Widmann wrote:

Hi

Wow that was lot of information...

Think about users storing files online (means with their customer name) - each customer maintains his own "hashtable" of files. Each File can consist of some or several thousand entries (depends on the size of the whole file).

for example:

File Test.doc  consists of 3 * 64K Blobs  - each blob does have the hash value ABC - so we will only store one blob, one hash value and the entry that this blob is needed 3 times, so we try to avoid duplicate file data (and there's a lot of duplicates)

Means our modell is:

Filename:
Test.doc
Hashes: 3
Hash 1: ABC
Hash 2: ABC
Hash 3: ABC
BLOB:ABC
 Used: 3
 Binary: Data 1*

Each customer does have:

Customer:  Customer:ID
  
Filename: Test.doc 
 MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
 Version: 0
 Hash 1 / Hash 2 / Hash 3

Filename: Another.doc
 
  MetaData (Path / Accessed / modified / Size / compression / OS-Type from / Security)
  Version: 0
  Hash 1 / Hash 2 / Hash 3
  Version: 1
  Hash 1 / Hash 2 / Hash 4
  Version: 2
  Hash 3 / Hash 2 / Hash 2

Hope this clear some things :-) 

Mike
 

2010/7/26 Aaron Morton <[hidden email]>
Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Not sure on your follow up question, so I'll just wildly blather on about things :)

My assumption of your data is you have 64K chunks that are identified by a hash, which can somehow be grouped together into larger files (so there is a "file name" of sorts).

One possible storage design (assuming the Random Partitioner) is....

A Chunks CF, each row in this CF uses the hash of the chunk as it's key and has is a single column with the chunk data. You could use more columns to store meta here.

A ChunkIndex CF, each row uses the file name (from above) as the key and has one column for each chunk in the file. The column name *could* be an offset for the chunk and the column value could be the hash for the chunk. Or you could use the chunk hash as the col name and the offset as the col value if needed.

To rebuild the file read the entire row from the ChunkIndex, then make a series of multi gets to read all the chunks. Or you could lazy populate the ones you needed.

This is all assuming that the 1000's comment below means you could want to combine the chunks  60+ MB chunks. It would be easier to keep all the chunks together in one row, if you are going to have large (unbounded) file size this may not be appropriate.

You could also think about using the order preserving partitioner, and using a compound key for each row such as "file_name_hash.offset" . Then by using the get_range_slices to scan the range of chunks for a file you would not need to maintain a secondary index. Some drawbacks to that approach, read the article above.

Hope the helps
Aaron



On 26 Jul, 2010,at 04:01 PM, Michael Widmann <[hidden email]> wrote:

Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <[hidden email]>
For what it's worth...

* Many smaller boxes with local disk storage are preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-config
* There are some row size limits see http://wiki.apache.org/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row using a super column consider building a secondary index in a standard column. One CF for the blobs using your hash, one CF that uses whatever they grouping key is with a col for every blobs hash value. Read from the index first, then from the blobs themselves.

Aaron



On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]> wrote:

Hi Jonathan

Thanks for your very valuable input on this.

I maybe didn't enough explanation - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored. 
  • The file name to the binary data (a hash) should be indexed for search
  • We could group the hashes in 62 "entry" points for search retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stored separate in cassandra
  • For Hardware we rely on solaris / opensolaris with ZFS in the backend
  • Write operations occur much more often than reads
  • Memory should hold the hash values mainly for fast search (not the binary data)
  • Read Operations (restore from cassandra) may be async - (get about 1000 Blobs) - group them restore
So my question is too: 

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we separate "caching" - hash values CFs cashed and indexed - binary data CFs not ...
Writes happens around the clock - on not that tremor speed but constantly
Would compaction of the database need really much disk space
Is it reliable on this size (more my fear)

thx for thinking and answers...

greetings

Mike

2010/7/23 Jonathan Shook <[hidden email]>
There are two scaling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement.  This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<[hidden email]> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF=2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - means
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage" data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalue
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the blobs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies



--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies




--
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Jonathan Shook
Some possibilities open up when using OPP, especially with aggregate
keys. This is more of an option when RF==cluster size, but not
necessarily a good reason to make RF=cluster size if you haven't
already.

For example, ':' and ';' make good boundary markers in aggregate keys,
since they are already in ordinal order and not usually part of key
names or hash values.
If you structure your chunks like so:

rowkey: <custname>:<filename>
  colname <chunk id/name>
  coldata <blob>

... then you can use the CF/row structure as a map. Asking for all
rows between "customerABC:" and "customerABC;" will yield all files
for abc. Asking for all rows between 'customerABC:file12' and
'customerABC;' would yield all files starting with the name 'file12',
and so forth. One drawback is that you can't do searches on file name
endings unless you iterate through the set of all files for that
customer. This is where the secondary indexes may be better for
describing your files. On the other hand, if you know the name of the
file (via secondary index, ...) having a structure like this would
potentially reduce your round-trips through Cassandra to get or store
it.


If you need another level of separation for row-length management, you
can add another layer. A good 3rd layer would be the name/id of the
first column in each row. I'd personally choose to store it in UTF-8
encoded hex. Your column names can simply be the first-byte offset of
each chunk/column for reassembly or stream control/verification.

64MB per row, 1MB columns
customerABC:file123:00000000 (colnames: 00000000, 00100000, 00200000, ...)
customerABC:file123:04000000 (colnames: 04000000, 04100000, ... )
if 0xFFFFFFFF is not enough for the file size (4,294,967,295), then
you can start with 10 or 12 digits instead (up to 2.8e+14)

If you needed to add metadata to chunk groups/chunks, you can use
column names which are disjoint from '0'-'F', as long as your API
knows how to set your predicates up likewise. If there is at least one
column name which is dependable in each chunk row, then you can use it
as your predicate for "what's out there" queries. This avoids loading
column data for the chunks when looking up names (row/file/... names).
On the other hand, if you use an empty predicate, there is not an easy
way to avoid tombstone rows unless you make another trip to Cassandra
to verify.

This is just more to consider. I'm not quite sure of the trade-offs
here. I'd be curious of others' opinions on it.

On Mon, Jul 26, 2010 at 10:06 AM, Michael Widmann
<[hidden email]> wrote:

> Okay . That really made a knot into my brain - It twist's a little bit now
> I've to draw that  on the whiteboard to understand it better ... but I've
> seen some very interesting cornerstones in your answer
> for our project.
>
> really thanks a lot
> mike
>
> 2010/7/26 aaron morton <[hidden email]>
>>
>> I see, got carried away thinking about it so here are some thoughts....
>> Your access patterns will determine the best storage design, so it's
>> probably not the best solution. I would welcome thoughts from others.
>> => Standard CF: Chunks
>> * key is chunk hash
>> * col named 'data' col value is chunk data
>> * another col with name "hash" and value of the hash.
>> * could also store access data against the individual chunks here
>> * probably cache lots of keys, and a only a few rows
>> - to test if the hashes you have already exist do a multi get slice . But
>> you need to specify a col to return the value for, and not the big data one.
>> So use the hash column.
>> => Standard CF: CustomerFiles
>> * key is customer id
>> * col per file name, col value perhaps latest version number or last
>> accessed
>> * cache keys and rows
>> - to get all the files for a customer slice the customers row
>> - if you want access data when you list all files for a customer, use a
>> super CF with a super col for each file name. Store the meta data for the
>> file in this CF *and* in the Files CF
>> => Super CF: Files
>> * key is client_id.file_name
>> * super column called "meta" with path / accessed etc. including versions
>> * super column called "current" with columns named "0001" and col values
>> as the chunk hash
>> * super column called "version.X" with the same format as above for
>> current
>> * cache keys and rows
>> - assumes meta is shared across versions
>> - to rebuild a file get_slice for all cols in the "current" super col and
>> then do multi gets for the chunks
>> - row grows as versions of file grows, but only storing links to the
>> chunks so probably OK. Consider how many versions X how many chunks, then
>> may way to make the number of rows grow instead of row size. Perhaps have a
>> FileVersions CF where the key includes the version number, then maintain
>> information about the current version in both Files and FileVersions CFs.
>> Files CF would only ever have the current file, update access meta data in
>> both CF's.
>> => Standard CF: ChunkUsage
>> * key is chunk hash
>> * col name is a versioned file key cust_id.file_name.version col value is
>> the count of usage in that version
>> * no caching
>> - cannot increase a counter until cassandra 0.7, so cannot keep a count of
>> chunk usage.
>> - this is the reverse index to see where the chunk is used.
>> - does not consider what is current, just all time usage.
>> - not sure how much re-use their is, but row size grows with reuse. Should
>> be ok for couple of million cols.
>>
>> Oh and if your going to use Hadoop / PIG to analyse the data in this
>> beastie you need to think about that in the design. You'll probably want
>> single CF to serve the queries with.
>> Hope that helps
>> Aaron
>>
>> On 26 Jul 2010, at 17:29, Michael Widmann wrote:
>>
>> Hi
>>
>> Wow that was lot of information...
>>
>> Think about users storing files online (means with their customer name) -
>> each customer maintains his own "hashtable" of files. Each File can consist
>> of some or several thousand entries (depends on the size of the whole file).
>>
>> for example:
>>
>> File Test.doc  consists of 3 * 64K Blobs  - each blob does have the hash
>> value ABC - so we will only store one blob, one hash value and the entry
>> that this blob is needed 3 times, so we try to avoid duplicate file data
>> (and there's a lot of duplicates)
>>
>> Means our modell is:
>>
>> Filename:
>> Test.doc
>> Hashes: 3
>> Hash 1: ABC
>> Hash 2: ABC
>> Hash 3: ABC
>> BLOB:ABC
>>  Used: 3
>>  Binary: Data 1*
>>
>> Each customer does have:
>>
>> Customer:  Customer:ID
>>
>> Filename: Test.doc
>>  MetaData (Path / Accessed / modified / Size / compression / OS-Type from
>> / Security)
>>  Version: 0
>>  Hash 1 / Hash 2 / Hash 3
>>
>> Filename: Another.doc
>>
>>   MetaData (Path / Accessed / modified / Size / compression / OS-Type from
>> / Security)
>>   Version: 0
>>   Hash 1 / Hash 2 / Hash 3
>>   Version: 1
>>   Hash 1 / Hash 2 / Hash 4
>>   Version: 2
>>   Hash 3 / Hash 2 / Hash 2
>>
>> Hope this clear some things :-)
>>
>> Mike
>>
>>
>> 2010/7/26 Aaron Morton <[hidden email]>
>>>
>>> Some background reading..
>>> http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
>>>
>>> Not sure on your follow up question, so I'll just wildly blather on about
>>> things :)
>>>
>>> My assumption of your data is you have 64K chunks that are identified by
>>> a hash, which can somehow be grouped together into larger files (so there is
>>> a "file name" of sorts).
>>>
>>> One possible storage design (assuming the Random Partitioner) is....
>>>
>>> A Chunks CF, each row in this CF uses the hash of the chunk as it's key
>>> and has is a single column with the chunk data. You could use more columns
>>> to store meta here.
>>>
>>> A ChunkIndex CF, each row uses the file name (from above) as the key and
>>> has one column for each chunk in the file. The column name *could* be an
>>> offset for the chunk and the column value could be the hash for the chunk.
>>> Or you could use the chunk hash as the col name and the offset as the col
>>> value if needed.
>>>
>>> To rebuild the file read the entire row from the ChunkIndex, then make a
>>> series of multi gets to read all the chunks. Or you could lazy populate the
>>> ones you needed.
>>>
>>> This is all assuming that the 1000's comment below means you could want
>>> to combine the chunks  60+ MB chunks. It would be easier to keep all the
>>> chunks together in one row, if you are going to have large (unbounded) file
>>> size this may not be appropriate.
>>>
>>> You could also think about using the order preserving partitioner, and
>>> using a compound key for each row such as "file_name_hash.offset" . Then by
>>> using the get_range_slices to scan the range of chunks for a file you would
>>> not need to maintain a secondary index. Some drawbacks to that approach,
>>> read the article above.
>>>
>>> Hope the helps
>>> Aaron
>>>
>>>
>>> On 26 Jul, 2010,at 04:01 PM, Michael Widmann <[hidden email]>
>>> wrote:
>>>
>>> Thanks for this detailed description ...
>>>
>>> You mentioned the secondary index in a standard column, would it be
>>> better to build several indizes?
>>> Is that even possible to build a index on for example 32 columns?
>>>
>>> The hint with the smaller boxes is very valuable!
>>>
>>> Mike
>>>
>>> 2010/7/26 Aaron Morton <[hidden email]>
>>>>
>>>> For what it's worth...
>>>>
>>>> * Many smaller boxes with local disk storage are preferable to 2 with
>>>> huge NAS storage.
>>>> * To cache the hash values look at the KeysCached setting in the
>>>> storage-config
>>>> * There are some row size limits see
>>>> http://wiki.apache.org/cassandra/CassandraLimitations
>>>> * If you wanted to get 1000 blobs, rather then group them in a single
>>>> row using a super column consider building a secondary index in a standard
>>>> column. One CF for the blobs using your hash, one CF that uses whatever they
>>>> grouping key is with a col for every blobs hash value. Read from the index
>>>> first, then from the blobs themselves.
>>>>
>>>> Aaron
>>>>
>>>>
>>>> On 24 Jul, 2010,at 06:51 PM, Michael Widmann <[hidden email]>
>>>> wrote:
>>>>
>>>> Hi Jonathan
>>>>
>>>> Thanks for your very valuable input on this.
>>>>
>>>> I maybe didn't enough explanation - so I'll try to clarify
>>>>
>>>> Here are some thoughts:
>>>>
>>>> binary data will not be indexed - only stored.
>>>> The file name to the binary data (a hash) should be indexed for search
>>>> We could group the hashes in 62 "entry" points for search retrieving ->
>>>> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
>>>> the 64k Blobs meta data (which one belong to which file) should be
>>>> stored separate in cassandra
>>>> For Hardware we rely on solaris / opensolaris with ZFS in the backend
>>>> Write operations occur much more often than reads
>>>> Memory should hold the hash values mainly for fast search (not the
>>>> binary data)
>>>> Read Operations (restore from cassandra) may be async - (get about 1000
>>>> Blobs) - group them restore
>>>>
>>>> So my question is too:
>>>>
>>>> 2 or 3 Big boxes or 10 till 20 small boxes for storage...
>>>> Could we separate "caching" - hash values CFs cashed and indexed -
>>>> binary data CFs not ...
>>>> Writes happens around the clock - on not that tremor speed but
>>>> constantly
>>>> Would compaction of the database need really much disk space
>>>> Is it reliable on this size (more my fear)
>>>>
>>>> thx for thinking and answers...
>>>>
>>>> greetings
>>>>
>>>> Mike
>>>>
>>>> 2010/7/23 Jonathan Shook <[hidden email]>
>>>>>
>>>>> There are two scaling factors to consider here. In general the worst
>>>>> case growth of operations in Cassandra is kept near to O(log2(N)). Any
>>>>> worse growth would be considered a design problem, or at least a high
>>>>> priority target for improvement.  This is important for considering
>>>>> the load generated by very large column families, as binary search is
>>>>> used when the bloom filter doesn't exclude rows from a query.
>>>>> O(log2(N)) is basically the best achievable growth for this type of
>>>>> data, but the bloom filter improves on it in some cases by paying a
>>>>> lower cost every time.
>>>>>
>>>>> The other factor to be aware of is the reduction of binary search
>>>>> performance for datasets which can put disk seek times into high
>>>>> ranges. This is mostly a direct consideration for those installations
>>>>> which will be doing lots of cold reads (not cached data) against large
>>>>> sets. Disk seek times are much more limited (low) for adjacent or near
>>>>> tracks, and generally much higher when tracks are sufficiently far
>>>>> apart (as in a very large data set). This can compound with other
>>>>> factors when session times are longer, but that is to be expected with
>>>>> any system. Your storage system may have completely different
>>>>> characteristics depending on caching, etc.
>>>>>
>>>>> The read performance is still quite high relative to other systems for
>>>>> a similar data set size, but the drop-off in performance may be much
>>>>> worse than expected if you are wanting it to be linear. Again, this is
>>>>> not unique to Cassandra. It's just an important consideration when
>>>>> dealing with extremely large sets of data, when memory is not likely
>>>>> to be able to hold enough hot data for the specific application.
>>>>>
>>>>> As always, the real questions have lots more to do with your specific
>>>>> access patterns, storage system, etc I would look at the benchmarking
>>>>> info available on the lists as a good starting point.
>>>>>
>>>>>
>>>>> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
>>>>> <[hidden email]> wrote:
>>>>> > Hi
>>>>> >
>>>>> > We plan to use cassandra as a data storage on at least 2 nodes with
>>>>> > RF=2
>>>>> > for about 1 billion small files.
>>>>> > We do have about 48TB discspace behind for each node.
>>>>> >
>>>>> > now my question is - is this possible with cassandra - reliable -
>>>>> > means
>>>>> > (every blob is stored on 2 jbods)..
>>>>> >
>>>>> > we may grow up to nearly 40TB or more on cassandra "storage" data ...
>>>>> >
>>>>> > anyone out did something similar?
>>>>> >
>>>>> > for retrieval of the blobs we are going to index them with an
>>>>> > hashvalue
>>>>> > (means hashes are used to store the blob) ...
>>>>> > so we can search fast for the entry in the database and combine the
>>>>> > blobs to
>>>>> > a normal file again ...
>>>>> >
>>>>> > thanks for answer
>>>>> >
>>>>> > michael
>>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> bayoda.com - Professional Online Backup Solutions for Small and Medium
>>>> Sized Companies
>>>
>>>
>>>
>>> --
>>> bayoda.com - Professional Online Backup Solutions for Small and Medium
>>> Sized Companies
>>
>>
>>
>> --
>> bayoda.com - Professional Online Backup Solutions for Small and Medium
>> Sized Companies
>>
>
>
>
> --
> bayoda.com - Professional Online Backup Solutions for Small and Medium Sized
> Companies
>
Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

aaron morton
> Some possibilities open up when using OPP, especially with aggregate
> keys. This is more of an option when RF==cluster size, but not
> necessarily a good reason to make RF=cluster size if you haven't
> already.

This use of the OOP sounds like the way Lucandra stores data, they
want to have range scans and some random key distribution.

http://github.com/tjake/Lucandra

See the hash_key() function in CassandraUtils.java for how they manually hash
the key before storing it in cassandra.


> 64MB per row, 1MB columns
> customerABC:file123:00000000 (colnames: 00000000, 00100000, 00200000, ...)
> customerABC:file123:04000000 (colnames: 04000000, 04100000, ... )
> if 0xFFFFFFFF is not enough for the file size (4,294,967,295), then
> you can start with 10 or 12 digits instead (up to 2.8e+14)

Grouping together chunks into larger groups/extents is an interesting idea.
You could have a 'read ahead' buffer.  I'm
sure somewhere in all these designs there is a magical balance between row size and
the number of rows. They were saying chunks with the same has should only
be stored once though, so not sure if it's applicable in this case.

> If you needed to add metadata to chunk groups/chunks, you can use
> column names which are disjoint from '0'-'F', as long as your API
> knows how to set your predicates up likewise. If there is at least one
> column name which is dependable in each chunk row, then you can use it
> as your predicate for "what's out there" queries. This avoids loading
> column data for the chunks when looking up names (row/file/... names).
> On the other hand, if you use an empty predicate, there is not an easy
> way to avoid tombstone rows unless you make another trip to Cassandra
> to verify.

I've experimented with name spacing columns before, and found easier to
use a super CF in the long run.

Cheers
Aaron
 

Reply | Threaded
Open this post in threaded view
|

Re: Cassandra to store 1 billion small 64KB Blobs

Bryan Whitehead
In reply to this post by Michael Widmann
Just a warning about ZFS. If the plan is to use JBOD w/RAID-Z, don't.
3, 4, 5, ... or N disks in a RAID-Z array (using ZFS) will result in
read performance equivalent to only 1 disk.

Check out this blog entry:
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance

The second chart and the section "The Parity Performance Rathole" are
both a must read.

On Fri, Jul 23, 2010 at 11:51 PM, Michael Widmann
<[hidden email]> wrote:

> Hi Jonathan
>
> Thanks for your very valuable input on this.
>
> I maybe didn't enough explanation - so I'll try to clarify
>
> Here are some thoughts:
>
> binary data will not be indexed - only stored.
> The file name to the binary data (a hash) should be indexed for search
> We could group the hashes in 62 "entry" points for search retrieving -> i
> think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
> the 64k Blobs meta data (which one belong to which file) should be stored
> separate in cassandra
> For Hardware we rely on solaris / opensolaris with ZFS in the backend
> Write operations occur much more often than reads
> Memory should hold the hash values mainly for fast search (not the binary
> data)
> Read Operations (restore from cassandra) may be async - (get about 1000
> Blobs) - group them restore
>
> So my question is too:
>
> 2 or 3 Big boxes or 10 till 20 small boxes for storage...
> Could we separate "caching" - hash values CFs cashed and indexed - binary
> data CFs not ...
> Writes happens around the clock - on not that tremor speed but constantly
> Would compaction of the database need really much disk space
> Is it reliable on this size (more my fear)
>
> thx for thinking and answers...
>
> greetings
>
> Mike
>
> 2010/7/23 Jonathan Shook <[hidden email]>
>>
>> There are two scaling factors to consider here. In general the worst
>> case growth of operations in Cassandra is kept near to O(log2(N)). Any
>> worse growth would be considered a design problem, or at least a high
>> priority target for improvement.  This is important for considering
>> the load generated by very large column families, as binary search is
>> used when the bloom filter doesn't exclude rows from a query.
>> O(log2(N)) is basically the best achievable growth for this type of
>> data, but the bloom filter improves on it in some cases by paying a
>> lower cost every time.
>>
>> The other factor to be aware of is the reduction of binary search
>> performance for datasets which can put disk seek times into high
>> ranges. This is mostly a direct consideration for those installations
>> which will be doing lots of cold reads (not cached data) against large
>> sets. Disk seek times are much more limited (low) for adjacent or near
>> tracks, and generally much higher when tracks are sufficiently far
>> apart (as in a very large data set). This can compound with other
>> factors when session times are longer, but that is to be expected with
>> any system. Your storage system may have completely different
>> characteristics depending on caching, etc.
>>
>> The read performance is still quite high relative to other systems for
>> a similar data set size, but the drop-off in performance may be much
>> worse than expected if you are wanting it to be linear. Again, this is
>> not unique to Cassandra. It's just an important consideration when
>> dealing with extremely large sets of data, when memory is not likely
>> to be able to hold enough hot data for the specific application.
>>
>> As always, the real questions have lots more to do with your specific
>> access patterns, storage system, etc. I would look at the benchmarking
>> info available on the lists as a good starting point.
>>
>> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
>> <[hidden email]> wrote:
>> > Hi
>> >
>> > We plan to use cassandra as a data storage on at least 2 nodes with RF=2
>> > for about 1 billion small files.
>> > We do have about 48TB discspace behind for each node.
>> >
>> > now my question is - is this possible with cassandra - reliable - means
>> > (every blob is stored on 2 jbods)..
>> >
>> > we may grow up to nearly 40TB or more on cassandra "storage" data ...
>> >
>> > anyone out did something similar?
>> >
>> > for retrieval of the blobs we are going to index them with an hashvalue
>> > (means hashes are used to store the blob) ...
>> > so we can search fast for the entry in the database and combine the
>> > blobs to
>> > a normal file again ...
>> >
>> > thanks for answer
>> >
>> > michael
>> >
>
>
>
> --
> bayoda.com - Professional Online Backup Solutions for Small and Medium Sized
> Companies
>