Storing photos, images, docs etc.

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Storing photos, images, docs etc.

mcasandra
Is it advisable or ok to store photos, images and docs in cassandra where you expect high volume of uploads and views?

I was reading about facebook implementation of haystack to store the photos. They don't put anything in their mysql db.

Since Cassandra is different from mysql I was wondering if it's ok or if there are going to be any issues.

I tried searching online to read articles or papers on similar subject but couldn't find any where cassandra was being used to store docs/images etc.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Edward Capriolo
On Tue, Mar 1, 2011 at 1:43 PM, mcasandra <[hidden email]> wrote:

> Is it advisable or ok to store photos, images and docs in cassandra where you
> expect high volume of uploads and views?
>
> I was reading about facebook implementation of haystack to store the photos.
> They don't put anything in their mysql db.
>
> Since Cassandra is different from mysql I was wondering if it's ok or if
> there are going to be any issues.
>
> I tried searching online to read articles or papers on similar subject but
> couldn't find any where cassandra was being used to store docs/images etc.
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6078278.html
> Sent from the [hidden email] mailing list archive at Nabble.com.
>

Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+files&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

Yields:
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage

This is also nearly a bi-monthly mailing list topic.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

mcasandra
thanks! If I am reading it correctly it looks like Cassandra is not a good solution for storing phots/images/blobs etc. even though it says it's fixed in version .7.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

RW>N
Depends on the specs of your large files.
If the files are less than 64MB, there will be no splitting.
Cassandra(actually thrift) has no streaming abilities. But if your
objects are small (in a few MBs) they would fit in memory easily.

I will have lot of binaries less than few MBs in size. I am actively
looking at cassandra for it. The ability to have a master-master
write, where you can distribute writes pretty much evenly across all
nodes, makes Cassandra good enough to consider for binaries of smaller
sizes (according to me).



On Tue, Mar 1, 2011 at 3:10 PM, mcasandra <[hidden email]> wrote:
> thanks! If I am reading it correctly it looks like Cassandra is not a good
> solution for storing phots/images/blobs etc. even though it says it's fixed
> in version .7.
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6078542.html
> Sent from the [hidden email] mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

mcasandra
Why do we think it's good to have files < 64 MB? How did one arrive at this no.?

If I understand correctly the problem is with Java Heap space might grow because of the large files. But doesn't it really depend on the concurrent requests * size of the response?

What are other options then? Store it on the filesystem but then that introduces SPOF.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Peter Schuller
In reply to this post by mcasandra
> Is it advisable or ok to store photos, images and docs in cassandra where you
> expect high volume of uploads and views?

To diverge a bit from the direction the thread is going: You can
definitely store large files in Cassandra. I would recommend against
doing so by simply smacking entire files into column values simple
because the architecture is such that columns are assumed to be
reasonably sized (lots of them fitting in memory, lots of temporary
columns are okay to create, etc).

Off the top of my head my starting point would be using one row per
file and splitting the actual content up into columns. For dealing
with larger files you may wish to consider splitting into multiple
rows so that even individual files can get replicated across a cluster
(avoids single very large files causing out-of-disk or performance
problems on an individual node, and allows an individual file to enjoy
scaling out for performance).

However, all that is just deciding on the representation of data in
Cassandra appropriately for the use case. I think the more real and
bigger issue is what you're looking for in terms of efficiency. I
wouldn't necessarily call Cassandra the most efficient way to store
large blobs, just because compaction will be a lot more expensive in
relative terms than when used for small individual items of data.
However on the other hand Cassandra should shine in giving you
reasonably efficient random access to subranges of files, yet allow
you to easily write file data in a non-coordinated fashion
(concurrency across sub ranges). There are non-trivial trade-offs.

If you were to store say predominantly 5-50 MB files and you had no
desire beyond just storing them as single large blobs, a local storage
model which implied one-file-per-per would be much more efficient
assuming each individual blob could be streamed to the client.

Bottom line, I think the two primary potential concerns would be: Are
you looking at a *lot* of writes? Write overhead in terms of
throughput and disk I/O should be larger than for your typical
database with small "things" (regardless of row/column/supercolumn
division) being written. The other thing is that if compaction becomes
I/O bound rather than disk bound, you may have bigger issues with read
latency than otherwise.

Regardless, I don't think focusing on whether or not it's a good idea
to have a huge single column is the right approach to the problem
since that's more about using the Cassandra data model appropriately.

--
/ Peter Schuller
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Sasha Dolgy-2
In reply to this post by Edward Capriolo
I took the advice from previous threads and use cassandra to hold
pointers to the files that are uploaded and other meta information.
Amazon S3 can be quite simple and pain free at times and was a great
cost-effective place for me to keep the large files... i have had some
great success already with this approach and read/serve the data from
the large files only when they are required

-sd

On Tue, Mar 1, 2011 at 8:44 PM, Edward Capriolo <[hidden email]> wrote:

> On Tue, Mar 1, 2011 at 1:43 PM, mcasandra <[hidden email]> wrote:
>> Is it advisable or ok to store photos, images and docs in cassandra where you
>> expect high volume of uploads and views?
>>
>> I was reading about facebook implementation of haystack to store the photos.
>> They don't put anything in their mysql db.
>>
>> Since Cassandra is different from mysql I was wondering if it's ok or if
>> there are going to be any issues.
>>
>> I tried searching online to read articles or papers on similar subject but
>> couldn't find any where cassandra was being used to store docs/images etc.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Norman Maurer
In reply to this post by Peter Schuller
2011/3/2 Peter Schuller <[hidden email]>:

>> Is it advisable or ok to store photos, images and docs in cassandra where you
>> expect high volume of uploads and views?
>
> To diverge a bit from the direction the thread is going: You can
> definitely store large files in Cassandra. I would recommend against
> doing so by simply smacking entire files into column values simple
> because the architecture is such that columns are assumed to be
> reasonably sized (lots of them fitting in memory, lots of temporary
> columns are okay to create, etc).
>
> Off the top of my head my starting point would be using one row per
> file and splitting the actual content up into columns. For dealing
> with larger files you may wish to consider splitting into multiple
> rows so that even individual files can get replicated across a cluster
> (avoids single very large files causing out-of-disk or performance
> problems on an individual node, and allows an individual file to enjoy
> scaling out for performance).
>
</snip>

Hector has some InputStream/OutputStream implementation for doing such
stuff since the last release.  See:

https://github.com/rantav/hector/tree/master/core/src/main/java/me/prettyprint/cassandra/io

Maybe it helps.

Bye,
Norman
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

RW>N
In reply to this post by mcasandra
>>What are other options then <<
Several.
1. Mogilefs. Stores on filesystem but metadata in database (MySQL or
Postgres). Also has redundancy built in. Does not require RAID. No
SPOF. But I think it has too many moving parts and requires a few more
boxes than cassandra.
2. Ofcourse the good old Blob in databases. But leads to chunking for
sizes greater than pagesize. overhead is high and fragmentation is an
issue over time.
3. S3. Takes care of redundancy and durability much better than
anything. But you are tied to Amazon. Also check their SLAs on uptime.
It might or might not meet your needs.
4. MongoDB is pretty good too for large files. But it splits files at
16MB and storage is not atomic for sizes greater than that. It too
like MogileFS needs a few more boxes for trackers and metadata.

Each has their pros and cons. I still think there is opportunity to
build more simpler storage just for large files with metadata being in
a central place. If anyone is aware of other options or customized
solutions, please suggest.


On Tue, Mar 1, 2011 at 5:36 PM, mcasandra <[hidden email]> wrote:

> Why do we think it's good to have files < 64 MB? How did one arrive at this
> no.?
>
> If I understand correctly the problem is with Java Heap space might grow
> because of the large files. But doesn't it really depend on the concurrent
> requests * size of the response?
>
> What are other options then? Store it on the filesystem but then that
> introduces SPOF.
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6078983.html
> Sent from the [hidden email] mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

mcasandra
Thanks! Please let me know if others have more suggestions.

In all feeling I get is to keep the images/docs off Cassandra. Flicks and facebook seem to have mysqldb for meta data and actual photos are stored somewhere else.

Looks like I need to search for hosting platform where data can be stored and hosting platform deals with the redundancy and performance.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

mcasandra
Has anyone heard about lustre distributed file system? I am wondering if it will work well where keep the metadata in Cassandra and images in Lustre.

I looked at MogileFS but not too sure about it's support.
Reply | Threaded
Open this post in threaded view
|

RE: Storing photos, images, docs etc.

Weili McClenahan
How Amazon implemented its S3? Seems to me that you are going to implement something like S3 - data storage system. I have the same requirement - need to store hugh amount of large files (pdf, image, zip, video, audio...).

Another question: how about HDFS?

-----Original Message-----
From: mcasandra [mailto:[hidden email]]
Sent: Thursday, March 03, 2011 11:49 AM
To: [hidden email]
Subject: Re: Storing photos, images, docs etc.

Has anyone heard about lustre distributed file system? I am wondering if it
will work well where keep the metadata in Cassandra and images in Lustre.

I looked at MogileFS but not too sure about it's support.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086135.html
Sent from the [hidden email] mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Edward Capriolo
In reply to this post by mcasandra
On Thu, Mar 3, 2011 at 2:49 PM, mcasandra <[hidden email]> wrote:
> Has anyone heard about lustre distributed file system? I am wondering if it
> will work well where keep the metadata in Cassandra and images in Lustre.
>
> I looked at MogileFS but not too sure about it's support.
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086135.html
> Sent from the [hidden email] mailing list archive at Nabble.com.
>

Luster and GlusterFS are cool but this is apples an oranges. Those are
both mountable file systems with POSIX support.This is very different
then a key value store.
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

RW>N
In reply to this post by mcasandra
why would you keep metadata in cassandra ? Even for millions of
documents, metadata would be very small, mysql/postgres should
suffice.

Luster ofcourse is well known and widely used along with glusterfs.
Luster I think requires kernel modifications and will be much more
complex. Also it is easier said than done that store in fs and
metadata in db. You will have to create a custom solution to integrate
and create transactions across them, avoid metadata spof, ensure
load-balance etc.




On Thu, Mar 3, 2011 at 2:49 PM, mcasandra <[hidden email]> wrote:
> Has anyone heard about lustre distributed file system? I am wondering if it
> will work well where keep the metadata in Cassandra and images in Lustre.
>
> I looked at MogileFS but not too sure about it's support.
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086135.html
> Sent from the [hidden email] mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

mcasandra
Well it's not just metadata that I need to store but also Username, profiles, followers etc. What I meant was store the location of the images along with other information that I described above. And when user queries them then pull it from the file sytem.

Most of the high volume sites (facebook, flickr, Digg etc.) currently seem to be storing location/URI in the DB and actual blobs(images/docs) etc. on the distributed file system. Some have written on their own and some are using MogileFS, Lustre etc.

Can't use S3 since the requirement is to keep everything within the network so that it's secure and under control.

Initially I thought Cassandra could be used for both row data and as well as large files. But from what I've read and suggestions that I've got it looks like I need to look at distributed file system which is fault taulrent and also scales well. After reading online I can come up with only few options like lustre, glusterfs and mogileFS to store large files. As you mentioned lustreFS needs kernel tweaking/volume creation etc. Still trying to read more about glusterFS. And mogileFS last updated date on the site is back in 2010 so not sure if it's still widely supported (in case of issues :)).
Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Dan Kuebrich
It's still maintained: https://github.com/mogilefs/ .  I don't have a good sense of the community, though we did use it at my last job.

On Thu, Mar 3, 2011 at 3:44 PM, mcasandra <[hidden email]> wrote:
Well it's not just metadata that I need to store but also Username, profiles,
followers etc. What I meant was store the location of the images along with
other information that I described above. And when user queries them then
pull it from the file sytem.

Most of the high volume sites (facebook, flickr, Digg etc.) currently seem
to be storing location/URI in the DB and actual blobs(images/docs) etc. on
the distributed file system. Some have written on their own and some are
using MogileFS, Lustre etc.

Can't use S3 since the requirement is to keep everything within the network
so that it's secure and under control.

Initially I thought Cassandra could be used for both row data and as well as
large files. But from what I've read and suggestions that I've got it looks
like I need to look at distributed file system which is fault taulrent and
also scales well. After reading online I can come up with only few options
like lustre, glusterfs and mogileFS to store large files. As you mentioned
lustreFS needs kernel tweaking/volume creation etc. Still trying to read
more about glusterFS. And mogileFS last updated date on the site is back in
2010 so not sure if it's still widely supported (in case of issues :)).


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086307.html
Sent from the [hidden email] mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Storing photos, images, docs etc.

Robert Coli
On Thu, Mar 3, 2011 at 1:02 PM, Dan Kuebrich <[hidden email]> wrote:
> It's still maintained: https://github.com/mogilefs/ .  I don't have a good
> sense of the community, though we did use it at my last job.

#mogilefs on freenode contains one of the most solicitous and helpful
project maintainers I have ever have the fortune of interacting with,
dormando.

MogileFS in general Just Works for Digg, and very well at that. Don't
let the perl thing scare you...

=Rob
PS - we may be veering into OFF-TOPIC-for-cassandra-user@ topics, but
feel free to contact me privately with followups about Mogilefs..