Backup solution

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Backup solution

Rene Kochen
Hi all,

Is the following a good backup solution.

Create two data-centers:

- A live data-center with multiple nodes (commodity hardware). Clients connect to this cluster with LOCAL_QUORUM.
- A backup data-center with 1 node (with fast SSDs). Clients do not connect to this cluster. Cluster only used for creating and storing snapshots.

Advantages:

- No snapshots and bulk network I/O (transfer snapshots) needed on the live cluster.
- Clients are not slowed down because writes to the backup data-center are async.
- On the backup cluster snapshots are made on a regular basis. This again does not affect the live cluster.
- The back-up cluster does not need to process client requests/reads, so we need less machines for the backup cluster than the live cluster.

Are there any disadvantages with this approach?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

Jabbar Azam

Hello,

If the live data centre disappears restoring the data from the backup is going to take ages especially if the data is going from one data centre to another, unless you have a high bandwidth connection between data centres or you have a small amount of data.

Jabbar Azam

On 14 Mar 2013 14:31, "Rene Kochen" <[hidden email]> wrote:
Hi all,

Is the following a good backup solution.

Create two data-centers:

- A live data-center with multiple nodes (commodity hardware). Clients connect to this cluster with LOCAL_QUORUM.
- A backup data-center with 1 node (with fast SSDs). Clients do not connect to this cluster. Cluster only used for creating and storing snapshots.

Advantages:

- No snapshots and bulk network I/O (transfer snapshots) needed on the live cluster.
- Clients are not slowed down because writes to the backup data-center are async.
- On the backup cluster snapshots are made on a regular basis. This again does not affect the live cluster.
- The back-up cluster does not need to process client requests/reads, so we need less machines for the backup cluster than the live cluster.

Are there any disadvantages with this approach?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

kochen
Thank you. I have a high bandwidth connection. But that also means that regular repairs on the backup data-center will take a long time.

2013/3/14 Jabbar Azam <[hidden email]>

Hello,

If the live data centre disappears restoring the data from the backup is going to take ages especially if the data is going from one data centre to another, unless you have a high bandwidth connection between data centres or you have a small amount of data.

Jabbar Azam

On 14 Mar 2013 14:31, "Rene Kochen" <[hidden email]> wrote:
Hi all,

Is the following a good backup solution.

Create two data-centers:

- A live data-center with multiple nodes (commodity hardware). Clients connect to this cluster with LOCAL_QUORUM.
- A backup data-center with 1 node (with fast SSDs). Clients do not connect to this cluster. Cluster only used for creating and storing snapshots.

Advantages:

- No snapshots and bulk network I/O (transfer snapshots) needed on the live cluster.
- Clients are not slowed down because writes to the backup data-center are async.
- On the backup cluster snapshots are made on a regular basis. This again does not affect the live cluster.
- The back-up cluster does not need to process client requests/reads, so we need less machines for the backup cluster than the live cluster.

Are there any disadvantages with this approach?

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

Aaron Turner
On Fri, Mar 15, 2013 at 3:12 AM, Rene Kochen
<[hidden email]> wrote:
> Thank you. I have a high bandwidth connection. But that also means that
> regular repairs on the backup data-center will take a long time.



Honestly, at this point I don't think anyone can provide you any good
feedback based on facts because so far you haven't given us any facts.
 Like:

1. How big of a data set?
2. How many nodes in your primary DC?
3. How many transactions/sec is your primary DC doing?
4. What are your uptime SLA's?
5. Just how fast is "high bandwidth"  How much latency?

Anyways, will it work?  Possibly.  What are the disadvantages?  Well
it depends on a bunch of things you haven't told us.



--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

Philip O'Toole
You can consider using a WAN optimization appliance such as a Riverbed Steelhead to significantly speed up your transfers, though that will cost. It is a common approach to speed up inter-datacenter transfers. Steelheads for the AWS EC2 cloud are also available.

(Disclaimer: I used to write software for the physical and AWS Steelheads.)

Philip

On Mar 15, 2013, at 9:22 AM, Aaron Turner <[hidden email]> wrote:

> On Fri, Mar 15, 2013 at 3:12 AM, Rene Kochen
> <[hidden email]> wrote:
>> Thank you. I have a high bandwidth connection. But that also means that
>> regular repairs on the backup data-center will take a long time.
>
>
>
> Honestly, at this point I don't think anyone can provide you any good
> feedback based on facts because so far you haven't given us any facts.
> Like:
>
> 1. How big of a data set?
> 2. How many nodes in your primary DC?
> 3. How many transactions/sec is your primary DC doing?
> 4. What are your uptime SLA's?
> 5. Just how fast is "high bandwidth"  How much latency?
>
> Anyways, will it work?  Possibly.  What are the disadvantages?  Well
> it depends on a bunch of things you haven't told us.
>
>
>
> --
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
> "carpe diem quam minimum credula postero"
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

kochen
In reply to this post by Aaron Turner
Hi Aaron,

We have many deployments, but typically:

- Live cluster of six nodes, replication factor = 3.
- A node processes more reads than writes (approximately 100 get_slices per/second, narrow rows).
- Data per node is about 50 to 100 GBytes.
- We should recover within 4 hours.

The idea is to put the backup cluster close to the live cluster with a gigabit connection only for Cassandra.

Thanks!

Rene

2013/3/15 Aaron Turner <[hidden email]>
On Fri, Mar 15, 2013 at 3:12 AM, Rene Kochen
<[hidden email]> wrote:
> Thank you. I have a high bandwidth connection. But that also means that
> regular repairs on the backup data-center will take a long time.



Honestly, at this point I don't think anyone can provide you any good
feedback based on facts because so far you haven't given us any facts.
 Like:

1. How big of a data set?
2. How many nodes in your primary DC?
3. How many transactions/sec is your primary DC doing?
4. What are your uptime SLA's?
5. Just how fast is "high bandwidth"  How much latency?

Anyways, will it work?  Possibly.  What are the disadvantages?  Well
it depends on a bunch of things you haven't told us.



--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

Aaron Turner
On Fri, Mar 15, 2013 at 10:35 AM, Rene Kochen
<[hidden email]> wrote:

> Hi Aaron,
>
> We have many deployments, but typically:
>
> - Live cluster of six nodes, replication factor = 3.
> - A node processes more reads than writes (approximately 100 get_slices
> per/second, narrow rows).
> - Data per node is about 50 to 100 GBytes.
> - We should recover within 4 hours.
>
> The idea is to put the backup cluster close to the live cluster with a
> gigabit connection only for Cassandra.

100 reads/sec/node doesn't sound like a lot to me... And 100G/node is
far below the recommended limit.  Sounds to me  you've possibly over
spec'd your cluster (not a bad thing, just an observation).  Of
course, if your data set is growing, then...

That said, I wouldn't consider a single node in a 2nd DC receiving
updates via Cassandra a "backup".  That's because a bug in cassandra
which corrupts your data or a user accidentally doing the wrong thing
(like issuing deletes they shouldn't) means that get's replicated to
all your nodes- including the one in the other DC.

A real backup would be to take snapshots on the nodes and then copy
them off the cluster.

I'd say replication is good if you want a hot-standby for a disaster
recovery site so you can quickly recover from a hardware fault.
Especially if you have a 4hr SLA, how are you going to get your
primary DC back up after a fire, earthquake, etc in 4 hours?  Heck, a
switch failure might knock you out for 4 hours depending on how
quickly you can swap another one in and how recent your config backups
are.

Better to have a DR site with a smaller set of nodes with the data
ready to go.  Maybe they won't be as fast, but hopefully you can make
sure the most important queries are handled.  But for that, I would
probably go with something more then just a single node in the DR DC.

One thing to remember is that compactions will impact the feasible
single node size to something smaller then you can potentially
allocate disk space for.   Ie: just because you can build a 4TB disk
array, doesn't mean you can have a single Cassandra node with 4TB of
data.  Typically, people around here seem to recommend ~400GB, but
that depends on hardware.

Honestly, for the price of a single computer you could test this
pretty easy.  That's what I'd do.

--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

kochen
Hi Aaron,

Thank you for your answer!

My idea was to do the snapshots in the backup DC only. That way the backup procedure will not affect the live DC. However I'm afraid that a point-in-time recovery via the snapshots in the second DC (first restore backup on backup DC and then repair live DC) will take too long. I expect the data to grow significantly.

It makes more sense to use the second cluster as a hot standby (and make snapshots on both clusters).

Rene

2013/3/16 Aaron Turner <[hidden email]>
On Fri, Mar 15, 2013 at 10:35 AM, Rene Kochen
<[hidden email]> wrote:
> Hi Aaron,
>
> We have many deployments, but typically:
>
> - Live cluster of six nodes, replication factor = 3.
> - A node processes more reads than writes (approximately 100 get_slices
> per/second, narrow rows).
> - Data per node is about 50 to 100 GBytes.
> - We should recover within 4 hours.
>
> The idea is to put the backup cluster close to the live cluster with a
> gigabit connection only for Cassandra.

100 reads/sec/node doesn't sound like a lot to me... And 100G/node is
far below the recommended limit.  Sounds to me  you've possibly over
spec'd your cluster (not a bad thing, just an observation).  Of
course, if your data set is growing, then...

That said, I wouldn't consider a single node in a 2nd DC receiving
updates via Cassandra a "backup".  That's because a bug in cassandra
which corrupts your data or a user accidentally doing the wrong thing
(like issuing deletes they shouldn't) means that get's replicated to
all your nodes- including the one in the other DC.

A real backup would be to take snapshots on the nodes and then copy
them off the cluster.

I'd say replication is good if you want a hot-standby for a disaster
recovery site so you can quickly recover from a hardware fault.
Especially if you have a 4hr SLA, how are you going to get your
primary DC back up after a fire, earthquake, etc in 4 hours?  Heck, a
switch failure might knock you out for 4 hours depending on how
quickly you can swap another one in and how recent your config backups
are.

Better to have a DR site with a smaller set of nodes with the data
ready to go.  Maybe they won't be as fast, but hopefully you can make
sure the most important queries are handled.  But for that, I would
probably go with something more then just a single node in the DR DC.

One thing to remember is that compactions will impact the feasible
single node size to something smaller then you can potentially
allocate disk space for.   Ie: just because you can build a 4TB disk
array, doesn't mean you can have a single Cassandra node with 4TB of
data.  Typically, people around here seem to recommend ~400GB, but
that depends on hardware.

Honestly, for the price of a single computer you could test this
pretty easy.  That's what I'd do.

--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

Aaron Turner
On Mon, Mar 18, 2013 at 7:33 AM, Rene Kochen
<[hidden email]> wrote:

> Hi Aaron,
>
> Thank you for your answer!
>
> My idea was to do the snapshots in the backup DC only. That way the backup
> procedure will not affect the live DC. However I'm afraid that a
> point-in-time recovery via the snapshots in the second DC (first restore
> backup on backup DC and then repair live DC) will take too long. I expect
> the data to grow significantly.
>
> It makes more sense to use the second cluster as a hot standby (and make
> snapshots on both clusters).

Remember, snapshots are *cheap*.  There's almost literally zero I/O
associated with a snapshot.  Backing up all that data off the system
is a different story, but at least it's large sequential reads which
is pretty well optimized.

--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"
Reply | Threaded
Open this post in threaded view
|

Re: Backup solution

aaron morton
IMHO this is a bad idea. 

* You secondary DC will have no redundancy, when you restart it you will be relying on HH and nodetool repair. 
* If your secondary DC machine fails so does the singe copy of your backup. 
* There will be additional management overhead for managing an unbalanced DC. 
* Restoring will be a pain. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 19/03/2013, at 5:37 AM, Aaron Turner <[hidden email]> wrote:

On Mon, Mar 18, 2013 at 7:33 AM, Rene Kochen
<[hidden email]> wrote:
Hi Aaron,

Thank you for your answer!

My idea was to do the snapshots in the backup DC only. That way the backup
procedure will not affect the live DC. However I'm afraid that a
point-in-time recovery via the snapshots in the second DC (first restore
backup on backup DC and then repair live DC) will take too long. I expect
the data to grow significantly.

It makes more sense to use the second cluster as a hot standby (and make
snapshots on both clusters).

Remember, snapshots are *cheap*.  There's almost literally zero I/O
associated with a snapshot.  Backing up all that data off the system
is a different story, but at least it's large sequential reads which
is pretty well optimized.

--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
   -- Benjamin Franklin
"carpe diem quam minimum credula postero"