data model to store large volume syslog

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

data model to store large volume syslog

Mohan L

Dear All,

I am looking Cassandra to store time series data(mostly syslog). The volume of data is very huge and more entries happening at the same timestamps. each record contain the following fields.
 
timestamps:host-name:facility:message

The below are the things needs to be monitored:


1). Need to get data between time X and Y
2). Need to get data between time X and Y for a host-name.
3). Need to search a 'pattern' in the message

the data model design which I am thinking is

1). create a column family 'cfrawlog' which stores raw log as received. row key could be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value' is raw log data. Since we are also going to use this log for forensics purpose, so it will help us to have all raw log with in the column family without missing.

2). I want to create one more column family which is going to have the parsed log so that we will use this
column family to query. my question is How to model this CF so that it will give answer of the above question? what would be the row key for this CF?

3). Is the above data model makes sense?

Any help and suggestion would be greatly appreciated.


Thanks
Mohan L


Reply | Threaded
Open this post in threaded view
|

RE: data model to store large volume syslog

moshe.kranc

Row key based on hour will create hot spots for write – for an entire hour, all the writes will be going to the same node, i.e., the node where the row resides. You need to come up with a row key that distributes writes evenly across all your C* nodes, e.g., time concatenated with a sequence counter.

 

From: Mohan L [mailto:[hidden email]]
Sent: Thursday, March 07, 2013 2:10 PM
To: [hidden email]
Subject: data model to store large volume syslog

 


Dear All,

I am looking Cassandra to store time series data(mostly syslog). The volume of data is very huge and more entries happening at the same timestamps. each record contain the following fields.
 
timestamps:host-name:facility:message

The below are the things needs to be monitored:


1). Need to get data between time X and Y
2). Need to get data between time X and Y for a host-name.
3). Need to search a 'pattern' in the message

the data model design which I am thinking is

1). create a column family 'cfrawlog' which stores raw log as received. row key could be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value' is raw log data. Since we are also going to use this log for forensics purpose, so it will help us to have all raw log with in the column family without missing.

2). I want to create one more column family which is going to have the parsed log so that we will use this
column family to query. my question is How to model this CF so that it will give answer of the above question? what would be the row key for this CF?

3). Is the above data model makes sense?

Any help and suggestion would be greatly appreciated.


Thanks
Mohan L

_______________________________________________

This message may contain information that is confidential or privileged. If you are not an intended recipient of this message, please delete it and any attachments, and notify the sender that you have received it in error. Unless specifically stated in the message or otherwise indicated, you may not duplicate, redistribute or forward this message or any portion thereof, including any attachments, by any means to any other person, including any retail investor or customer. This message is not a recommendation, advice, offer or solicitation, to buy/sell any product or service, and is not an official confirmation of any transaction. Any opinions presented are solely those of the author and do not necessarily represent those of Barclays. This message is subject to terms available at: www.barclays.com/emaildisclaimer and, if received from Barclays' Sales or Trading desk, the terms available at: www.barclays.com/salesandtradingdisclaimer/. By messaging with Barclays you consent to the foregoing. Barclays Bank PLC is a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays group.

_______________________________________________

Reply | Threaded
Open this post in threaded view
|

Re: data model to store large volume syslog

aaron morton
> 1). create a column family 'cfrawlog' which stores raw log as received. row key could be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value' is raw log data. Since we are also going to use this log for forensics purpose, so it will help us to have all raw log with in the column family without missing.
As Moshe said there is a chance of hot spotting if you are sending all writes to a certain row.
You also need to consider how big the row will get, in general stay below about 30MB. You can go higher but there are some implications.


> 2). I want to create one more column family which is going to have the parsed log so that we will use this column family to query. my question is How to model this CF so that it will give answer of the above question? what would be the row key for this CF?  
Something like:

row_key: YYYYMMDD
column: <host:timestamp:>

Note, i've not considered how to handle duplicate time stamps from the same host

> 3). Is the above data model makes sense?
Sort of.
Do some googling for cassandra and log data, look at https://github.com/thobbs/logsandra


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/03/2013, at 4:16 AM, [hidden email] wrote:

> Row key based on hour will create hot spots for write – for an entire hour, all the writes will be going to the same node, i.e., the node where the row resides. You need to come up with a row key that distributes writes evenly across all your C* nodes, e.g., time concatenated with a sequence counter.
>  
> From: Mohan L [mailto:[hidden email]]
> Sent: Thursday, March 07, 2013 2:10 PM
> To: [hidden email]
> Subject: data model to store large volume syslog
>  
>
> Dear All,
>
> I am looking Cassandra to store time series data(mostly syslog). The volume of data is very huge and more entries happening at the same timestamps. each record contain the following fields.
>  
> timestamps:host-name:facility:message
>
> The below are the things needs to be monitored:
>
>
> 1). Need to get data between time X and Y
> 2). Need to get data between time X and Y for a host-name.
> 3). Need to search a 'pattern' in the message
>
> the data model design which I am thinking is
>
> 1). create a column family 'cfrawlog' which stores raw log as received. row key could be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value' is raw log data. Since we are also going to use this log for forensics purpose, so it will help us to have all raw log with in the column family without missing.
>
> 2). I want to create one more column family which is going to have the parsed log so that we will use this column family to query. my question is How to model this CF so that it will give answer of the above question? what would be the row key for this CF?
>
> 3). Is the above data model makes sense?
>
> Any help and suggestion would be greatly appreciated.
>
>
> Thanks
> Mohan L
>
>
> _______________________________________________
>
> This message may contain information that is confidential or privileged. If you are not an intended recipient of this message, please delete it and any attachments, and notify the sender that you have received it in error. Unless specifically stated in the message or otherwise indicated, you may not duplicate, redistribute or forward this message or any portion thereof, including any attachments, by any means to any other person, including any retail investor or customer. This message is not a recommendation, advice, offer or solicitation, to buy/sell any product or service, and is not an official confirmation of any transaction. Any opinions presented are solely those of the author and do not necessarily represent those of Barclays. This message is subject to terms available at: www.barclays.com/emaildisclaimer and, if received from Barclays' Sales or Trading desk, the terms available at: www.barclays.com/salesandtradingdisclaimer/. By messaging with Barclays you consent to the foregoing. Barclays Bank PLC is a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays group.
>
> _______________________________________________
>

Reply | Threaded
Open this post in threaded view
|

Re: data model to store large volume syslog

Mohan L


On Fri, Mar 8, 2013 at 9:42 PM, aaron morton <[hidden email]> wrote:
> 1). create a column family 'cfrawlog' which stores raw log as received. row key could be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value' is raw log data. Since we are also going to use this log for forensics purpose, so it will help us to have all raw log with in the column family without missing.
As Moshe said there is a chance of hot spotting if you are sending all writes to a certain row.
You also need to consider how big the row will get, in general stay below about 30MB. You can go higher but there are some implications.


> 2). I want to create one more column family which is going to have the parsed log so that we will use this column family to query. my question is How to model this CF so that it will give answer of the above question? what would be the row key for this CF?
Something like:

row_key: YYYYMMDD
column: <host:timestamp:>

Note, i've not considered how to handle duplicate time stamps from the same host

I have created a standard column family with:

row_key : <YYYYMMDDHH:hostname>
Column_Name  : <timestamp:hostname>
Column_Value (as JSON dump) : {"date": "2013-03-05 06:21:56", "hostname": "example.com", "error_message": "Starting checkpoint of DB.db at Tue Mar 05 2013 06:21"}

I have two question in the above model:

1). If the column_name is same for the given row_key then Cassandra will update the column_value. Is there any way in to append the value in the same column(say first time do insert and next time do append)? Does it make sense my question?

2). Is there any way I can search/filter based on column_value? If not possible,  what is the work around way to achieve this king of column_value based search/filter in Cassandra? 

say for example : The below query return subrange of the columns in a row. It will return all value between the range.  what will be the way to filter subrange output bases on their column_value? 

result = col_fam.get(key,column_start='2013-03-05 05:02:11example.com', column_finish='2013-03-05 06:28:27example.com')

Any help and suggestion would be greatly appreciated.

Thanks
Mohan L
Reply | Threaded
Open this post in threaded view
|

Re: data model to store large volume syslog

Aaron Turner
On Wed, Mar 13, 2013 at 4:23 AM, Mohan L <[hidden email]> wrote:

>
>
> On Fri, Mar 8, 2013 at 9:42 PM, aaron morton <[hidden email]>
> wrote:
>>
>> > 1). create a column family 'cfrawlog' which stores raw log as received.
>> > row key could be 'yyyyddmmhh'(new row is added for each hour or less), each
>> > 'column name' is uuid with 'value' is raw log data. Since we are also going
>> > to use this log for forensics purpose, so it will help us to have all raw
>> > log with in the column family without missing.
>> As Moshe said there is a chance of hot spotting if you are sending all
>> writes to a certain row.
>> You also need to consider how big the row will get, in general stay below
>> about 30MB. You can go higher but there are some implications.
>>
>>
>> > 2). I want to create one more column family which is going to have the
>> > parsed log so that we will use this column family to query. my question is
>> > How to model this CF so that it will give answer of the above question? what
>> > would be the row key for this CF?
>> Something like:
>>
>> row_key: YYYYMMDD
>> column: <host:timestamp:>
>>
>> Note, i've not considered how to handle duplicate time stamps from the
>> same host
>
>
> I have created a standard column family with:
>
> row_key : <YYYYMMDDHH:hostname>
> Column_Name  : <timestamp:hostname>
> Column_Value (as JSON dump) : {"date": "2013-03-05 06:21:56", "hostname":
> "example.com", "error_message": "Starting checkpoint of DB.db at Tue Mar 05
> 2013 06:21"}
>
> I have two question in the above model:
>
> 1). If the column_name is same for the given row_key then Cassandra will
> update the column_value. Is there any way in to append the value in the same
> column(say first time do insert and next time do append)? Does it make sense
> my question?

You can only insert a new value which overwrites the old
rowkey/column_name pair.  The slow way is to do a read followed by a
write.  Faster is keep some kind of in-memory cache of recent inserts
so you read from memory followed by the write- obviously though that
could have scaling issues. Another solution is to write another column
and concat the values on read.

> 2). Is there any way I can search/filter based on column_value? If not
> possible,  what is the work around way to achieve this king of column_value
> based search/filter in Cassandra?


You can with indexes, but indexes really only work if your
column_names are known in advance- for your use case that's probably
not useful.  The usual solution is to insert the same data multiple
times (de-normalize your data) so that your read queries are
efficient.  Ie: depending on the query, you would probably query a
different CF.  Again, remember to distribute your writes across
multiple rows to avoid hot spots.  For example, if you want to search
by priority and facility, you'd want to encode that in the rowkey or
column_name


> say for example : The below query return subrange of the columns in a row.
> It will return all value between the range.  what will be the way to filter
> subrange output bases on their column_value?
>
> key = '2013030505example.com'
> result = col_fam.get(key,column_start='2013-03-05 05:02:11example.com',
> column_finish='2013-03-05 06:28:27example.com')
>
> Any help and suggestion would be greatly appreciated.


I'd suggest using TimeUUID for timestamp in the column name- probably
a lot wiser then rolling your own solution.

One thing I'd add, is that there is no reason to duplicate information
like the hostname in both the row key and the column name.  You're
just wasting storage at that point.  Just put it in the rowkey and be
done with it.

That said, you should think about what other kind of queries you need
to do.  Basically, you won't be able to search for anything in the
value- only by row key and column name.  So for example if you care
about the facility and priority, then you'll need to some how encode
that in the row/column name.  Otherwise you'll have to filter out
records post-query.  So for read performance, chances are you'll have
to insert the information multiple times depending on your search
parameters.

FYI, I could of sworn someone on this list announced a few months ago
some kind of C* powered syslog storage solution they had developed.
You may want to do some searches and see if you can find the project
and learn anything from that.


--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"