Monday, August 19, 2013

Oracle, hadoop and flume

The below video is created by Oracle to provide some more insight in what Oracle can offer in the field of big-data and Hadoop like / based solutions. Hadoop applications are data intensive in most cases. Reason for stating most cases is that this is not always the case as you can have a massive parallel computation on a small dataset and still use Hadoop for this due to its distribution capabilities. However, in most cases it is data intensive and Oracle being one of the major players in the database and systems fields that work with massive amounts of data it is not more then understandable that they jump on the Big-Data bandwagon. Currently Oracle is offering a big-data appliance, they do provide connectors between the Oracle database and the Hadoop platform and they are working closely with Cloudera.

One of the interesting things in the above video is the mention of Apache Flume, which is casually named however is a complete Apache project on itself.

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Flume is in general used to be installed on a cluster of servers and to mine data from logfiles and send them via channels to an "data sink" in the form of HDFS (Hadoop Distributed File System). The nice thing of Flume is that it is very light wait and can be deployed on every server where you have the need to collect logfiles from and where you need them to be aggregated into your HDFS filesystem where Hadoop can start working on analysing the data. Having stated that the "data sink" is HDFS is essentially correct as you can also use "normal" flat files and custom data targets supported by custom "data sinks". For now we will use HDFS as an example. 

Flume has a straightforward design however some naming needs some understanding to understand a flume implementation. Flume consist on a high level out of data sources, flume agents, flume collectors and in our case in a HDFS storage location. This is also the way data is flowing in a Flum implementation.

A data source is in this case any file (or files that are present on the local filesystem of a host computer/server/device running flume. This can be for example a file in /var/log that you want to mine with flume. We take here as an example /var/log/messages which is generally available on all Linux systems. 

A Flume agent is responsible for collecting data from the data source and sending this to the flume collector. You can have more then one flume agent running at the same time. 

A Flume collector gathers the data from and aggregates this from one or more flume agents and send this (in our case) to the HDFS data store. 

When configuring Flume you will have to configure (primarily) your flume agent and your flume collector. What needs to be taken into account is the configuration of your sink (which is the term for where you will send your data to).

First we configure the agent and we have to set the source and sink. The source can be for example a log file in /etc/logs and the sink will be (for an agent) the agentsink which will be located on the local machine. For this you can configure a tail of the logfile you want flume to monitor. The agentsink will be a port on localhost. You can do this for example with the below command;

exec config agent 'tail("/var/log/somefile.log")' 'agentSink("localhost",35853)'

This will configure the agent to read the /var/log/somefile.log file and send the output to localhost port 35852. The collector will read the output from the agentsink at port 35853

A configuration line for the collector to collect the data (on the same machine) from this port and send it to a HDFS filesystem could be the one below (where you have to change the location of your collectorsink to a valid one).

collector : collectorSource(35853) | collectorSink("hdfs://namenode/flume/","srcdata"); 

Collector source. Listens for data from agentSinks forwarding to port port. If port is not specified, the node default collector TCP port, 35863. This source registers itself at the Master so that its failover chains can automatically be determined. A less documented feature however is that you can also provide a host in your collectorSource definition. This helps in building a single collector that is collecting from a number of agents and is sending the data to a HDFS storage location. When working with such a model it is very advisable however to look into the high availability and failover options of Flume. There are a large number of ways to implement high availability and failover for Flume and when you are implementing Flume for some mission critical solutions this is worth investigating.

On the topic of Flume Oracle has also luanched a specific video on youtube which you can view here below. Next to this there are some very good source of information available at the Cloudera website and at the Apache Flume project website

No comments: