Johan Louwers - Tech blog: August 2013

Monday, August 19, 2013

Oracle, hadoop and flume

The below video is created by Oracle to provide some more insight in what Oracle can offer in the field of big-data and Hadoop like / based solutions. Hadoop applications are data intensive in most cases. Reason for stating most cases is that this is not always the case as you can have a massive parallel computation on a small dataset and still use Hadoop for this due to its distribution capabilities. However, in most cases it is data intensive and Oracle being one of the major players in the database and systems fields that work with massive amounts of data it is not more then understandable that they jump on the Big-Data bandwagon. Currently Oracle is offering a big-data appliance, they do provide connectors between the Oracle database and the Hadoop platform and they are working closely with Cloudera.

One of the interesting things in the above video is the mention of Apache Flume, which is casually named however is a complete Apache project on itself.

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Flume is in general used to be installed on a cluster of servers and to mine data from logfiles and send them via channels to an "data sink" in the form of HDFS (Hadoop Distributed File System). The nice thing of Flume is that it is very light wait and can be deployed on every server where you have the need to collect logfiles from and where you need them to be aggregated into your HDFS filesystem where Hadoop can start working on analysing the data. Having stated that the "data sink" is HDFS is essentially correct as you can also use "normal" flat files and custom data targets supported by custom "data sinks". For now we will use HDFS as an example.

Flume has a straightforward design however some naming needs some understanding to understand a flume implementation. Flume consist on a high level out of data sources, flume agents, flume collectors and in our case in a HDFS storage location. This is also the way data is flowing in a Flum implementation.

A data source is in this case any file (or files that are present on the local filesystem of a host computer/server/device running flume. This can be for example a file in /var/log that you want to mine with flume. We take here as an example /var/log/messages which is generally available on all Linux systems.

A Flume agent is responsible for collecting data from the data source and sending this to the flume collector. You can have more then one flume agent running at the same time.

A Flume collector gathers the data from and aggregates this from one or more flume agents and send this (in our case) to the HDFS data store.

When configuring Flume you will have to configure (primarily) your flume agent and your flume collector. What needs to be taken into account is the configuration of your sink (which is the term for where you will send your data to).

First we configure the agent and we have to set the source and sink. The source can be for example a log file in /etc/logs and the sink will be (for an agent) the agentsink which will be located on the local machine. For this you can configure a tail of the logfile you want flume to monitor. The agentsink will be a port on localhost. You can do this for example with the below command;

exec config agent 'tail("/var/log/somefile.log")' 'agentSink("localhost",35853)'

This will configure the agent to read the /var/log/somefile.log file and send the output to localhost port 35852. The collector will read the output from the agentsink at port 35853

A configuration line for the collector to collect the data (on the same machine) from this port and send it to a HDFS filesystem could be the one below (where you have to change the location of your collectorsink to a valid one).

collector : collectorSource(35853) | collectorSink("hdfs://namenode/flume/","srcdata");

Collector source. Listens for data from agentSinks forwarding to port port. If port is not specified, the node default collector TCP port, 35863. This source registers itself at the Master so that its failover chains can automatically be determined. A less documented feature however is that you can also provide a host in your collectorSource definition. This helps in building a single collector that is collecting from a number of agents and is sending the data to a HDFS storage location. When working with such a model it is very advisable however to look into the high availability and failover options of Flume. There are a large number of ways to implement high availability and failover for Flume and when you are implementing Flume for some mission critical solutions this is worth investigating.

On the topic of Flume Oracle has also luanched a specific video on youtube which you can view here below. Next to this there are some very good source of information available at the Cloudera website and at the Apache Flume project website

Tuesday, August 13, 2013

Technical specifications for Oracle Virtual Compute Appliance

Oracle announced today a new engineered system, the Oracle Virtual Compute Appliance. The announcement was done by wim coekaerts, the Oracle Senior Vice President for Linux and virtualization technology. The engineered system is based upon X86 processor technology and uses Oracle VM as a hypervisor technology. One of the interesting things is that Oracle states it will be supporting Windows from day one. This is next to the expected Oracle Linux, Solaris and RedHat Linux support.

The Oracle Virtual Compute Appliance is based upon a maximum number of 25 Sun X3-2 servers as compute nodes, a 18TB ZFS storage appliance and infiniband connect solution. Next tot his the systems is equipped with a local web-based management console and a Oracle VM Manager console. It is expected that this will tie into Oracle OPSCenter and Oracle Enterprise Manager however details are still a little vague on this point.

Technical specifications of the Oracle Virtual Compute Appliance:

Compute Nodes: 2 to 25Controller Nodes: 2

(2) Intel 2.2 GHz Xeon (8 core) processors
256 GB 1,600 MHz RAM
(2) 900 GB HDDs (RAID1)
(1) Dual-port QDR InfiniBand HCA (PCIe)
(1) GbE management port (BASE-T)

Oracle ZFS Storage Appliance

(4) QDR InfiniBand ports (one active and one passive per storage head)
292 GB solid state disk write cache
18 TB serial-attached SCSI (SAS) disks
(2) GbE management ports

Oracle Virtual Networking

(2) Oracle Fabric Interconnect F1-15 model with 15 I/O module slots each with:
(20) Non-blocking QDR InfiniBand server ports
(4) Quad-port 10 Gb Ethernet modules
(2) Dual-port 8 Gb Fibre Channel modules**Fibre Channel is included, but not supported in the initial release

InfiniBand Spine Switch

(36) QDR InfiniBand ports
(1) GbE management ports (BASE-T)

Preinstalled Software

Oracle Virtual Compute Appliance controller
Oracle VM
Oracle VM Manager
Storage Operating System Software
Oracle SDN

More technical specifications on the Oracle Virtual Compute Appliance can be found in the below embeded document:

ovca-faq-1988834 by Johan Louwers

A frequently asked questions document on the Oracle Virtual Compute Appliance can be found in the below embedded document

Oracle Virtual Compute Appliance Ds 1988829 by Johan Louwers

More information on the Oracle Virtual Compute Appliance can be found on this landing page at oracle.com

Monday, August 12, 2013

Social media and the influence on social

Social media by its name is implying that it is social. Social in some sense should bring people together and should help form groups. Humans are social beings by nature so social media should make us more happy. It should help us to become more happy people if we believe the implicit promise of the naming social media. For some people this might be the case and social media will help people to become more social and form groups. However, in the below video, based upon the book Alone together written by Sherry Turkle we get a look at what social media influence has on loneliness.

Some very interesting views are expressed and the video contains some warnings. Warnings to people who do adopt social media and to humanity in total. Somehow it carries the message that social media, in its current version, can be an addition to our face-to-face social life however cannot be a replacement. As social media is fairly new and our ways of doing things are already engrained in our DNA from the day we have walked the earth it is naturally we do not immediately have a fit with the new technology. Whatever your opinion and if your agree or disagree with those vies, the views in the video or the views in the book, it is an interesting thought and something interesting to discuss.

If you are interested in the book Alone together, please find an embed of the book below which is currently placed on books.google.com.