Thursday, February 16, 2012

Map reduce into relation of Big Data and Oracle

Everyone is talking about big-data, we are still trying to define when data becomes big data and we are just at the doorstep of understanding all the possibilities of what we can do with big data if we apply big analysis on it. Even though this field of (enterprise) IT is quite new we see a lot of companies who are taking big data very serious. For example Oracle is taking this point very serious as they are seen as the company which should be able to handle large sets of data. Oracle is teeming up with some of the big players in the market, for example they are teeming up with Cloudera which is one of the leading players in the Hadoop field.

As the data company Oracle is spending a lot of time on thinking about big data and building products and solutions to work with Big Data. Meaning Oracle is trying to answer the question "how did data become big data" or to rephrase that question "when is data big data". The answer which Oracle is coming with and what was promoted by Tom Kyte is coming as this slide in their latest presentation

Oracle states that big data can be defined based upon 4 criteria. It should have a certain volume, it should have a certain velocity (speed of data growth), the variety (all kinds of sources and forms the data is coming in) and the value as in the value that the data has or potentially value it can have as you are able to extract the true value from it.

Extracting the true value and unlocking the true value of your big data will take a lot of computing power and for this you will need a superb compute infrastructure. We have the map reduce solution which is developed by Google and has been released a couple of years ago. In the below slide you can see how the map reduce compute infrastructure / algorithm thinking works. This is the map reduce picture used by Tom Kyte during its presentation on big data.

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.

As Google is the company who came with map reduce it might be good to check what Google has to say on it when they are explaining it. In the below video you can see a recording of the Google Developers Day 2008 where Google was explaining the map reduce solution they had developed and where using internally.

Map reduce and and Hadoop which is the primary solution for map reduce coming from the Apache foundation as an open source solution fits in the statement "the future of computing is parallelism" and which is to my opinion is still very valid. In that article we zoomed more in to the parallelism where Hadoop and map reduce talk about a more massive scale parallelism however in essence it is still valid and the same.

No comments: