Saturday, July 26, 2014

Query Big Data with SQL

Data management used to be “easy” within enterprises, in most common cases data lived was stored in files on a file system or it was stored in a relational database. With some small exceptions this was where you where able to find data. With the explosion of data we see today and with the innovation around the question how to handle the data explosion we see a lot more options coming into play. The rise of NoSQL databases and the rise of HDFS based Hadoop solutions places data in a lot more places then only the two mentioned.
Having the option to store data where it is most likely adding the most value to the company is from an architectural point of view a great addition. By having the option for example to not choice for a relational database however store data in a NoSQL database or HDFS file system is giving architects a lot more flexibility when creating an enterprise wide strategy. However, it is also causing a new problem, when you try to combine data this might become much harder. When you store all your data in a relations database you can easily query all the data with a single SQL statement. When parts of your data reside in a relational database, parts in a NoSQL database and parts in a HDFS cluster the answer to this question might become a bit harder and a lot of additional coding might be required to get a single overview.
Oracle announced “Oracle Big Data SQL” which is an “addition” to the SQL language which enables you to query data not only in the Oracle Database however also query, in the same select statement, data that resides in other places. Other places being Hadoop HDFS clusters and NoSQL databases. By extending the data dictionary of the Oracle database and allowing it to store information of data that is stored in NoSQL or Hadoop HDFS clusters Oracle can now make use of those sources in combination with the data stored in the database itself.

The Oracle Big Data SQL way of working will allow you to create single queries in your familiar SQL language however execute them on other platforms. The Oracle Big Data SQL implementation will take care of the translation to other languages while developers can stick to SQL as they are used to.

Oracle Big Data SQL is available with Oracle Database 12C in combination with theOracle Exadata Engineered system and the Oracle Big Data appliance engineered system. The use of Oracle Engineered systems make sense as you are able to use infiniband connections between the two systems to eliminate the network bottleneck. Also the entire design of pushing parts of a query to another system is in line with how Exadata works. In the Exadata machine the workload (or number crunching) is done for a large part not on the compute nodes but rather on the storage nodes. This ensures that more CPU cycles are available for other tasks and sorting, filtering and other things are done where they are supposed to be done, on the storage layer.

A similar strategy is what you see in the implementation of Oracle Big Data SQL. When a query (or part of a query) is pushed to the Oracle Big Data Appliance only the answer is send back and not a full set of data. This means that (again) the CPU’s of the database instance are not loaded with tasks that can be done somewhere else (on the Big Data Appliance).
The option to use Oracle Big Data SQL has a number of advantages to our customers, both on a technical as well as architectural and integration level. We can now lower the load on database instance CPU’s and are not forced to manual create connections between relations databases and NoSQL and Hadoop HDFS solutions. While on the other hand helps customers get rapid return on investment. Some Capgemini statements can be found on the Oracle website in a post by Peter Jeffcock and Brad Tewksbury from Oracle after the Oracle Key partner briefing on Oracle Big Data SQL.

No comments: