Wednesday, July 25, 2012

pharmaceutical map reduce


The website pharmatimes.com is diving into the results or a report recently published by Oracle on how the pharmaceutical industry is working with data and what the current bottlenecks are.

Around 93% said their organisation is collecting and managing more business information today than two years ago, by an average of 78% more. However 29% of life sciences executives give their company a 'D' or 'F' in preparedness to manage the data deluge; only 10% of the latter group give their organisation an 'A'.


Life sciences respondents say they are unable to realise, on average, 20% of additional revenue per year, translating to about $98.5 million per year, "by not being able to fully leverage the information they collect." Of those surveyed from the pharma/biotech sector, 30% are "most frustrated with their inability to give business managers access to the information they need without relying on the IT team".

Interesting to see is that there is an increase in the amount of data collected and the data potentially available to analysts and business users. Secondly it is interesting to see that there is frustration within the companies that they need IT to be involved when they need to use the collected data. What you see commonly in most companies is that the business has a need for a report or analysis and / or a report and for this they turn to IT. The request for change will be added to the list of work for the developers within IT and they will build and deliver the new reporting functionality to the user who requested it.

When there is an urgent business need to have a new report created this can be frustrating that there is a lead time and the business has to wait until the report is generated. Users in general would like to have a framework where they can quickly and easily build their own reporting and by doing so no longer be depending on the IT department.

Such tools and platforms are available within the market however not commonly deployed due to a couple of reasons.

a) The use of such tooling is blocked by the IT department as they are afraid that it will decrease their value within the company as a department

b) IT claims that the reports and the resulting queries build by the business users are sub-optimal and could cause performance issues

c) The use of the “self build” reporting tool is considered to have a steep learning curve by some people and due to this tooling is not deployed.

Point C is something you can discuss and will depend on the level of employees and their feeling with IT. Also it depends on the tool(s) selected if this is indeed true. However point A and point B can be tackled and should not be holding your company back from enabling users to build their own reports.

Reason A is something that will have to be tackled in the political arena of your company, if management backing is available the IT management should be persuaded to provide the needed support in getting the project started. This will inevitably lead in a decrease of work for the IT department in the form of building new reports, however will increase the need to support the new platforms and can open a whole new area of services for IT. This new area can include also building the more complex and challenging reports.

Reason B is something that is heavily depending on a couple of factors. One of them is how much understanding will the users have about what their questions to the system will do performance wise and how well are they trained in using the tool in a correct manner. Secondly it will depend on the selected tool, how “smart” will the tool create the queries based upon what the user is building with a drag and drop interface. One last factor will be the size of the data you will have available. If you have to query a couple terabytes this will be faster than when you have to query multi petabytes of data.

To remove the reason not to deploy such tools as stated in B involves a more detailed thought and plan. It will depend partially on the tool selection however it will also depend on how you will organize your data. When we look at the rate in which companies are gathering data you can state that for a large number of companies it would be beneficial to look at solutions in the field of big-data. Solutions developed and deployed in the field of big-data look at a different way, a more distributed way, of storing and handling data. If you take the design of your hardware and the way you access data and compute it into consideration you can deploy a platform which is ideal for users who deploy their own written reports and queries.

In a traditional setup as shown below you will store all your data in a single data source and you will have a single node which will take care of the computing of the results and to communicate with the users. For small sets of data this is a way that will work, however, when working with large sets of data this can become problematic as the available resources to the computing node can become a bottleneck. When lost of users deploy their custom written queries on this performance can drop to a no longer accepted level. Due to the nature of the setup scaling out in a vertical way is not an option and you can only do horizontal scaling by adding more CPU’s to your computing node.


In a more parallel way of doing things and within the thinking of how to handle big data you can create a cluster of smaller sub-sets of your data and dedicate a computing node to each set of data. When a user starts a request all nodes will be working a small amount of time on this request and send back the result a node who will collect all the answers and provide it in a consolidated way to the end user. This way of working is providing you a faster way of computing your results and provides at the same time the option to do horizontal scaling by adding more computing and data nodes when your data grows or when the need for more performance arises.


Popular ways of deploying such a strategy is by deploying a implementation of the map/reduce programming paradigm. Companies like for example Oracle and Pentaho are adopting the map/reduce paradigm by implementing hooks to the Hadoop framework who will do this for you.

When selecting a tool that will enable your users to build their own reports and queries it is advisable to look at how this tool is using the map/reduce programming paradigm and how scalable it is for data growth. By taking this into considerations you can safeguard the usability of the tooling for the future when data is growing and the demand on the system is growing.

No comments: