Johan Louwers - Tech blog

Thursday, September 13, 2012

Get Linux file system blocksize

In a previous post I discussed the topic of blocksize for an Oracle database. As you have seen in this post it can be important to understand which blocksize to select when creating your database. When talking about blocksize it is also good to know what the blocksize is on your files system. Your database and the way it is requesting blocks from disk is related to the blocksize on your file system mounted to your operating system.

If you want to know the blocksize of your file system you can check this in multiple ways. One of the most commonly used ways is to execute the below command:

dumpe2fs /dev/sda1 | grep -i 'Block size'

this calls dumpe2fs, prints the super block and blocks group information for the filesystem present on device. in the above example we gather the information of /dev/sda1

As you can see we do a grep on 'Block size'. If you do not grep this you will get a LOT of information. the most usable for most users is the first part. You can simply use a more to get this information in the most easy way:

dumpe2fs | more

as an example you can see the output of this below which is the information of one of my workstations:

dumpe2fs 1.42 (29-Nov-2011)
Filesystem volume name:   
Last mounted on:          /
Filesystem UUID:          584232d2-3bc8-41bb-9534-c3a4f4f6e64c
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              91054080
Block count:              364187648
Reserved block count:     18209382
Free blocks:              337917734
Free inodes:              90742705
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      937
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sat Dec 31 17:42:47 2011
Last mount time:          Sun Sep  9 20:37:55 2012
Last write time:          Wed Apr 11 17:52:11 2012
Mount count:              11
Maximum mount count:      21
Last checked:             Wed Apr 11 17:52:11 2012
Check interval:           15552000 (6 months)
Next check after:         Mon Oct  8 17:52:11 2012
Lifetime writes:          225 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       70780693
Default directory hash:   half_md4
Directory Hash Seed:      c55ef959-1938-4934-b178-eb9409c00d0b
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00102c0c
Journal start:            30520

Sunday, September 09, 2012

Oracle database blocksize considerations

When creating a Oracle database (or new database file) most people quickly move over the question concerning the database blocksize. In most cases the standard for an Oracle database blocksize is selected, which is 8KB (8192 byte) or 4KB (4096 byte). In general this might be a good choice however it can be good to consider / reconsider your blocksize when you have a high demanding environment where database performance is of essence.

you have to make some considerations and you have to know the advantages and disadvantages of larger or smaller blocksizes for your Oracle database. from The Oracle 11.2 Database Performance Tuning Guide we can learn the following:

Smaller blocksize.
Advantages;

Good for small rows with lots of random access.
Reduces block contention.

Disadvantages;

Has relatively large space overhead due to metadata (that is, block header).
Not recommended for large rows. There might only be a few rows stored for each block, or worse, row chaining if a single row does not fit into a block.

Larger blocksize.
Advantages;

Has lower overhead, so there is more room to store data.
Permits reading several rows into the buffer cache with a single I/O (depending on row size and block size).
Good for sequential access or very large rows (such as LOB data).

Disadvantages;

Wastes space in the buffer cache, if you are doing random access to small rows and have a large block size. For example, with an 8 KB block size and 50 byte row size, you waste 7,950 bytes in the buffer cache when doing random access.
Not good for index blocks used in an OLTP environment, because they increase block contention on the index leaf blocks.

some guidelines are provided by Oracle when you have to make this consideration based upon the the fact if you have many read operations or if you have many write operations. Remember that you can state your blocksize per datafile so when you are desiging a architecture for your database you might consider to have write operations in a different datafile than you have the most of your read operations. You will have to make some considerations here and this might have effect on your application design however it is worth looking at your options.

Read operations;
Regardless of the size of the data, the goal is to minimize the number of reads required to retrieve the desired data.

If the rows are small and access is predominantly random, then choose a smaller block size.
if the rows are small and access is predominantly sequential, then choose a larger block size.
If the rows are small and access is both random and sequential, then it might be effective to choose a larger block size.
If the rows are large, such as rows containing large object (LOB) data, then choose a larger block size.

Write operations;

For high-concurrency OLTP systems, consider appropriate values for INITRANS, MAXTRANS, and FREELISTS when using a larger block size. These parameters affect the degree of update concurrency allowed within a block. However, you do not need to specify the value for FREELISTS when using automatic segment-space management.
If you are uncertain about which block size to choose, then try a database block size of 8 KB for most systems that process a large number of transactions. This represents a good compromise and is usually effective. Only systems processing LOB data need more than 8 KB.

A larger data block size provides greater efficiency in disk and memory I/O (access and storage of data). Therefore, consider specifying a block size larger than your operating system block size if the following conditions exist:

Oracle Database is on a large computer system with a large amount of memory and fast disk drives. For example, databases controlled by mainframe computers with vast hardware resources typically use a data block size of 4K or greater.
The operating system that runs Oracle Database uses a small operating system block size. For example, if the operating system block size is 1K and the default data block size matches this, the database may be performing an excessive amount of disk I/O during normal operation. For best performance in this case, a database block should consist of multiple operating system blocks.

You can simply check your current blocksize per datafile by executing the below SQL query:

 SELECT 
       name, 
       bytes, 
       blocks, 
       block_size 
 FROM 
     v$datafile

Sunday, September 02, 2012

Developing big-data triggers

Most of you will know CERN from the LHC, Large Hadron Collider, experiment used to discover the Higgs Boson particle. This is one of the most interesting experiments within physics at this moment and the search for the Higgs Boson particle comes into the news quite often. What a lot of people however do not realize is that this research is somewhat different from traditional research in the field of physics as it comes to the amount of data.

When Isaac Newton “discovered” gravity it only took him a tree to lean against and a apple to fall down. Those are not a lot of input streams of information. When it comes to finding the Higgs Boson particle we are playing in a total different field when it comes to the number of inputs. During an event the data capture system will store every second a dataflow the size of rougly six times the Encyclopædia Britannica.

The main issue is that the systems will not be able to handle and store all the data presented to the sensors. All sensors will have triggers developed to capture the most important data. As we are talking about find a particle that is never discovered before the triggers might discard the Higgs Boson particle data instead of storing it for analysis. Developing the triggers is one of the crucial parts of the experiment and one of the most critical parts. In the below video Tulika Bose a assistant professor from the University of Boston gives a short introduction to this.

Within CERN The TriDAS Project is responsible for developing the data Acquisition and High-Level Trigger systems. Those systems will select the data and store it and finally result in data that can be analyzed. For this a large group of scientists and top people from a large number of IT companies have been working together to build this. IT companies like Oracle and Intel have been providing CERN with people and equipment mainly so they can test their new systems in one of the most demanding and data intensive setups currently operational.

Below you can see a high a high level architecture of the CMS DAQ system. This image comes from the "Technical Design Report, Volume 2" delivered by the TriDAS Project project team.

In a somewhat more detailed view the system looks like the architecture below from ALICE project. This shows you the connection to the databases and other parts of the systems.

While finding the Higgs Boson particle is for the common public possibly not that interesting on the short term having IT companies working together with CERN is even though it might not be that obvious at first. CERN is handling a enormous load of data. IT companies who participate in this project are building new hardware, software and algorithms that are specific to finding the Higgs Boson particle. However, the developed technology will be used within building solutions that will end up in serving customers.

As big-data is getting more and more attention and as we can see all kinds of big-data based solutions are developed we can see that this is no longer a pure scientific play field. It is getting into the day-to-day lives of people. This will help people in the very near future in their day-to-day lives. So, next time you question what the search for the Higgs Boson particle is bringing you as a individual on the short term, take the big-data part into your consideration and do not think it is only interesting to scientists(which is a incorrect statement already however a topic I will not cover on this blog :-) )

Oracle Logistic chain

The overall logistic chain used within companies around the world is for 90% the same for all companies. Some specific customer variances are there and the way it is implemented might be different however overall all they all look quite the same. You purchase goods, you do something and you sell it to customers. Purchasing goods can be purchasing parts where you build your own product, purchasing goods to mine your final product or purchase, store and sell. All in all your company will buy things. Selling can be to external customers or external customers (cost centers) however somewhere your company will make profit (I hope).

Meaning all in all the model will look quite the same and you will see this model reflected in your ERP system. having Oracle eBS or SAP or another ERP system somehow it will be having a standard logistic chain in the system. For each ERP product will have some different naming for the modules. For Oracle Oracle eBS the logistics chain will look somewhat like the one below.

This model quickly shows a introduction on where to place the Oracle e-Business suite modules in the logistics chain. You can see for example Oracle AP, Oracle AR, inventory, General Ledger, etc etc etc

Tuesday, August 28, 2012

Oracle exadata InfiniBand connections

Oracle Exalytics is the Oracle appliance for running business intelligence in memory. Oracle Exalytics can be deployed in existing IT environments by itself or in conjunction with Oracle Exadata and/or Oracle Exalogic to enable extreme performance. A general Exalytics machine equipped with 1 terrabyte of memory which can be used for in memory operations. Even though it is capable of running 1 terrabyte in memory it will still require that data is provided to the system. When you have “stable” data which is not changing frequently this is not a direct problem however if you have a dataset which changes rapidly you can experience performance that is not up to what you expected. In those cases when you have a dataset with a high number of changing data it can be beneficial to connect your Exalytics appliance, or a cluster of Exalytics appliances, to an Oracle ExaData appliance.

One of the benefits of connection you Exalytics appliance to a ExaData appliance is that you can connect them via infiniband. Both appliances are equipped with infiniband connectors. InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices.

By default a Oracle ExaData (full rack) setup is equipped with 3 infiniband switches, one spine switch and 2 leave switches which are connected a shown below which hardens the setup against the failure of one of the switches. This makes the appliance setup more resilient and enables you to update firmware of your switches without the need for downtime.

When connecting a single Exalytics to your exadata appliance it is common practice to connect to both leave switches. You will have 2 infiniband ports on the back of your Exalytics appliance, those will be used to connect to the spine switches. This helps you to make the connection resilient against the failure of one of the switches.

More information can be found on the slidedeck about this subject shown below.

Oracle infiniband cabling for Oracle appliances from Johan Louwers

Sunday, August 19, 2012

Oracle NoSQL and Eclipse

The Oracle NoSQL database is a very simple to deploy NoSQL key-value store which requires almost no setup. In a previous blogpost I have showed how you can setup your Oracle NoSQL database quickly and ready to be used. However, having your Oracle NoSQL database up and running is only the first step. The second step will be actually doing something with it. That will be coding a application for it in most cases. The Oracle NoSQL database ships with a couple of examples of code.
In the below example you can see how to use the Eclipse development platform with your Oracle NoSQL database and how to use the Oracle NoSQL examples.

Wednesday, August 15, 2012

Integrated SOA gateway generate WSDL

Oracle e-Business Suite ships with an Integrated SOA gateway which can be used to generate webservices. This will help you to expose Oracle eBS functionality to other applications via a webservice. The Integration repository is filled with possible API’s you can use however which are not deployed by default. If you like to deploy them you will have to generate a WSDL for them.

Issue is that by default you will not have the option to create a WSDL when you grant yourself the Oracle eBS responsibility “integrated SOA gateway”. If you select this responsibility and search for a API you would like to turn into a webservice you will see a page like the one below. There should be a button stating “generate WSDL” however as you can see this one is missing.

You need to take some steps before this button will become available.

1) Login to the system as the user SYSADMIN or as a user who has been granted the role “security Administrator”. On how to grant this role see this blogpost

2) Select the responsibility User Management and query for the user you would like to grant the “generate WSDL” option.

3) Grant this user the role “Irep Administrator”.

Now you should be able to generate a WSDL and expose the functionality to other applications via a webservice.

Couchbase database overview

Robert Scoble is running a nice interview on CouchBase. CoucheBase is JSON store NoSQL database. CoucheBase is one of the upcomming NoSQL implementations and is seen more and more in the big-data field and is used by quite some startups.

Couchbase is a privately held company funded by Accel Partners, Ignition Partners, Mayfield Fund and North Bridge Venture Partners. It’s the simple, fast, elastic NoSQL database technology for interactive web applications. Couchbase is the NoSQL database market share leader, with production deployments at AOL, Deutsche Post, NTT Docomo, Salesforce.com, Turner Broadcasting Systems, Zynga and hundreds of other organizations worldwide. Couchbase products enable a more flexible, scalable, high-performance, and cost-effective approach to data management than relational database solutions and are particularly well suited for web applications deployed on virtualized or cloud infrastructures, and mobile applications requiring real-time data synchronization between devices and the cloud.

“NoSQL really started a few years ago when people started having increased problems with relational databases and technologies like MySQL and the proprietary technologies from the big guys,” explains Bob Wiederhold, CEO of Couchbase. “A lot of application developers were starting to look for something different. They wanted something that provided them a much more flexible data model, something with much higher performance and something that would scale a lot more easily to deal with the huge amount of data.”

Couchbase Server makes it easy to optimally match resources to the changing needs of an application by automatically distributing data and I/O across commodity servers or virtual machines. It scales out and supports live cluster topology changes while continuing to service data operations. Its managed object caching technology delivers consistent, sub-millisecond random reads, while sustaining high-throughput writes. As a document-oriented database, Couchbase Server accommodates changing data management requirements without the burden of schema management. Getting started with Couchbase Server is easy. Linux, Windows and OSX packages can be installed and configured in minutes.

Saturday, August 11, 2012

Oracle network sniffer tool

For all developers who have been working on software that is doing “something” on the network there has been a moment that you have been analyzing what really happened between two points. Your code should make sure a certain set of data is send from A to B however you have a strange feeling that it is not doing quite what you want it to do during runtime. A common way of checking what really happens is diving into the network stream and capturing the network traffic with a network sniffer to find out. Most people how have been looking into the network communication have been using like Ettercap for example. Ettercap is used by a wide group of developers, network experts, security experts and hackers to capture all things traveling over the network. Other tools are also available however Ettercap is one of the more know tools and easy in use.

When you are developing Oracle ADF applications you do want to check what is really happening on a network level from time to time. Especially when you are trying to speed up your application and are looking to remove all overhead in the network layer to keep it as lean as possible.

Oracle is providing a HTTP analyzer and is shipping it with Oracle JDeveloper. In the below video you can see a quick example of how to setup the Oracle HTTP analyzer for ADF round-trip monitoring.

Friday, August 10, 2012

The new social media is 3D

Do you think social media is cool, you think you know all about the ways to interact with social media platforms like facebook, google+, twitter and Weibo. Most likely you do if you are a frequent user of social media platforms. However, do you also know how the future of presenting the social media networks might look like. And it is good to know that a lot of people are already interacting on social media platforms and the number of them is still rising. Upcoming generations will see it as one the normal ways of interacting. Where we now have a large set of people who will never join a social network online in the future the adoption (in online countries) will be coming to a almost 100% in my personal opinion. Below infographic is showing the usage of social networks in 2011.

Source: marketing-online.co.uk

You can see some concepts online of how the future might look like. For example brandrepublic.com is showing a new way of interacting with facebook in a 3D setting. Just watch the video below which is made together with LG and is making use of a LG screen.

Now mix this thought with Google glasses (watch the movie below) and it will not come as a surprise that in the near future you might see people standing in the street making strange gestures. They might just be rearranging some topics in their 3D social media feed or responding to a message.

Missing the mobile hype

Mobile usage is skyrocketing and especially the use of apps on smartphones is already in overdrive for quite some time. That would suggest that companies are investing heavily into the smartphone and apps. One would think that mobile apps are the talk of the town at many corporate marketing departments and due to this a hot topic at many CIO meetings. You could expect that mobile is on the combined strategic roadmap of both IT and marketing.

A recent study done by Forrester shows that only 40% of the companies actually do have a mobile roadmap.

“Successful mobile road maps require investment in supporting activities. Making specific investments in mobile education and skills development, maintaining organizational flexibility to increase speed to market, and adapting to local markets are key to the success of a mobile road map.”

source: Service2Media

Indeed this is true however having a good mobile strategy can put your company on top of the food chain which is a comfortable place to be in the current economical period. When in a economical downturn market you want to make your brand (B2B and B2C) as visible as possible and you would like to offer your customers as much easy service as possible against a price as low as possible.

For example, you run a reservation line for restaurants. A common practice of people is to decide where to eat when they are with friends and such events are not planned well in advance. Due to the nature of the decision cycle a phonecall is commonly the quickest way to find out if there are still tables available. Every phonecall to a service agent costs you 50 cents. Calculating your ROI is quickly done and you also provide a second service to your customers who now can quickly browse available restaurants in the area from their smart phone.

The examples are numerous and the options are endless. Missing the boat on mobile can put your company at the bottom of the food chain. There are quite some good companies around that can help you plan your mobile strategy and do the full execution of this plan. Some good examples are Capgemini and Service2Media.

Tuesday, August 07, 2012

Oracle Data Integrator failed Checking operating system certification

Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. Oracle Data Integrator can be used to communicate between Oracle databases and the Hadoop platform. When working on big-data projects ODI can help you to quickly develop and deploy map reduce solutions within your enterprise landscape.

ODI can be installed on several platforms including Oracle Linux. One of the issues with installing on Oracle Linux is that the latest version of Oracle Linux is currently not supported and will result in an error during installation. The installer will state (in my case Oracle Linux 6.3) the following:

Checking operating system certification
Expected result: One of enterprise-5.4, enterprise-4, enterprise-5, redhat-5.4, redhat-4, redhat-5, SuSE-10, SuSE-11
Actual Result: redhat-6.3

Meaning the Oracle Data Integrator is not certified currently on Oracle Linux 6.3 and secondly the Oracle Universal Installer is identifying the operating system as a redhat-6.3 distribution.

You can resolve this by adding Oracle Linux (redhat) as a certified system to the file used to check what is certified. In the past you could do so by adding a line to the file $basedir/Disk1/install/linux/oraparm.ini . However, in the newer versions of the Oracle Universal Installer a new way of working has been introduced and you have to edit a different file.

You will have to edit the XML file where the prerequisites are stored. You can find this file at $basedir/Disk1/stage/prereq/linux/refhost.xml in the refhost.xml file you will find a couple of configurations which are allowed. I have added the below lines to the file to make Oracle Linux 6.3 a valid configuration.

After changing the XML file and saving it you can restart the installer and you will see that the check is now completed without any issue. I have noticed some differences in different machines where sometimes it is needed to restart the installer and sometimes you can just do a retry. To be on the safe side it is good practice to start the installer again.

Wednesday, July 25, 2012

Oracle web Cache routing only mode

Oracle Web Cache provides not only the option to act as a web cache server, it is also providing you the option to do load balancing over multiple nodes. Having a web cache server balance load over multiple nodes is a logical thing to do as in most cases where you will use web cache you will be serving most likely a high end website which needs to be up and performing all the time.

A less know option from the Oracle Web Cache server is that it can also act only a load balancer and ignoring the web cache part of the solution. In some cases you do explicitly not want to use the web cache however does want to use the load balance options.

To be able to set the web cache in load balance only mode you will have to change the configuration of your webcache.xml file. The .xml configuration file can be located at:

(UNIX) ORACLE_INSTANCE//config/WebCache/webcache_name
(Windows) ORACLE_INSTANCE\\config\WebCache\webcache_name

Within the configuration you will have to set the ROUTINGONLY option to yes. After setting this you will have to restart the web cache server to activate the new settings.

CACHE WCDEBUGON="NO" CHRONOSONPERNODE="NO" CAPACITY="301" VOTES="1"
INSTANCENAME="instance_name" COMPONENTNAME="component_name" ORACLEINSTANCE="instance" HOSTNAME="web_cache_host_name"
ORACLEHOME="directory" NAME="web_cache_name"
ROUTINGONLY="YES"

When you have done so you can check in the web cache manager if the routingonly part is set. You should see the below message in the administration part informing you that you are running in routing only mode.

pharmaceutical map reduce

The website pharmatimes.com is diving into the results or a report recently published by Oracle on how the pharmaceutical industry is working with data and what the current bottlenecks are.

“Around 93% said their organisation is collecting and managing more business information today than two years ago, by an average of 78% more. However 29% of life sciences executives give their company a 'D' or 'F' in preparedness to manage the data deluge; only 10% of the latter group give their organisation an 'A'.

Life sciences respondents say they are unable to realise, on average, 20% of additional revenue per year, translating to about $98.5 million per year, "by not being able to fully leverage the information they collect." Of those surveyed from the pharma/biotech sector, 30% are "most frustrated with their inability to give business managers access to the information they need without relying on the IT team".“

Interesting to see is that there is an increase in the amount of data collected and the data potentially available to analysts and business users. Secondly it is interesting to see that there is frustration within the companies that they need IT to be involved when they need to use the collected data. What you see commonly in most companies is that the business has a need for a report or analysis and / or a report and for this they turn to IT. The request for change will be added to the list of work for the developers within IT and they will build and deliver the new reporting functionality to the user who requested it.

When there is an urgent business need to have a new report created this can be frustrating that there is a lead time and the business has to wait until the report is generated. Users in general would like to have a framework where they can quickly and easily build their own reporting and by doing so no longer be depending on the IT department.

Such tools and platforms are available within the market however not commonly deployed due to a couple of reasons.

a) The use of such tooling is blocked by the IT department as they are afraid that it will decrease their value within the company as a department

b) IT claims that the reports and the resulting queries build by the business users are sub-optimal and could cause performance issues

c) The use of the “self build” reporting tool is considered to have a steep learning curve by some people and due to this tooling is not deployed.

Point C is something you can discuss and will depend on the level of employees and their feeling with IT. Also it depends on the tool(s) selected if this is indeed true. However point A and point B can be tackled and should not be holding your company back from enabling users to build their own reports.

Reason A is something that will have to be tackled in the political arena of your company, if management backing is available the IT management should be persuaded to provide the needed support in getting the project started. This will inevitably lead in a decrease of work for the IT department in the form of building new reports, however will increase the need to support the new platforms and can open a whole new area of services for IT. This new area can include also building the more complex and challenging reports.

Reason B is something that is heavily depending on a couple of factors. One of them is how much understanding will the users have about what their questions to the system will do performance wise and how well are they trained in using the tool in a correct manner. Secondly it will depend on the selected tool, how “smart” will the tool create the queries based upon what the user is building with a drag and drop interface. One last factor will be the size of the data you will have available. If you have to query a couple terabytes this will be faster than when you have to query multi petabytes of data.

To remove the reason not to deploy such tools as stated in B involves a more detailed thought and plan. It will depend partially on the tool selection however it will also depend on how you will organize your data. When we look at the rate in which companies are gathering data you can state that for a large number of companies it would be beneficial to look at solutions in the field of big-data. Solutions developed and deployed in the field of big-data look at a different way, a more distributed way, of storing and handling data. If you take the design of your hardware and the way you access data and compute it into consideration you can deploy a platform which is ideal for users who deploy their own written reports and queries.

In a traditional setup as shown below you will store all your data in a single data source and you will have a single node which will take care of the computing of the results and to communicate with the users. For small sets of data this is a way that will work, however, when working with large sets of data this can become problematic as the available resources to the computing node can become a bottleneck. When lost of users deploy their custom written queries on this performance can drop to a no longer accepted level. Due to the nature of the setup scaling out in a vertical way is not an option and you can only do horizontal scaling by adding more CPU’s to your computing node.

In a more parallel way of doing things and within the thinking of how to handle big data you can create a cluster of smaller sub-sets of your data and dedicate a computing node to each set of data. When a user starts a request all nodes will be working a small amount of time on this request and send back the result a node who will collect all the answers and provide it in a consolidated way to the end user. This way of working is providing you a faster way of computing your results and provides at the same time the option to do horizontal scaling by adding more computing and data nodes when your data grows or when the need for more performance arises.

Popular ways of deploying such a strategy is by deploying a implementation of the map/reduce programming paradigm. Companies like for example Oracle and Pentaho are adopting the map/reduce paradigm by implementing hooks to the Hadoop framework who will do this for you.

When selecting a tool that will enable your users to build their own reports and queries it is advisable to look at how this tool is using the map/reduce programming paradigm and how scalable it is for data growth. By taking this into considerations you can safeguard the usability of the tooling for the future when data is growing and the demand on the system is growing.

Friday, July 20, 2012

Infection pattern analysis in Oracle

Quite recently some customers have asked me a somewhat similar question. They all have applications residing in an Oracle database. The application is a standard application build by a third party. During the past years they have been building additions to those applications. Now they come to a moment in time that the third party vendor is providing upgrades to the software. The update will involve a whole set of newly developed or already existing however changed database objects.

The past couple of years they have been building extensions which have caused a version lock-in. Question that is now popping up is how intertwined is are the custom build extensions which the standard code and with the parts of the code that will change. To give a clear answer on this you will have to check the dependencies that are within the application (including extensions).

When unraveling such a puzzle you will have to look at infection path analysis. For example we have the below application which includes an extension. A, B, C, D & E are the standard components of the application. 1, 2, 3 & 4 are custom objects build as an extension. You can see the dependencies towards standard components visualized with the lines coming from the custom objects towards the standard objects.

In the above example all the red objects will change during an upgrade. This means that the customized objects you should look and, based upon this view, are object 1 (depending on changing object A) and object 3 (depending on changing object A & D).

This is a first generation infection path which only shows you the direct relations between custom objects and changing objects. You should take this a step deeper. In the below example we have gone a dependency level deeper and you can see that 4 is depending on 3 and 1 is depending on 2.

As we potentially have to change 3 to cope with the changes in A & D we also have to look at the potential code change of object 3. And if 3 is changed this might affect object 4.

Object 1 is depending on object 2 and in this level of the infection 2 is not changed so this not changing anything on the list of objects to check.

With every level you go deeper into an infection pattern you will see more objects are potentially “infected” by the change and should have a proper look at by a developer. You can also create “resistant islands “which are in no way affected by change. Potentially you can have your entire database analyzed with a proper infection pattern algorithm. If this is wise is for debate because it can cloud the usability and the correctness of your outcome. In general I do tend to think a 3 level of 4 level deep infection pattern analysis is proper to be used within Oracle databases.

When you are trying to develop your own infection pattern algorithm for a database it is good to have a look at a couple of things.

Within the database dependencies are stored in sys.dependency$ and more information about the objects are stored in dba_objects. Combining the 2 in a query will give you a headstart in building your algorithm. As a simple example if I wanted to know something about object 123321 I could fire of the query;

select 
      objects2.owner as dep_owner,
      objects2.object_name as dep_object_name,
      objects2.object_type as dep_object_type
  from 
      sys.dependency$ depen,
      dba_objects objects2
where 
      depen.p_obj# = 123321
      and objects2.owner not like 'SYS'
      and objects2.object_type not in ('SYNONYM')
      and objects2.object_id = depen.d_obj#
order by 1, 2, 3;

If you build this into a more profound PL/SQL script and you would add a list of changing components to start with you could create a dependency list. The easiest way is to output it to a flat file and provide it to your developers and consultants as a reference of things to look into as they are most likely to be hit by the upgrade.

However, as we are humans and humans are somewhat visually and like to see things in pictures a great way to do this is not to output it to a flat text file however build as a secondary output a file that you can parse via DOT. DOT is a markup language used to plot relation diagrams as the ones above used in the examples. As DOT is a free and opensource way of doing this and it saves you hours and hours of building diagrams in MS Visio I do think it is worth looking into the DOT Language documentation.