Sunday, January 04, 2009

Oracle Database Server Tuning

A couple of days ago I was asked by someone to have a look at one of the LInux servers running a Oracle database. The complain was that the database was slow, query optimization was done and the database was tuned so that from a Oracle point of view everything should work correctly. The actual question was, 'do we have to add 2 more CPU's to the machine or do you advise adding 4 more CPU's to the machine'. Adding more CPU's to a server will however not always make your application run faster. To understand this principle you have to have some understanding of the process scheduler. I wil try to explain some of the detail here.

The 'problem' with SMP (Symmetric Multi Processor architecture) servers is that you can put more than one CPU in your server (and/or can have multiply cores). Your UNIX server will be able to make use of all your processor cores however a single single process can only run on a single core at once. It is important to understand this in the SMP architecture. When running a UNIX server with Oracle on top (or any other software) you will have a lot of process running which will all claim a amount of processing time on a CPU. The scheduling mechanism will schedule processes to run on a CPU and will share CPU's.You will have more processes running than can be handled by the CPU's so the scheduler will put processes in wait queues and will make sure everything will get his fair share of processing time. Best know schedulers are the Staircase scheduler which is developed for a large part by Con Kolivas and the priority based scheduler.

A scheduler will assign a state to a process, the general processing sates are the running state, suspended state(blocked state) or ready-to-run(ready) state.

Running means that the process is executed on the CPU, instruction sets are running on the CPU. The process will be running on the CPU until it is done or it is 'removed' from the CPU because an other process has to be run and has priority over the running the process. It can also happen that a process is 'removed' from the CPU because it has to wait for a resource like for example disk I/O. in this case the process is 'removed' from the cpu in a suspended state, when resources become available (data from the disk arrives) the process is returned in a ready-to-run state.

The ready to run state is a state where the process is placed in the run queue and is waiting for its moment to be placed upon the CPU and execute its instructions. Each CPU has its own run queue. When the number of processes in the run queue grows this is a indicator that the system is under utilized with CPU power. processes getting in and out of the running queue is normal however they should not 'stay' in the queue because this indicates that processes have to wait for processing time and this makes your server slowing down. The run queue is one of the most likely bottlenecks when your servers CPU's are unable to handle the load.

Monitoring the number of processes in the run queue is one thing to monitor, an other thing you will have to monitor is the number of processes in the suspended state. The number of processes waiting for, for example, data being read from the disks can indicate that your disks are slow. The same you can say for network i/o memory access etc.

However, it is possible that your CPU's are not keeping up, your network speed is correct, your disks are fast than you will see that processes are piling up. You can use vmstat to check this, below is a example of vmstat.



procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 586796 12932 266552 0 0 419 48 29 236 2 8 81 8
0 0 0 586648 12932 266552 0 0 0 0 13 123 0 3 98 0
0 0 0 586664 12932 266552 0 0 0 0 12 116 1 2 97 0
0 0 0 586664 12932 266552 0 0 0 0 12 115 1 2 98 0
0 0 0 586648 12932 266552 0 0 0 0 12 112 0 2 98 0
0 0 0 586664 12932 266552 0 0 0 0 12 115 1 2 97 0
0 0 0 586664 12932 266552 0 0 0 0 37 160 1 2 97 0
1 0 0 586664 12932 266600 0 0 0 0 97 347 2 5 93 0
4 1 0 585020 12984 267744 0 0 596 14 112 536 3 10 62 26
0 1 0 569168 13180 279640 0 0 6040 0 252 1448 6 23 0 71
0 1 0 553616 13308 287104 0 0 3772 6 243 891 14 27 0 60
0 0 0 542484 13384 288592 0 0 760 162 104 1117 16 31 34 19
1 0 0 542244 13396 288728 0 0 60 0 22 231 2 5 88 5
0 0 0 541888 13440 288876 0 0 50 250 54 257 2 6 88 5
0 0 0 539268 13456 288912 0 0 8 114 48 256 5 7 86 1


Important fields here are the first field 'r' and the 'id' field. 'r' is providing you information about the number of processes in the run queue. It is normal for a system to have processes in the run queue, there is no standard for what is normal however keeping a number of 4 processes on average can be quite ok. The second field, the 'id' field is providing you information about the load on the system. When the system is complete idle it will read 100 when it is completely loaded it will state 100. Those fields can provide you some insight on the load and the run queue on your system. If your system is real slow and the number of processes in the run queue is low you can state that there is no use to add more CPU's because the CPU's will not have work to process. In those cases you will have to look where the bottleneck is however you can say with some degree of certainty that the CPU's are not the bottleneck and that adding more CPU's will not help. Adding faster CPU's will help process the requests faster so when you have the feeling processes take to long to complete and there is no other bottleneck to solve in I/O for example you can always add faster CPU's.

In the example above I have executed the command 'vmstat -n 1' to make vmstat show the stats every second until I stopped the process. To get a good average however you should run this process for a longer period like a couple of weeks to get good insight. The data should be analyzed to get a good insight in the overall average and the average in certain periods of the day and/or week. Even do vmstat is providing real good insight you might want to look at other tools or scripts if you are planning to have a longer period of analysis of the system. Oracle is providing some scripts for this, you might want to take a look at the following metalink notes; 301137.1 and 466966.1.

Conclusion, as you can see adding more CPU's to a server will not always make the application run faster. In some conditions you have to change other things to remove the bottleneck, for example adding faster memory, adding faster disks, adding a faster NIC or looking at your storage systems to get your database running faster.

No comments: