At SmartApps we are currently working on a project to move all Linux environments to Oracle Unbreakable Linux. Because we are a Oracle minded company and are hosting Oracle databases and Oracle E-Business suites on Linux platforms it is a logical choice for us to move from red-hat to Oracle unbreakable Linux.
The SmartApps server farm consists out of an enormous number of Linux servers and a portion of them was in need of upgrading the operating system due to various reasons. Before we were running Red-Hat servers however now Oracle has its own Linux distribution. Because standardization is key when you operate large datacenters the decision to switch to Oracle Unbreakable Linux was a easy to make.
Because some of the servers who are in the first set of upgrades are production servers so the need was there to have an upgrade done in the fastest way possible to ensure not more than 30 minutes downtime per server. This would ensure that we would stay inside the maintenance time windows which are in the customer SLA’s.
Because of this the decision was made to kickstart all the servers. Instead of starting with CD1 we did a PXE boot and the servers would connect a central server which contains an image of Oracle Unbreakable Linux, via the fiber network the servers would install without (much) human intervention, some information files where used to bind specific server information and incorporate it in the install. Things like which IP address is bind to which network interface, which network mounts need to be made, which users should be created and which Oracle instance needs to be started on which server.
The generation of the kickstart script and creating it in such a way that it is so generic that we could use it on all the servers in the first upgrade batch of the server farm has been the most time consuming of all. However after the script was created and tested installation of a new server was indeed a <30 minute job.
The biggest problem we found when creating the script and started testing it was that PXE boot cannot handle in all cases some network router settings of Cisco routers. This was the case for non fiber optic servers. The problem is that we have redundant network links to ensure network uptime in case of a cable failure. To make sure that network traffic is looping in redundant network loops we have configured a Spanning Tree Protocol (which is a OSI layer-2 protocol) to ensure that no looping is done.
If your switch is doing spanning tree calculations, you’re not forwarding any traffic at all within the first 50 seconds and that may exceed the timeouts in your PXE setup. On your switch’s ports, try using "spanning-tree portfast" to jump immediately to a forwarding state. This solved the problem, for the rest we were able to do a very fast and successful upgrade of the first set of operating systems and are now planning to do the second very shortly.
If you like to know more about the working of the IEEE Standard 802.1D have a look at http://ieee802.org/1/ .If you like to know more about the algorithm behind the spanning tree protocol you can have a look at the this document http://www1.cs.columbia.edu/~ji/F02/ir02/p44-perlman.pdf