Saturday, December 13, 2008

Cluster Computing Network Blueprint

For those who have been watching the lecture by Aaron Kimball from Google "cluster computing and mapreduce lecture 1" might have noticed that a big portion of the first part of the lecture is about networking and why networking is so important to distributed computing.

"Designing real distributed systems requires consideration of networking topology."

Let say you are designing a Hadoop cluster for distributed computing for a company who will be processing lots and lots of information during the night to be able to use this information the next morning for daily business. The last thing you want is that the work is not completed during the night due to a networking problem. A failing switch can, in certain network setups, make that you loose a large portion of your computing power. Thinking about the network setup and making a good network blueprint for your system is a vital part of creating a successful solution.

We take for example the previously mentioned company. This company is working on chip research and during the day people are making new designs and algorithms which needs to be tested in your Hadoop cluster during the night. The next morning when people come in they expect to have the results of the jobs they have placed in the Hadoop queue the previous day. The number of engineers in this company is enormous and the cluster has a use of around 98% during 'non working hours' and 80% during working hours. As you can see a failure of the entire system or a reduction of the computing power will have a enormous impact on the daily work of all the engineers.

A closer look at this theoretical cluster.
- The cluster consists out of 960 computing nodes.
- A node is a 1u server with 2 quad core processors and 6 gigabyte of ram.
- The nodes are placed in 19' racks every rack houses 24 computing nodes.
- There are 40 racks with computing nodes.

As you can see, if we would lose a entire rack due to a network failure we would lose 2.5% of the computing power. As the cluster is used 98% during the night we would not have enough computing power to do all the work during the night. Losing a single node will not be a problem, however losing a stack of nodes will result in a major problem during the next day. For this we will have to create a network blueprint in which we can ensure that we will not lose a entire computing stack. If we talk about a computing stack we talk about a rack of 24 servers in this example.

First we will have to take a look at how we will connect the racks, if you look at the Cisco website you will be able to find the Cisco Catalyst 3560 product page. The Cisco Catalyst 3560 has 4 Fiber ports and we will be using those switches to connect the computing stacks. However, to ensure the network redundancy we will use 2 switches for very stack instead of one. As you can see in the diagram below we crosslink the switches. SW-B0 and SW-B1 will both be handling computing stack B, switches SW-C0 and SW-C1 will be handling the network load for computing stack C etc etc. We will be connecting SW-B0 with fiber to SW-C0 and SW-C1. We will also connect SW-B1 with fiber to SW-C0 and SW-C1. In case SW-B0 or SW-B1 will fail the network can still route traffic to the the switches in the B computing stack and also the a computing stack. By creating the network in this way the network will not fail to route traffic to the other stacks, the only thing that will happen is that the surviving switch will have to handle more load.


This setup will however not solve the problem that the nodes that are connected to the failing switch will loose the network connection. To resolve this we will attache every node to two switches. Every computing stack has 24 computing nodes. The switch has 48 ports and we do have 2 switches. To solve this problem we place 2 network interfaces in every node. One will be in standby mode and one will be active. To spread the load at all even numbered nodes the active NIC will be connected to switch 0, at all uneven numbered nodes the active NIC will be connected switch 1. For the inactive (standby) NIC, all the even numbered nodes will be connected to switch 1 and all uneven numbered nodes will be connected to switch 0. In a normal situation the load will be balanced between the two switches, in case one of the two switches fails the standby NIC's will have to become active and all the network traffic to the nodes in the computing stack will be handled by the surviving switch.



To have the NIC's to switch to the surviving network switch and to make sure that operation continue as normal you will have to make sure that the network keeps looking at the servers in the same way as before the moment one of the switches failed. To do this you will have to make sure that the new NIC has the same IP address and MAC address. To do so you can make use of IPAT, IP Address Takeover.

"IP address takeover feature is available on many commercial clusters. This feature protects an installation against failures of the Network Interface Cards (NICs). In order to make this mechanism work, installations must have two NICs for each IP address assigned to a server. Both the NICs must be connected to the same physical network. One NIC is always active while the other is in a standby mode. The moment the system detects a problem with the main adapter, it immediately fails over to the standby NIC. Ongoing TCP/IP connections are not disturbed and as a result clients do not notice any downtime on the server. "

Now we have tackled almost every possible breakdown, however what can happen is that not one switch but both switches in stack break. If we look at the examples above this would mean that the stacks will be separated by the broken stack. To prevent this you will have to make a connection between the first and the last stack as you make between all the stack. By doing so you make a 'ring' of your network. With correct setup of all your switches and making good routing a failover routings your network can also handle the malfunction of a complete stack in combination with both switches in the stack.

Even do this is a theoretical blueprint, developing your network in such a way in combination with writing your own code to controle network flows, scripts to control the IPAT and the switching back of IPAT, thinking about reporting and alerting mechanisms will make a very solid network. If there are any questions about this networking blueprint for cluster computing please do send me a e-mail or post a comment. I will reply with a answer (good or bad) or explain things in more detail in a new post.


No comments: