Friday, August 04, 2017

Oracle Linux - Intuition Engineering and Site Reliability Engineering with Elastic and Vizceral

IT operations are vital to organisations, in daily business operations a massive system disruption will halt an entire enterprise. Running and operating massive scale IT deployments who are to big to fail takes more than how it is done traditionally. Next to DevOps we see the rise of Site Reliability Engineering, originally pioneered by Google, and complemented with Intuition Engineering, pioneered by Netflix. You see more and more companies who have IT which is to big to fail turn to new concepts of operation.  By developing new ways of operation proven ways are adopted and improved.

Site Reliability Engineering
According to Ben Treynor, VP engineering at Google Site Reliability Engineering is the following;
"Fundamentally, it's what happens when you ask a software engineer to design an operations function. When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the "everything can be treated as a software problem" approach and run with it.

So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

On top of that, in Google, we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment -- not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work."

Intuition Engineering
An addition to Site Reliability Engineering can be Intuition Engineering. Intuition Engineering is providing a Site Reliability Engineer with with information in way that it appeals to the brain’s capacity to process massive amounts of visual data in parallel to give users an experience -- a sense, an intuition -- of the state of a holistic system, rather than objective facts. An example of a Intuition Engineering tool is Vizceral developed by Netflix and discussed by Casey Rosenthal, Engineering Manager at Netflix, Justin Reynolds and others in numerous talks. In the below video you can see Justin Reynolds give an introduction into Vizceral.


Implementing Vizceral
For small system footprints using Vizceral might be interesting however not that important for day to day operations. When operating a relative small number of servers and services it is relatively easy to locate an issue and make a decision. In cases where you have a massive number of servers and services it will be hard for a site reliability engineer to take in the vast amount of data and spot possible issues and take split second decisions. In deployments like this it can be very beneficial to implement Vizceral.

Even though Vizceral might look complicated at first glance it is in reality a relative simple however extremely well crafted solution which has been donated to the open source community by Netflix. The process of getting the right data into Vizceral to provide the needed view of the now is the more complex task.

The below image shows a common implementation where we are running a large number of Oracle Linux nodes. All nodes have a local Elastic Beat to collect logs and data and ship this to Elasticsearch where Site Reliability Engineers can use Kibana to get insight in all data from all servers.



Even though Elasticsearch and Kibana in combination with Logstash and Elastic Beats provide a enormous benefit to Site Reliability Engineers they can even still be overwhelmed by the massive amount of data available and it can take time to find the root cause of an issue. As we are already collecting all data from all servers and services we would like to also feed this to Vizceral. The below image shows a reference implementation where we pull data from Elasticsearch and provide to Vizceral.



As you can see from the above image we have introduced two new components, the "Vizceral Feeder API" and "Netflix Vizceral". Both components are running a Docker Containers.

The Vizceral Feeder API
To extract the data we collected inside Elasticsearch and feed this to Vizceral we use the Vizceral Feeder API. The Vizceral Feeder API is an internal product which we hope to provide to the Open Source community at one point in the near future. In effect the API is a bridge between Elasticsearch and Vizceral.

The Vizceral Feeder API will query Elasticsearch for all the required information. Based upon the dataset returned a Vizceral JSON file is created compatible with Vizceral.

Depending on your appetite to modify Vizceral you can have Vizceral pull the JSON file from the Feeder API every x seconds or you can have a secondary process pull the file from the Feeder and place it locally in the Docker container hosting Vizceral.

If you are not into developing your own addition to Vizceral and would like to be up and running relatively fast you should go for the local file replacement strategy.

If you go for the solution in which Vizceral will pull the JSON from the feeder you will have to make sure that you take the following into account;

  • The Vizceral Feeder API needs to be accessible by the workstations used by the Site Reliability Engineers 
  • The JSON file needs to be presented with the Content-type: application/json header to ensure the data is seen as true JSON
  • The JSON file needs to be presented with the Access-Control-Allow-Origin: * header to ensure it is CORS compatible

Thursday, August 03, 2017

Oracle Linux - enable Docker daemon socket option

Installing Docker on a Oracle Linux instance is relative easy and you can get things to work extremely fast and easy. Within a very short timeframe you will have your Docker engine running and you first containers up and running. However, at one point in time you do want to start interacting with docker in a more interactive manner and not only use the docker command from the CLI. In a more integrated situation you do want to communicate over an API with Docker.

In our case the need was to have Jenkins build a Maven project with would build a Docker container with the help from the Docker Maven Plugin build by the people at Spotify. The first run we did hit an issue stating that the build failed with the below message:

[INFO] I/O exception (java.io.IOException) caught when processing request to {}->unix://localhost:80: Permission denied
[INFO] Retrying request to {}->unix://localhost:80

The message need to be solved by taking two steps, (1) ensuring you have your docker Daemon listening on an external socket and (2) ensuring you set an environment variable.

Setting the Docker daemon socket option:
To ensure the docker daemon will listen, on port 2375 you have to make some changes to /etc/sysconfig/docker , location of this configuration file differs per Linux distribution however on Oracle Linux you will need this file.

You will have to ensure that other_args is stating that you want to run the daemon sockets. In the below example we have made the explicit configuration that it needs to run on the localhost IP and the external IP of the docker host.

other_args="-H tcp://127.0.0.1:2375 -H tcp://192.168.56.4:2375 -H unix:///var/run/docker.sock"

Setting DOCKER_HOST environment variable:
To make sure that Jenkins knows where to find the Docker API you will have to set the DOCKER_HOST environment variable. You can do so from the command line with the below command:

export DOCKER_HOST="tcp://192.168.56.4:2375"

Even though the above export works, if you would only need this for Jenkins you can also set a global environment var within Jenkins. Setting it in Jenkins when you only need it in Jenkins might be a better idea. You can set global environment variables within Jenkins under "Manage Jenkins" -"Configure System" - "Global Properties"

Now, when you run a build the build should connect to docker on port 2375 (not 80) and the build should finish without any issue. 

Oracle Linux - IPv4 forwarding is disabled. Networking will not work

Using Docker for the first time can be confusing, especially on the networking part. When you run Docker for the first time on a vanilla Oracle Linux instance you might be hitting a networking issue the first time you start a container and try to do network forwarding. By default IPv4 forwarding is disabled and should be set to enabled to make use of Docker in the right way.

The below error might be what you are facing when starting your first docker container on Oracle Linux:

WARNING: IPv4 forwarding is disabled. Networking will not work.

To resolve this issue you will to make changes to the configuration of your Docker host OS. In our case we run a Oracle Linux operating system with the Docker engine on top of it. To ensure you have forwarding active you will have to change setting in /etc/sysctl.conf . By default you will have the following:

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

You will have to change this into 1 as shown below

# Controls IP packet forwarding
net.ipv4.ip_forward = 1 

As soon as you have ensured the new settings are active, and only after you made sure they are active, your Docker containers should start without any issue.