Friday, August 04, 2017

Oracle Linux - Intuition Engineering and Site Reliability Engineering with Elastic and Vizceral

IT operations are vital to organisations, in daily business operations a massive system disruption will halt an entire enterprise. Running and operating massive scale IT deployments who are to big to fail takes more than how it is done traditionally. Next to DevOps we see the rise of Site Reliability Engineering, originally pioneered by Google, and complemented with Intuition Engineering, pioneered by Netflix. You see more and more companies who have IT which is to big to fail turn to new concepts of operation.  By developing new ways of operation proven ways are adopted and improved.

Site Reliability Engineering
According to Ben Treynor, VP engineering at Google Site Reliability Engineering is the following;
"Fundamentally, it's what happens when you ask a software engineer to design an operations function. When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the "everything can be treated as a software problem" approach and run with it.

So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

On top of that, in Google, we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment -- not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work."

Intuition Engineering
An addition to Site Reliability Engineering can be Intuition Engineering. Intuition Engineering is providing a Site Reliability Engineer with with information in way that it appeals to the brain’s capacity to process massive amounts of visual data in parallel to give users an experience -- a sense, an intuition -- of the state of a holistic system, rather than objective facts. An example of a Intuition Engineering tool is Vizceral developed by Netflix and discussed by Casey Rosenthal, Engineering Manager at Netflix, Justin Reynolds and others in numerous talks. In the below video you can see Justin Reynolds give an introduction into Vizceral.

Implementing Vizceral
For small system footprints using Vizceral might be interesting however not that important for day to day operations. When operating a relative small number of servers and services it is relatively easy to locate an issue and make a decision. In cases where you have a massive number of servers and services it will be hard for a site reliability engineer to take in the vast amount of data and spot possible issues and take split second decisions. In deployments like this it can be very beneficial to implement Vizceral.

Even though Vizceral might look complicated at first glance it is in reality a relative simple however extremely well crafted solution which has been donated to the open source community by Netflix. The process of getting the right data into Vizceral to provide the needed view of the now is the more complex task.

The below image shows a common implementation where we are running a large number of Oracle Linux nodes. All nodes have a local Elastic Beat to collect logs and data and ship this to Elasticsearch where Site Reliability Engineers can use Kibana to get insight in all data from all servers.

Even though Elasticsearch and Kibana in combination with Logstash and Elastic Beats provide a enormous benefit to Site Reliability Engineers they can even still be overwhelmed by the massive amount of data available and it can take time to find the root cause of an issue. As we are already collecting all data from all servers and services we would like to also feed this to Vizceral. The below image shows a reference implementation where we pull data from Elasticsearch and provide to Vizceral.

As you can see from the above image we have introduced two new components, the "Vizceral Feeder API" and "Netflix Vizceral". Both components are running a Docker Containers.

The Vizceral Feeder API
To extract the data we collected inside Elasticsearch and feed this to Vizceral we use the Vizceral Feeder API. The Vizceral Feeder API is an internal product which we hope to provide to the Open Source community at one point in the near future. In effect the API is a bridge between Elasticsearch and Vizceral.

The Vizceral Feeder API will query Elasticsearch for all the required information. Based upon the dataset returned a Vizceral JSON file is created compatible with Vizceral.

Depending on your appetite to modify Vizceral you can have Vizceral pull the JSON file from the Feeder API every x seconds or you can have a secondary process pull the file from the Feeder and place it locally in the Docker container hosting Vizceral.

If you are not into developing your own addition to Vizceral and would like to be up and running relatively fast you should go for the local file replacement strategy.

If you go for the solution in which Vizceral will pull the JSON from the feeder you will have to make sure that you take the following into account;

  • The Vizceral Feeder API needs to be accessible by the workstations used by the Site Reliability Engineers 
  • The JSON file needs to be presented with the Content-type: application/json header to ensure the data is seen as true JSON
  • The JSON file needs to be presented with the Access-Control-Allow-Origin: * header to ensure it is CORS compatible

No comments: