Johan Louwers - Tech blog: August 2017

Thursday, August 31, 2017

Oracle Linux - ClusterShell

When operating large clusters consisting out of large numbers of nodes the desire to be able to execute a command on all, or a subset of nodes, comes quickly. You might want for example to run certain commands on all nodes without having to login to the nodes. When doing configuration solutions like Ansible or Puppet are very good solutions to use. However, for day to day operations they might not be sufficient and you would like to have the option of a distributed shell.

A solution for this is building your own tooling, or you can adopt a solution such as ClusterShell. ClusterShell is a scalable Python Framework, however it is a lot more than that. In the simplest form of usage it is a way to execute commands on groups of nodes in your cluster with a single command. That leaves open the option to do a lot more interesting things with it when you start to look into the options of hooking into the Python API’s and build your own distributed solutions with ClusterShell as a foundation for this.

Installing ClusterShell on Oracle Linux is relative easy and can be done by using the EPEL repository for YUM. Just ensure you have the EPEL repository available. If you have the EPEL respository for Oracle Linux installed you should be able to have the file /etc/yum.repos.d/epel.repo which (in our case, contains the following repository configuration:

[epel]
name=Extra Packages for Enterprise Linux 6 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 6 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch/debug
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-6&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
gpgcheck=1

[epel-source]
name=Extra Packages for Enterprise Linux 6 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/6/SRPMS
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-source-6&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
gpgcheck=1

If you do not have this you will have to make sure you locate and download the appropriate epel-release-x-x.noarch.rpm file http://download.fedoraproject.org/pub/epel/ . As an example, you could download the file and install it as shown below:

# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm
# rpm -ivh epel-release-6-5.noarch.rpm

Now you should be able to use YUM to install ClusterShell on Oracle Linux, this can be done by executing the below yum command:

yum install clustershell

To test the installation you can, as an example, execute the below command to verify if clush is installed. Clush is a part of the full ClusterShell installation and being able to interact with it is a good indication of a successful installation.

[root@localhost /]# clush --version
clush 1.7.3
[root@localhost /]#

To make full use of ClusterShell you will have to start defining your configuration and the nodes you want to be able to control with ClusterShell. The main configuration is done in the configurations file located at: /etc/clustershell . A basic installation should give you the below files in this loaction:

[root@localhost clustershell]# tree /etc/clustershell/
/etc/clustershell/
├── clush.conf
├── groups.conf
├── groups.conf.d
│   ├── genders.conf.example
│   ├── README
│   └── slurm.conf.example
├── groups.d
│   ├── cluster.yaml.example
│   ├── local.cfg
│   └── README
└── topology.conf.example

2 directories, 9 files
[root@localhost clustershell]#

Friday, August 25, 2017

Oracle Linux - Install Ansible

Ansible is an open-source automation engine that automates software provisioning, configuration management, and application deployment. Ansible is based upon a push mechanism where you will push configurations to the servers rather than pulling them as is done by, for example, puppet. When you want to start using Ansible the first step required will be configuring that central location from where you will push the Ansible configurations. Installing Ansible on a Oracle Linux machine is rather straight forward and can be achieved by following the below steps.

Step 1
To be able to install Ansible via the YUM command you will have to ensure that you have the EPEL release RPM installed which will take care of ensuring that you have the fedora YUM repository in place. This is due to the fact that the RPM's for ansible are placed on the fedora repository.

You can do so by first executing a wget to download the file and than install it with the RPM command:

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm


rpm -ivh epel-release-6-8.noarch.rpm

If done correct you will now have something like the below in your YUM repository directory:

[root@localhost ~]# ls -la /etc/yum.repos.d/
total 24
drwxr-xr-x.  2 root root 4096 Aug 25 09:22 .
drwxr-xr-x. 63 root root 4096 Aug 25 08:36 ..
-rw-r--r--   1 root root  957 Nov  5  2012 epel.repo
-rw-r--r--   1 root root 1056 Nov  5  2012 epel-testing.repo
-rw-r--r--.  1 root root 7533 Mar 28 10:13 public-yum-ol6.repo
[root@localhost ~]#

if you check the epel.repo file you should have at least the "Extra packages for Enterprise Linux 6" channel active. You can see this in the example below:

[root@localhost ~]# cat /etc/yum.repos.d/epel.repo 
[epel]
name=Extra Packages for Enterprise Linux 6 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 6 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/6/$basearch/debug
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-6&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
gpgcheck=1

[epel-source]
name=Extra Packages for Enterprise Linux 6 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/6/SRPMS
mirrorlist=https://mirrors.fedoraproject.org/metalink?repo=epel-source-6&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
gpgcheck=1
[root@localhost ~]#

Steps 2
As soon as you have completed the needed steps in step 1 you should be able to do an installation of Ansible on Oracle Linux by executing a simple yum install command.

yum install ansible

Step 3
In basic your installation should be done and Ansible should be available and ready to be configured. To ensure you have the installation right you can conduct the below test to verify.

[root@localhost init.d]#  ansible localhost -m ping
 [WARNING]: provided hosts list is empty, only localhost is available

localhost | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}
[root@localhost init.d]#

Oracle Linux - inspect memory fragments with buddyinfo

The file /proc/buddyinfo is used primarily for diagnosing memory fragmentation issues. Using the buddy algorithm, each column represents the number of pages of a certain order (a certain size) that are available at any given time. You get to view the free fragments for each available order, for the different zones of each numa node.

The content of /proc/buddinfo as shown below will show you the number of free memory chunks. You have to read the numbers from left to right where the first column each value is 2^(0*PAGE_SIZE) the second is 2^(1*PAGE_SIZE) etc ect.

An example of the content of the buddyfile on Oracle Linux 6 can be seen below:

[root@jenkins proc]# cat buddyinfo 
Node 0, zone      DMA     15     32     84     24      6      5      2      0      0      0      0 
Node 0, zone    DMA32    604    342    165     64     28     10     15      2      1      0      0 
[root@jenkins proc]#

Friday, August 04, 2017

Oracle Linux - Intuition Engineering and Site Reliability Engineering with Elastic and Vizceral

IT operations are vital to organisations, in daily business operations a massive system disruption will halt an entire enterprise. Running and operating massive scale IT deployments who are to big to fail takes more than how it is done traditionally. Next to DevOps we see the rise of Site Reliability Engineering, originally pioneered by Google, and complemented with Intuition Engineering, pioneered by Netflix. You see more and more companies who have IT which is to big to fail turn to new concepts of operation. By developing new ways of operation proven ways are adopted and improved.

Site Reliability Engineering
According to Ben Treynor, VP engineering at Google Site Reliability Engineering is the following;
"Fundamentally, it's what happens when you ask a software engineer to design an operations function. When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the "everything can be treated as a software problem" approach and run with it.

So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

On top of that, in Google, we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment -- not only the production environment, but also the development teams, the testing teams, the users, and so on. Those rules and work practices help us to keep doing primarily engineering work and not operations work."

Intuition Engineering
An addition to Site Reliability Engineering can be Intuition Engineering. Intuition Engineering is providing a Site Reliability Engineer with with information in way that it appeals to the brain’s capacity to process massive amounts of visual data in parallel to give users an experience -- a sense, an intuition -- of the state of a holistic system, rather than objective facts. An example of a Intuition Engineering tool is Vizceral developed by Netflix and discussed by Casey Rosenthal, Engineering Manager at Netflix, Justin Reynolds and others in numerous talks. In the below video you can see Justin Reynolds give an introduction into Vizceral.

Implementing Vizceral
For small system footprints using Vizceral might be interesting however not that important for day to day operations. When operating a relative small number of servers and services it is relatively easy to locate an issue and make a decision. In cases where you have a massive number of servers and services it will be hard for a site reliability engineer to take in the vast amount of data and spot possible issues and take split second decisions. In deployments like this it can be very beneficial to implement Vizceral.

Even though Vizceral might look complicated at first glance it is in reality a relative simple however extremely well crafted solution which has been donated to the open source community by Netflix. The process of getting the right data into Vizceral to provide the needed view of the now is the more complex task.

The below image shows a common implementation where we are running a large number of Oracle Linux nodes. All nodes have a local Elastic Beat to collect logs and data and ship this to Elasticsearch where Site Reliability Engineers can use Kibana to get insight in all data from all servers.

Even though Elasticsearch and Kibana in combination with Logstash and Elastic Beats provide a enormous benefit to Site Reliability Engineers they can even still be overwhelmed by the massive amount of data available and it can take time to find the root cause of an issue. As we are already collecting all data from all servers and services we would like to also feed this to Vizceral. The below image shows a reference implementation where we pull data from Elasticsearch and provide to Vizceral.

As you can see from the above image we have introduced two new components, the "Vizceral Feeder API" and "Netflix Vizceral". Both components are running a Docker Containers.

The Vizceral Feeder API
To extract the data we collected inside Elasticsearch and feed this to Vizceral we use the Vizceral Feeder API. The Vizceral Feeder API is an internal product which we hope to provide to the Open Source community at one point in the near future. In effect the API is a bridge between Elasticsearch and Vizceral.

The Vizceral Feeder API will query Elasticsearch for all the required information. Based upon the dataset returned a Vizceral JSON file is created compatible with Vizceral.

Depending on your appetite to modify Vizceral you can have Vizceral pull the JSON file from the Feeder API every x seconds or you can have a secondary process pull the file from the Feeder and place it locally in the Docker container hosting Vizceral.

If you are not into developing your own addition to Vizceral and would like to be up and running relatively fast you should go for the local file replacement strategy.

If you go for the solution in which Vizceral will pull the JSON from the feeder you will have to make sure that you take the following into account;

The Vizceral Feeder API needs to be accessible by the workstations used by the Site Reliability Engineers
The JSON file needs to be presented with the Content-type: application/json header to ensure the data is seen as true JSON
The JSON file needs to be presented with the Access-Control-Allow-Origin: * header to ensure it is CORS compatible.

Thursday, August 03, 2017

Oracle Linux - enable Docker daemon socket option

Installing Docker on a Oracle Linux instance is relative easy and you can get things to work extremely fast and easy. Within a very short timeframe you will have your Docker engine running and you first containers up and running. However, at one point in time you do want to start interacting with docker in a more interactive manner and not only use the docker command from the CLI. In a more integrated situation you do want to communicate over an API with Docker.

In our case the need was to have Jenkins build a Maven project with would build a Docker container with the help from the Docker Maven Plugin build by the people at Spotify. The first run we did hit an issue stating that the build failed with the below message:

[INFO] I/O exception (java.io.IOException) caught when processing request to {}->unix://localhost:80: Permission denied
[INFO] Retrying request to {}->unix://localhost:80

The message need to be solved by taking two steps, (1) ensuring you have your docker Daemon listening on an external socket and (2) ensuring you set an environment variable.

Setting the Docker daemon socket option:
To ensure the docker daemon will listen, on port 2375 you have to make some changes to /etc/sysconfig/docker , location of this configuration file differs per Linux distribution however on Oracle Linux you will need this file.

You will have to ensure that other_args is stating that you want to run the daemon sockets. In the below example we have made the explicit configuration that it needs to run on the localhost IP and the external IP of the docker host.

other_args="-H tcp://127.0.0.1:2375 -H tcp://192.168.56.4:2375 -H unix:///var/run/docker.sock"

Setting DOCKER_HOST environment variable:
To make sure that Jenkins knows where to find the Docker API you will have to set the DOCKER_HOST environment variable. You can do so from the command line with the below command:

export DOCKER_HOST="tcp://192.168.56.4:2375"

Even though the above export works, if you would only need this for Jenkins you can also set a global environment var within Jenkins. Setting it in Jenkins when you only need it in Jenkins might be a better idea. You can set global environment variables within Jenkins under "Manage Jenkins" -"Configure System" - "Global Properties"

Now, when you run a build the build should connect to docker on port 2375 (not 80) and the build should finish without any issue.

Oracle Linux - IPv4 forwarding is disabled. Networking will not work

Using Docker for the first time can be confusing, especially on the networking part. When you run Docker for the first time on a vanilla Oracle Linux instance you might be hitting a networking issue the first time you start a container and try to do network forwarding. By default IPv4 forwarding is disabled and should be set to enabled to make use of Docker in the right way.

The below error might be what you are facing when starting your first docker container on Oracle Linux:

WARNING: IPv4 forwarding is disabled. Networking will not work.

To resolve this issue you will to make changes to the configuration of your Docker host OS. In our case we run a Oracle Linux operating system with the Docker engine on top of it. To ensure you have forwarding active you will have to change setting in /etc/sysctl.conf . By default you will have the following:

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

You will have to change this into 1 as shown below

# Controls IP packet forwarding
net.ipv4.ip_forward = 1

As soon as you have ensured the new settings are active, and only after you made sure they are active, your Docker containers should start without any issue.