Wednesday, September 25, 2019

Creating a training set table for machine learning in Oracle Database

When building a machine learning model, you will require a learning / training set of data. To enable you to quickly create a set of training data you can make use of the SQL SAMPLE clause in a select statement. Using the SAMPLE clause you instruct the database to select from a random sample of data from the table, rather than from the entire table. This provides a very simple way of getting the random collection of records you require for training your model.

Situation
You do have a large (or small) table of data in your database, in our case we use an Oracle Autonomous Data Warehouse and intend to use a part of this as training data while you want to use the remaining part for testing your model.


Assume we have a table named louwersj.loans which we want to use for both our training data as well as our test data. A simple way of splitting it in a 70/30 fashion is to use the below commands:

Step 1:
Check the total number of records in the table:

SELECT
    COUNT(1)
FROM
    louwersj.loans;

In our case this will give us the result of 614 as we have 614 records in our dataset

Step 2:
Take a 70% of the total and use this to create a table, this is where we will use the sample clause in the SQL statement to ensure we get a random 70% of the records. By issuing the below command the table loans_traindata will be exactly the same as the original table loans with only a random subset of the original table.

CREATE TABLE louwersj.loans_traindata
    AS
        SELECT
            *
        FROM
            louwersj.loans SAMPLE ( 70 ) SEED ( 1 )


To validate if this gives us what we wanted we can do another count to see if we indeed get a training set which contains 70% of the original table, the below command will return 455.

SELECT
    COUNT(1)
FROM
    louwersj.loans_traindata

Step 3:
Next to the train data we need to have some test data to validate the working of our machine learning model after we have trained it. For this we can use the remaining 30% of the data from the original table. With the following command we create a new table which will contain exactly that:

CREATE TABLE louwersj.loans_testdata
    AS
        SELECT
            *
        FROM
            louwersj.loans
        MINUS
        SELECT
            *
        FROM
            louwersj.loans_traindata

Conclusion
Using the SAMPLE clause as part of a CREATE TABLE AS statement in the Oracle Database helps you to speed up creating a good training set and test for your machine learning model. No needing to extract data from the database and re-insert the data, you can do all within the database without any actual moving of the data. 

Tuesday, September 24, 2019

Groovy - AST Transformation

Groovy is a powerful language that gives the opportunity to its users to plugin into the compilation process to create what we call AST transformations, ie. the ability to customize the Abstract Syntax Tree representing your programs before the compiler walks this tree to generate Java bytecode.

When writing a lot of Groovy code, especially when you write it as part of a wider team it will be very beneficial to take some time to look into the inner workings of AST. As AST transformation can be build yourself to extend how Groovy is working it can help in ensuring that your code will be much more similar between different developers than it would be without using AST transformation.

When you are new to Groovy AST transformation the below talk can be a good starting point.

Monday, September 23, 2019

Google Cloud Function call Oracle ADW Rest end-point

When running an Oracle Autonomous Database, for example an Oracle Autonomous Data Warehouse (ADW for short) it is very likely that multiple applications and solutions do want to have access to the data available in the ADW.  A common scenario is that, a department in the enterprise has been developing an application in isolation and at one point in time requires some additional data from the data warehouse. In this case the data warehouse is the Oracle Autonomous Data Warehouse.

Call Oracle ADW from Google Cloud Functions
When developing an application in the Google Cloud you can make use of Google Cloud Functions. As Google Cloud Functions support development in Python you can write a generic function to retrieve, for example, customer details based upon a customer ID. We have deployed the Oracle ADW restfull data service in a previous blogpost. In this blogpost we want to call it with a GET request from the Google Cloud.

One generic function
When building an application using Google Cloud Functions which at several points need to interact with data in the Oracle ADW you do not want to code these multiple times. A more logical way of doing things is building one function to interact with Oracle ADW to obtain the needed data.


Every time your application calls the Google Cloud Function, with the propper JSON payload which contains a valid customer ID the Google Cloud function will call the Oracle ORDS endpoint which we developed as part of Oracle ADW. The return message from Oracle ADW will be the return message from the Google Cloud Function.

By building this "interaction layer" developers will only have to build the interaction with the Oracle Cloud based Oracle ADW once and after that they can work within Google Cloud to complete their specific Google Cloud based application.

Deploy a Google Cloud Function for Oracle Database
Deploying a Google Cloud function for Oracle Database starts with the same steps as deploying any cloud function. In our case we build a Python based application. The below image showcases the initial creation of the function:



We indicate that we want to use Python 3.7 and that the function inside our code, which is the entrypoint for execution, is named getCustomer.

The code used is shown below. Do note; when developing a production solution you most likely want to add additional security and a lot more error handling than shown in this example. This is just a very (very very) not production ready example. Additionally, the full URL of the Oracle ADW has been substituted with XXXX

import urllib.request

def getCustomerResponse(requistedCustomerId):
    """
    :param    :return:    """
    baseUrl = "https://XXXX.oraclecloudapps.com/ords/louwersj/parties/b2b/customers/"
    fullUrl = baseUrl + requistedCustomerId
    operUrl = urllib.request.urlopen(fullUrl)

    if(operUrl.getcode()==200):
        data = operUrl.read()
    else:
        data("Error receiving data from ADW", operUrl.getcode())
    return data

def getCustomer(request):
    """
    :param request:    :return:    """
    requestJson = request.get_json(silent=True)
    requestArgs = request.args

    if requestJson and 'customer_id' in requestJson:
        customerId = requestJson['customer_id']
    elif requestArgs and 'customer_id' in requestArgs:
        customerId = requestArgs['customer_id']
    else:
        customerId = 'ERROR'
    if customerId == "ERROR":
        responseData = "No customer_id provided"    else:

        responseData = getCustomerResponse(customerId)
    return responseData

Testing the function
Upon deployment you can test the google cloud function using the test functionality in the Google UI (or by calling it directly) from another location. If all is working you should receive a JSON style return message as shown in the below screenshot.



In the above screenshot the trigger event field contains our test JSON payload and the  function output contains a JSON response which originates for the Oracle ADW. 

Conclusion
When developing applications on multiple platforms, multiple clouds and multiple technologies and you require access to one central source of the truth you can use multiple technologies to connect to a centrally located Oracle Autonomous Data Warehouse. However, using a REST interface is in most cases a very simple and "fit for the job" kind of solution. 

When developing a solution like this it will require more strict error handling and it will require strict authentication and authorization however the base principle stands that hybrid multi-cloud applications can integrate with an Oracle Autonomous Data Warehouse in a very easy and cloud native manner. 

Create REST endpoint in Oracle Autonomous Database

Oracle provides, as part of the Oracle Cloud portfolio an Autonomous Database solution. The Autonomous Database is provided as an OLTP as well as a Data Warehouse deployment model. Without going into the technical details or the technical and operational benefits in this article we will focus on how to build REST interfaces in conjunction with oracle Autonomous Database. In this example we will use an Oracle Autonomous Data warehouse.

The example environment
For this example, we will have an Oracle Autonomous Data warehouse or ADW for short. As part of our example we will have a table called customers which will hold a generic structure of all our global customers and the parent / child relationship between customers in our table.

In the below screenshot you can see the table definition using the Oracle APEX object browser which is provisioned as part of the ADW deployment.



The example goal
The goal we will try to achieve in this example is providing a REST endpoint for applications to connect to and get some basic information about a customer as well as providing a REST endpoint which will enable an application to retrieve all subsidiaries from a given customer. All interactions are done based upon the customer ID which in our case is using a UUID.

Creating the first REST endpoint
The example is showing the entire creation of the REST endpoint by using the Oracle ADW APEX interface, however, this can also be achieved using any compatible SQL client and is not relying on the UI.

Creating a REST endpoint in Oracle ADW has to comply with a certain order of components. Restful Data Services require a module which can hold one or more templates (end points) and each template will hold one or more handlers. Handlers are responsible for handling the request for a certain request type, for example a POST or a GET request.

In our example we firstly create a module which in our case we name ADW.backend.parties with a base path called /parties/ .



When the module has been defined we can create the ORDS template, in this example we create a template for b2b/customers/:id in this annotation the intention is that :id will be substituted with a customer ID. As we have a component with a URI /parties the full path will become, as an example, /parties/b2b/customers/{some-customer-id}.



Having the template without any handlers to handle an incoming request will not provide any added functionality. As we want users to be able to get information based upon a customer id we will create a GET request handler which will be triggered on any GET request being executed against the end-point. The handler is also the location where the actual PL/SQL code will be defined to be executed when a GET request is being send. The below screenshot shows this.



Trigger the first REST endpoint. 
Having the first REST endpoint fully deployed we can test the endpoint by trying to execute a GET request against if from an external location. As this is a GET request we can do this from a browser, however you could use anything from cURL up until customer written Python code to call the endpoint with a GET request.

When providing the endpoint in a browser we get the below response:


For readability purposes we can format the message so it becomes more readable for humans.


Building the subsidiary endpoint
As stated we would build, as part of this example, also a way to lookup all subsidiaries of a given company. The previous endpoint provided the details of one company with the mention of the ID of the parent company. However, in some cases someone would like to retrieve a list of subsidiaries.

We already have an endpoint /parties/b2b/customers/{some-customer-id} and we can expand that with /subsidiaries which would make the endpoint /parties/b2b/customers/{some-customer-id}/subsidiaries

To achieve this we build a second ORDS endpoint specifically for this and we create a GET request handler for this newly created ORDS endpoint as well. The below screenshot shows the creation of the ORDS template to provide the required endpoint:


When the ORDS template is created we can create the GET handler. The GET handler is shown in the screenshot below and reacts to the :id which is part of the URI.


We now have created our second endpoint which will provide a JSON response containing all the subsidiaries for a given customer ID.  In case we call the endpoint and format the response we will see a message as shown below:


Conclusion
When you are using an Oracle Autonomous Database you automatically get a very simple way of building RESTfull data services in the form of REST endpoints. Even though the above example only scratches the surface of the possibilities and much more complex and much more secure implementations can be build it showcases the ease of use and showcases how quickly you can build a comprehensive REST interface while only leveraging the Oracle Cloud based solution in the form of an Oracle Autonomous Database.

Tuesday, May 07, 2019

Fn Project - quick install guide

The Fn project is an open source project originally started within Oracle as part of the drive for more enterprise open source solutions. The Fn project is an open source serverless compute platform. With Fn, you deploy your functions to an Fn server which automatically executes and manages them. Each function is executed in a Docker container enabling the platform to provide broad support for development languages including Java, JavaScript (Node), Go, Python, Ruby, and others. The Fn client and server are simple and elegant. You can run the server locally on your laptop, or on a server in your data center or in the cloud. The Fn project has a strong enterprise focus with emphasis on security, scalability, and observability.

With companies moving more and more to cloud native solutions and requiring more and more solutions that will not tie them into a single cloud platform you can observe a move away from vendor specific and proprietary serverless solutions.

The Fn project is a fully open source solution to build cloud native serverless solutions and you will be able to run it at every location. Oracle provides a managed service which allows you to consume Fn based serverless while at the same time you can run it within your own datacenter and on every cloud provider of your choice.

To get started quickly with the Fn project the fastest way is to install it on a virtual machine to experience it hands-on. The below presentation will give you a quick guide on how to get started with the Fn project on a local (virtual) machine.



More information can be found on the Fn project website and on other locations such as the Fn Slack channel and github

Monday, March 18, 2019

Oracle Linux & Cloud Automation - clean yum data with ansible

Building infrastructure, configuring servers and deploying software has been a manual task for many years. With the rise of cloud computing, virtual machines, containers and CI/CD processes we see more and more that those manual tasks are being diminished and fully automated. When building your infrastructure in the Oracle Cloud you can make use of a wide set of automation tools to make everything software defined and automated. Solutions like Ansible and Terraform provide some of the building blocks which can help you automate all the previously manual tasks. Ensuring you leverage solutions like this will increase speed and agility supported by Cloud computing.

In this  serie "Oracle Linux & Cloud Automation" we will go into more detail on how several solutions you can leverage to automate large parts of your IT footprint lifecycle. Please use the tagged label to find all posts on this subject.

Ansible
Ansible is an open-source software provisioning, configuration management, and application deployment tool. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration.

Cleaning yum
Ansible has the easy possibility to install (remove or update) packages using yum. As a good practice you should clean your yum data from cached data so it is not taking any unneeded space on your local file system. Ansible is providing a module as a wrapper around the yum command, this means that you can use Ansible instead of directly interact with the yum command itself.

Missing from the module is a command to clean the cache. You can do so by putting the following in the Ansible playbook.

# clean all the yuym data.
  - name: Clean all yum data
    command: yum clean all
    args:
      warn: yes 

As you can see this is not using the yum module, it is using the command module instead. This is also the reason that we have the warn flag, currently set to yes however advisable to set to no. The warn flag will print the below message;

 [WARNING]: Consider using the yum module rather than running yum.  If you need to use 
command because yum is insufficient you can add warn=False to this command task or set 
command_warnings=False in ansible.cfg to get rid of this message.

The reason for the warning is that Ansible expects that all yum related commands are being done via the yum module and not the command module. As the yum module is missing the "clean all" option we have to do this via the command module and use the full command.

Oracle Linux tested
The below playbook is tested on an Oracle Linux 7 instance with Ansible 2.7.7.

show the playbook:
[root@localhost ansi]# cat webserver_playbook.yml 
- hosts: localhost
  tasks:
  - name: Ensure the latest yum-utils python package is available on the server
    yum:
      name: yum-utils
      state: latest
  - name: Ensure the latest python package is available on the server
    yum:
      name: python
      state: latest
  - name: Ensure the latest python-pip package is availabel on the server
    yum:
      name: python-pip
      state: latest
# clean all the yuym data.
  - name: Clean all yum data
    command: yum clean all
    args:
      warn: yes 

Show the output:
[root@localhost ansi]# ansible-playbook webserver_playbook.yml 
 [WARNING]: provided hosts list is empty, only localhost is available. Note
that the implicit localhost does not match 'all'


PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Ensure the latest yum-utils python package is available on the server] ***
ok: [localhost]

TASK [Ensure the latest python package is available on the server] *************
ok: [localhost]

TASK [Ensure the latest python-pip package is availabel on the server] *********
ok: [localhost]

TASK [Clean all yum data] ******************************************************
 [WARNING]: Consider using the yum module rather than running yum.  If you need
to use command because yum is insufficient you can add warn=False to this
command task or set command_warnings=False in ansible.cfg to get rid of this
message.

changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=5    changed=1    unreachable=0    failed=0   

[root@localhost ansi]# 

Do also note that you have "changed=1" always. This is due to the fact that you always run the yum clean all. Even though this is not a config change on your system it is detected as a change. Technically it is a change as you clean the yum data, even in the cases that you do not download a package you still clean the meta-data.

Conclusion
Even though no native support for the clean is available in the Ansible yum module you can ensure that your system is not holding unnecessary data on the filesystem as shown in the example above. 

Friday, March 15, 2019

dash - changing the favicon

Dash is a Python framework for building analytical web applications. It can be used to very quickly develop small applications capable of running small analytical visualizations. As it is developers in Python you have a natural fit with solutions like Panda and the like.

When getting started with Dash, developed by Plotly, you might run into the following question; how do I change the favicon to show the one I want and not the one shipped by default.



The answer is, stop trying to code the solution. The only way (currently) is to create a directory named assets in the root of your project and add the desired favicon in to this location. This should result in the desired favicon showing.


Tuesday, March 05, 2019

Python - machine learning and clustering

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Within machine learning we place clustering under unsupervised learning, clustering is used for example in recommendation systems, targeted marketing and customer segmentation.

The below outline is a simple starting point showing a basic form of clustering on a relatively small dataset. The dataset we will use is displayed in the scatter chart below. The objective we have is to determine 3 clusters in the data shown in scatter chart. In this example case the data is just random data, the data can however represent virtually everything. The data could for example be customers, demographic data, sensor data-points or anything else.


If you look at the data humans are by default driven by Apophenia to try and see patterns. Apophenia has come to imply a universal human tendency to seek patterns in random information, such as gambling. However, even though the human mind will try to see a pattern this is far from correct in many cases. To make a true valid clustering we will need to actually base the clustering on math and not the feeling of the human mind. 

By leveraging Python code we can devide the data into 3 distinct clusters, the found clusters are shown below in different colors.


We can now see the different clusters that are within the data. Finding the members of the cluster is done based upon K-means clustering,  K-means is a clustering algorithm that aims to partition n observations into k clusters. The main steps are:


  • Initialisation – K initial “means” (centroids) are generated at random
  • Assignment – K clusters are created by associating each observation with the nearest centroid
  • Update – The centroid of the clusters becomes the new mean

The result is that after the updates yuu will end up with (in our case) 3 centroids and the datapoint which is assoicated with this centroid based upon the most optimal (smallest) distance to the centroid.


The above scatter chart shows the centroids which form the backbone of the clustering. Normallyt hey will be hidden as they do not form an actual datapoint from the dataset. As you can see we have now 3 clusters from the bigger dataset.

Examples of clustering can be found on my Github project containing machine learning examples. 

Wednesday, February 20, 2019

Kubernetes - Minikube start dashboard for a Web UI

For those developing solutions that should run under Kubernetes, running a local version of Kubernetes leveraging Minikube can make your life much more easy. One of the question some people do have is how to ensure they can make use of the Kubernetes Web UI.

Running the Kubernetes Web UI while working with Minikube is relatively easy and you can start the Web UI with a single command of the minikube CLI. The below command showcases how to start the Web UI and also have your local browser open automatically to guide you to the correct URL.

Johans-MacBook-Pro:log jlouwers$ minikube dashboard
🔌  Enabling dashboard ...
🤔  Verifying dashboard health ...
🚀  Launching proxy ...
🤔  Verifying proxy health ...
🎉  Opening http://127.0.0.1:60438/api/v1/namespaces/kube-system/services/http:kubernetes-dashboard:/proxy/ in your default browser...

As you can see from the above example the only command needed is 'minikube dashboard'.


The above screenshot shows you the Kubernetes Web UI in the browser as started by the minikube command.

Friday, February 15, 2019

Python Pandas – consume Oracle Rest API data

When working with Pandas the most common know way to get data into a pandas Dataframe is to read a local csv file into the dataframe using a read_csv() operation. In many cases the data which is encapsulated within the csv file originally came from a database. To get from a database to a csv file on a machine where your Python code is running includes running a query, exporting the results to a csv file and transporting the csv file to a location where the Python code can read it and transform it into a pandas DataFrame.

When looking a modern systems we see that more and more persistent data stores provide REST APIs to expose data. Oracle has ORDS (Oracle Rest Data Services) which provide an easy way to build REST API endpoint as part of your Oracle Database.

Instead of extracting the data from the database, build a csv file, transport the csv file so you are able to consume it you can also instruct your python code to directly interact with the ORDS REST endpoint and read the JSON file directly.

The below JSON structure is an example of a very simple ORDS endpoint response message. From this message we are, in this example, only interested in the items it returns and we do want to have that in our pandas DataFrame.

{
 "items": [{
  "empno": 7369,
  "ename": "SMITH",
  "job": "CLERK",
  "mgr": 7902,
  "hiredate": "1980-12-17T00:00:00Z",
  "sal": 800,
  "comm": null,
  "deptno": 20
 }, {
  "empno": 7499,
  "ename": "ALLEN",
  "job": "SALESMAN",
  "mgr": 7698,
  "hiredate": "1981-02-20T00:00:00Z",
  "sal": 1600,
  "comm": 300,
  "deptno": 30
 }, {
  "empno": 7521,
  "ename": "WARD",
  "job": "SALESMAN",
  "mgr": 7698,
  "hiredate": "1981-02-22T00:00:00Z",
  "sal": 1250,
  "comm": 500,
  "deptno": 30
 }, {
  "empno": 7566,
  "ename": "JONES",
  "job": "MANAGER",
  "mgr": 7839,
  "hiredate": "1981-04-02T00:00:00Z",
  "sal": 2975,
  "comm": null,
  "deptno": 20
 }, {
  "empno": 7654,
  "ename": "MARTIN",
  "job": "SALESMAN",
  "mgr": 7698,
  "hiredate": "1981-09-28T00:00:00Z",
  "sal": 1250,
  "comm": 1400,
  "deptno": 30
 }, {
  "empno": 7698,
  "ename": "BLAKE",
  "job": "MANAGER",
  "mgr": 7839,
  "hiredate": "1981-05-01T00:00:00Z",
  "sal": 2850,
  "comm": null,
  "deptno": 30
 }, {
  "empno": 7782,
  "ename": "CLARK",
  "job": "MANAGER",
  "mgr": 7839,
  "hiredate": "1981-06-09T00:00:00Z",
  "sal": 2450,
  "comm": null,
  "deptno": 10
 }],
 "hasMore": true,
 "limit": 7,
 "offset": 0,
 "count": 7,
 "links": [{
  "rel": "self",
  "href": "http://192.168.33.10:8080/ords/pandas_test/test/employees"
 }, {
  "rel": "describedby",
  "href": "http://192.168.33.10:8080/ords/pandas_test/metadata-catalog/test/item"
 }, {
  "rel": "first",
  "href": "http://192.168.33.10:8080/ords/pandas_test/test/employees"
 }, {
  "rel": "next",
  "href": "http://192.168.33.10:8080/ords/pandas_test/test/employees?offset=7"
 }]
}

The below code shows how to fetch the data with Python from the ORDS endpoint and normalize the JSON in a way that we will only have the information about items in our dataframe.
import json
from urllib2 import urlopen
from pandas.io.json import json_normalize

# Fetch the data from the remote ORDS endpoint
apiResponse = urlopen("http://192.168.33.10:8080/ords/pandas_test/test/employees")
apiResponseFile = apiResponse.read().decode('utf-8', 'replace')

# load the JSON data we fetched from the ORDS endpoint into a dict
jsonData = json.loads(apiResponseFile)

# load the dict containing the JSON data into a DataFrame by using json_normalized.
# do note we only use 'items'
df = json_normalize(jsonData['items'])

# show the evidence we received the data from the ORDS endpoint.
print (df.head())
Interacting with a ORDS endpoint to retrieve the data out of the Oracle Database can be in many cases be much more efficient than taking the more traditional csv route. Options to use a direct connection to the database and use SQL statements will be for another example post. You can see the code used above also in the machine learning examples project on Github.

Wednesday, February 13, 2019

resolved - cx_Oracle.DatabaseError: ORA-24454: client host name is not set

When developing Python code in combiantion with cx_Oracle on a Mac you might run into some issues, especially when configuring your mac for the first time. One of the strange things I encountered was the ORA-24454 error when trying to connect to an Oracle database from my MacBook. ORA-24454 states that the client host name is not set.

When looking into the issue it turns out that the combination of the Oracle instant client and cx_Oracle will look into /etc/hosts on a Mac to find the client hostname to use it when initiating the connection from a mac to the database.

resolve the issue
A small disclaimer, this worked for me, I do expect it will work for other Mac users as well. First you have to find the actual hostname of your system, you can do so by executing one of the following commands;

Johans-MacBook-Pro:~ root# hostname 
Johans-MacBook-Pro.local

or you can run;

Johans-MacBook-Pro:~ root# python -c 'import socket; print(socket.gethostname());'
Johans-MacBook-Pro.local

Knowing the actual hostname of your machine you can now set it in /ect/hosts. This should make it look like something like the one below;

127.0.0.1 localhost
127.0.0.1 Johans-MacBook-Pro.local

When set this should ensure you do not longer encounter the cx_Oracle.DatabaseError: ORA-24454: client host name is not set error when running your Python code.

Tuesday, February 12, 2019

Python pandas – merge dataframes

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. When working with data you can load data (from multiple type of sources) into a designated DataFrame which will hold the data for future actions. A DataFrame is a Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In many cases the operations you want to do on data require data from more than one single data source. In those cases you have the option to merge (concatenate, join) multiple DataFrames into a single DataFrame for the operations you intend. In the below example, we merge two sets of data (DataFrames) from the World Bank into a single dataset (DataFrame) in one of the most basic merge manners.

Used datasets
For those interested in the datasets, the original data is coming from data.worldbank.org, for this specific example I have modified the way the .csv file is provided originally. You can get the modified .csv files from my machine learning examples project located at github.

Example code
The example we show is relative simple and is shown in the diagram below, we load two datasets using Pandas read_csv() into their individual DataFrame. When both are loaded we merge the two DataFrames into a single (new) Dataframe using merge().


The below is an outline of the code example, you can get the code example, including the used datasets from my machine learning examples project at github.

import pandas as pd

df0 = pd.read_csv('../../data/dataset_4.csv', delimiter=";",)
print ('show the content of the first file via dataframe df0')
print (df0.head())

df1 = pd.read_csv('../../data/dataset_5.csv', delimiter=";",)
print ('show the content of the second file via dataframe df1')
print (df1.head())

df2 = pd.merge(df0, df1, on=['Country Code','Country Name'])
print ('show the content of merged dataframes as a single dataframe')
print (df2.head())

Monday, February 11, 2019

Secure Software Development - the importance of dependency manifest files

When developing code, in this specific example python code, one thing you want to make sure is that you do not develop vulnerabilites. Vulnerabilities can be introduced primarily in two ways; you create them or you include them. One way of providing an extra check that you do not include vulnerabilties in your application is making sure you handle the dependency manifest files in the right way.

A dependency manifest file makes sure you have all the components your application relies upon are in a central place. One of the advantages is that you can use this file to scan for known security issues in components you depend upon. It is very easy to do an import or include like statement and add additional functionality to your code. However, whatever you include might have a known bug or vulnerability in a specific version.

Creating a dependency manifest file in python
When developing Python code you can leverage pip to create a dependency manifest file, commonly named as requirments.txt . The below command shows how you can create a dependency manifest file

pip freeze > requirements.txt

if we look into the content of this file we will notice a structure like the one shown below which lists all the dependencies and the exact version.

altgraph==0.10.2
bdist-mpkg==0.5.0
bonjour-py==0.3
macholib==1.5.1
matplotlib==1.3.1
modulegraph==0.10.4
numpy==1.16.1
pandas==0.24.1
py2app==0.7.3
pyobjc-core==2.5.1
pyobjc-framework-Accounts==2.5.1
pyobjc-framework-AddressBook==2.5.1
pyobjc-framework-AppleScriptKit==2.5.1
pyobjc-framework-AppleScriptObjC==2.5.1
pyobjc-framework-Automator==2.5.1
pyobjc-framework-CFNetwork==2.5.1
pyobjc-framework-Cocoa==2.5.1
pyobjc-framework-Collaboration==2.5.1
pyobjc-framework-CoreData==2.5.1
pyobjc-framework-CoreLocation==2.5.1
pyobjc-framework-CoreText==2.5.1
pyobjc-framework-DictionaryServices==2.5.1
pyobjc-framework-EventKit==2.5.1
pyobjc-framework-ExceptionHandling==2.5.1
pyobjc-framework-FSEvents==2.5.1
pyobjc-framework-InputMethodKit==2.5.1
pyobjc-framework-InstallerPlugins==2.5.1
pyobjc-framework-InstantMessage==2.5.1
pyobjc-framework-LatentSemanticMapping==2.5.1
pyobjc-framework-LaunchServices==2.5.1
pyobjc-framework-Message==2.5.1
pyobjc-framework-OpenDirectory==2.5.1
pyobjc-framework-PreferencePanes==2.5.1
pyobjc-framework-PubSub==2.5.1
pyobjc-framework-QTKit==2.5.1
pyobjc-framework-Quartz==2.5.1
pyobjc-framework-ScreenSaver==2.5.1
pyobjc-framework-ScriptingBridge==2.5.1
pyobjc-framework-SearchKit==2.5.1
pyobjc-framework-ServiceManagement==2.5.1
pyobjc-framework-Social==2.5.1
pyobjc-framework-SyncServices==2.5.1
pyobjc-framework-SystemConfiguration==2.5.1
pyobjc-framework-WebKit==2.5.1
pyOpenSSL==0.13.1
pyparsing==2.0.1
python-dateutil==2.8.0
pytz==2013.7
scipy==0.13.0b1
six==1.12.0
xattr==0.6.4

Check for known security issues
One of the most simple ways to check for known security issues is checking your code in at github.com. As part of the service provided by Github you will get alerts, based upon dependency manifest file, which dependencies might have a known security issue. The below screenshot shows the result of uploading a Python dependency manifest file to github.


As it turns out, somewhere in the chain of dependencies some project still has a old version of a pyOpenSSL included which has a known security vulnerability. The beauty of this approach is you have an direct insight and you can correct this right away.

Sunday, February 10, 2019

Python Matplotlib - showing or hiding a legend in a plot


When working with Matplotlib of visualize your data there are situations that you want to show the legend and in some cases you want to hide the legend. Showing or hiding the legend is very simple, as long as you know how to do it, the below example showcases both showing and hiding the legend from your plot.

The code used in this example uses pandas and matplotlib to plot the data. The full example of this is part of my machine learning example repository on Github where you can find this specific code and more.

Plot with legend
The below image shows the plotted data with a legend. Having a legend is in some cases very good, however in some cases it might be very disturbing to your image. Personally I think keeping a plot very clean (without a legend) is the best way of presenting a plot in many cases.
The code used for this is shown below. As you can see we use legend=True

df.plot(kind='line',x='ds',y='y',ax=ax, legend=True)


Plot without legend
The below image shows the plotted data without a legend. Having a legend is in some cases very good, however in some cases it might be very disturbing to your image. Personally I think keeping a plot very clean (without a legend) is the best way of presenting a plot in many cases.

The code used for this is shown below. As you can see we use legend=False
df.plot(kind='line',x='ds',y='y',ax=ax, legend=False)

Thursday, January 31, 2019

machine learning - matplotlib error in matplotlib.backends import _macosx


When trying to visualize and plot data in Python you might work with Matplotlib. In case you are working on MacOS and you use a venv, in some cases you might run into the below error message:

RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information.

The reason for this error is that Matplotlib is not able to find the correct backend. The most easy way to resolve this in a quick and dirty way is to add the following line to your code.

matplotlib.use('TkAgg')

This should remove (in most cases) the error and your code should be able to run correctly.