Tuesday, February 12, 2019

Python pandas – merge dataframes

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. When working with data you can load data (from multiple type of sources) into a designated DataFrame which will hold the data for future actions. A DataFrame is a Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In many cases the operations you want to do on data require data from more than one single data source. In those cases you have the option to merge (concatenate, join) multiple DataFrames into a single DataFrame for the operations you intend. In the below example, we merge two sets of data (DataFrames) from the World Bank into a single dataset (DataFrame) in one of the most basic merge manners.

Used datasets
For those interested in the datasets, the original data is coming from data.worldbank.org, for this specific example I have modified the way the .csv file is provided originally. You can get the modified .csv files from my machine learning examples project located at github.

Example code
The example we show is relative simple and is shown in the diagram below, we load two datasets using Pandas read_csv() into their individual DataFrame. When both are loaded we merge the two DataFrames into a single (new) Dataframe using merge().


The below is an outline of the code example, you can get the code example, including the used datasets from my machine learning examples project at github.

import pandas as pd

df0 = pd.read_csv('../../data/dataset_4.csv', delimiter=";",)
print ('show the content of the first file via dataframe df0')
print (df0.head())

df1 = pd.read_csv('../../data/dataset_5.csv', delimiter=";",)
print ('show the content of the second file via dataframe df1')
print (df1.head())

df2 = pd.merge(df0, df1, on=['Country Code','Country Name'])
print ('show the content of merged dataframes as a single dataframe')
print (df2.head())

No comments: