Pandas Library Demonstration

Introduction

Pandas is a python library (a set of useful functions that eliminate the need for writing codes from scratch) that is generally used for data manipulation and data analysis. Usually, it is used to handle structured data i.e. in tabular format. The name “pandas” is derived from the term “panel data”, an econometrics name for the tabular data. This library can be used on two types of data structures:

(i) A Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

(ii) A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Install pandas

There are multiple ways to install Pandas, but I prefer using conda environment. First, open the Anaconda CMD.exe format.

conda install pandas
conda install os

We might also need another library “os” to specify the path of the inputs files.

Import pandas

Once Pandas is successfully installed in your system, you can import Pandas simply by the following command.

import pandas as pd
import os

where pd is an alias used for Pandas so that the Pandas package can be referred to as pd instead of pandas.

Reading the data

Creating pandas Series and DataFrame

Following commands are used

# A pandas series 
pd.Series( [ 12, 13, 1, 3, 5 ] )

# pandas dataframe
pd.DataFrame({'name': ['John', 'Michael', 'Neil', 'Robert'], 'age': [21, 23, 24, 28],'occupation': ['data scientist',  'doctor',  'data analyst', 'engineer']})

Python1

There are various other methods to read different types of files, such as read_json(), read_html(), read_excel(), etc which can be easily used as per the requirement.

df = pd.read_csv("C:\\Users\\wb580236\\Demo_Data\\ITF_PASSENGER_TRANSPORT_13092021212626037.csv")

Explore the data

First, we have to explore the data. there are multiple commands to understand the data. First, we can use “head()” function to display the top 5 rows from our data set. For more rows we can use df.head(n) and n is the number of rows. Similarly, “tail()” method can show the last rows.

head() & tail()

df.head()
df.tail()

Now, if we want to see the dimension of our dataset, we can use “shape”, which will show the dimensions in (No. of rows, No. of columns) format.

shape()

df.shape

If we want to know some more information about our dataset, we can use the “info()” function of pandas. It displays various information about our data such as the column names, no. of non-null values in each column(feature), type of each column, memory usage, etc.

info()

df.info()

isnull() and sum()

Using “isnull()” and “sum()” functions, we can find the no. of null values in a DataFrame for every feature.

df.isnull()
df['Flag Codes'].isnull()
df['Value'].sum()

drop()

If we want to drop a particular column using the pandas “drop()” function. Note – We provide “inplace=True’ to modify the current DataFrame.

df.drop(['Flags'], axis=1, inplace=True)

describe()

Using the “describe()” function, we can get various information about the numerical columns in our DataFrame, such as total count of rows, mean, the minimum value, the maximum value, and the spread of the values in the particular feature.

df.describe()

value_counts()

“value_counts()” function is used to identify the different categories in a feature as well as the count of values per category.

df['Year'].value_counts()

fillna()

Now, as we know that, by using the “isnull()” and “sum()” functions, we can check if our data has any missing values or not. We can fill up the missing values using the mode(a value that appears most frequently in a data set) of this particular feature using the “fillna()” function.

df['Flag Codes'].fillna('Not Avialable', inplace=True)
df.iloc[1:2, :]

Note – This way of filling the missing or nan values is not the most effective way and there are other much more efficient ways for imputation. There is a lot of thought which goes behind when we are working with missing values but, this is just for explaining the functionality of the fillna() function.

Processing the data

Once data is cleaned in the previous steps. The next step is data processing using multiple functions.

Dropping columns in the DataFrame

The first step is dropping unnecessary columns.

# To dropa specific column as a series
df1 = df.drop(['YEAR'], axis=1) 

# Extracting multiple columns at a time 
df2 = df.drop(['YEAR', 'Flag Codes'], axis=1)

Slicing the data in the DataFrame

# Extracting a particular column and row from dataframe. dataframe.iloc[m,n] where m is row index and n is column index 
df.iloc[3, 5]   # single element 4throw and 6th column
df.iloc[1:20, :]

There is also label-based slicing.

#Extracting rows and columns based on column or row labels
df.loc[2:50, 'Country'] # single element row label 2 and column sales

We can extract specific columns or rows satisfying a particular condition.

#Extracting all the values in column whose value is greater than 22
df.loc[ df.Year>  2007] 
df[ df.Year> 22]

Sorting the data in the DataFrame

We can sort our DataFrame by index or values with Pandas “sort_index()” and “sort_values()” functions. Below is the implementation for sort by values.

df.sort_values(by='Country')

Other operations of the DataFrames

Merging two datasets based on a column can be done using pd.merge

data_1 = pd.DataFrame(
       {
            "key_1": ["K0", "K0", "K1", "K2"],
            "key_2": ["K0", "K1", "K0", "K1"],
            "A": ["A0", "A1", "A2", "A3"],
            "B": ["B0", "B1", "B2", "B3"],
       }
    )
data_2 = pd.DataFrame(
       {
            "key_1": ["K0", "K1", "K1", "K2"],
            "key_2": ["K0", "K0", "K0", "K0"],
            "C": ["C0", "C1", "C2", "C3"],
            "D": ["D0", "D1", "D2", "D3"],
        }
    )
 #Merging the two dataframesby inner join on particular columns
pd.merge( data_1,  data_2,  how = 'inner' ,  on = ['col_1',  'col_2' ] )

Concatenating pandas objects along the specified axis.

# concatenating dataframes one on top of the other
 pd.concat( [ data_1,  data_2], axis = 0)

Pivot Table of the DataFrame

Creating a spread-sheet like pivot table as a Dataframe

df3 = df[["Country","Year","Variable","Value"]]
df3.pivot_table(values = "Value", index="Country", columns= "Year", aggfunc ="sum")

Exporting the DataFrame

Now, we can save our DataFrame in a CSV file using the pandas “to_csv()” function. As we don’t want to store the preceding indices of each row, hence, we will set index=False.

df3.to_csv("C:\\Users\\wb580236\\Demo_Data\\output.csv",index=False)
Next