Pandas Library Demonstration
Introduction
Pandas is a python library (a set of useful functions that eliminate the need for writing codes from scratch) that is generally used for data manipulation and data analysis. Usually, it is used to handle structured data i.e. in tabular format. The name “pandas” is derived from the term “panel data”, an econometrics name for the tabular data. This library can be used on two types of data structures:
(i) A Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).
(ii) A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Install pandas
There are multiple ways to install Pandas, but I prefer using conda environment. First, open the Anaconda CMD.exe format.
conda install pandas
conda install os
We might also need another library “os” to specify the path of the inputs files.
Import pandas
Once Pandas is successfully installed in your system, you can import Pandas simply by the following command.
import pandas as pd
import os
where pd is an alias used for Pandas so that the Pandas package can be referred to as pd instead of pandas.
Reading the data
Creating pandas Series and DataFrame
Following commands are used
# A pandas series
pd.Series( [ 12, 13, 1, 3, 5 ] )
# pandas dataframe
pd.DataFrame({'name': ['John', 'Michael', 'Neil', 'Robert'], 'age': [21, 23, 24, 28],'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer']})
There are various other methods to read different types of files, such as read_json(), read_html(), read_excel(), etc which can be easily used as per the requirement.
df = pd.read_csv("C:\\Users\\wb580236\\Demo_Data\\ITF_PASSENGER_TRANSPORT_13092021212626037.csv")
Explore the data
First, we have to explore the data. there are multiple commands to understand the data. First, we can use “head()” function to display the top 5 rows from our data set. For more rows we can use df.head(n) and n is the number of rows. Similarly, “tail()” method can show the last rows.
head() & tail()
df.head()
df.tail()
Now, if we want to see the dimension of our dataset, we can use “shape”, which will show the dimensions in (No. of rows, No. of columns) format.
shape()
df.shape
If we want to know some more information about our dataset, we can use the “info()” function of pandas. It displays various information about our data such as the column names, no. of non-null values in each column(feature), type of each column, memory usage, etc.
info()
df.info()
isnull() and sum()
Using “isnull()” and “sum()” functions, we can find the no. of null values in a DataFrame for every feature.
df.isnull()
df['Flag Codes'].isnull()
df['Value'].sum()
drop()
If we want to drop a particular column using the pandas “drop()” function. Note – We provide “inplace=True’ to modify the current DataFrame.
df.drop(['Flags'], axis=1, inplace=True)
describe()
Using the “describe()” function, we can get various information about the numerical columns in our DataFrame, such as total count of rows, mean, the minimum value, the maximum value, and the spread of the values in the particular feature.
df.describe()
value_counts()
“value_counts()” function is used to identify the different categories in a feature as well as the count of values per category.
df['Year'].value_counts()
fillna()
Now, as we know that, by using the “isnull()” and “sum()” functions, we can check if our data has any missing values or not. We can fill up the missing values using the mode(a value that appears most frequently in a data set) of this particular feature using the “fillna()” function.
df['Flag Codes'].fillna('Not Avialable', inplace=True)
df.iloc[1:2, :]
Note – This way of filling the missing or nan values is not the most effective way and there are other much more efficient ways for imputation. There is a lot of thought which goes behind when we are working with missing values but, this is just for explaining the functionality of the fillna() function.
Processing the data
Once data is cleaned in the previous steps. The next step is data processing using multiple functions.
Dropping columns in the DataFrame
The first step is dropping unnecessary columns.
# To dropa specific column as a series
df1 = df.drop(['YEAR'], axis=1)
# Extracting multiple columns at a time
df2 = df.drop(['YEAR', 'Flag Codes'], axis=1)
Slicing the data in the DataFrame
# Extracting a particular column and row from dataframe. dataframe.iloc[m,n] where m is row index and n is column index
df.iloc[3, 5] # single element 4throw and 6th column
df.iloc[1:20, :]
There is also label-based slicing.
#Extracting rows and columns based on column or row labels
df.loc[2:50, 'Country'] # single element row label 2 and column sales
We can extract specific columns or rows satisfying a particular condition.
#Extracting all the values in column whose value is greater than 22
df.loc[ df.Year> 2007]
df[ df.Year> 22]
Sorting the data in the DataFrame
We can sort our DataFrame by index or values with Pandas “sort_index()” and “sort_values()” functions. Below is the implementation for sort by values.
df.sort_values(by='Country')
Other operations of the DataFrames
Merging two datasets based on a column can be done using pd.merge
data_1 = pd.DataFrame(
{
"key_1": ["K0", "K0", "K1", "K2"],
"key_2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
data_2 = pd.DataFrame(
{
"key_1": ["K0", "K1", "K1", "K2"],
"key_2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
#Merging the two dataframesby inner join on particular columns
pd.merge( data_1, data_2, how = 'inner' , on = ['col_1', 'col_2' ] )
Concatenating pandas objects along the specified axis.
# concatenating dataframes one on top of the other
pd.concat( [ data_1, data_2], axis = 0)
Pivot Table of the DataFrame
Creating a spread-sheet like pivot table as a Dataframe
df3 = df[["Country","Year","Variable","Value"]]
df3.pivot_table(values = "Value", index="Country", columns= "Year", aggfunc ="sum")
Exporting the DataFrame
Now, we can save our DataFrame in a CSV file using the pandas “to_csv()” function. As we don’t want to store the preceding indices of each row, hence, we will set index=False.
df3.to_csv("C:\\Users\\wb580236\\Demo_Data\\output.csv",index=False)