Important Pandas Functions For Data Cleaning Tasks

Ayanlowo Babatunde
4 min readOct 31, 2020

--

The main Purpose of Data Cleaning is to identify and remove errors & duplicate data, in order to create a reliable dataset. This improves the quality of the training data for analytics and enables accurate decision-making.

Data cleaning and data preparation is a critical first step in any AI/machine learning project. In reality most data scientists spend most of their time cleaning data during machine learning Project tasks.

In this post we tend to look at some important pandas functions that could assist in data cleaning processes during machine learning tasks.

The first thing we created a sample dataframe for this example:

Code showing sample dataframe sample dataset

#SETTING UP MY DATAFRAME
# Intialize Dictioanry of lists First Table data.
data = {‘Cust_Id’:[1001,1002,1003,1004,1005,1007],
‘Name’:[‘Tom’, ‘nick’,’krish’, ‘jack’,’tommy’,’kunle’],
‘Age’:[20, 21, 19, 18,np.nan,np.nan],
‘State’:[‘Lagos’,’Oyo’,’Kaduna’,’Imo’,’abia’,’Borno’],
‘Sex’:[‘Male’,’female’,’m’,’female’,’male’,np.nan]}
# Create DataFrame
CUSTOMER=pd.DataFrame(data)

# Print the output.
CUSTOMER

Sample Dataset for Customer Orders
Sample Customer Order

Priliminary Fuctions (copy(),shape,head(),tail(),)

copy():#making a back up of the dataset in the original is messed up
CUSTOMER_COPY=CUSTOMER.copy()
Customer_orders_COPY=Customer_orders.copy()

#Trying to find the number of rows and cols (rows,cols)
#result of shape is always a tupple of rows,cols
CUSTOMER.shape

#head() and tail()
#trying to check the firt five rows and the last five rows
CUSTOMER.head()
CUSTOMER.tail()
#gives the first five index

#dtypes()#gives a series of datypes of the dataframe(customer)
CUSTOMER.dtypes

#info() gives more details like what kind of object is the dataset?
#index range starting from 0.
#datatypes in the dataset

CUSTOMER.info()

# select_types()
#select columns with specific kind of datatype
#this select columns with datatype object
CUSTOMER.select_dtypes(include=[‘object’]).dtypes
#checking which columns are object type i.e string
CUSTOMER.select_dtypes(include=[‘object’]).columns

#unit type is int
#we want to convert to float
Customer_orders[‘UNIT_PRICE’].astype(float)

#unique() and nunique()
#unique() gives the dinstinct values within a column
#while nunique() counts the number of unique values in the column
print(CUSTOMER[‘Sex’].unique())
#nunique will not count nan values except you add dropna as a parameter
#parameter dropna =False
print(CUSTOMER[‘Sex’].nunique(dropna=False))

#value_counts()
#give you the number of count for each distinct value within a column
#for sex the distinct values count is as follows
CUSTOMER[‘Sex’].value_counts(dropna=False)

#dropna() & drop both are used for dropping
#dropna drops axis rows or cols with nulls values
#if axis is 0 it will drop corresponding rows,if 1 drops corresponding
#columns
CUSTOMER.dropna(axis=0)

#fillNa() filling customer column values nan with unspecified
CUSTOMER[‘Sex’].fillna(‘unspecified’,inplace=True)

# replace() can also be used to change as follows values
CUSTOMER=CUSTOMER_COPY.copy()
CUSTOMER.replace(to_replace=np.nan,value=’UNSPECIFIED’)

GroupBy The Groupby works based on the principle of Split-Apply-Combine

GroupBy Syntax

How Groupby works

Split-Apply-Combine

#groupby
#This groups the customer_orders table by cust_id and sum agg the
#order_qty & unit_price
Customer_orders.groupby(‘CUST_ID’)[‘ORDER_QTY’,’UNIT_PRICE’].sum()

For more on groupby this link gives a good starter:

1.https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

  1. https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e

#to_date() converts an object column to a date format column
Customer_orders[‘ORDER_DATE’]=pd.to_datetime(Customer_orders[‘ORDER_DATE’],format=’%m/%d/%y’)
print(Customer_orders.ORDER_DATE.dt.year)

#set_index()
#indexing is a powerful tool to query your dataframe object with loc

Somtimes during join Operation of two tables you might need to set your index
CUSTOMER.set_index(‘Cust_Id’,inplace=True)

# Using loc
CUSTOMER.loc[[1001,1002,1003]]

#loc might be preferable instead of replace function at times used instead of replace
CUSTOMER.loc[(CUSTOMER.index==[1005]),’Age’]=’unspecified’
CUSTOMER

More Examples in the below links

  1. https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
    2.https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/

#query()
CUSTOMER_Filtered1=CUSTOMER.query(‘Age < 30’)

print(CUSTOMER_Filtered1)

More here :1. https://cmdlinetips.com/2019/07/how-to-select-rows-of-pandas-dataframe-with-query-function/

Function,Apply,lambda

def SEX_CONVERSION(sex):
if sex==’Male’:
return ‘M’
elif sex==’m’:
return ‘M’
elif sex==’male’:
return ‘M’
elif sex==’female’:
return ‘F’
else:
return ‘Unspecified’

#apply() and lambda
CUSTOMER[‘Sex’]=CUSTOMER[‘Sex’].apply(lambda x:SEX_CONVERSION(x))

For merge():

merge() exmples links

This Sample exercise tends to show most used pandas functions during data cleaning process.

--

--

Ayanlowo Babatunde
Ayanlowo Babatunde

Written by Ayanlowo Babatunde

Industrial Engineer with interests in Machine learning/Robotics/IOT