Looking to jump into data analysis with Python? Well, look no further than pandas! Pandas is a powerful open-source Python library extensively used for data manipulation and analysis. It offers two primary data structures:
Series: A one-dimensional array-like object that can hold any data type (integers, strings, Python objects, etc.). It's like a single column from a spreadsheet.
DataFrame: A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet. It allows you to store and manipulate data in a tabular format.
We'll primarily focus on DataFrame methods, but it's important to understand that pandas also provides functionalities for working with Series. DataFrame methods allow you to manipulate, analyze, and transform the data contained within a DataFrame.
We'll transform ourselves from being lone data explorers to confident analysts with the help of 'Pandas Helpers' as we explore methods applied during various stages of the data analysis journey.
Let's dive right in:
Data Loaders:
These methods will equip us with essential tools to collect and organize our data.
pd.read
_csv(file_path)
: Load data from a CSV file into a DataFrame.pd.read
_excel(file_path)
: Load data from an Excel file into a DataFrame.pd.read
_sql(file_path)
: Load data from a SQL database into a DataFrame.pd.read
_json(file_path)
: Load data from a JSON file into a DataFrame.pd.read
_html(file_path)
: Extract tables from HTML and convert them into DataFrames.
Additional Techniques:
- Handling Delimiters: By default,
read_csv()
uses a comma (,
) as the delimiter. If your data uses a different delimiter (like semicolon ';'), you can specify it using thedelimiter
argument:
df = pd.read_csv(file_path, delimiter=";")
- Specifying headers: By default,
read_csv()
assumes the first row contains column names. If your CSV doesn't have headers, use theheader
argument withNone
:
df = pd.read_csv(file_path, header=None)
- Selecting data types: You can specify data types for columns while reading the CSV using the
dtype
argument:
df = pd.read_csv(file_path, dtype={'column_name': 'data_type'})
Now, we've successfully turned our data into a manageable DataFrame. This allows us to use some dataframe specific methods as well.
Data Detectives:
These will help us extract specific pieces of information from our data.
Indexing and Slicing:
df.loc[]
: Select rows and columns by label.df.iloc[]
: Select rows and columns by integer position.
Conditional Selection:
df[condition]
: Select rows based on a condition.df.query()
: Select rows using a query expression.
Column Selection:
df['column_name']
: Select a single column by name.df[['column1', 'column2']]
: Select multiple columns by names.
Let's now clean our data.
Data Janitors:
Adding and Removing Columns:
df['new_column'] = ...
: Add a new column to the DataFrame.df.drop(columns=['column_name'])
: Remove one or more columns from the DataFrame.
Handling Missing Data:
df.dropna()
: Drop rows with missing values.df.fillna()
: Fill missing values with a specified value.
Dealing With Duplicates:
- Use
.duplicated()
to find rows that are entirely identical based on specified columns (optional argumentsubset
specifies columns to check).
- Use
df.dropna()
: Drop rows with missing values.df.fillna()
: Fill missing values with a specified value.
Data Type Conversions:
Checking data types: Use
.dtypes
to get an overview of the data types for each column.Converting data types: Use methods like
.astype()
to convert columns to a specific data type (e.g., converting strings to numeric types).
String Manipulation:
Removing leading/trailing whitespaces: Use
.str.strip()
to remove leading and trailing whitespaces from string columns.Converting text to lowercase/uppercase: Use
.str.lower()
or.str.upper()
to convert text in string columns to lowercase or uppercase, respectively.Replacing text patterns: Use
.str.replace()
to replace specific patterns within string columns with a new value.
Data Chefs:
Let's now 'prepare' our data.
pd.pivot_table()
: Create a pivot table from DataFrame.pd.melt()
: Unpivot a DataFrame from wide to long format.pd.concat()
: Concatenate DataFrames along rows or columns.pd.merge()
: Merge two DataFrames based on common columns or indices.
Data Analysts:
Majority of these methods will fall within the dataframe methods as they'll be performed on dataframes.
df.describe()
: Generate descriptive statistics for numerical columns.df.groupby()
: Group DataFrame using a mapper or by a Series of columns.df.sort_values()
: Sort DataFrame by values along either axis.df.value_counts()
: Provides a quick overview of how frequently each value occurs in a column.df.nunique()
: count the number of distinct or unique entries within a pandas Series or DataFrame
Data StoryTellers:
Basic Plotting: Showcase built-in pandas plotting methods like
.plot()
,.
plot.bar
()
,.plot.hist()
, etc., for creating basic visualizations (scatter plots, bar charts, histograms) directly from the DataFrame.Customization: Emphasize the importance of customizing plots using arguments like
kind
,color
,title
, andxlabel/ylabel
to enhance readability and clarity.
For the nitty-gritty details, head over to the pandas documentation: https://pandas.pydata.org/