Data Analysis with Pandas: A Guide to Wrangling Data

Data Analysis with Pandas: A Guide to Wrangling Data

Looking to jump into data analysis with Python? Well, look no further than pandas! Pandas is a powerful open-source Python library extensively used for data manipulation and analysis. It offers two primary data structures:

  • Series: A one-dimensional array-like object that can hold any data type (integers, strings, Python objects, etc.). It's like a single column from a spreadsheet.

  • DataFrame: A two-dimensional labeled data structure with rows and columns, similar to a spreadsheet. It allows you to store and manipulate data in a tabular format.

We'll primarily focus on DataFrame methods, but it's important to understand that pandas also provides functionalities for working with Series. DataFrame methods allow you to manipulate, analyze, and transform the data contained within a DataFrame.

We'll transform ourselves from being lone data explorers to confident analysts with the help of 'Pandas Helpers' as we explore methods applied during various stages of the data analysis journey.

Let's dive right in:

Data Loaders:

These methods will equip us with essential tools to collect and organize our data.

  • pd.read_csv(file_path): Load data from a CSV file into a DataFrame.

  • pd.read_excel(file_path): Load data from an Excel file into a DataFrame.

  • pd.read_sql(file_path): Load data from a SQL database into a DataFrame.

  • pd.read_json(file_path): Load data from a JSON file into a DataFrame.

  • pd.read_html(file_path): Extract tables from HTML and convert them into DataFrames.

Additional Techniques:

  • Handling Delimiters: By default, read_csv() uses a comma (,) as the delimiter. If your data uses a different delimiter (like semicolon ';'), you can specify it using the delimiter argument:
df = pd.read_csv(file_path, delimiter=";")
  • Specifying headers: By default, read_csv() assumes the first row contains column names. If your CSV doesn't have headers, use the header argument with None:
df = pd.read_csv(file_path, header=None)
  • Selecting data types: You can specify data types for columns while reading the CSV using the dtype argument:
df = pd.read_csv(file_path, dtype={'column_name': 'data_type'})

Now, we've successfully turned our data into a manageable DataFrame. This allows us to use some dataframe specific methods as well.

Data Detectives:

These will help us extract specific pieces of information from our data.

  • Indexing and Slicing:

    • df.loc[]: Select rows and columns by label.

    • df.iloc[]: Select rows and columns by integer position.

  • Conditional Selection:

    • df[condition]: Select rows based on a condition.

    • df.query(): Select rows using a query expression.

  • Column Selection:

    • df['column_name']: Select a single column by name.

    • df[['column1', 'column2']]: Select multiple columns by names.

Let's now clean our data.

Data Janitors:

  • Adding and Removing Columns:

    • df['new_column'] = ...: Add a new column to the DataFrame.

    • df.drop(columns=['column_name']): Remove one or more columns from the DataFrame.

  • Handling Missing Data:

    • df.dropna(): Drop rows with missing values.

    • df.fillna(): Fill missing values with a specified value.

  • Dealing With Duplicates:

    • Use .duplicated() to find rows that are entirely identical based on specified columns (optional argument subset specifies columns to check).
  • df.dropna(): Drop rows with missing values.

  • df.fillna(): Fill missing values with a specified value.

  • Data Type Conversions:

    • Checking data types: Use .dtypes to get an overview of the data types for each column.

    • Converting data types: Use methods like .astype() to convert columns to a specific data type (e.g., converting strings to numeric types).

  • String Manipulation:

    • Removing leading/trailing whitespaces: Use .str.strip() to remove leading and trailing whitespaces from string columns.

    • Converting text to lowercase/uppercase: Use .str.lower() or .str.upper() to convert text in string columns to lowercase or uppercase, respectively.

    • Replacing text patterns: Use .str.replace() to replace specific patterns within string columns with a new value.

Data Chefs:

Let's now 'prepare' our data.

  • pd.pivot_table(): Create a pivot table from DataFrame.

  • pd.melt(): Unpivot a DataFrame from wide to long format.

  • pd.concat(): Concatenate DataFrames along rows or columns.

  • pd.merge(): Merge two DataFrames based on common columns or indices.

Data Analysts:

Majority of these methods will fall within the dataframe methods as they'll be performed on dataframes.

  • df.describe(): Generate descriptive statistics for numerical columns.

  • df.groupby(): Group DataFrame using a mapper or by a Series of columns.

  • df.sort_values(): Sort DataFrame by values along either axis.

  • df.value_counts() : Provides a quick overview of how frequently each value occurs in a column.

  • df.nunique() : count the number of distinct or unique entries within a pandas Series or DataFrame

Data StoryTellers:

  • Basic Plotting: Showcase built-in pandas plotting methods like .plot(), .plot.bar(), .plot.hist(), etc., for creating basic visualizations (scatter plots, bar charts, histograms) directly from the DataFrame.

  • Customization: Emphasize the importance of customizing plots using arguments like kind, color, title, and xlabel/ylabel to enhance readability and clarity.

For the nitty-gritty details, head over to the pandas documentation: https://pandas.pydata.org/