How to perform exploratory data analysis in Python

Mr. Irshad 1 days ago

13 comments
10 min read

Imagine yourself as a detective walking into a new crime scene. You wouldn’t jump straight to conclusions—you’d scan the area, look for clues, identify patterns, and talk to witnesses. EDA is the detective work of data science. It’s the first and most important step before building advanced machine learning models or drawing strong conclusions. It’s about asking smart questions, testing assumptions, and letting the data lead the way.

This guide will be your magnifying glass and notebook. Whether you’re a beginner just stepping into data analysis or an experienced professional looking for a refresher, we’ll walk through the full EDA process using Python—the go-to language for data scientists. With powerful libraries like Pandas, Matplotlib, and Seaborn, you’ll learn how to slice, dice, and visualize data to reveal insights that matter.

What is Exploratory Data Analysis (EDA) and Why Bother?

Coined by the brilliant American mathematician John Tukey, EDA is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. It’s not about formal hypothesis testing or building predictive models just yet. Instead, it’s about developing an intuition for your data.

So, why is it so important?

Spotting Errors and Anomalies: You'll quickly find mistakes, typos, or impossible values (like a human age of 200) that need cleaning.
Understanding Variables: You get a feel for the different features in your data, their distributions, and their data types.
Uncovering Relationships: You can identify potential relationships and correlations between variables, which can be goldmines for future modeling.
Guiding Feature Engineering: The insights from EDA help you create new, more meaningful features from your existing data.
Validating Assumptions: It helps you check if the assumptions for a particular statistical model (like linearity or normality) hold true for your dataset.

Skipping EDA is like trying to build a house without looking at the blueprints. You might end up with something, but it probably won't be stable, reliable, or what you intended.

The EDA Toolkit: Your Python Arsenal

Before we dive in, let's make sure our toolkit is ready. We’ll be relying on a few core Python libraries. If you don't have them installed, a simple pip install command will do the trick.

Pandas: The undisputed champion for data manipulation and analysis in Python. It provides data structures like the DataFrame, which is essentially a smart spreadsheet you can control with code.
NumPy: The foundational package for numerical computing in Python. Pandas is built on top of it, and it's essential for any mathematical operations.
Matplotlib & Seaborn: These are our visualization powerhouses. Matplotlib is the foundational library, offering immense control over every detail of a plot. Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics. It makes creating beautiful, complex plots much simpler.

Let's get started by importing them into our Python environment (typically a Jupyter Notebook).

Python

Copy Code

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns



# Set some visual preferences for our plots

sns.set(style="whitegrid")

%matplotlib inline

For our investigation, we'll use the classic "Titanic" dataset, which is readily available in Seaborn. It contains information about passengers on the Titanic and whether they survived the disaster. It's a fantastic dataset for practicing EDA because it has a mix of numerical and categorical data, and some missing values to deal with.

Python

Copy Code

# Load the dataset

df = sns.load_dataset('titanic')

Step 1: The First Glance - Getting to Know Your Data

Just like meeting someone for the first time, our initial goal is to get a general impression of our dataset. We're not looking for deep insights yet, just the basic facts.

Peeking at the Data

The .head() and .tail() functions are perfect for this. They show you the first and last few rows, respectively. This helps you understand what the columns are and the kind of data they hold.

Python

Copy Code

# Display the first 5 rows

print(df.head())

This simple command immediately tells us about columns like survived, pclass (passenger class), sex, age, etc.

Getting the Big Picture with .info()

The .info() method provides a concise summary of the DataFrame. It’s one of the most useful commands in EDA.

Python

Copy Code

# Get a summary of the dataframe

df.info()

This output is packed with crucial information:

It tells us there are 891 entries (rows).
It lists each column name and the count of non-null (non-empty) values. Notice that age, deck, and embark_town have fewer than 891 non-null values, which means they have missing data. This is a critical finding!
It shows the Dtype (data type) for each column. We see integers (int64), floating-point numbers (float64), and objects (object, which usually means strings).

Summarizing with .describe()

For numerical columns, the .describe() method is a statistical powerhouse. It gives you a quick rundown of the central tendency, dispersion, and shape of the distribution of a dataset.

Python

Copy Code

# Get descriptive statistics for numerical columns

df.describe()

This reveals things like:

The average age (mean) of a passenger was about 29.7 years.
The youngest passenger was a baby of 0.42 years (about 5 months), and the oldest was 80.
The fare (fare) varied wildly, from $0 to over $512. The std (standard deviation) is high, and the 75% percentile is much lower than the max, suggesting some very high, possibly outlier, fares.

For categorical columns, you can use .describe(include=['object']).

Python

Copy Code

# Get descriptive statistics for categorical columns

df.describe(include=['object'])

This tells us there were more male passengers than female, and most people (top) embarked from Southampton.

Step 2: The Cleanup Crew - Handling Missing Data

Our initial investigation revealed missing values. Dirty data can skew our analysis and mislead our models. It's time to clean it up.

First, let's get a clear count of missing values per column.

Python

Copy Code

# Check for missing values

print(df.isnull().sum())

This confirms that age, deck, and embark_town have missing values, with deck being the most problematic with 688 missing entries.

How we handle these depends on the context:

For age: We could fill the missing values with the mean or median age. Since age distribution might differ by gender or class, a smarter approach is to fill it with the median age of a related group (e.g., median age of males in 3rd class). For simplicity here, let's use the overall median.
For deck: With so many values missing, trying to guess them might introduce more noise than signal. The best option is often to drop the column entirely.
For embark_town and embarked: There are only two missing values. We can fill them with the most common embarkation port (the mode).

Python

Copy Code

# Handling missing values

# Fill 'age' with the median

df['age'].fillna(df['age'].median(), inplace=True)



# Drop the 'deck' column

df.drop('deck', axis=1, inplace=True)



# Fill 'embark_town' and 'embarked' with the mode

df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)

df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)



# Verify that there are no more missing values

print(df.isnull().sum())

Success! Our dataset is now clean and complete.

Step 3: Single-Variable Stories - Univariate Analysis

Now that our data is clean, we can start analyzing variables one by one. This is called univariate analysis. The goal is to understand the distribution of each variable.

Analyzing Categorical Variables

For categorical columns like survived, pclass, and sex, we can use count plots to see the frequency of each category.

Python

Copy Code

# Univariate analysis of the 'survived' column

sns.countplot(x='survived', data=df)

plt.title('Survival Count (0 = No, 1 = Yes)')

plt.show()

This plot quickly shows us that more people died than survived.

Python

Copy Code

# Univariate analysis of passenger class

sns.countplot(x='pclass', data=df)

plt.title('Passenger Class Distribution')

plt.show()

This reveals that the majority of passengers were in the 3rd class.

Analyzing Numerical Variables

For numerical columns like age and fare, histograms and box plots are excellent tools. A histogram shows the frequency distribution of the data.

Python

Copy Code

# Univariate analysis of 'age'

sns.histplot(df['age'], bins=30, kde=True) # kde adds a smooth density line

plt.title('Age Distribution of Passengers')

plt.show()

The age distribution seems to be skewed towards younger adults, with a peak between 20 and 30 years old.

A box plot gives us a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It's fantastic for spotting outliers.

Python

Copy Code

# Univariate analysis of 'fare'

sns.boxplot(x=df['fare'])

plt.title('Fare Distribution')

plt.show()

This box plot for fare clearly shows a large number of outliers on the higher end, confirming our suspicion from the .describe() output.

Step 4: Data in Dialogue - Bivariate Analysis

This is where the real detective work begins. Bivariate analysis is about exploring the relationship between two variables. We're looking for connections and correlations.

Numerical vs. Categorical

How does a numerical variable change across different categories? For example, did the survival rate depend on the passenger class?

Python

Copy Code

# Survival rate by passenger class

sns.barplot(x='pclass', y='survived', data=df)

plt.title('Survival Rate by Passenger Class')

plt.ylabel('Survival Rate')

plt.show()

Insight: A stark and powerful story. Passengers in 1st class had a much higher survival rate than those in 2nd, and 2nd class passengers had a higher rate than those in 3rd. This is a major clue!

What about the relationship between age and survival? A box plot is great here.

Python

Copy Code

# Age distribution by survival status

sns.boxplot(x='survived', y='age', data=df)

plt.title('Age Distribution by Survival Status')

plt.show()

This shows that survivors tended to be slightly younger on average, though the distributions are quite similar.

Categorical vs. Categorical

What about the relationship between two categorical variables, like sex and survival?

Python

Copy Code

# Survival rate by sex

sns.barplot(x='sex', y='survived', data=df)

plt.title('Survival Rate by Sex')

plt.ylabel('Survival Rate')

plt.show()

Insight: Another incredibly strong finding. Female passengers had a drastically higher survival rate than male passengers. The "women and children first" protocol appears to be reflected in the data.

Numerical vs. Numerical

To see how two numerical variables relate, a scatter plot is the go-to choice. Let's see if there's a relationship between a passenger's age and the fare they paid.

Python

Copy Code

# Age vs. Fare

sns.scatterplot(x='age', y='fare', data=df)

plt.title('Age vs. Fare Paid')

plt.show()

The plot doesn't show a strong linear relationship. Most people, regardless of age, paid lower fares, but there are some older passengers who paid very high fares.

To analyze all numerical relationships at once, we can compute a correlation matrix and visualize it as a heatmap.

Python

Copy Code

# Correlation heatmap for numerical variables

numeric_df = df.select_dtypes(include=np.number)

plt.figure(figsize=(10, 7))

sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')

plt.show()

The heatmap shows the correlation coefficient between pairs of variables. Values close to 1 (or -1) indicate a strong positive (or negative) correlation. We see a moderate positive correlation between fare and survival and a negative one between class and survival, which reinforces our earlier findings.

Step 5: Putting It All Together - Drawing Conclusions

After this deep dive, what have we learned?

Survival was not random. It was heavily influenced by other factors.
Social Class was a major factor: First-class passengers had the best chance of survival.
Gender was critical: Being female dramatically increased the chances of survival.
Fare and Class are linked: Passengers in higher classes paid significantly higher fares.
Age played a role: Younger people had a slightly better chance of survival.

EDA is an iterative cycle. Each discovery leads to new questions. For instance, did being a female in 3rd class give you a better chance than being a male in 1st class? You can continue to slice and dice the data to answer these more complex, or multivariate, questions. A tool like sns.pairplot(df) can even be used to visualize relationships across all variables at once, though it can be overwhelming for datasets with many features.

Final Thoughts

Exploratory Data Analysis (EDA) is the starting point for every successful data project. It’s a creative, curious process that turns raw numbers into meaningful stories. With Python libraries like Pandas, Matplotlib, and Seaborn, you can clean, analyze, and visualize your data in a structured way—helping you uncover insights that might otherwise stay hidden.

The techniques you’ve seen here are core skills for anyone in data science. If you’re ready to move beyond the basics and tackle more complex, real-world challenges, it’s important to strengthen your foundation. A structured learning path can accelerate that growth. Enrolling in a comprehensive Uncodemy's Data Science using Python course in Noida can give you hands-on experience, expert guidance, and the confidence to apply these concepts to industry-level projects.

So, grab a dataset, open up the Jupyter Notebook, and start exploring. The stories are already in the data—you just need to ask the right questions.