Pandas DataFrames: Data Manipulation and Analysis

Welcome to Pandas DataFrames! Think of pandas as Excel on steroids - it can handle massive datasets, perform complex operations, and analyze data in ways Excel could never dream of.

Why Pandas Matters

Pandas makes data analysis feel natural:

import pandas as pd

# Load data (like opening a spreadsheet)
df = pd.read_csv('sales_data.csv')

# Filter data (like Excel filters)
high_sales = df[df['revenue'] > 1000]

# Group and summarize (like pivot tables)
sales_by_region = df.groupby('region')['revenue'].sum()

# Add calculations (like Excel formulas)
df['profit_margin'] = df['profit'] / df['revenue'] * 100

# Handle missing data automatically
df = df.dropna()  # Remove rows with missing values

Series: 1D Data

Before DataFrames, let’s understand Series:

import pandas as pd
import numpy as np

# Create a Series from a list
sales = pd.Series([100, 200, 150, 300, 250])
print(sales)
# 0    100
# 1    200
# 2    150
# 3    300
# 4    250
# dtype: int64

# Series with custom index
sales = pd.Series([100, 200, 150, 300, 250], 
                  index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
print(sales)
# Mon    100
# Tue    200
# Wed    150
# Thu    300
# Fri    250
# dtype: int64

# Access elements
print(sales['Mon'])    # 100
print(sales[0])        # 100 (positional)
print(sales.iloc[0])   # 100 (explicit positional)
print(sales.loc['Mon']) # 100 (explicit label)

# Series operations
print(sales * 1.1)     # 10% increase
print(sales > 200)     # Boolean mask
print(sales[sales > 200])  # Filter

DataFrame Creation

From Dictionaries

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo'],
    'Salary': [50000, 60000, 70000, 55000]
}

df = pd.DataFrame(data)
print(df)
#      Name  Age      City  Salary
# 0   Alice   25  New York   50000
# 1     Bob   30    London   60000
# 2 Charlie   35     Paris   70000
# 3   Diana   28     Tokyo   55000

From Lists

# From list of lists
data = [
    ['Alice', 25, 'New York', 50000],
    ['Bob', 30, 'London', 60000],
    ['Charlie', 35, 'Paris', 70000],
    ['Diana', 28, 'Tokyo', 55000]
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City', 'Salary'])
print(df)

From NumPy Arrays

# From NumPy array
arr = np.random.rand(4, 3)
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)

From CSV Files

# Read CSV file
df = pd.read_csv('data.csv')

# Read with options
df = pd.read_csv('data.csv', 
                 sep=',',           # delimiter
                 header=0,          # header row
                 index_col=0,       # index column
                 parse_dates=['date'])  # parse dates

# Read Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read JSON
df = pd.read_json('data.json')

DataFrame Exploration

Basic Information

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
})

print("Shape:", df.shape)        # (3, 3) - rows, columns
print("Columns:", df.columns)    # Index(['Name', 'Age', 'Salary'])
print("Index:", df.index)        # RangeIndex(start=0, stop=3)
print("Data types:")
print(df.dtypes)
# Name      object
# Age        int64
# Salary     int64
# dtype: object

# Summary statistics
print(df.describe())
#               Age        Salary
# count    3.000000      3.000000
# mean    30.000000  60000.000000
# std      5.000000  10000.000000
# min     25.000000  50000.000000
# 25%     27.500000  55000.000000
# 50%     30.000000  60000.000000
# 75%     32.500000  65000.000000
# max     35.000000  70000.000000

# First few rows
print(df.head())     # First 5 rows
print(df.head(2))    # First 2 rows

# Last few rows
print(df.tail())     # Last 5 rows
print(df.tail(2))    # Last 2 rows

# Random sample
print(df.sample(2))  # Random 2 rows

Indexing and Selection

Column Selection

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'London', 'Paris', 'Tokyo'],
    'Salary': [50000, 60000, 70000, 55000]
})

# Single column (returns Series)
names = df['Name']
ages = df.Age  # Alternative syntax

# Multiple columns (returns DataFrame)
subset = df[['Name', 'Age']]

# Column operations
df['Salary_2024'] = df['Salary'] * 1.05  # Add 5% raise
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100], 
                        labels=['Young', 'Middle', 'Senior'])

Row Selection

# By position (iloc)
first_row = df.iloc[0]      # First row
first_two = df.iloc[0:2]    # First two rows
specific_rows = df.iloc[[0, 2]]  # Rows 0 and 2

# By label (loc)
df.index = ['A', 'B', 'C', 'D']  # Set custom index
row_B = df.loc['B']        # Row with index 'B'
rows_BC = df.loc['B':'C']  # Rows B through C

# Boolean indexing
young_people = df[df['Age'] < 30]
high_earners = df[df['Salary'] > 55000]
nyc_residents = df[df['City'] == 'NYC']

# Multiple conditions
young_nyc = df[(df['Age'] < 30) & (df['City'] == 'NYC')]

Advanced Selection

# isin() - check membership
european_cities = df[df['City'].isin(['London', 'Paris'])]
print(european_cities)

# between() - range check
mid_salary = df[df['Salary'].between(55000, 65000)]
print(mid_salary)

# query() - SQL-like syntax
result = df.query('Age < 30 and Salary > 50000')
print(result)

# nlargest/nsmallest
top_earners = df.nlargest(2, 'Salary')
youngest = df.nsmallest(2, 'Age')

Data Cleaning

Handling Missing Values

# Create DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, np.nan]
})

print("Missing values:")
print(df.isnull().sum())
# A    1
# B    1
# C    1
# dtype: int64

# Drop rows with any missing values
df_clean = df.dropna()

# Drop rows with all missing values
df_clean = df.dropna(how='all')

# Drop columns with missing values
df_clean = df.dropna(axis=1)

# Fill missing values
df_filled = df.fillna(0)  # Fill with 0
df_filled = df.fillna(df.mean())  # Fill with column means
df_filled = df.fillna(method='ffill')  # Forward fill
df_filled = df.fillna(method='bfill')  # Backward fill

# Fill specific columns
df['A'] = df['A'].fillna(df['A'].median())
df['B'] = df['B'].fillna('Unknown')

Data Type Conversion

df = pd.DataFrame({
    'Age': ['25', '30', '35'],
    'Salary': ['50000', '60000', '70000'],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03']
})

# Convert to numeric
df['Age'] = pd.to_numeric(df['Age'])
df['Salary'] = pd.to_numeric(df['Salary'])

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Convert to categorical
df['Category'] = pd.Categorical(['A', 'B', 'A'])

print(df.dtypes)

Removing Duplicates

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35]
})

print("Duplicates:")
print(df.duplicated())
# 0    False
# 1    False
# 2     True
# 3    False
# dtype: bool

# Remove duplicates
df_unique = df.drop_duplicates()

# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['Name'])

Data Transformation

Adding Columns

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Math': [85, 92, 78],
    'Science': [88, 85, 90],
    'English': [92, 88, 85]
})

# Calculate total and average
df['Total'] = df['Math'] + df['Science'] + df['English']
df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)

# Grade based on average
def assign_grade(avg):
    if avg >= 90: return 'A'
    elif avg >= 80: return 'B'
    elif avg >= 70: return 'C'
    else: return 'F'

df['Grade'] = df['Average'].apply(assign_grade)

# Percentage scores
df['Math_Pct'] = df['Math'] / df['Total'] * 100

String Operations

df = pd.DataFrame({
    'Name': ['alice smith', 'BOB JOHNSON', 'Charlie Brown'],
    'Email': ['alice@example.com', 'bob@test.com', 'charlie@demo.com']
})

# String methods
df['Name'] = df['Name'].str.title()  # Title case
df['First_Name'] = df['Name'].str.split().str[0]  # First name
df['Last_Name'] = df['Name'].str.split().str[1]   # Last name
df['Name_Length'] = df['Name'].str.len()  # String length

# Email domain
df['Domain'] = df['Email'].str.split('@').str[1]

# Check patterns
df['Is_Gmail'] = df['Email'].str.contains('gmail')
df['Valid_Email'] = df['Email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')

Date Operations

df = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=5, freq='D'),
    'Sales': [100, 200, 150, 300, 250]
})

# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.day_name()

# Date arithmetic
df['Next_Week'] = df['Date'] + pd.Timedelta(days=7)
df['Days_Old'] = (pd.Timestamp.now() - df['Date']).dt.days

# Resampling (if Date is index)
df.set_index('Date', inplace=True)
weekly_sales = df.resample('W').sum()  # Weekly totals
monthly_sales = df.resample('M').mean()  # Monthly averages

Grouping and Aggregation

Basic GroupBy

df = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Salary': [50000, 55000, 45000, 48000, 60000, 65000, 70000],
    'Years_Exp': [3, 5, 2, 4, 6, 8, 10]
})

# Group by department
grouped = df.groupby('Department')

# Aggregate functions
print(grouped['Salary'].mean())
# Department
# HR      46500.0
# IT      65000.0
# Sales   52500.0
# Name: Salary, dtype: float64

print(grouped['Salary'].agg(['count', 'mean', 'min', 'max']))
#          count     mean    min    max
# Department                          
# HR           2  46500.0  45000  48000
# IT           3  65000.0  60000  70000
# Sales        2  52500.0  50000  55000

# Multiple aggregations
result = grouped.agg({
    'Salary': ['mean', 'std'],
    'Years_Exp': ['mean', 'max']
})
print(result)

Advanced Grouping

# Multiple groupby columns
df['Seniority'] = pd.cut(df['Years_Exp'], bins=[0, 3, 6, 20], 
                        labels=['Junior', 'Mid', 'Senior'])

grouped_multi = df.groupby(['Department', 'Seniority'])
result = grouped_multi['Salary'].mean()
print(result)
# Department  Seniority
# HR          Junior       45000.0
#             Mid          48000.0
# IT          Mid          60000.0
#             Senior       67500.0
# Sales       Junior       50000.0
#             Mid          55000.0
# Name: Salary, dtype: float64

# Custom aggregation functions
def salary_range(x):
    return x.max() - x.min()

result = grouped.agg({
    'Salary': ['mean', salary_range],
    'Employee': 'count'
})
print(result)

Pivot Tables

# Create pivot table
pivot = df.pivot_table(
    values='Salary',
    index='Department',
    columns='Seniority',
    aggfunc='mean',
    fill_value=0
)
print(pivot)
# Seniority   Junior     Mid   Senior
# Department                        
# HR          45000.0  48000.0     0.0
# IT              0.0  60000.0 67500.0
# Sales       50000.0  55000.0     0.0

Merging and Joining

Concatenation

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Vertical concatenation (add rows)
result = pd.concat([df1, df2])
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 0  5  7
# 1  6  8

# Horizontal concatenation (add columns)
df3 = pd.DataFrame({'C': [9, 10]})
result = pd.concat([df1, df3], axis=1)
print(result)
#    A  B   C
# 0  1  3   9
# 1  2  4  10

Merging (Joins)

employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Sales', 'HR', 'IT']
})

# Inner join (default)
result = pd.merge(employees, departments, on='dept_id')
print(result)
#    emp_id     name  dept_id dept_name
# 0       1    Alice       10     Sales
# 1       2      Bob       20        HR
# 2       3  Charlie       10     Sales
# 3       4    Diana       30        IT

# Left join
result = pd.merge(employees, departments, on='dept_id', how='left')

# Right join
result = pd.merge(employees, departments, on='dept_id', how='right')

# Outer join
result = pd.merge(employees, departments, on='dept_id', how='outer')

Practical Examples

Example 1: Sales Analysis

# Sample sales data
sales_data = {
    'Date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100)
}

df = pd.DataFrame(sales_data)

# Basic analysis
print("Total sales:", df['Sales'].sum())
print("Average sale:", df['Sales'].mean())
print("Best day:", df.loc[df['Sales'].idxmax(), 'Date'])

# Sales by product
product_sales = df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])
print("\nSales by product:")
print(product_sales)

# Sales by region and product
pivot_table = df.pivot_table(
    values='Sales',
    index='Region',
    columns='Product',
    aggfunc='sum',
    fill_value=0
)
print("\nSales by region and product:")
print(pivot_table)

# Monthly trends
df['Month'] = df['Date'].dt.month
monthly_sales = df.groupby('Month')['Sales'].sum()
print("\nMonthly sales:")
print(monthly_sales)

# Top performing regions
region_performance = df.groupby('Region')['Sales'].sum().sort_values(ascending=False)
print("\nRegion performance:")
print(region_performance)

Example 2: Student Grade Analysis

# Student data
students = pd.DataFrame({
    'student_id': range(1, 21),
    'name': [f'Student_{i}' for i in range(1, 21)],
    'class': np.random.choice(['A', 'B', 'C'], 20),
    'math': np.random.randint(50, 100, 20),
    'science': np.random.randint(50, 100, 20),
    'english': np.random.randint(50, 100, 20)
})

# Calculate averages
students['average'] = students[['math', 'science', 'english']].mean(axis=1)

# Assign grades
def get_grade(score):
    if score >= 90: return 'A'
    elif score >= 80: return 'B'
    elif score >= 70: return 'C'
    elif score >= 60: return 'D'
    else: return 'F'

students['grade'] = students['average'].apply(get_grade)

# Class performance
class_stats = students.groupby('class').agg({
    'math': 'mean',
    'science': 'mean',
    'english': 'mean',
    'average': ['mean', 'std'],
    'grade': lambda x: x.value_counts().index[0]  # Most common grade
})

print("Class performance:")
print(class_stats)

# Identify struggling students
struggling = students[students['average'] < 70]
print(f"\nStruggling students ({len(struggling)}):")
print(struggling[['name', 'class', 'average', 'grade']])

# Subject-wise analysis
subject_means = students[['math', 'science', 'english']].mean()
print("\nSubject averages:")
print(subject_means)

# Correlation between subjects
correlation = students[['math', 'science', 'english']].corr()
print("\nSubject correlations:")
print(correlation)

Example 3: E-commerce Data Analysis

# Sample e-commerce data
orders = pd.DataFrame({
    'order_id': range(1, 1001),
    'customer_id': np.random.randint(1, 201, 1000),
    'product_id': np.random.randint(1, 51, 1000),
    'quantity': np.random.randint(1, 5, 1000),
    'price': np.random.uniform(10, 500, 1000),
    'order_date': pd.date_range('2023-01-01', periods=1000, freq='H')
})

# Calculate order totals
orders['total'] = orders['quantity'] * orders['price']

# Revenue analysis
total_revenue = orders['total'].sum()
avg_order_value = orders['total'].mean()
print(f"Total revenue: ${total_revenue:,.2f}")
print(f"Average order value: ${avg_order_value:.2f}")

# Top products by revenue
product_revenue = orders.groupby('product_id')['total'].sum().sort_values(ascending=False)
print("\nTop 5 products by revenue:")
print(product_revenue.head())

# Customer analysis
customer_orders = orders.groupby('customer_id').agg({
    'order_id': 'count',
    'total': 'sum'
}).rename(columns={'order_id': 'order_count', 'total': 'total_spent'})

top_customers = customer_orders.sort_values('total_spent', ascending=False)
print("\nTop 5 customers:")
print(top_customers.head())

# Time-based analysis
orders['month'] = orders['order_date'].dt.month
monthly_revenue = orders.groupby('month')['total'].sum()
print("\nMonthly revenue:")
print(monthly_revenue)

# Customer segmentation
def segment_customer(total_spent):
    if total_spent > 2000: return 'VIP'
    elif total_spent > 1000: return 'Gold'
    elif total_spent > 500: return 'Silver'
    else: return 'Bronze'

customer_orders['segment'] = customer_orders['total_spent'].apply(segment_customer)
segment_counts = customer_orders['segment'].value_counts()
print("\nCustomer segments:")
print(segment_counts)

Best Practices

1. Chain Operations

# Method chaining for readable code
result = (df
    .query('age > 18')
    .assign(adult=True)
    .groupby('department')
    .agg({'salary': 'mean'})
    .reset_index()
    .sort_values('salary', ascending=False)
)

2. Use Appropriate Data Types

# Convert to category for memory efficiency
df['category'] = df['category'].astype('category')

# Use datetime for date operations
df['date'] = pd.to_datetime(df['date'])

3. Handle Missing Data Properly

# Don't just drop missing data - understand why it's missing
print(df.isnull().sum())  # Check missing values
print(df.isnull().mean())  # Percentage missing

# Use appropriate imputation
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].median())
df['categorical_col'] = df['categorical_col'].fillna('Unknown')

4. Vectorized Operations

# Good - vectorized
df['new_col'] = df['col1'] * df['col2']

# Bad - loops
df['new_col'] = [x * y for x, y in zip(df['col1'], df['col2'])]

Practice Exercises

Exercise 1: Data Loading and Exploration

Load a CSV file and:

Display the first 10 rows
Show basic statistics for numeric columns
Check for missing values
Display data types of each column

Exercise 2: Data Cleaning

Given a messy DataFrame:

Remove duplicate rows
Fill missing numeric values with column means
Fill missing categorical values with ‘Unknown’
Convert date strings to datetime objects

Exercise 3: Data Transformation

Create new columns:

Calculate BMI from height and weight
Create age groups (18-25, 26-35, etc.)
Calculate percentage scores
Extract domain from email addresses

Exercise 4: Grouping and Aggregation

Analyze sales data by:

Total sales by product category
Average order value by region
Top 5 customers by total spending
Monthly sales trends

Exercise 5: Merging DataFrames

Given customer and order data:

Join customer info with their orders
Calculate customer lifetime value
Find customers who haven’t ordered in 30 days
Create a customer summary report

Summary

Pandas DataFrames are powerful for data manipulation:

Creating DataFrames:

import pandas as pd

# From dictionary
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

# From CSV
df = pd.read_csv('data.csv')

Data Selection:

# Columns
df['column_name']
df[['col1', 'col2']]

# Rows
df.iloc[0:5]      # Position-based
df.loc['index']    # Label-based
df[df['col'] > 10] # Boolean indexing

Data Operations:

# Cleaning
df.dropna()                    # Remove missing
df.fillna(value)              # Fill missing
df.drop_duplicates()          # Remove duplicates

# Transformation
df['new_col'] = df['col1'] * 2 # Add column
df.groupby('col')['val'].sum() # Group and aggregate

Key Concepts:

Series (1D) and DataFrame (2D) structures
Indexing with .loc and .iloc
Data cleaning and preprocessing
Grouping and aggregation
Merging and joining data

Next: Data Cleaning - Handling missing data and preprocessing! 🧹

Popular Topics

Categories