Pandas DataFrames: Data Manipulation and Analysis
Welcome to Pandas DataFrames! Think of pandas as Excel on steroids - it can handle massive datasets, perform complex operations, and analyze data in ways Excel could never dream of.
Why Pandas Matters
Pandas makes data analysis feel natural:
import pandas as pd
# Load data (like opening a spreadsheet)
df = pd.read_csv('sales_data.csv')
# Filter data (like Excel filters)
high_sales = df[df['revenue'] > 1000]
# Group and summarize (like pivot tables)
sales_by_region = df.groupby('region')['revenue'].sum()
# Add calculations (like Excel formulas)
df['profit_margin'] = df['profit'] / df['revenue'] * 100
# Handle missing data automatically
df = df.dropna() # Remove rows with missing values
Series: 1D Data
Before DataFrames, letβs understand Series:
import pandas as pd
import numpy as np
# Create a Series from a list
sales = pd.Series([100, 200, 150, 300, 250])
print(sales)
# 0 100
# 1 200
# 2 150
# 3 300
# 4 250
# dtype: int64
# Series with custom index
sales = pd.Series([100, 200, 150, 300, 250],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
print(sales)
# Mon 100
# Tue 200
# Wed 150
# Thu 300
# Fri 250
# dtype: int64
# Access elements
print(sales['Mon']) # 100
print(sales[0]) # 100 (positional)
print(sales.iloc[0]) # 100 (explicit positional)
print(sales.loc['Mon']) # 100 (explicit label)
# Series operations
print(sales * 1.1) # 10% increase
print(sales > 200) # Boolean mask
print(sales[sales > 200]) # Filter
DataFrame Creation
From Dictionaries
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print(df)
# Name Age City Salary
# 0 Alice 25 New York 50000
# 1 Bob 30 London 60000
# 2 Charlie 35 Paris 70000
# 3 Diana 28 Tokyo 55000
From Lists
# From list of lists
data = [
['Alice', 25, 'New York', 50000],
['Bob', 30, 'London', 60000],
['Charlie', 35, 'Paris', 70000],
['Diana', 28, 'Tokyo', 55000]
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City', 'Salary'])
print(df)
From NumPy Arrays
# From NumPy array
arr = np.random.rand(4, 3)
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)
From CSV Files
# Read CSV file
df = pd.read_csv('data.csv')
# Read with options
df = pd.read_csv('data.csv',
sep=',', # delimiter
header=0, # header row
index_col=0, # index column
parse_dates=['date']) # parse dates
# Read Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read JSON
df = pd.read_json('data.json')
DataFrame Exploration
Basic Information
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
})
print("Shape:", df.shape) # (3, 3) - rows, columns
print("Columns:", df.columns) # Index(['Name', 'Age', 'Salary'])
print("Index:", df.index) # RangeIndex(start=0, stop=3)
print("Data types:")
print(df.dtypes)
# Name object
# Age int64
# Salary int64
# dtype: object
# Summary statistics
print(df.describe())
# Age Salary
# count 3.000000 3.000000
# mean 30.000000 60000.000000
# std 5.000000 10000.000000
# min 25.000000 50000.000000
# 25% 27.500000 55000.000000
# 50% 30.000000 60000.000000
# 75% 32.500000 65000.000000
# max 35.000000 70000.000000
# First few rows
print(df.head()) # First 5 rows
print(df.head(2)) # First 2 rows
# Last few rows
print(df.tail()) # Last 5 rows
print(df.tail(2)) # Last 2 rows
# Random sample
print(df.sample(2)) # Random 2 rows
Indexing and Selection
Column Selection
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['NYC', 'London', 'Paris', 'Tokyo'],
'Salary': [50000, 60000, 70000, 55000]
})
# Single column (returns Series)
names = df['Name']
ages = df.Age # Alternative syntax
# Multiple columns (returns DataFrame)
subset = df[['Name', 'Age']]
# Column operations
df['Salary_2024'] = df['Salary'] * 1.05 # Add 5% raise
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100],
labels=['Young', 'Middle', 'Senior'])
Row Selection
# By position (iloc)
first_row = df.iloc[0] # First row
first_two = df.iloc[0:2] # First two rows
specific_rows = df.iloc[[0, 2]] # Rows 0 and 2
# By label (loc)
df.index = ['A', 'B', 'C', 'D'] # Set custom index
row_B = df.loc['B'] # Row with index 'B'
rows_BC = df.loc['B':'C'] # Rows B through C
# Boolean indexing
young_people = df[df['Age'] < 30]
high_earners = df[df['Salary'] > 55000]
nyc_residents = df[df['City'] == 'NYC']
# Multiple conditions
young_nyc = df[(df['Age'] < 30) & (df['City'] == 'NYC')]
Advanced Selection
# isin() - check membership
european_cities = df[df['City'].isin(['London', 'Paris'])]
print(european_cities)
# between() - range check
mid_salary = df[df['Salary'].between(55000, 65000)]
print(mid_salary)
# query() - SQL-like syntax
result = df.query('Age < 30 and Salary > 50000')
print(result)
# nlargest/nsmallest
top_earners = df.nlargest(2, 'Salary')
youngest = df.nsmallest(2, 'Age')
Data Cleaning
Handling Missing Values
# Create DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, np.nan]
})
print("Missing values:")
print(df.isnull().sum())
# A 1
# B 1
# C 1
# dtype: int64
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows with all missing values
df_clean = df.dropna(how='all')
# Drop columns with missing values
df_clean = df.dropna(axis=1)
# Fill missing values
df_filled = df.fillna(0) # Fill with 0
df_filled = df.fillna(df.mean()) # Fill with column means
df_filled = df.fillna(method='ffill') # Forward fill
df_filled = df.fillna(method='bfill') # Backward fill
# Fill specific columns
df['A'] = df['A'].fillna(df['A'].median())
df['B'] = df['B'].fillna('Unknown')
Data Type Conversion
df = pd.DataFrame({
'Age': ['25', '30', '35'],
'Salary': ['50000', '60000', '70000'],
'Date': ['2023-01-01', '2023-01-02', '2023-01-03']
})
# Convert to numeric
df['Age'] = pd.to_numeric(df['Age'])
df['Salary'] = pd.to_numeric(df['Salary'])
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Convert to categorical
df['Category'] = pd.Categorical(['A', 'B', 'A'])
print(df.dtypes)
Removing Duplicates
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
})
print("Duplicates:")
print(df.duplicated())
# 0 False
# 1 False
# 2 True
# 3 False
# dtype: bool
# Remove duplicates
df_unique = df.drop_duplicates()
# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['Name'])
Data Transformation
Adding Columns
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 92, 78],
'Science': [88, 85, 90],
'English': [92, 88, 85]
})
# Calculate total and average
df['Total'] = df['Math'] + df['Science'] + df['English']
df['Average'] = df[['Math', 'Science', 'English']].mean(axis=1)
# Grade based on average
def assign_grade(avg):
if avg >= 90: return 'A'
elif avg >= 80: return 'B'
elif avg >= 70: return 'C'
else: return 'F'
df['Grade'] = df['Average'].apply(assign_grade)
# Percentage scores
df['Math_Pct'] = df['Math'] / df['Total'] * 100
String Operations
df = pd.DataFrame({
'Name': ['alice smith', 'BOB JOHNSON', 'Charlie Brown'],
'Email': ['alice@example.com', 'bob@test.com', 'charlie@demo.com']
})
# String methods
df['Name'] = df['Name'].str.title() # Title case
df['First_Name'] = df['Name'].str.split().str[0] # First name
df['Last_Name'] = df['Name'].str.split().str[1] # Last name
df['Name_Length'] = df['Name'].str.len() # String length
# Email domain
df['Domain'] = df['Email'].str.split('@').str[1]
# Check patterns
df['Is_Gmail'] = df['Email'].str.contains('gmail')
df['Valid_Email'] = df['Email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')
Date Operations
df = pd.DataFrame({
'Date': pd.date_range('2023-01-01', periods=5, freq='D'),
'Sales': [100, 200, 150, 300, 250]
})
# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.day_name()
# Date arithmetic
df['Next_Week'] = df['Date'] + pd.Timedelta(days=7)
df['Days_Old'] = (pd.Timestamp.now() - df['Date']).dt.days
# Resampling (if Date is index)
df.set_index('Date', inplace=True)
weekly_sales = df.resample('W').sum() # Weekly totals
monthly_sales = df.resample('M').mean() # Monthly averages
Grouping and Aggregation
Basic GroupBy
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
'Salary': [50000, 55000, 45000, 48000, 60000, 65000, 70000],
'Years_Exp': [3, 5, 2, 4, 6, 8, 10]
})
# Group by department
grouped = df.groupby('Department')
# Aggregate functions
print(grouped['Salary'].mean())
# Department
# HR 46500.0
# IT 65000.0
# Sales 52500.0
# Name: Salary, dtype: float64
print(grouped['Salary'].agg(['count', 'mean', 'min', 'max']))
# count mean min max
# Department
# HR 2 46500.0 45000 48000
# IT 3 65000.0 60000 70000
# Sales 2 52500.0 50000 55000
# Multiple aggregations
result = grouped.agg({
'Salary': ['mean', 'std'],
'Years_Exp': ['mean', 'max']
})
print(result)
Advanced Grouping
# Multiple groupby columns
df['Seniority'] = pd.cut(df['Years_Exp'], bins=[0, 3, 6, 20],
labels=['Junior', 'Mid', 'Senior'])
grouped_multi = df.groupby(['Department', 'Seniority'])
result = grouped_multi['Salary'].mean()
print(result)
# Department Seniority
# HR Junior 45000.0
# Mid 48000.0
# IT Mid 60000.0
# Senior 67500.0
# Sales Junior 50000.0
# Mid 55000.0
# Name: Salary, dtype: float64
# Custom aggregation functions
def salary_range(x):
return x.max() - x.min()
result = grouped.agg({
'Salary': ['mean', salary_range],
'Employee': 'count'
})
print(result)
Pivot Tables
# Create pivot table
pivot = df.pivot_table(
values='Salary',
index='Department',
columns='Seniority',
aggfunc='mean',
fill_value=0
)
print(pivot)
# Seniority Junior Mid Senior
# Department
# HR 45000.0 48000.0 0.0
# IT 0.0 60000.0 67500.0
# Sales 50000.0 55000.0 0.0
Merging and Joining
Concatenation
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Vertical concatenation (add rows)
result = pd.concat([df1, df2])
print(result)
# A B
# 0 1 3
# 1 2 4
# 0 5 7
# 1 6 8
# Horizontal concatenation (add columns)
df3 = pd.DataFrame({'C': [9, 10]})
result = pd.concat([df1, df3], axis=1)
print(result)
# A B C
# 0 1 3 9
# 1 2 4 10
Merging (Joins)
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Sales', 'HR', 'IT']
})
# Inner join (default)
result = pd.merge(employees, departments, on='dept_id')
print(result)
# emp_id name dept_id dept_name
# 0 1 Alice 10 Sales
# 1 2 Bob 20 HR
# 2 3 Charlie 10 Sales
# 3 4 Diana 30 IT
# Left join
result = pd.merge(employees, departments, on='dept_id', how='left')
# Right join
result = pd.merge(employees, departments, on='dept_id', how='right')
# Outer join
result = pd.merge(employees, departments, on='dept_id', how='outer')
Practical Examples
Example 1: Sales Analysis
# Sample sales data
sales_data = {
'Date': pd.date_range('2023-01-01', periods=100, freq='D'),
'Product': np.random.choice(['A', 'B', 'C'], 100),
'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'Sales': np.random.randint(100, 1000, 100)
}
df = pd.DataFrame(sales_data)
# Basic analysis
print("Total sales:", df['Sales'].sum())
print("Average sale:", df['Sales'].mean())
print("Best day:", df.loc[df['Sales'].idxmax(), 'Date'])
# Sales by product
product_sales = df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])
print("\nSales by product:")
print(product_sales)
# Sales by region and product
pivot_table = df.pivot_table(
values='Sales',
index='Region',
columns='Product',
aggfunc='sum',
fill_value=0
)
print("\nSales by region and product:")
print(pivot_table)
# Monthly trends
df['Month'] = df['Date'].dt.month
monthly_sales = df.groupby('Month')['Sales'].sum()
print("\nMonthly sales:")
print(monthly_sales)
# Top performing regions
region_performance = df.groupby('Region')['Sales'].sum().sort_values(ascending=False)
print("\nRegion performance:")
print(region_performance)
Example 2: Student Grade Analysis
# Student data
students = pd.DataFrame({
'student_id': range(1, 21),
'name': [f'Student_{i}' for i in range(1, 21)],
'class': np.random.choice(['A', 'B', 'C'], 20),
'math': np.random.randint(50, 100, 20),
'science': np.random.randint(50, 100, 20),
'english': np.random.randint(50, 100, 20)
})
# Calculate averages
students['average'] = students[['math', 'science', 'english']].mean(axis=1)
# Assign grades
def get_grade(score):
if score >= 90: return 'A'
elif score >= 80: return 'B'
elif score >= 70: return 'C'
elif score >= 60: return 'D'
else: return 'F'
students['grade'] = students['average'].apply(get_grade)
# Class performance
class_stats = students.groupby('class').agg({
'math': 'mean',
'science': 'mean',
'english': 'mean',
'average': ['mean', 'std'],
'grade': lambda x: x.value_counts().index[0] # Most common grade
})
print("Class performance:")
print(class_stats)
# Identify struggling students
struggling = students[students['average'] < 70]
print(f"\nStruggling students ({len(struggling)}):")
print(struggling[['name', 'class', 'average', 'grade']])
# Subject-wise analysis
subject_means = students[['math', 'science', 'english']].mean()
print("\nSubject averages:")
print(subject_means)
# Correlation between subjects
correlation = students[['math', 'science', 'english']].corr()
print("\nSubject correlations:")
print(correlation)
Example 3: E-commerce Data Analysis
# Sample e-commerce data
orders = pd.DataFrame({
'order_id': range(1, 1001),
'customer_id': np.random.randint(1, 201, 1000),
'product_id': np.random.randint(1, 51, 1000),
'quantity': np.random.randint(1, 5, 1000),
'price': np.random.uniform(10, 500, 1000),
'order_date': pd.date_range('2023-01-01', periods=1000, freq='H')
})
# Calculate order totals
orders['total'] = orders['quantity'] * orders['price']
# Revenue analysis
total_revenue = orders['total'].sum()
avg_order_value = orders['total'].mean()
print(f"Total revenue: ${total_revenue:,.2f}")
print(f"Average order value: ${avg_order_value:.2f}")
# Top products by revenue
product_revenue = orders.groupby('product_id')['total'].sum().sort_values(ascending=False)
print("\nTop 5 products by revenue:")
print(product_revenue.head())
# Customer analysis
customer_orders = orders.groupby('customer_id').agg({
'order_id': 'count',
'total': 'sum'
}).rename(columns={'order_id': 'order_count', 'total': 'total_spent'})
top_customers = customer_orders.sort_values('total_spent', ascending=False)
print("\nTop 5 customers:")
print(top_customers.head())
# Time-based analysis
orders['month'] = orders['order_date'].dt.month
monthly_revenue = orders.groupby('month')['total'].sum()
print("\nMonthly revenue:")
print(monthly_revenue)
# Customer segmentation
def segment_customer(total_spent):
if total_spent > 2000: return 'VIP'
elif total_spent > 1000: return 'Gold'
elif total_spent > 500: return 'Silver'
else: return 'Bronze'
customer_orders['segment'] = customer_orders['total_spent'].apply(segment_customer)
segment_counts = customer_orders['segment'].value_counts()
print("\nCustomer segments:")
print(segment_counts)
Best Practices
1. Chain Operations
# Method chaining for readable code
result = (df
.query('age > 18')
.assign(adult=True)
.groupby('department')
.agg({'salary': 'mean'})
.reset_index()
.sort_values('salary', ascending=False)
)
2. Use Appropriate Data Types
# Convert to category for memory efficiency
df['category'] = df['category'].astype('category')
# Use datetime for date operations
df['date'] = pd.to_datetime(df['date'])
3. Handle Missing Data Properly
# Don't just drop missing data - understand why it's missing
print(df.isnull().sum()) # Check missing values
print(df.isnull().mean()) # Percentage missing
# Use appropriate imputation
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].median())
df['categorical_col'] = df['categorical_col'].fillna('Unknown')
4. Vectorized Operations
# Good - vectorized
df['new_col'] = df['col1'] * df['col2']
# Bad - loops
df['new_col'] = [x * y for x, y in zip(df['col1'], df['col2'])]
Practice Exercises
Exercise 1: Data Loading and Exploration
Load a CSV file and:
- Display the first 10 rows
- Show basic statistics for numeric columns
- Check for missing values
- Display data types of each column
Exercise 2: Data Cleaning
Given a messy DataFrame:
- Remove duplicate rows
- Fill missing numeric values with column means
- Fill missing categorical values with βUnknownβ
- Convert date strings to datetime objects
Exercise 3: Data Transformation
Create new columns:
- Calculate BMI from height and weight
- Create age groups (18-25, 26-35, etc.)
- Calculate percentage scores
- Extract domain from email addresses
Exercise 4: Grouping and Aggregation
Analyze sales data by:
- Total sales by product category
- Average order value by region
- Top 5 customers by total spending
- Monthly sales trends
Exercise 5: Merging DataFrames
Given customer and order data:
- Join customer info with their orders
- Calculate customer lifetime value
- Find customers who havenβt ordered in 30 days
- Create a customer summary report
Summary
Pandas DataFrames are powerful for data manipulation:
Creating DataFrames:
import pandas as pd
# From dictionary
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
# From CSV
df = pd.read_csv('data.csv')
Data Selection:
# Columns
df['column_name']
df[['col1', 'col2']]
# Rows
df.iloc[0:5] # Position-based
df.loc['index'] # Label-based
df[df['col'] > 10] # Boolean indexing
Data Operations:
# Cleaning
df.dropna() # Remove missing
df.fillna(value) # Fill missing
df.drop_duplicates() # Remove duplicates
# Transformation
df['new_col'] = df['col1'] * 2 # Add column
df.groupby('col')['val'].sum() # Group and aggregate
Key Concepts:
- Series (1D) and DataFrame (2D) structures
- Indexing with .loc and .iloc
- Data cleaning and preprocessing
- Grouping and aggregation
- Merging and joining data
Next: Data Cleaning - Handling missing data and preprocessing! π§Ή