Data Cleaning: Handling Missing Data and Preprocessing
Welcome to Data Cleaning! Think of data cleaning as housekeeping for your data - you organize, fix, and prepare your data so it’s ready for analysis. In the real world, 80% of data science time is spent cleaning data!
Why Data Cleaning Matters
Garbage in, garbage out - clean data leads to reliable insights:
# Bad data leads to wrong conclusions
dirty_data = pd.DataFrame({
'sales': [100, 200, None, 400, 'not_a_number'],
'date': ['2023-01-01', 'invalid_date', '2023-01-03', None, '2023-01-05']
})
# Clean data enables proper analysis
clean_data = pd.DataFrame({
'sales': [100, 200, 250, 400, 350], # Missing values filled
'date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])
})
Identifying Missing Data
Types of Missing Data
import pandas as pd
import numpy as np
# Create sample data with different types of missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5], # NaN (Not a Number)
'B': [1, 2, None, 4, 5], # None (Python null)
'C': [1, 2, '', 4, 5], # Empty string
'D': [1, 2, 'N/A', 4, 5], # Custom missing indicator
'E': [1, 2, 'NULL', 4, 5] # Another custom indicator
})
print("DataFrame:")
print(df)
print("\nData types:")
print(df.dtypes)
Detecting Missing Values
# Check for NaN/None values
print("Is null:")
print(df.isnull())
# A B C D E
# 0 False False False False False
# 1 False False False False False
# 2 True True False False False
# 3 False False False False False
# 4 False False False False False
# Count missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())
# A 1
# B 1
# C 0
# D 0
# E 0
# dtype: int64
# Total missing values
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
# Percentage of missing values
print("\nPercentage missing:")
print(df.isnull().mean() * 100)
Custom Missing Value Detection
# Define custom missing value indicators
missing_indicators = ['', 'N/A', 'NULL', 'null', 'NaN', 'n/a', 'missing']
# Check for missing values including custom indicators
def is_missing(value):
if pd.isna(value): # Catches NaN, None, NaT
return True
if isinstance(value, str) and value.strip().lower() in [x.lower() for x in missing_indicators]:
return True
return False
# Apply custom missing detection
missing_mask = df.applymap(is_missing)
print("Custom missing detection:")
print(missing_mask)
print(f"\nTotal missing (including custom): {missing_mask.sum().sum()}")
Handling Missing Data
Removal Strategies
# Sample data with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [1, np.nan, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, np.nan]
})
print("Original data:")
print(df)
print(f"Shape: {df.shape}")
# Remove rows with any missing values
df_dropna = df.dropna()
print(f"\nAfter dropna(): {df_dropna.shape}")
print(df_dropna)
# Remove rows where all values are missing
df_dropna_all = df.dropna(how='all')
print(f"\nAfter dropna(how='all'): {df_dropna_all.shape}")
# Remove columns with any missing values
df_dropna_cols = df.dropna(axis=1)
print(f"\nAfter dropna(axis=1): {df_dropna_cols.shape}")
# Remove rows with missing values in specific columns
df_dropna_subset = df.dropna(subset=['A', 'B'])
print(f"\nAfter dropna(subset=['A', 'B']): {df_dropna_subset.shape}")
Imputation Strategies
# Forward fill (use previous value)
df_ffill = df.fillna(method='ffill')
print("Forward fill:")
print(df_ffill)
# Backward fill (use next value)
df_bfill = df.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill)
# Fill with specific value
df_fill_zero = df.fillna(0)
print("\nFill with 0:")
print(df_fill_zero)
# Fill with column mean (numeric columns only)
numeric_cols = df.select_dtypes(include=[np.number]).columns
df_fill_mean = df.copy()
df_fill_mean[numeric_cols] = df_fill_mean[numeric_cols].fillna(df_fill_mean[numeric_cols].mean())
print("\nFill with mean:")
print(df_fill_mean)
# Fill with column median
df_fill_median = df.copy()
df_fill_median[numeric_cols] = df_fill_median[numeric_cols].fillna(df_fill_median[numeric_cols].median())
print("\nFill with median:")
print(df_fill_median)
# Fill with column mode (most frequent value)
df_fill_mode = df.copy()
for col in df.columns:
mode_value = df[col].mode()
if not mode_value.empty:
df_fill_mode[col] = df_fill_mode[col].fillna(mode_value[0])
print("\nFill with mode:")
print(df_fill_mode)
Advanced Imputation
from sklearn.impute import KNNImputer, SimpleImputer
# KNN imputation (uses similar rows to fill missing values)
knn_imputer = KNNImputer(n_neighbors=2)
df_numeric = df.select_dtypes(include=[np.number])
df_knn = pd.DataFrame(knn_imputer.fit_transform(df_numeric),
columns=df_numeric.columns,
index=df.index)
print("KNN imputation:")
print(df_knn)
# Iterative imputation (MICE - Multiple Imputation by Chained Equations)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
iterative_imputer = IterativeImputer(random_state=42)
df_iterative = pd.DataFrame(iterative_imputer.fit_transform(df_numeric),
columns=df_numeric.columns,
index=df.index)
print("\nIterative imputation:")
print(df_iterative)
Data Type Conversion
Converting Data Types
# Sample messy data
df = pd.DataFrame({
'age': ['25', '30', '35', '40', '45'],
'salary': ['$50,000', '$60,000', '$70,000', '$80,000', '$90,000'],
'date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01', '2023-05-01'],
'rating': ['4.5', '3.8', '4.2', '4.9', '3.5'],
'is_active': ['True', 'False', 'True', 'True', 'False']
})
print("Original dtypes:")
print(df.dtypes)
# Convert to numeric
df['age'] = pd.to_numeric(df['age'])
df['rating'] = pd.to_numeric(df['rating'])
# Convert salary (remove $ and comma)
df['salary'] = df['salary'].str.replace('[\$,]', '', regex=True).astype(float)
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to boolean
df['is_active'] = df['is_active'].map({'True': True, 'False': False})
print("\nConverted dtypes:")
print(df.dtypes)
print("\nConverted data:")
print(df)
Handling Conversion Errors
# Data with conversion errors
df = pd.DataFrame({
'numbers': ['1', '2', 'three', '4', '5'],
'dates': ['2023-01-01', '2023-02-01', 'not_a_date', '2023-04-01', '2023-05-01']
})
# Safe numeric conversion
df['numbers_safe'] = pd.to_numeric(df['numbers'], errors='coerce')
print("Safe numeric conversion:")
print(df[['numbers', 'numbers_safe']])
# Safe date conversion
df['dates_safe'] = pd.to_datetime(df['dates'], errors='coerce')
print("\nSafe date conversion:")
print(df[['dates', 'dates_safe']])
# Check for conversion failures
print(f"\nNumeric conversion failures: {df['numbers_safe'].isnull().sum()}")
print(f"Date conversion failures: {df['dates_safe'].isnull().sum()}")
String Data Cleaning
Text Preprocessing
# Sample text data
df = pd.DataFrame({
'names': [' Alice Smith ', 'BOB johnson', ' charlie brown ', 'Diana Prince'],
'emails': ['alice@example.com', 'bob@test.COM', 'CHARLIE@demo.com', 'diana@WORK.COM'],
'comments': ['Great product!', 'NOT GOOD', 'excellent work!!!', 'ok service']
})
# String cleaning
df['names_clean'] = df['names'].str.strip().str.title()
print("Clean names:")
print(df[['names', 'names_clean']])
# Email normalization
df['emails_clean'] = df['emails'].str.lower().str.strip()
print("\nClean emails:")
print(df[['emails', 'emails_clean']])
# Text standardization
df['comments_clean'] = (df['comments']
.str.lower()
.str.strip()
.str.replace(r'[^\w\s]', '', regex=True) # Remove punctuation
.str.replace(r'\s+', ' ', regex=True)) # Normalize whitespace
print("\nClean comments:")
print(df[['comments', 'comments_clean']])
String Extraction and Parsing
# Complex string data
df = pd.DataFrame({
'full_address': [
'123 Main St, New York, NY 10001',
'456 Oak Ave, Los Angeles, CA 90210',
'789 Pine Rd, Chicago, IL 60601'
],
'phone': [
'(555) 123-4567',
'555-987-6543',
'555.456.7890'
],
'product_info': [
'Laptop - Model: X1, Price: $999',
'Phone - Model: Y2, Price: $699',
'Tablet - Model: Z3, Price: $399'
]
})
# Extract address components
df[['street', 'city', 'state_zip']] = df['full_address'].str.split(', ', expand=True)
df[['state', 'zip']] = df['state_zip'].str.split(' ', expand=True)
df = df.drop('state_zip', axis=1)
print("Parsed addresses:")
print(df[['full_address', 'street', 'city', 'state', 'zip']])
# Standardize phone numbers
df['phone_clean'] = (df['phone']
.str.replace(r'[^\d]', '', regex=True) # Remove non-digits
.str.replace(r'^(\d{3})(\d{3})(\d{4})$', r'(\1) \2-\3', regex=True)) # Format
print("\nClean phone numbers:")
print(df[['phone', 'phone_clean']])
# Extract product information
df['model'] = df['product_info'].str.extract(r'Model: (\w+)')
df['price'] = df['product_info'].str.extract(r'Price: \$(\d+)').astype(float)
print("\nExtracted product info:")
print(df[['product_info', 'model', 'price']])
Outlier Detection and Handling
Statistical Methods
import numpy as np
from scipy import stats
# Sample data with outliers
np.random.seed(42)
data = np.random.normal(100, 10, 100) # Normal distribution
data = np.append(data, [200, 250, 300]) # Add outliers
df = pd.DataFrame({'values': data})
print("Data statistics:")
print(df['values'].describe())
# Z-score method (values > 3 standard deviations)
z_scores = np.abs(stats.zscore(df['values']))
outliers_z = df[z_scores > 3]
print(f"\nOutliers by Z-score (>3): {len(outliers_z)}")
print(outliers_z)
# IQR method (values outside 1.5 * IQR)
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['values'] < lower_bound) | (df['values'] > upper_bound)]
print(f"\nOutliers by IQR method: {len(outliers_iqr)}")
print(outliers_iqr)
Handling Outliers
# Remove outliers
df_no_outliers = df[(df['values'] >= lower_bound) & (df['values'] <= upper_bound)]
print(f"Data points after removing outliers: {len(df_no_outliers)}")
# Cap outliers (winsorization)
df_capped = df.copy()
df_capped['values'] = np.clip(df_capped['values'], lower_bound, upper_bound)
print("After capping outliers:")
print(df_capped['values'].describe())
# Replace outliers with median
median_value = df['values'].median()
df_replaced = df.copy()
outlier_mask = (df['values'] < lower_bound) | (df['values'] > upper_bound)
df_replaced.loc[outlier_mask, 'values'] = median_value
print(f"\nOutliers replaced with median ({median_value}):")
print(df_replaced[outlier_mask])
Data Normalization and Scaling
Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
# Sample data
df = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': [100, 200, 300, 400, 500],
'feature3': [10, 20, 30, 40, 50]
})
# Min-Max scaling (0 to 1)
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Min-Max scaled:")
print(df_minmax)
# Custom range scaling (-1 to 1)
scaler_custom = MinMaxScaler(feature_range=(-1, 1))
df_custom = pd.DataFrame(scaler_custom.fit_transform(df), columns=df.columns)
print("\nCustom range scaled (-1 to 1):")
print(df_custom)
Standard Scaling (Z-score)
# Standard scaling (mean=0, std=1)
scaler = StandardScaler()
df_standard = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Standard scaled:")
print(df_standard)
print(f"Mean: {df_standard.mean().round(10).tolist()}")
print(f"Std: {df_standard.std().round(10).tolist()}")
Robust Scaling
# Robust scaling (uses median and IQR, less sensitive to outliers)
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Robust scaled:")
print(df_robust)
Encoding Categorical Variables
Label Encoding
from sklearn.preprocessing import LabelEncoder
# Categorical data
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'M', 'S'],
'quality': ['good', 'excellent', 'good', 'poor', 'excellent']
})
# Label encoding (converts to integers)
le = LabelEncoder()
df_encoded = df.copy()
for col in df.columns:
df_encoded[col] = le.fit_transform(df[col])
print("Label encoded:")
print(df_encoded)
# Show mapping
for col in df.columns:
le = LabelEncoder()
le.fit(df[col])
mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(f"{col} mapping: {mapping}")
One-Hot Encoding
# One-hot encoding (creates binary columns)
df_onehot = pd.get_dummies(df, prefix=['color', 'size', 'quality'])
print("One-hot encoded:")
print(df_onehot)
# Drop first category to avoid multicollinearity
df_onehot_drop = pd.get_dummies(df, drop_first=True, prefix=['color', 'size', 'quality'])
print("\nOne-hot encoded (drop first):")
print(df_onehot_drop)
Feature Engineering
Creating New Features
# Sample sales data
df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100, freq='D'),
'sales': np.random.randint(100, 1000, 100),
'customers': np.random.randint(10, 50, 100)
})
# Date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['is_weekend'] = df['date'].dt.weekday >= 5
# Sales per customer
df['sales_per_customer'] = df['sales'] / df['customers']
# Rolling statistics
df['sales_7d_avg'] = df['sales'].rolling(window=7).mean()
df['sales_7d_std'] = df['sales'].rolling(window=7).std()
# Lag features
df['sales_prev_day'] = df['sales'].shift(1)
df['sales_change'] = df['sales'] - df['sales_prev_day']
# Cumulative features
df['cumulative_sales'] = df['sales'].cumsum()
df['sales_pct_change'] = df['sales'].pct_change()
print("Engineered features:")
print(df.head(10))
Data Quality Validation
Data Quality Checks
def validate_dataframe(df):
"""Comprehensive data quality validation."""
issues = []
# Check for missing values
missing = df.isnull().sum()
if missing.any():
issues.append(f"Missing values found: {missing[missing > 0].to_dict()}")
# Check for duplicates
duplicates = df.duplicated().sum()
if duplicates > 0:
issues.append(f"Duplicate rows found: {duplicates}")
# Check data types
for col in df.columns:
if df[col].dtype == 'object':
# Check for mixed types in object columns
types = df[col].apply(type).value_counts()
if len(types) > 1:
issues.append(f"Mixed data types in {col}: {types.to_dict()}")
# Check for outliers (numeric columns)
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
z_scores = np.abs(stats.zscore(df[col].dropna()))
outliers = (z_scores > 3).sum()
if outliers > 0:
issues.append(f"Outliers in {col}: {outliers} values > 3 std dev")
# Check for negative values where they shouldn't be
positive_cols = ['sales', 'price', 'quantity', 'age'] # domain knowledge
for col in positive_cols:
if col in df.columns:
negative = (df[col] < 0).sum()
if negative > 0:
issues.append(f"Negative values in {col}: {negative}")
return issues
# Test validation
df_test = pd.DataFrame({
'sales': [100, 200, -50, 300, 400],
'price': [10, 20, 30, np.nan, 50],
'category': ['A', 'B', 'A', 'B', 'A']
})
issues = validate_dataframe(df_test)
if issues:
print("Data quality issues found:")
for issue in issues:
print(f"- {issue}")
else:
print("No data quality issues found!")
Practical Examples
Example 1: Customer Data Cleaning
# Raw customer data
customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
'name': [' John Doe ', 'Jane Smith', 'BOB Johnson', None, 'Alice Brown'],
'email': ['john@email.com', 'jane@test.COM', 'bob@work.com', 'invalid-email', None],
'age': ['25', 'thirty', '35', '40', '45'],
'signup_date': ['2023-01-01', '2023-02-01', 'not-a-date', '2023-04-01', '2023-05-01'],
'total_purchases': ['$1,200', '$500', '800', 'N/A', '$2,500']
})
print("Raw data:")
print(customers)
print(f"Missing values: {customers.isnull().sum().sum()}")
# Clean names
customers['name_clean'] = (customers['name']
.str.strip()
.str.title()
.fillna('Unknown'))
# Clean emails
customers['email_clean'] = (customers['email']
.str.lower()
.str.strip())
# Validate emails
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
customers['valid_email'] = customers['email_clean'].str.match(email_pattern, na=False)
# Clean age
customers['age_clean'] = pd.to_numeric(customers['age'], errors='coerce')
# Clean dates
customers['signup_date_clean'] = pd.to_datetime(customers['signup_date'], errors='coerce')
# Clean purchases
customers['total_purchases_clean'] = (customers['total_purchases']
.str.replace('[\$,]', '', regex=True)
.replace('N/A', np.nan)
.astype(float))
print("\nCleaned data:")
print(customers[['name_clean', 'email_clean', 'valid_email', 'age_clean', 'signup_date_clean', 'total_purchases_clean']])
# Data quality report
print(f"\nData quality report:")
print(f"Valid emails: {customers['valid_email'].sum()}/{len(customers)}")
print(f"Missing ages: {customers['age_clean'].isnull().sum()}")
print(f"Invalid dates: {customers['signup_date_clean'].isnull().sum()}")
print(f"Missing purchases: {customers['total_purchases_clean'].isnull().sum()}")
Example 2: Sales Data Preprocessing
# Raw sales data
sales = pd.DataFrame({
'transaction_id': range(1, 101),
'date': pd.date_range('2023-01-01', periods=100, freq='D'),
'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones'], 100),
'quantity': np.random.randint(1, 10, 100),
'unit_price': np.random.uniform(50, 1000, 100),
'customer_type': np.random.choice(['Regular', 'VIP', 'New'], 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})
# Introduce some data quality issues
sales.loc[10:15, 'quantity'] = -5 # Negative quantities
sales.loc[20:25, 'unit_price'] = np.nan # Missing prices
sales.loc[30:35, 'date'] = 'invalid_date' # Invalid dates
sales.loc[40:45, 'product'] = np.nan # Missing products
print("Data with quality issues:")
print(f"Missing values: {sales.isnull().sum().sum()}")
print(f"Negative quantities: {(sales['quantity'] < 0).sum()}")
# Clean quantity (remove negatives)
sales['quantity_clean'] = sales['quantity'].clip(lower=0)
# Fill missing prices with median
median_price = sales['unit_price'].median()
sales['unit_price_clean'] = sales['unit_price'].fillna(median_price)
# Fill missing products with mode
mode_product = sales['product'].mode()[0]
sales['product_clean'] = sales['product'].fillna(mode_product)
# Fix dates
sales['date_clean'] = pd.to_datetime(sales['date'], errors='coerce')
# Fill invalid dates with forward fill
sales['date_clean'] = sales['date_clean'].fillna(method='ffill')
# Calculate total sales
sales['total_sales'] = sales['quantity_clean'] * sales['unit_price_clean']
# Create derived features
sales['month'] = sales['date_clean'].dt.month
sales['is_high_value'] = sales['total_sales'] > 500
print("\nCleaned data summary:")
print(f"Total transactions: {len(sales)}")
print(f"Total sales: ${sales['total_sales'].sum():,.2f}")
print(f"Average transaction: ${sales['total_sales'].mean():.2f}")
print(f"High-value transactions: {sales['is_high_value'].sum()}")
# Sales by product
product_sales = sales.groupby('product_clean')['total_sales'].sum().sort_values(ascending=False)
print("\nSales by product:")
print(product_sales)
# Monthly trends
monthly_sales = sales.groupby('month')['total_sales'].sum()
print("\nMonthly sales:")
print(monthly_sales)
Best Practices
1. Document Your Cleaning Process
def clean_customer_data(df):
"""
Clean customer data following these steps:
1. Remove leading/trailing whitespace
2. Standardize email formats
3. Convert dates to datetime
4. Handle missing values appropriately
5. Validate data ranges
"""
# Implementation
pass
2. Create Data Quality Checks
def data_quality_report(df):
"""Generate comprehensive data quality report."""
report = {
'total_rows': len(df),
'total_columns': len(df.columns),
'missing_values': df.isnull().sum().to_dict(),
'duplicate_rows': df.duplicated().sum(),
'data_types': df.dtypes.to_dict()
}
# Add column-specific checks
for col in df.columns:
if df[col].dtype in ['int64', 'float64']:
report[f'{col}_stats'] = {
'min': df[col].min(),
'max': df[col].max(),
'mean': df[col].mean(),
'zeros': (df[col] == 0).sum()
}
return report
3. Use Pipelines for Reproducible Cleaning
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Define preprocessing steps
numeric_features = ['age', 'salary']
categorical_features = ['department', 'location']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Apply to data
X_processed = preprocessor.fit_transform(X)
Practice Exercises
Exercise 1: Basic Data Cleaning
Given a dataset with missing values:
- Identify all missing values
- Remove rows with more than 50% missing data
- Fill remaining missing values appropriately
- Report data quality improvements
Exercise 2: Text Data Cleaning
Clean a dataset of product reviews:
- Remove HTML tags
- Convert to lowercase
- Remove punctuation and special characters
- Remove extra whitespace
- Remove stop words
Exercise 3: Outlier Detection
Analyze a dataset of house prices:
- Detect outliers using IQR method
- Detect outliers using Z-score method
- Compare the two methods
- Decide how to handle the outliers
Exercise 4: Feature Engineering
Create new features from transaction data:
- Extract date features (month, day of week, etc.)
- Calculate rolling averages
- Create lag features
- Generate categorical features from continuous variables
Exercise 5: Data Validation Pipeline
Create a data validation system that:
- Checks for required columns
- Validates data types
- Checks value ranges
- Ensures referential integrity
- Generates validation reports
Summary
Data cleaning transforms raw data into analysis-ready data:
Missing Data Handling:
# Detection
df.isnull().sum() # Count missing values
# Removal
df.dropna() # Remove missing rows
df.dropna(axis=1) # Remove missing columns
# Imputation
df.fillna(0) # Fill with constant
df.fillna(df.mean()) # Fill with mean
Data Type Conversion:
# Safe conversion
pd.to_numeric(df['col'], errors='coerce')
pd.to_datetime(df['date'], errors='coerce')
# String cleaning
df['col'].str.strip().str.lower()
Outlier Handling:
# IQR method
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_clean = df[(df['col'] >= lower) & (df['col'] <= upper)]
Key Concepts:
- Missing data detection and imputation
- Data type conversion and validation
- Outlier detection and handling
- String data cleaning
- Feature engineering
- Data quality validation
Next: Data Visualization - Creating charts with matplotlib! 📊