Daily Tech Brief

Top startup stories in your inbox

Subscribe Free

© 2026 rakrisi Daily

Data Cleaning - Handling Missing Data and Preprocessing

Data Cleaning: Handling Missing Data and Preprocessing

Welcome to Data Cleaning! Think of data cleaning as housekeeping for your data - you organize, fix, and prepare your data so it’s ready for analysis. In the real world, 80% of data science time is spent cleaning data!

Why Data Cleaning Matters

Garbage in, garbage out - clean data leads to reliable insights:

# Bad data leads to wrong conclusions
dirty_data = pd.DataFrame({
    'sales': [100, 200, None, 400, 'not_a_number'],
    'date': ['2023-01-01', 'invalid_date', '2023-01-03', None, '2023-01-05']
})

# Clean data enables proper analysis
clean_data = pd.DataFrame({
    'sales': [100, 200, 250, 400, 350],  # Missing values filled
    'date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])
})

Identifying Missing Data

Types of Missing Data

import pandas as pd
import numpy as np

# Create sample data with different types of missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],           # NaN (Not a Number)
    'B': [1, 2, None, 4, 5],             # None (Python null)
    'C': [1, 2, '', 4, 5],               # Empty string
    'D': [1, 2, 'N/A', 4, 5],            # Custom missing indicator
    'E': [1, 2, 'NULL', 4, 5]            # Another custom indicator
})

print("DataFrame:")
print(df)
print("\nData types:")
print(df.dtypes)

Detecting Missing Values

# Check for NaN/None values
print("Is null:")
print(df.isnull())
#       A      B      C      D      E
# 0  False  False  False  False  False
# 1  False  False  False  False  False
# 2   True   True  False  False  False
# 3  False  False  False  False  False
# 4  False  False  False  False  False

# Count missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())
# A    1
# B    1
# C    0
# D    0
# E    0
# dtype: int64

# Total missing values
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

# Percentage of missing values
print("\nPercentage missing:")
print(df.isnull().mean() * 100)

Custom Missing Value Detection

# Define custom missing value indicators
missing_indicators = ['', 'N/A', 'NULL', 'null', 'NaN', 'n/a', 'missing']

# Check for missing values including custom indicators
def is_missing(value):
    if pd.isna(value):  # Catches NaN, None, NaT
        return True
    if isinstance(value, str) and value.strip().lower() in [x.lower() for x in missing_indicators]:
        return True
    return False

# Apply custom missing detection
missing_mask = df.applymap(is_missing)
print("Custom missing detection:")
print(missing_mask)

print(f"\nTotal missing (including custom): {missing_mask.sum().sum()}")

Handling Missing Data

Removal Strategies

# Sample data with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [1, np.nan, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, np.nan]
})

print("Original data:")
print(df)
print(f"Shape: {df.shape}")

# Remove rows with any missing values
df_dropna = df.dropna()
print(f"\nAfter dropna(): {df_dropna.shape}")
print(df_dropna)

# Remove rows where all values are missing
df_dropna_all = df.dropna(how='all')
print(f"\nAfter dropna(how='all'): {df_dropna_all.shape}")

# Remove columns with any missing values
df_dropna_cols = df.dropna(axis=1)
print(f"\nAfter dropna(axis=1): {df_dropna_cols.shape}")

# Remove rows with missing values in specific columns
df_dropna_subset = df.dropna(subset=['A', 'B'])
print(f"\nAfter dropna(subset=['A', 'B']): {df_dropna_subset.shape}")

Imputation Strategies

# Forward fill (use previous value)
df_ffill = df.fillna(method='ffill')
print("Forward fill:")
print(df_ffill)

# Backward fill (use next value)
df_bfill = df.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill)

# Fill with specific value
df_fill_zero = df.fillna(0)
print("\nFill with 0:")
print(df_fill_zero)

# Fill with column mean (numeric columns only)
numeric_cols = df.select_dtypes(include=[np.number]).columns
df_fill_mean = df.copy()
df_fill_mean[numeric_cols] = df_fill_mean[numeric_cols].fillna(df_fill_mean[numeric_cols].mean())
print("\nFill with mean:")
print(df_fill_mean)

# Fill with column median
df_fill_median = df.copy()
df_fill_median[numeric_cols] = df_fill_median[numeric_cols].fillna(df_fill_median[numeric_cols].median())
print("\nFill with median:")
print(df_fill_median)

# Fill with column mode (most frequent value)
df_fill_mode = df.copy()
for col in df.columns:
    mode_value = df[col].mode()
    if not mode_value.empty:
        df_fill_mode[col] = df_fill_mode[col].fillna(mode_value[0])
print("\nFill with mode:")
print(df_fill_mode)

Advanced Imputation

from sklearn.impute import KNNImputer, SimpleImputer

# KNN imputation (uses similar rows to fill missing values)
knn_imputer = KNNImputer(n_neighbors=2)
df_numeric = df.select_dtypes(include=[np.number])
df_knn = pd.DataFrame(knn_imputer.fit_transform(df_numeric), 
                      columns=df_numeric.columns, 
                      index=df.index)
print("KNN imputation:")
print(df_knn)

# Iterative imputation (MICE - Multiple Imputation by Chained Equations)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iterative_imputer = IterativeImputer(random_state=42)
df_iterative = pd.DataFrame(iterative_imputer.fit_transform(df_numeric), 
                           columns=df_numeric.columns, 
                           index=df.index)
print("\nIterative imputation:")
print(df_iterative)

Data Type Conversion

Converting Data Types

# Sample messy data
df = pd.DataFrame({
    'age': ['25', '30', '35', '40', '45'],
    'salary': ['$50,000', '$60,000', '$70,000', '$80,000', '$90,000'],
    'date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01', '2023-05-01'],
    'rating': ['4.5', '3.8', '4.2', '4.9', '3.5'],
    'is_active': ['True', 'False', 'True', 'True', 'False']
})

print("Original dtypes:")
print(df.dtypes)

# Convert to numeric
df['age'] = pd.to_numeric(df['age'])
df['rating'] = pd.to_numeric(df['rating'])

# Convert salary (remove $ and comma)
df['salary'] = df['salary'].str.replace('[\$,]', '', regex=True).astype(float)

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Convert to boolean
df['is_active'] = df['is_active'].map({'True': True, 'False': False})

print("\nConverted dtypes:")
print(df.dtypes)
print("\nConverted data:")
print(df)

Handling Conversion Errors

# Data with conversion errors
df = pd.DataFrame({
    'numbers': ['1', '2', 'three', '4', '5'],
    'dates': ['2023-01-01', '2023-02-01', 'not_a_date', '2023-04-01', '2023-05-01']
})

# Safe numeric conversion
df['numbers_safe'] = pd.to_numeric(df['numbers'], errors='coerce')
print("Safe numeric conversion:")
print(df[['numbers', 'numbers_safe']])

# Safe date conversion
df['dates_safe'] = pd.to_datetime(df['dates'], errors='coerce')
print("\nSafe date conversion:")
print(df[['dates', 'dates_safe']])

# Check for conversion failures
print(f"\nNumeric conversion failures: {df['numbers_safe'].isnull().sum()}")
print(f"Date conversion failures: {df['dates_safe'].isnull().sum()}")

String Data Cleaning

Text Preprocessing

# Sample text data
df = pd.DataFrame({
    'names': ['  Alice Smith  ', 'BOB johnson', '  charlie brown  ', 'Diana Prince'],
    'emails': ['alice@example.com', 'bob@test.COM', 'CHARLIE@demo.com', 'diana@WORK.COM'],
    'comments': ['Great product!', 'NOT GOOD', 'excellent work!!!', 'ok service']
})

# String cleaning
df['names_clean'] = df['names'].str.strip().str.title()
print("Clean names:")
print(df[['names', 'names_clean']])

# Email normalization
df['emails_clean'] = df['emails'].str.lower().str.strip()
print("\nClean emails:")
print(df[['emails', 'emails_clean']])

# Text standardization
df['comments_clean'] = (df['comments']
    .str.lower()
    .str.strip()
    .str.replace(r'[^\w\s]', '', regex=True)  # Remove punctuation
    .str.replace(r'\s+', ' ', regex=True))    # Normalize whitespace
print("\nClean comments:")
print(df[['comments', 'comments_clean']])

String Extraction and Parsing

# Complex string data
df = pd.DataFrame({
    'full_address': [
        '123 Main St, New York, NY 10001',
        '456 Oak Ave, Los Angeles, CA 90210',
        '789 Pine Rd, Chicago, IL 60601'
    ],
    'phone': [
        '(555) 123-4567',
        '555-987-6543',
        '555.456.7890'
    ],
    'product_info': [
        'Laptop - Model: X1, Price: $999',
        'Phone - Model: Y2, Price: $699',
        'Tablet - Model: Z3, Price: $399'
    ]
})

# Extract address components
df[['street', 'city', 'state_zip']] = df['full_address'].str.split(', ', expand=True)
df[['state', 'zip']] = df['state_zip'].str.split(' ', expand=True)
df = df.drop('state_zip', axis=1)

print("Parsed addresses:")
print(df[['full_address', 'street', 'city', 'state', 'zip']])

# Standardize phone numbers
df['phone_clean'] = (df['phone']
    .str.replace(r'[^\d]', '', regex=True)  # Remove non-digits
    .str.replace(r'^(\d{3})(\d{3})(\d{4})$', r'(\1) \2-\3', regex=True))  # Format

print("\nClean phone numbers:")
print(df[['phone', 'phone_clean']])

# Extract product information
df['model'] = df['product_info'].str.extract(r'Model: (\w+)')
df['price'] = df['product_info'].str.extract(r'Price: \$(\d+)').astype(float)

print("\nExtracted product info:")
print(df[['product_info', 'model', 'price']])

Outlier Detection and Handling

Statistical Methods

import numpy as np
from scipy import stats

# Sample data with outliers
np.random.seed(42)
data = np.random.normal(100, 10, 100)  # Normal distribution
data = np.append(data, [200, 250, 300])  # Add outliers

df = pd.DataFrame({'values': data})

print("Data statistics:")
print(df['values'].describe())

# Z-score method (values > 3 standard deviations)
z_scores = np.abs(stats.zscore(df['values']))
outliers_z = df[z_scores > 3]
print(f"\nOutliers by Z-score (>3): {len(outliers_z)}")
print(outliers_z)

# IQR method (values outside 1.5 * IQR)
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df[(df['values'] < lower_bound) | (df['values'] > upper_bound)]
print(f"\nOutliers by IQR method: {len(outliers_iqr)}")
print(outliers_iqr)

Handling Outliers

# Remove outliers
df_no_outliers = df[(df['values'] >= lower_bound) & (df['values'] <= upper_bound)]
print(f"Data points after removing outliers: {len(df_no_outliers)}")

# Cap outliers (winsorization)
df_capped = df.copy()
df_capped['values'] = np.clip(df_capped['values'], lower_bound, upper_bound)
print("After capping outliers:")
print(df_capped['values'].describe())

# Replace outliers with median
median_value = df['values'].median()
df_replaced = df.copy()
outlier_mask = (df['values'] < lower_bound) | (df['values'] > upper_bound)
df_replaced.loc[outlier_mask, 'values'] = median_value
print(f"\nOutliers replaced with median ({median_value}):")
print(df_replaced[outlier_mask])

Data Normalization and Scaling

Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Sample data
df = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [100, 200, 300, 400, 500],
    'feature3': [10, 20, 30, 40, 50]
})

# Min-Max scaling (0 to 1)
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Min-Max scaled:")
print(df_minmax)

# Custom range scaling (-1 to 1)
scaler_custom = MinMaxScaler(feature_range=(-1, 1))
df_custom = pd.DataFrame(scaler_custom.fit_transform(df), columns=df.columns)
print("\nCustom range scaled (-1 to 1):")
print(df_custom)

Standard Scaling (Z-score)

# Standard scaling (mean=0, std=1)
scaler = StandardScaler()
df_standard = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Standard scaled:")
print(df_standard)
print(f"Mean: {df_standard.mean().round(10).tolist()}")
print(f"Std: {df_standard.std().round(10).tolist()}")

Robust Scaling

# Robust scaling (uses median and IQR, less sensitive to outliers)
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Robust scaled:")
print(df_robust)

Encoding Categorical Variables

Label Encoding

from sklearn.preprocessing import LabelEncoder

# Categorical data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'quality': ['good', 'excellent', 'good', 'poor', 'excellent']
})

# Label encoding (converts to integers)
le = LabelEncoder()
df_encoded = df.copy()
for col in df.columns:
    df_encoded[col] = le.fit_transform(df[col])

print("Label encoded:")
print(df_encoded)

# Show mapping
for col in df.columns:
    le = LabelEncoder()
    le.fit(df[col])
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    print(f"{col} mapping: {mapping}")

One-Hot Encoding

# One-hot encoding (creates binary columns)
df_onehot = pd.get_dummies(df, prefix=['color', 'size', 'quality'])
print("One-hot encoded:")
print(df_onehot)

# Drop first category to avoid multicollinearity
df_onehot_drop = pd.get_dummies(df, drop_first=True, prefix=['color', 'size', 'quality'])
print("\nOne-hot encoded (drop first):")
print(df_onehot_drop)

Feature Engineering

Creating New Features

# Sample sales data
df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'sales': np.random.randint(100, 1000, 100),
    'customers': np.random.randint(10, 50, 100)
})

# Date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['is_weekend'] = df['date'].dt.weekday >= 5

# Sales per customer
df['sales_per_customer'] = df['sales'] / df['customers']

# Rolling statistics
df['sales_7d_avg'] = df['sales'].rolling(window=7).mean()
df['sales_7d_std'] = df['sales'].rolling(window=7).std()

# Lag features
df['sales_prev_day'] = df['sales'].shift(1)
df['sales_change'] = df['sales'] - df['sales_prev_day']

# Cumulative features
df['cumulative_sales'] = df['sales'].cumsum()
df['sales_pct_change'] = df['sales'].pct_change()

print("Engineered features:")
print(df.head(10))

Data Quality Validation

Data Quality Checks

def validate_dataframe(df):
    """Comprehensive data quality validation."""
    issues = []
    
    # Check for missing values
    missing = df.isnull().sum()
    if missing.any():
        issues.append(f"Missing values found: {missing[missing > 0].to_dict()}")
    
    # Check for duplicates
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        issues.append(f"Duplicate rows found: {duplicates}")
    
    # Check data types
    for col in df.columns:
        if df[col].dtype == 'object':
            # Check for mixed types in object columns
            types = df[col].apply(type).value_counts()
            if len(types) > 1:
                issues.append(f"Mixed data types in {col}: {types.to_dict()}")
    
    # Check for outliers (numeric columns)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        z_scores = np.abs(stats.zscore(df[col].dropna()))
        outliers = (z_scores > 3).sum()
        if outliers > 0:
            issues.append(f"Outliers in {col}: {outliers} values > 3 std dev")
    
    # Check for negative values where they shouldn't be
    positive_cols = ['sales', 'price', 'quantity', 'age']  # domain knowledge
    for col in positive_cols:
        if col in df.columns:
            negative = (df[col] < 0).sum()
            if negative > 0:
                issues.append(f"Negative values in {col}: {negative}")
    
    return issues

# Test validation
df_test = pd.DataFrame({
    'sales': [100, 200, -50, 300, 400],
    'price': [10, 20, 30, np.nan, 50],
    'category': ['A', 'B', 'A', 'B', 'A']
})

issues = validate_dataframe(df_test)
if issues:
    print("Data quality issues found:")
    for issue in issues:
        print(f"- {issue}")
else:
    print("No data quality issues found!")

Practical Examples

Example 1: Customer Data Cleaning

# Raw customer data
customers = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'name': ['  John Doe  ', 'Jane Smith', 'BOB Johnson', None, 'Alice Brown'],
    'email': ['john@email.com', 'jane@test.COM', 'bob@work.com', 'invalid-email', None],
    'age': ['25', 'thirty', '35', '40', '45'],
    'signup_date': ['2023-01-01', '2023-02-01', 'not-a-date', '2023-04-01', '2023-05-01'],
    'total_purchases': ['$1,200', '$500', '800', 'N/A', '$2,500']
})

print("Raw data:")
print(customers)
print(f"Missing values: {customers.isnull().sum().sum()}")

# Clean names
customers['name_clean'] = (customers['name']
    .str.strip()
    .str.title()
    .fillna('Unknown'))

# Clean emails
customers['email_clean'] = (customers['email']
    .str.lower()
    .str.strip())

# Validate emails
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
customers['valid_email'] = customers['email_clean'].str.match(email_pattern, na=False)

# Clean age
customers['age_clean'] = pd.to_numeric(customers['age'], errors='coerce')

# Clean dates
customers['signup_date_clean'] = pd.to_datetime(customers['signup_date'], errors='coerce')

# Clean purchases
customers['total_purchases_clean'] = (customers['total_purchases']
    .str.replace('[\$,]', '', regex=True)
    .replace('N/A', np.nan)
    .astype(float))

print("\nCleaned data:")
print(customers[['name_clean', 'email_clean', 'valid_email', 'age_clean', 'signup_date_clean', 'total_purchases_clean']])

# Data quality report
print(f"\nData quality report:")
print(f"Valid emails: {customers['valid_email'].sum()}/{len(customers)}")
print(f"Missing ages: {customers['age_clean'].isnull().sum()}")
print(f"Invalid dates: {customers['signup_date_clean'].isnull().sum()}")
print(f"Missing purchases: {customers['total_purchases_clean'].isnull().sum()}")

Example 2: Sales Data Preprocessing

# Raw sales data
sales = pd.DataFrame({
    'transaction_id': range(1, 101),
    'date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones'], 100),
    'quantity': np.random.randint(1, 10, 100),
    'unit_price': np.random.uniform(50, 1000, 100),
    'customer_type': np.random.choice(['Regular', 'VIP', 'New'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})

# Introduce some data quality issues
sales.loc[10:15, 'quantity'] = -5  # Negative quantities
sales.loc[20:25, 'unit_price'] = np.nan  # Missing prices
sales.loc[30:35, 'date'] = 'invalid_date'  # Invalid dates
sales.loc[40:45, 'product'] = np.nan  # Missing products

print("Data with quality issues:")
print(f"Missing values: {sales.isnull().sum().sum()}")
print(f"Negative quantities: {(sales['quantity'] < 0).sum()}")

# Clean quantity (remove negatives)
sales['quantity_clean'] = sales['quantity'].clip(lower=0)

# Fill missing prices with median
median_price = sales['unit_price'].median()
sales['unit_price_clean'] = sales['unit_price'].fillna(median_price)

# Fill missing products with mode
mode_product = sales['product'].mode()[0]
sales['product_clean'] = sales['product'].fillna(mode_product)

# Fix dates
sales['date_clean'] = pd.to_datetime(sales['date'], errors='coerce')
# Fill invalid dates with forward fill
sales['date_clean'] = sales['date_clean'].fillna(method='ffill')

# Calculate total sales
sales['total_sales'] = sales['quantity_clean'] * sales['unit_price_clean']

# Create derived features
sales['month'] = sales['date_clean'].dt.month
sales['is_high_value'] = sales['total_sales'] > 500

print("\nCleaned data summary:")
print(f"Total transactions: {len(sales)}")
print(f"Total sales: ${sales['total_sales'].sum():,.2f}")
print(f"Average transaction: ${sales['total_sales'].mean():.2f}")
print(f"High-value transactions: {sales['is_high_value'].sum()}")

# Sales by product
product_sales = sales.groupby('product_clean')['total_sales'].sum().sort_values(ascending=False)
print("\nSales by product:")
print(product_sales)

# Monthly trends
monthly_sales = sales.groupby('month')['total_sales'].sum()
print("\nMonthly sales:")
print(monthly_sales)

Best Practices

1. Document Your Cleaning Process

def clean_customer_data(df):
    """
    Clean customer data following these steps:
    1. Remove leading/trailing whitespace
    2. Standardize email formats
    3. Convert dates to datetime
    4. Handle missing values appropriately
    5. Validate data ranges
    """
    # Implementation
    pass

2. Create Data Quality Checks

def data_quality_report(df):
    """Generate comprehensive data quality report."""
    report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'duplicate_rows': df.duplicated().sum(),
        'data_types': df.dtypes.to_dict()
    }
    
    # Add column-specific checks
    for col in df.columns:
        if df[col].dtype in ['int64', 'float64']:
            report[f'{col}_stats'] = {
                'min': df[col].min(),
                'max': df[col].max(),
                'mean': df[col].mean(),
                'zeros': (df[col] == 0).sum()
            }
    
    return report

3. Use Pipelines for Reproducible Cleaning

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define preprocessing steps
numeric_features = ['age', 'salary']
categorical_features = ['department', 'location']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply to data
X_processed = preprocessor.fit_transform(X)

Practice Exercises

Exercise 1: Basic Data Cleaning

Given a dataset with missing values:

  1. Identify all missing values
  2. Remove rows with more than 50% missing data
  3. Fill remaining missing values appropriately
  4. Report data quality improvements

Exercise 2: Text Data Cleaning

Clean a dataset of product reviews:

  1. Remove HTML tags
  2. Convert to lowercase
  3. Remove punctuation and special characters
  4. Remove extra whitespace
  5. Remove stop words

Exercise 3: Outlier Detection

Analyze a dataset of house prices:

  1. Detect outliers using IQR method
  2. Detect outliers using Z-score method
  3. Compare the two methods
  4. Decide how to handle the outliers

Exercise 4: Feature Engineering

Create new features from transaction data:

  1. Extract date features (month, day of week, etc.)
  2. Calculate rolling averages
  3. Create lag features
  4. Generate categorical features from continuous variables

Exercise 5: Data Validation Pipeline

Create a data validation system that:

  1. Checks for required columns
  2. Validates data types
  3. Checks value ranges
  4. Ensures referential integrity
  5. Generates validation reports

Summary

Data cleaning transforms raw data into analysis-ready data:

Missing Data Handling:

# Detection
df.isnull().sum()  # Count missing values

# Removal
df.dropna()        # Remove missing rows
df.dropna(axis=1)  # Remove missing columns

# Imputation
df.fillna(0)       # Fill with constant
df.fillna(df.mean())  # Fill with mean

Data Type Conversion:

# Safe conversion
pd.to_numeric(df['col'], errors='coerce')
pd.to_datetime(df['date'], errors='coerce')

# String cleaning
df['col'].str.strip().str.lower()

Outlier Handling:

# IQR method
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_clean = df[(df['col'] >= lower) & (df['col'] <= upper)]

Key Concepts:

  • Missing data detection and imputation
  • Data type conversion and validation
  • Outlier detection and handling
  • String data cleaning
  • Feature engineering
  • Data quality validation

Next: Data Visualization - Creating charts with matplotlib! 📊