Data Science Fundamentals
Welcome to Data Science Fundamentals! Think of data science as detective work with numbers - you collect clues (data), analyze patterns, and tell stories that help people make better decisions.
What Youβll Learn
This module introduces core data science concepts with Python:
- NumPy Arrays - Efficient numerical computing
- Pandas DataFrames - Data manipulation and analysis
- Data Cleaning - Handling missing data and outliers
- Data Visualization - Creating charts with matplotlib
- Statistical Analysis - Basic statistics and correlations
- Real-World Projects - Analyzing actual datasets
Why Data Science Matters
Data science is transforming every industry:
- Business - Customer insights, sales forecasting, market analysis
- Healthcare - Disease prediction, treatment optimization, drug discovery
- Finance - Risk assessment, fraud detection, algorithmic trading
- Sports - Performance analysis, game strategy, player evaluation
- Social Good - Climate modeling, education improvement, policy analysis
Real-World Applications
Data science powers:
- Recommendation Systems - Netflix shows, Amazon products
- Autonomous Vehicles - Self-driving car navigation
- Medical Diagnosis - Cancer detection, radiology analysis
- Financial Trading - High-frequency stock trading algorithms
- Social Media - Content moderation, trend analysis
- Climate Science - Weather prediction, environmental monitoring
Module Structure
11-data-science/
βββ 01-numpy-arrays.md # Numerical computing with arrays
βββ 02-pandas-dataframes.md # Data manipulation and analysis
βββ 03-data-cleaning.md # Handling missing data and preprocessing
βββ 04-data-visualization.md # Charts and plots with matplotlib
βββ 05-statistical-analysis.md # Basic statistics and correlations
βββ 06-real-world-project.md # Complete data analysis project
Prerequisites
Before starting this module, you should be familiar with:
- Python basics (variables, loops, functions)
- Basic mathematics (algebra, statistics)
- File handling (reading/writing files)
Tools Youβll Use
NumPy
The foundation of scientific computing in Python:
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
result = arr * 2 + 10
mean = np.mean(arr)
Pandas
The Excel of Python for data manipulation:
import pandas as pd
# Read data
df = pd.read_csv('data.csv')
# Analyze data
summary = df.describe()
filtered = df[df['price'] > 100]
# Group and aggregate
sales_by_category = df.groupby('category')['sales'].sum()
Matplotlib
The artistβs toolkit for data visualization:
import matplotlib.pyplot as plt
# Create plots
plt.plot(x_data, y_data)
plt.bar(categories, values)
plt.scatter(x, y)
# Customize appearance
plt.title('Sales by Month')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.show()
Learning Approach
This module follows a practical, project-based approach:
- Learn by Doing - Each concept includes hands-on examples
- Real Datasets - Work with actual data from various domains
- Progressive Difficulty - Start simple, build to complex analyses
- Visual Learning - See your data come to life through charts
- Problem-Solving - Apply data science to real-world challenges
Data Science Workflow
Every data science project follows this iterative process:
graph TD
A[Ask Question] --> B[Collect Data]
B --> C[Clean Data]
C --> D[Explore Data]
D --> E[Analyze Data]
E --> F[Visualize Results]
F --> G[Communicate Findings]
G --> H{More Questions?}
H -->|Yes| A
H -->|No| I[Done]
1. Ask the Right Questions
- What problem are you trying to solve?
- What data do you need?
- What insights are you looking for?
2. Collect and Clean Data
- Gather data from various sources
- Handle missing values and errors
- Transform data into usable formats
3. Explore and Analyze
- Understand data distributions
- Find patterns and relationships
- Test hypotheses with statistics
4. Visualize and Communicate
- Create clear, compelling charts
- Tell the story behind the data
- Make recommendations based on findings
Common Data Science Tasks
Exploratory Data Analysis (EDA)
# Load and examine data
df = pd.read_csv('sales_data.csv')
print(df.head())
print(df.info())
print(df.describe())
# Check for missing values
print(df.isnull().sum())
# Visualize distributions
df['price'].hist(bins=50)
plt.title('Price Distribution')
plt.show()
Data Cleaning
# Remove duplicates
df = df.drop_duplicates()
# Fill missing values
df['price'] = df['price'].fillna(df['price'].median())
# Convert data types
df['date'] = pd.to_datetime(df['date'])
# Remove outliers
df = df[df['price'] < df['price'].quantile(0.99)]
Statistical Analysis
# Correlation analysis
correlation = df['sales'].corr(df['advertising'])
print(f"Correlation: {correlation}")
# Group comparisons
avg_sales_by_region = df.groupby('region')['sales'].mean()
print(avg_sales_by_region)
# Hypothesis testing
from scipy import stats
t_stat, p_value = stats.ttest_ind(group1, group2)
Real-World Example: Sales Analysis
Imagine youβre analyzing sales data for a retail company:
import pandas as pd
import matplotlib.pyplot as plt
# Load sales data
sales_df = pd.read_csv('monthly_sales.csv')
# Calculate key metrics
total_sales = sales_df['revenue'].sum()
avg_order_value = sales_df['revenue'].mean()
best_month = sales_df.loc[sales_df['revenue'].idxmax(), 'month']
# Visualize sales trend
plt.figure(figsize=(12, 6))
plt.plot(sales_df['month'], sales_df['revenue'], marker='o')
plt.title('Monthly Sales Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.grid(True)
plt.show()
# Analyze by product category
category_sales = sales_df.groupby('category')['revenue'].sum()
category_sales.plot(kind='bar')
plt.title('Sales by Category')
plt.ylabel('Revenue ($)')
plt.show()
Career Opportunities
Data science offers diverse career paths:
Data Analyst
- Focus: Business intelligence and reporting
- Skills: Excel, SQL, basic statistics, visualization
- Salary: $60,000 - $90,000 USD
Data Scientist
- Focus: Machine learning and predictive modeling
- Skills: Python, R, statistics, ML algorithms
- Salary: $90,000 - $140,000 USD
Machine Learning Engineer
- Focus: Production ML systems and infrastructure
- Skills: Python, TensorFlow, cloud platforms, MLOps
- Salary: $110,000 - $160,000 USD
Data Engineer
- Focus: Data pipelines and infrastructure
- Skills: SQL, Python, Spark, cloud databases
- Salary: $90,000 - $130,000 USD
Getting Started
Ready to begin your data science journey? Letβs start with NumPy arrays - the building blocks of numerical computing in Python!
Next: NumPy Arrays - Efficient numerical computing! π’