Data Science Portfolio Masterclass

00

Before You Write One Line of Code

Lessons from two decades in the field — what no bootcamp will tell you

I've reviewed thousands of portfolios. Most are graveyard repos — 10 Jupyter notebooks with titanic survival predictions and iris classifications. Nobody cares. What makes a recruiter lean forward is a portfolio that screams: "This person solves problems in the real world." Here's how you build that.

— Senior DS Mentor / 20+ Years Experience

The 3 Sins of Every Beginner Portfolio

✗

Sin #1 — Tutorial Datasets

Titanic. Iris. Boston Housing. MNIST. If your project uses these, a recruiter can't tell if YOU did anything. Real datasets have messiness, context, domain knowledge. Always source your own data from APIs, Kaggle competitions, or web scraping.

✗

Sin #2 — No Business Problem

Your accuracy is 94.3%... so what? Every project must answer "Who benefits from this? By how much? What decision does this change?" Translate model output into dollars, risk reduction, or operational efficiency. That's what business stakeholders care about.

✗

Sin #3 — A Notebook is Not a Product

Notebooks are for exploration. Production is an API, a dashboard, a scheduled pipeline. If your project can't be used by someone who isn't you, it's not complete. Wrap everything in FastAPI, Streamlit, or at minimum a Docker container.

What Recruiters in 2025 Actually Want

LLM Integration

GPT-4, Claude, RAG systems

MLOps

MLflow, Airflow, CI/CD

Cloud

AWS/GCP/Azure deployment

Real-time

Kafka, streaming pipelines

Story

Business impact, not just metrics

Clean Code

Modular, tested, documented

🎯 What a Recruiter Does in 90 Seconds

Scans your GitHub README — is there a problem statement + demo link?
Checks folder structure — does this look like a professional project or homework?
Looks for a live demo / API endpoint / dashboard — can I see this working NOW?
Reads your results section — what was the measurable impact?
Checks tech stack tags — do they match our job description?

01

RAG-Powered Document Intelligence System

LLM + Vector Database + Production API — the most in-demand skill of 2025

Smart Document Q&A Engine

Ask natural language questions against 10,000+ documents — with citations and source grounding

🔥 Hot in 2025

Every company is sitting on a mountain of PDFs, reports, and internal docs that nobody reads. The person who can build a system to query that knowledge intelligently is immediately worth $120K+. This project teaches you the full modern AI stack.

— Why this project matters

Problem Statement

A law firm has 50,000 case documents. Lawyers spend 40% of their time searching for precedents. Build an AI system where a lawyer types "Find cases involving breach of contract in software agreements from 2018–2022" and gets accurate answers with source citations in under 3 seconds.

PythonLangChainOpenAI / Claude API ChromaDB / PineconeFastAPIDocker StreamlitAWS S3

Step-by-Step Build Plan

1

Phase 1 — Data Collection & Domain Selection (Week 1)

Choose a real, interesting domain. Good options:

SEC 10-K filings (publicly available, financial domain)
ArXiv research papers in a niche (e.g., "climate ML papers")
Government policy documents from data.gov
Stack Overflow dumps (technical Q&A)

Download 500–2000 documents minimum. More = more impressive. Write a scraper or use existing APIs. Document your data collection in a clean notebook.

2

Phase 2 — Data Pipeline & Preprocessing (Week 1–2)

This is where junior DS fail — messy code with no structure. Build it properly:

PDF/text parsing with PyMuPDF or pdfplumber
Intelligent chunking strategy (not just 500 chars — use semantic chunking)
Metadata extraction (date, author, section headers)
Deduplication and quality filtering

3

Phase 3 — Embedding & Vector Store (Week 2)

The core of RAG — embedding your documents:

Try multiple embedding models: OpenAI ada-002 vs sentence-transformers
Benchmark retrieval quality — this is your experiment section
Use ChromaDB locally first, then migrate to Pinecone for cloud
Implement hybrid search: vector + BM25 keyword search combined

# Phase 3: Embedding pipeline — production-grade
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import logging

logger = logging.getLogger(__name__)

class DocumentIndexer:
    def __init__(self, chunk_size=800, chunk_overlap=100):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ".", " "]
        )
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    def index_documents(self, docs: list, persist_path: str):
        chunks = self.splitter.split_documents(docs)
        logger.info(f"Indexing {len(chunks)} chunks...")
        vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=persist_path
        )
        vectorstore.persist()
        return vectorstore

4

Phase 4 — RAG Chain with Evaluation (Week 3)

THIS IS WHAT SEPARATES YOU. Don't just build the RAG — evaluate it properly:

Build a test set of 50 question-answer pairs manually
Measure retrieval precision@k and recall@k
Use RAGAS library to score faithfulness, answer relevancy, context precision
A/B test different chunking strategies and show results in a table
Add citations: every answer must show which document it came from

5

Phase 5 — FastAPI Backend + Streamlit Frontend (Week 3–4)

REST API with /query, /upload, /health endpoints
Async processing for document uploads
Streamlit UI with chat interface, source display, and confidence scores
Rate limiting and basic auth (shows you think about production)
Dockerize everything — one docker-compose up should run the whole system

6

Phase 6 — Deploy + Document (Week 4)

Deploy to AWS EC2 (free tier) or Hugging Face Spaces for free
Write a proper README: problem → approach → results → demo link
Create a 2-minute Loom video demo — MASSIVE differentiator
Write a Medium post or LinkedIn article explaining your architecture decisions

💼 What to Highlight to Recruiters

"Built a RAG system handling 10K documents with <3s query latency"
"Improved retrieval precision by 23% using hybrid search vs pure vector search"
"Deployed production API handling 100 concurrent requests on AWS"
"Evaluated system using RAGAS framework — faithfulness score of 0.89"

02

End-to-End MLOps Pipeline with Drift Detection

The project that shows you can maintain models in production — not just train them once

Real-Time Fraud Detection System

Train → Track → Deploy → Monitor → Retrain — the full production lifecycle

⚙️ MLOps Gold

90% of ML projects fail in production — not because the model is bad, but because nobody built the infrastructure to maintain it. The data scientist who understands the full lifecycle — training through monitoring — is 3x more valuable than one who only knows modeling. This project teaches that lifecycle.

— Why this project matters

Problem Statement

Credit card fraud costs $32B annually. Build a system that detects fraud in real-time, tracks model performance over time, detects when the model degrades (data drift), and automatically triggers retraining. This is a system, not a model.

PythonXGBoost/LightGBMMLflow Apache AirflowFastAPIEvidently AI DockerPostgreSQLGrafana

Architecture Overview

┌─────────────────── DATA LAYER ───────────────────────┐
│  Raw Transactions → Feature Engineering → Feature Store │
└───────────────────────────────────┬──────────────────┘
                                    │
┌──────────────── TRAINING LAYER ────▼──────────────────┐
│  MLflow Tracking → Experiment → Model Registry        │
│  Hyperparameter Tuning → Best Model → Staging         │
└───────────────────────────────────┬──────────────────┘
                                    │
┌──────────────── SERVING LAYER ─────▼──────────────────┐
│  FastAPI Endpoint → Real-time Predictions → Logging   │
└───────────────────────────────────┬──────────────────┘
                                    │
┌──────────────── MONITORING LAYER ──▼──────────────────┐
│  Evidently Reports → Data Drift → Performance Drift   │
│  Grafana Dashboard → Alerts → Retrain Trigger         │
└──────────────────────────────────────────────────────┘

Step-by-Step Build Plan

1

Phase 1 — Data & Feature Engineering (Week 1)

Use the Kaggle Credit Card Fraud dataset — but don't stop at the raw features. Add engineered features that show domain understanding:

Rolling aggregations: "transactions in last 1h/6h/24h per card"
Velocity features: "amount deviation from user's median"
Time features: hour of day, day of week, time since last transaction
Build a Feature Store using a simple SQLite or Redis cache

2

Phase 2 — Experiment Tracking with MLflow (Week 1–2)

This is the #1 thing missing in junior portfolios — experiment tracking:

Log every experiment: params, metrics, artifacts
Compare Logistic Regression vs Random Forest vs XGBoost vs LightGBM
Handle class imbalance properly: SMOTE, class weights, threshold tuning
Use F1-Score + Precision-Recall AUC as primary metrics (not accuracy)
Register your best model in MLflow Model Registry

import mlflow
import mlflow.sklearn
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve

def train_and_track(X_train, y_train, X_val, y_val, params, model_name):
    with mlflow.start_run(run_name=model_name):
        # Log all hyperparameters
        mlflow.log_params(params)

        model = XGBClassifier(**params)
        model.fit(X_train, y_train,
                   eval_set=[(X_val, y_val)],
                   early_stopping_rounds=20)

        # Log all metrics
        preds = model.predict(X_val)
        proba = model.predict_proba(X_val)[:, 1]

        mlflow.log_metrics({
            "f1": f1_score(y_val, preds),
            "pr_auc": roc_auc_score(y_val, proba),
            "precision": precision_score(y_val, preds),
        })

        # Log model artifact
        mlflow.sklearn.log_model(model, "model")
        return model

3

Phase 3 — FastAPI Serving with Prediction Logging (Week 2–3)

Load model from MLflow registry at startup
Log every prediction to PostgreSQL: input features, prediction, probability, timestamp
This logged data becomes your monitoring dataset
Add /health and /metrics endpoints
Implement a shadow mode: new model vs old model, both running simultaneously

4

Phase 4 — Data Drift Detection with Evidently (Week 3) ← THE STAR

This is what makes this project elite. Most junior DS don't know what drift is:

Data drift: input feature distributions have shifted (fraudsters change tactics)
Concept drift: the relationship between features and fraud has changed
Use Evidently AI to generate drift reports weekly
Set up alerts when drift exceeds threshold (e.g., PSI > 0.2)
Build an Airflow DAG that runs weekly drift checks automatically

5

Phase 5 — Grafana Monitoring Dashboard (Week 4)

Real-time metrics: predictions/second, fraud rate, model latency
Feature drift scores over time (line charts)
Precision/Recall over rolling windows
Alert panel: "Model degradation detected — retrain triggered"

💼 What to Highlight to Recruiters

"Designed automated retraining pipeline triggered by statistical drift detection"
"Reduced model performance degradation by 67% through proactive drift monitoring"
"Tracked 150+ experiments in MLflow, improving F1 score from 0.73 to 0.91"
"Built real-time monitoring dashboard serving 10K predictions/day"

03

Causal Inference & Business Intelligence Dashboard

The project that proves you think like a scientist, not just an engineer

A/B Test Analyzer + Causal Impact Dashboard

Did this product change CAUSE the revenue lift — or was it just correlation?

🧠 Strategic Thinking

Every senior DS role asks this: "How do you know the change caused the improvement?" Correlation is easy to find. Causation is what makes decisions. This project teaches causal inference, A/B testing rigor, and business storytelling — skills that separate DS from data analysts AND from ML engineers.

— Why this project matters

Problem Statement

An e-commerce company ran a pricing experiment: 50% of users saw a 10% discount badge. Did the discount badge cause more purchases? Build a rigorous analysis system that answers this question statistically, corrects for confounders, and presents findings to a non-technical executive audience.

PythonSQLStatsmodels DoWhy / CausalMLPlotly / Dash dbtBigQuery/SnowflakeJupyter

Step-by-Step Build Plan

1

Phase 1 — Synthetic Data Generation + SQL Pipeline (Week 1)

Build your own realistic dataset — this itself demonstrates competence:

Generate user sessions with treatment/control assignments
Add realistic confounders: device type, user tenure, geography
Introduce novelty effects, seasonal bias, Simpson's paradox scenarios
Build dbt models to transform raw events → analysis-ready tables
Write clean SQL with window functions, CTEs, and proper documentation

2

Phase 2 — Pre-Analysis: SRM & Validity Checks (Week 1–2)

Junior analysts skip this. Senior scientists always start here:

Sample Ratio Mismatch (SRM): Is the 50/50 split actually 50/50?
Pre-experiment balance: Are treatment/control groups similar before the test?
Novelty effect check: Does the effect decline over time (new-user bias)?
CUPED: Use pre-experiment data to reduce variance and increase power

3

Phase 3 — Statistical Analysis (Week 2)

Frequentist: t-test with proper power analysis and multiple testing correction
Bayesian: Beta-binomial model — show posterior distributions, not just p-values
Bootstrap confidence intervals for non-normal metrics (revenue)
Segmented analysis: does the effect differ by user segment?
Minimum detectable effect and required sample size calculation BEFORE analysis

import numpy as np
from scipy import stats
import dowhy

class ExperimentAnalyzer:
    def __init__(self, df, treatment_col, outcome_col, covariates):
        self.df = df
        self.treatment = treatment_col
        self.outcome = outcome_col
        self.covariates = covariates

    def check_srm(self):
        # Sample Ratio Mismatch check
        counts = self.df[self.treatment].value_counts()
        expected = [sum(counts) / 2] * 2
        chi2, p = stats.chisquare(counts, f_exp=expected)
        if p < 0.01:
            print(f"⚠️  SRM DETECTED: p={p:.4f}. Results may be invalid!")
        return {"chi2": chi2, "p_value": p, "srm_detected": p < 0.01}

    def causal_estimate(self):
        # DoWhy causal graph approach
        model = dowhy.CausalModel(
            data=self.df,
            treatment=self.treatment,
            outcome=self.outcome,
            common_causes=self.covariates
        )
        identified = model.identify_effect()
        estimate = model.estimate_effect(
            identified, method_name="backdoor.propensity_score_matching"
        )
        return model.refute_estimate(identified, estimate,
                   method_name="random_common_cause")

4

Phase 4 — Causal Inference (Week 3) ← THE DIFFERENTIATOR

This is the 1% skill. Almost nobody in junior portfolios touches this:

Build a Causal DAG (Directed Acyclic Graph) showing confounders
Propensity Score Matching — correct for selection bias
Difference-in-Differences for observational data
Use DoWhy to verify causal estimates with refutation tests
Write a one-page "business decision memo" — not technical language, executive language

5

Phase 5 — Interactive Plotly Dash Dashboard (Week 3–4)

Executive summary: "The discount badge caused +8.3% conversion (95% CI: 5.1%–11.4%)"
Interactive metric selector: conversion rate, revenue, session time
Confidence interval visualizer with frequentist vs Bayesian comparison
Segment drill-down: filter by device, geography, user type
Deploy on Heroku or Render (free tier)

💼 What to Highlight to Recruiters

"Applied causal inference (DoWhy, PSM) to validate A/B test results beyond p-values"
"Detected and corrected for Simpson's Paradox in segmented user analysis"
"Built executive BI dashboard translating statistical results into revenue impact estimates"
"Used CUPED to reduce metric variance by 31%, lowering required sample size by 40%"

04

The 2025 Data Scientist Toolkit

Tools you must know, organized by category and priority

Don't try to learn every tool. Learn the RIGHT tools deeply. Here's what's actually used in industry in 2025, with honest assessment of what's essential vs nice-to-have.

— Tool philosophy

Python

Non-negotiable. pandas, numpy, scikit-learn.

SQL

Window functions, CTEs, query optimization.

XGBoost / LightGBM

Still dominant for tabular data in 2025.

PyTorch

Deep learning. At least understand it.

Plotly / Seaborn

Visualization for EDA and presentations.

Statsmodels

Statistical testing, regression, time series.

Polars

Replacing pandas for large data. Learn it now.

Git / GitHub

Branching, PR workflow, GitHub Actions CI.

Skill Level Benchmarks

PythonMust be expert

SQLMust be advanced

StatisticsMust be solid

ML AlgorithmsMust be strong

MLflow

Experiment tracking, model registry. Essential.

Apache Airflow

Workflow orchestration. Standard in industry.

Docker

Containerization. Non-negotiable in 2025.

FastAPI

Model serving. Fast, modern, Pythonic.

Evidently AI

Data drift + model monitoring.

DVC

Data versioning. Like git for datasets.

Grafana

Dashboards for model monitoring.

GitHub Actions

CI/CD for ML. Auto-test on push.

LangChain

LLM application framework. Still dominant.

LlamaIndex

RAG and document indexing pipelines.

OpenAI API

GPT-4o. Function calling. Fine-tuning.

Anthropic API

Claude. Excellent for long-context tasks.

ChromaDB / Pinecone

Vector databases for RAG.

RAGAS

RAG evaluation framework. Use this!

Ollama

Local LLMs. Privacy-safe deployment.

Hugging Face

Open source models + model hub.

AWS (EC2/S3/Lambda)

Industry standard. Must know basics.

BigQuery / Snowflake

Cloud data warehouses. SQL at scale.

dbt

Data transformation. Rising fast in 2025.

Spark (PySpark)

Big data. Know basics for senior roles.

Kafka

Event streaming. Real-time pipelines.

Terraform

Infrastructure as code. Nice-to-have.

05

Your 16-Week Execution Roadmap

From zero to portfolio-ready — week by week, no fluff

Consistency beats intensity. 2 focused hours per day beats 14-hour weekend sprints. Here is the exact schedule I would give a protégé starting from scratch today.

— The plan

WEEKS 1–2

Foundation SetupSet up local dev environment (VS Code, Docker, Python 3.11, Git). Complete Python refresh: OOP, decorators, context managers. SQL advanced: window functions, CTEs, performance. Build a data scraper for your chosen domain. Deliverable: GitHub profile set up, first repo with scraping pipeline.

WEEKS 3–5

Project 3 — A/B Testing & Causal InferenceStart here because statistics is the hardest and most overlooked. Generate synthetic experiment data, build the analysis pipeline, create the Dash dashboard. Deliverable: Live dashboard deployed + analysis notebook.

WEEKS 6–9

Project 2 — MLOps PipelineFraud detection system. MLflow setup → model training → FastAPI serving → Evidently monitoring → Airflow DAG → Grafana dashboard. Deliverable: Dockerized system with monitoring, running locally + deployed.

WEEKS 10–13

Project 1 — RAG Document SystemChoose domain, collect documents, build embedding pipeline, implement RAG chain, evaluate with RAGAS, deploy API + Streamlit UI. Deliverable: Live demo URL, 2-min Loom video walkthrough.

WEEKS 14–15

Portfolio PolishWrite READMEs for every project (problem → approach → results → demo). Write 3 technical Medium/LinkedIn posts. Clean up all code: typing, docstrings, unit tests. Record demo videos.

WEEK 16

Launch & ApplyFinalize LinkedIn profile + GitHub. Tailor resume for each application. Start applying. The portfolio does the talking.

Final Checklist — Before You Apply

Each project has a clear problem statement in the README
Each project has a live demo link or Loom video
Each project quantifies its business impact (%, $, time saved)
Code is modular, not one giant notebook
Unit tests exist (even basic ones)
Docker setup works (docker-compose up launches the system)
GitHub profile has a pinned projects section
LinkedIn headline mentions "ML Engineer" or "Data Scientist" with tech keywords
At least 1 technical article published explaining one project
You can talk about each project for 10 minutes in an interview without notes

3 Portfolio Projects That Will
Set You on Fire

The 3 Sins of Every Beginner Portfolio

What Recruiters in 2025 Actually Want

Problem Statement

Step-by-Step Build Plan

Problem Statement

Architecture Overview

Step-by-Step Build Plan

Problem Statement

Step-by-Step Build Plan

Skill Level Benchmarks

Final Checklist — Before You Apply

The Portfolio Is the Interview

3 Portfolio Projects That WillSet You on Fire

The 3 Sins of Every Beginner Portfolio

What Recruiters in 2025 Actually Want

Problem Statement

Step-by-Step Build Plan

Problem Statement

Architecture Overview

Step-by-Step Build Plan

Problem Statement

Step-by-Step Build Plan

Skill Level Benchmarks

Final Checklist — Before You Apply

The Portfolio Is the Interview

3 Portfolio Projects That Will
Set You on Fire