1 The CRISP-DM Process
CRISP-DM (Cross-Industry Standard Process for Data Mining) is the most widely used methodology for planning and executing data science and data mining projects. It breaks a project into six phases that form an iterative life cycle — the arrows between phases run in both directions, and the whole loop often repeats as new insight emerges.
1.1 The Six Phases
┌─────────────────────┐ ┌─────────────────────┐
│ 1. Business │ ───▶ │ 2. Data │
│ Understanding │ ◀─── │ Understanding │
└─────────────────────┘ └──────────┬──────────┘
▲ │
│ ▼
│ ┌─────────────────────┐
│ │ 3. Data Preparation │
│ └──────────┬──────────┘
│ │
│ ▼
│ ┌─────────────────────┐
│ │ 4. Modeling │
│ └──────────┬──────────┘
│ │
│ ▼
┌─────────┴───────────┐ ┌─────────────────────┐
│ 6. Deployment │ ◀─── │ 5. Evaluation │
└─────────────────────┘ └─────────────────────┘
1.1.1 1. Business Understanding
Focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan.
- Determine business objectives and success criteria
- Assess the situation (resources, risks, constraints, costs/benefits)
- Define data mining goals
- Produce a project plan
1.1.2 2. Data Understanding
Starts with initial data collection and proceeds to activities that build familiarity with the data, identify quality problems, and form first hypotheses.
- Collect initial data
- Describe data (format, volume, fields)
- Explore data (summary stats, visualizations)
- Verify data quality
1.1.3 3. Data Preparation
Covers all activities to construct the final dataset that will be fed into the modeling tools. Typically the most time-consuming phase.
- Select data (rows and columns)
- Clean data (handle missing values, errors, outliers)
- Construct data (derived attributes, feature engineering)
- Integrate data (merge sources)
- Format data for modeling
1.1.4 4. Modeling
Various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Often loops back to data preparation.
- Select modeling techniques
- Generate a test design (train/validation/test split)
- Build models
- Assess models (technical performance)
1.1.5 5. Evaluation
Evaluates the model(s) against the business objectives (not just technical metrics) to be certain the model properly achieves the goals before deployment.
- Evaluate results against business success criteria
- Review the process
- Determine next steps (proceed, iterate, or stop)
1.1.6 6. Deployment
Organizes and presents the knowledge so the customer can use it. Ranges from a report to a repeatable, production-grade scoring process.
- Plan deployment
- Plan monitoring and maintenance
- Produce a final report
- Review the project (lessons learned)
1.2 Key Characteristics
- Iterative, not linear — phases feed back into one another; you frequently return to earlier phases as you learn.
- Business-driven — the process begins and ends with business value, not algorithms.
- Tool- and industry-agnostic — applies to any domain and tech stack.
- Documentation-heavy — each phase produces deliverables that inform the next.
1.3 This Repository
This repository is organized as a Quarto book where each CRISP-DM phase is a chapter with its own working folder:
| Phase | Folder |
|---|---|
| 1. Business Understanding | 01_business_understanding/ |
| 2. Data Understanding | 02_data_understanding/ |
| 3. Data Preparation | 03_data_preparation/ |
| 4. Modeling | 04_modeling/ |
| 5. Evaluation | 05_evaluation/ |
| 6. Deployment | 06_deployment/ |