4 Data Preparation
CRISP-DM Phase 3. Construct the final dataset to be fed into the modeling tools. Usually the most time-consuming phase.
4.1 Objectives
- Select data — Decide which records and attributes to include or exclude, and why.
- Clean data — Handle missing values, errors, and outliers; raise data quality to the level required by the chosen techniques.
- Construct data — Derive new attributes and engineer features; generate records where needed.
- Integrate data — Combine information from multiple tables or sources.
- Format data — Reformat as required by modeling tools (e.g., types, encoding).
4.2 Deliverables
- Dataset(s) and dataset description
- Record of selection, cleaning, construction, integration, and formatting steps
4.3 Notes
Document transformation logic, feature engineering, and the final dataset here.