4  Data Preparation

CRISP-DM Phase 3. Construct the final dataset to be fed into the modeling tools. Usually the most time-consuming phase.

4.1 Objectives

  • Select data — Decide which records and attributes to include or exclude, and why.
  • Clean data — Handle missing values, errors, and outliers; raise data quality to the level required by the chosen techniques.
  • Construct data — Derive new attributes and engineer features; generate records where needed.
  • Integrate data — Combine information from multiple tables or sources.
  • Format data — Reformat as required by modeling tools (e.g., types, encoding).

4.2 Deliverables

  • Dataset(s) and dataset description
  • Record of selection, cleaning, construction, integration, and formatting steps

4.3 Notes

Document transformation logic, feature engineering, and the final dataset here.