Research2026

DPE Label Prediction

Built an interpretable multiclass model to predict DPE energy labels (A–G) from open ADEME data, with evaluation focused on class imbalance, robustness, and next-step improvements.

Overview

This project documents a machine learning pipeline that predicts the DPE energy label (A–G) from a small, interpretable set of building and energy descriptors available in open ADEME DPE records.

The motivation is practical: infer a plausible label from structured inputs to support quality checks (detecting inconsistent records) and provide a reusable component in a broader product pipeline.

Data sources and scope

Geography: Paris
Time window: 2018–2021
Primary source: ADEME open DPE datasets (pre-2021 and post-2021 methodology)
Broader project context: Enedis open data on annual residential consumption by address (Paris-only cleaning)

Target and feature design

The supervised target is the DPE label etiquette dpe, with 7 classes: A, B, C, D, E, F, G. The feature set was intentionally kept lightweight for interpretability.

Numerical: surface habitable logement, conso 5 usages par m2 ep, emission ges 5 usages
Categorical: type batiment, periode construction, type energie principale chauffage
After preprocessing: 24-dimensional vector (scaling + one-hot encoding)

Preprocessing pipeline

The full workflow is implemented as a scikit-learn Pipeline, ensuring identical transformations at training and inference time.

Numerical variables are median-imputed and standardized. Categorical variables are imputed with the most frequent value, coerced to strings, and one-hot encoded with unknown categories ignored to keep inference robust.

Practical issue encountered

Reloading the serialized model (joblib) initially failed because a helper function used in preprocessing was defined inside a notebook. The robust fix is to move helpers into a dedicated Python module and import them consistently during training/inference.

Train/test split and dataset sizes

A stratified 80/20 split was used to preserve the A–G label distribution under class imbalance.

Dev set: 80,349 rows
Test set: 150,155 rows
Note: split is stratified but not grouped (possible leakage if multiple records per building/address)

Model choice and training

DPE Label Prediction

Overview

Data sources and scope

Target and feature design

Preprocessing pipeline

Train/test split and dataset sizes

Model choice and training

Model form

Evaluation results

Limitations

Next improvements

Report