Nicodème Westphalen
Research2026

DPE Label Prediction

Built an interpretable multiclass model to predict DPE energy labels (A–G) from open ADEME data, with evaluation focused on class imbalance, robustness, and next-step improvements.

Overview

This project documents a machine learning pipeline that predicts the DPE energy label (A–G) from a small, interpretable set of building and energy descriptors available in open ADEME DPE records.

The motivation is practical: infer a plausible label from structured inputs to support quality checks (detecting inconsistent records) and provide a reusable component in a broader product pipeline.

Data sources and scope

  • Geography: Paris
  • Time window: 2018–2021
  • Primary source: ADEME open DPE datasets (pre-2021 and post-2021 methodology)
  • Broader project context: Enedis open data on annual residential consumption by address (Paris-only cleaning)

Target and feature design

The supervised target is the DPE label etiquette dpe, with 7 classes: A, B, C, D, E, F, G. The feature set was intentionally kept lightweight for interpretability.

  • Numerical: surface habitable logement, conso 5 usages par m2 ep, emission ges 5 usages
  • Categorical: type batiment, periode construction, type energie principale chauffage
  • After preprocessing: 24-dimensional vector (scaling + one-hot encoding)

Preprocessing pipeline

The full workflow is implemented as a scikit-learn Pipeline, ensuring identical transformations at training and inference time.

Numerical variables are median-imputed and standardized. Categorical variables are imputed with the most frequent value, coerced to strings, and one-hot encoded with unknown categories ignored to keep inference robust.

Practical issue encountered

Reloading the serialized model (joblib) initially failed because a helper function used in preprocessing was defined inside a notebook. The robust fix is to move helpers into a dedicated Python module and import them consistently during training/inference.

Train/test split and dataset sizes

A stratified 80/20 split was used to preserve the A–G label distribution under class imbalance.

  • Dev set: 80,349 rows
  • Test set: 150,155 rows
  • Note: split is stratified but not grouped (possible leakage if multiple records per building/address)

Model choice and training

We trained a linear multiclass classifier using SGDClassifier with multinomial logistic regression (loss="log_loss"). This choice fits sparse one-hot features and remains interpretable.

Class imbalance is handled via class_weight="balanced". Training raised a non-convergence warning, pointing to tuning opportunities (increase max_iter, adjust tol and/or alpha).

Model form

After preprocessing, the input becomes a feature vector ϕ(x) ∈ ℝ²⁴. For each class k ∈ A–G, the model computes a linear score:

s_k(x) = b_k + Σ_{j=1..24} w_{k,j} ϕ_{j}(x)

Scores are converted to probabilities with softmax, and the prediction is the argmax over class scores.

Evaluation results

Held-out test set (n = 150,155): accuracy = 0.73, balanced accuracy = 0.798, macro F1 = 0.72, weighted F1 = 0.72.

Performance is strongest on high-support labels (notably C and D). Rare labels are harder to evaluate reliably (especially A), and class E is predicted conservatively (high precision, lower recall).

Limitations

  • Non-convergence warning suggests further tuning and/or regularization adjustments.
  • Potential leakage risk: split is not grouped by building/address identifier.
  • Class A scarcity makes its precision/recall estimates noisy.
  • DPE methodology shift in mid-2021 complicates cross-period comparability.

Next improvements

  • Increase max_iter and tune alpha/tol to improve convergence stability.
  • Validate robustness with a grouped split (more conservative generalization).
  • Consider merging rare labels or optimizing for broader risk bands.
  • Roadmap: train a regression model on measured Enedis consumption + uncertainty intervals.

Full model build report (Paris, 2018–2021), including dataset scope, feature design, preprocessing, explicit model form, and evaluation discussion.