DPE Label Prediction
Built an interpretable multiclass model to predict DPE energy labels (A–G) from open ADEME data, with evaluation focused on class imbalance, robustness, and next-step improvements.
Overview
This project documents a machine learning pipeline that predicts the DPE energy label (A–G) from a small, interpretable set of building and energy descriptors available in open ADEME DPE records.
The motivation is practical: infer a plausible label from structured inputs to support quality checks (detecting inconsistent records) and provide a reusable component in a broader product pipeline.
Data sources and scope
- Geography: Paris
- Time window: 2018–2021
- Primary source: ADEME open DPE datasets (pre-2021 and post-2021 methodology)
- Broader project context: Enedis open data on annual residential consumption by address (Paris-only cleaning)
Target and feature design
The supervised target is the DPE label etiquette dpe, with 7 classes: A, B, C, D, E, F, G. The feature set was intentionally kept lightweight for interpretability.
- Numerical: surface habitable logement, conso 5 usages par m2 ep, emission ges 5 usages
- Categorical: type batiment, periode construction, type energie principale chauffage
- After preprocessing: 24-dimensional vector (scaling + one-hot encoding)
Preprocessing pipeline
The full workflow is implemented as a scikit-learn Pipeline, ensuring identical transformations at training and inference time.
Numerical variables are median-imputed and standardized. Categorical variables are imputed with the most frequent value, coerced to strings, and one-hot encoded with unknown categories ignored to keep inference robust.
Reloading the serialized model (joblib) initially failed because a helper function used in preprocessing was defined inside a notebook. The robust fix is to move helpers into a dedicated Python module and import them consistently during training/inference.
Train/test split and dataset sizes
A stratified 80/20 split was used to preserve the A–G label distribution under class imbalance.
- Dev set: 80,349 rows
- Test set: 150,155 rows
- Note: split is stratified but not grouped (possible leakage if multiple records per building/address)
Model choice and training
We trained a linear multiclass classifier using SGDClassifier with multinomial logistic regression (loss="log_loss"). This choice fits sparse one-hot features and remains interpretable.
Class imbalance is handled via class_weight="balanced". Training raised a non-convergence warning, pointing to tuning opportunities (increase max_iter, adjust tol and/or alpha).
Model form
After preprocessing, the input becomes a feature vector ϕ(x) ∈ ℝ²⁴. For each class k ∈ A–G, the model computes a linear score:
s_k(x) = b_k + Σ_{j=1..24} w_{k,j} ϕ_{j}(x)
Scores are converted to probabilities with softmax, and the prediction is the argmax over class scores.
Evaluation results
Held-out test set (n = 150,155): accuracy = 0.73, balanced accuracy = 0.798, macro F1 = 0.72, weighted F1 = 0.72.
Performance is strongest on high-support labels (notably C and D). Rare labels are harder to evaluate reliably (especially A), and class E is predicted conservatively (high precision, lower recall).
Limitations
- Non-convergence warning suggests further tuning and/or regularization adjustments.
- Potential leakage risk: split is not grouped by building/address identifier.
- Class A scarcity makes its precision/recall estimates noisy.
- DPE methodology shift in mid-2021 complicates cross-period comparability.
Next improvements
- Increase max_iter and tune alpha/tol to improve convergence stability.
- Validate robustness with a grouped split (more conservative generalization).
- Consider merging rare labels or optimizing for broader risk bands.
- Roadmap: train a regression model on measured Enedis consumption + uncertainty intervals.
Report
Download PDFFull model build report (Paris, 2018–2021), including dataset scope, feature design, preprocessing, explicit model form, and evaluation discussion.