MSc thesis · Bangor University · 2026
Predicting vehicle CO₂ emissions with machine learning, and making the result something a buyer can actually use.
For my MSc in Artificial Intelligence & Data Science at Bangor University, I built and compared machine-learning models that predict a car's CO₂ emissions from its features, then used SHAP to make every prediction explainable. The goal was simple: turn a technical model into clear, evidence-based guidance for people choosing a lower-emission vehicle.
6,282
vehicle records
12
input variables
0.9943
best R² (Random Forest)
2.35
MAE (g/km)

Aligned with SDG 13, Climate Action, through consumer-level transparency.
In short
What the study set out to do
Road transport is responsible for roughly 16% of global CO₂ emissions, and light-duty vehicles make up a large share of that. Emissions data is increasingly published, but it is usually too technical for an ordinary buyer to act on. I set out to close that gap: build models accurate enough to be trusted, then make them transparent enough that a non-expert can see why a given car emits what it does. I used a public dataset of 6,282 Canadian vehicles, trained several models, and explained the best one with SHAP so the reasoning is visible, not hidden inside a black box.
The gap I addressed
Most emissions tools are accurate but opaque, or simple but inaccurate. I aimed for both: high accuracy and interpretability that supports a real purchasing decision.
Objectives
- 1Predict vehicle CO₂ emissions from engine size, cylinder count and fuel efficiency.
- 2Compare four ML algorithms: Linear Regression, Random Forest, KNN and SVR.
- 3Interpret model behaviour and identify the key features using SHAP values.
- 4Test whether extended attributes (fuel type, vehicle class, transmission) improve accuracy.
Research questions
- 1How accurately can CO₂ emissions be predicted from engine size, cylinder count and fuel consumption?
- 2Which model gives the best balance of accuracy and interpretability?
- 3What is the relative influence of each vehicle feature on emissions?
- 4Do extra features such as fuel type and vehicle class meaningfully improve performance?
How I built it
Data and methodology
I worked with a public dataset from Natural Resources Canada, cleaned it, engineered features, and trained a spread of models before interpreting the strongest one. The pipeline moves from raw data to a prediction a person can understand.
Step 1
Collect
7,385 records from Natural Resources Canada, reduced to 6,282 after removing 1,103 duplicates.
Step 2
Preprocess
One-hot encoding, feature selection, scaling, and an 80/20 split (5,025 train / 1,257 test).
Step 3
Train
Linear Regression, Random Forest, KNN and SVR, plus Decision Tree, Gradient Boosting and XGBoost.
Step 4
Evaluate
Compared on MAE, RMSE, MAPE and R², with GridSearchCV tuning for Random Forest.
Step 5
Explain
SHAP values for global and per-vehicle interpretability, so the predictions are readable.
Results
Which model predicted best
Random Forest came out on top, with an average error of just 2.35 g/km and an R² of 0.9943, near-perfect agreement between predicted and actual emissions. Lower bars are better.
Mean absolute error on the extended feature set. XGBoost led the tree-only models at R² 0.9747; SVR was the weakest overall.
Explainability
What actually drives a car's emissions
Using SHAP, I could measure how much each feature pushed a prediction up or down. One story dominates: how much fuel a car burns matters far more than its badge or its gearbox.
Importance by feature group
Mean absolute SHAP value
Usage pattern, how efficiently a car burns fuel, accounts for roughly 80 to 85% of the model's decision.
How individual features move a prediction
Ethanol (E85) is the one major feature that lowers predicted emissions. Everything else above pushes them up.
The patterns
Where the emissions concentrate
Average CO₂ by fuel type (g/km)
Average CO₂ by vehicle class (g/km)
Highest emitters
Bugatti 522 g/km, Lamborghini 402, SRT 389. Supercars emit nearly twice the dataset average.
Lowest emitters
Hyundai Ioniq Electric and Kia Soul EV came in under 100 g/km.
Cylinders
4-cylinder cars sit at 100 to 250 g/km; 8+ cylinders routinely pass 350; a 16-cylinder engine topped 500.
What this means if you are choosing a car
Instead of trusting a sticker or a marketing claim, you can see what really moves a car's emissions. Combined fuel consumption is by far the strongest signal, followed by fuel type. A smaller, more fuel-efficient engine, or an alternative fuel like ethanol, lowers the footprint far more than brand or transmission ever will. My contribution was not only an accurate model, but a transparent one a non-expert can actually read and act on.
Sustainability & ethics
Built for SDG 13, Climate Action
The work supports SDG 13 by making vehicle emissions transparent at the point of choice. I used only public, anonymised data, and analysed SHAP values across drivetrain and fuel categories to check for and reduce bias, so the guidance stays fair as well as accurate.
Public data
Anonymised, openly published Canadian records.
Fairness
SHAP checked across fuel and drivetrain groups.
Transparency
Every prediction is explainable, not a black box.
Impact
Lower-emission choices made easier for buyers.
Limitations
- The data is Canadian only, so it may not generalise to other fuel mixes or regulations.
- Interaction effects beyond tree ensembles were not modelled explicitly.
- Real-world factors like vehicle age, maintenance and driving behaviour were not captured.
Future work
- Add real-time telemetry or IoT sensor data for greater realism.
- Expand the dataset across regions and climates for broader generalisation.
- Combine deep learning with interpretable models to balance power and transparency.
- Extend to lifecycle CO₂, including manufacturing and disposal.

Supervision & acknowledgement
This research was completed under the supervision of Dr. Roggers Giddings at the School of Computer Science, Bangor University, whose guidance shaped the work throughout.
The full paper is available on request. A downloadable copy will be added here soon.