MSc thesis · Bangor University · 2026

Predicting vehicle CO₂ emissions with machine learning, and making the result something a buyer can actually use.

For my MSc in Artificial Intelligence & Data Science at Bangor University, I built and compared machine-learning models that predict a car's CO₂ emissions from its features, then used SHAP to make every prediction explainable. The goal was simple: turn a technical model into clear, evidence-based guidance for people choosing a lower-emission vehicle.

Richard Mensah·School of Computer Science, Bangor University·Supervisor: Dr. Roggers Giddings

6,282

vehicle records

12

input variables

0.9943

best R² (Random Forest)

2.35

MAE (g/km)

Climate and emissions data visualised from above

Aligned with SDG 13, Climate Action, through consumer-level transparency.

In short

What the study set out to do

Road transport is responsible for roughly 16% of global CO₂ emissions, and light-duty vehicles make up a large share of that. Emissions data is increasingly published, but it is usually too technical for an ordinary buyer to act on. I set out to close that gap: build models accurate enough to be trusted, then make them transparent enough that a non-expert can see why a given car emits what it does. I used a public dataset of 6,282 Canadian vehicles, trained several models, and explained the best one with SHAP so the reasoning is visible, not hidden inside a black box.

The gap I addressed

Most emissions tools are accurate but opaque, or simple but inaccurate. I aimed for both: high accuracy and interpretability that supports a real purchasing decision.

Objectives

  1. 1Predict vehicle CO₂ emissions from engine size, cylinder count and fuel efficiency.
  2. 2Compare four ML algorithms: Linear Regression, Random Forest, KNN and SVR.
  3. 3Interpret model behaviour and identify the key features using SHAP values.
  4. 4Test whether extended attributes (fuel type, vehicle class, transmission) improve accuracy.

Research questions

  1. 1How accurately can CO₂ emissions be predicted from engine size, cylinder count and fuel consumption?
  2. 2Which model gives the best balance of accuracy and interpretability?
  3. 3What is the relative influence of each vehicle feature on emissions?
  4. 4Do extra features such as fuel type and vehicle class meaningfully improve performance?

How I built it

Data and methodology

I worked with a public dataset from Natural Resources Canada, cleaned it, engineered features, and trained a spread of models before interpreting the strongest one. The pipeline moves from raw data to a prediction a person can understand.

Step 1

Collect

7,385 records from Natural Resources Canada, reduced to 6,282 after removing 1,103 duplicates.

Step 2

Preprocess

One-hot encoding, feature selection, scaling, and an 80/20 split (5,025 train / 1,257 test).

Step 3

Train

Linear Regression, Random Forest, KNN and SVR, plus Decision Tree, Gradient Boosting and XGBoost.

Step 4

Evaluate

Compared on MAE, RMSE, MAPE and R², with GridSearchCV tuning for Random Forest.

Step 5

Explain

SHAP values for global and per-vehicle interpretability, so the predictions are readable.

Python 3.10pandasscikit-learnSHAPJupyterRandom ForestGridSearchCV

Results

Which model predicted best

Random Forest came out on top, with an average error of just 2.35 g/km and an R² of 0.9943, near-perfect agreement between predicted and actual emissions. Lower bars are better.

Random Forestbest
2.35 g/km · R² 0.9943
Linear Regression
3.32 g/km · R² 0.9882
KNN
5.1 g/km · R² 0.9831
SVR
8.04 g/km · R² 0.8952

Mean absolute error on the extended feature set. XGBoost led the tree-only models at R² 0.9747; SVR was the weakest overall.

Explainability

What actually drives a car's emissions

Using SHAP, I could measure how much each feature pushed a prediction up or down. One story dominates: how much fuel a car burns matters far more than its badge or its gearbox.

Importance by feature group

Mean absolute SHAP value

Usage pattern
49.7
Fuel type
6.8
Vehicle class
1.5
Engine specs
1.2
Transmission
0.9

Usage pattern, how efficiently a car burns fuel, accounts for roughly 80 to 85% of the model's decision.

How individual features move a prediction

Combined fuel consumption+47.9 g/km
Ethanol fuel (E85)-5.4 g/km
City fuel consumption+1.35 g/km
Highway fuel consumption+0.5 g/km
Engine size+0.46 g/km
Cylinder count+0.24 g/km

Ethanol (E85) is the one major feature that lowers predicted emissions. Everything else above pushes them up.

The patterns

Where the emissions concentrate

above average below average dataset mean (251.16 g/km)

Average CO₂ by fuel type (g/km)

Ethanol (E85)
276
Regular gasoline
266
Diesel
235
Premium gasoline
235
Natural gas
213

Average CO₂ by vehicle class (g/km)

Luxury cars
298.4
Sports cars
285
Executive cars
257.5
Family cars
234.3

Highest emitters

Bugatti 522 g/km, Lamborghini 402, SRT 389. Supercars emit nearly twice the dataset average.

Lowest emitters

Hyundai Ioniq Electric and Kia Soul EV came in under 100 g/km.

Cylinders

4-cylinder cars sit at 100 to 250 g/km; 8+ cylinders routinely pass 350; a 16-cylinder engine topped 500.

What this means if you are choosing a car

Instead of trusting a sticker or a marketing claim, you can see what really moves a car's emissions. Combined fuel consumption is by far the strongest signal, followed by fuel type. A smaller, more fuel-efficient engine, or an alternative fuel like ethanol, lowers the footprint far more than brand or transmission ever will. My contribution was not only an accurate model, but a transparent one a non-expert can actually read and act on.

Sustainability & ethics

Built for SDG 13, Climate Action

The work supports SDG 13 by making vehicle emissions transparent at the point of choice. I used only public, anonymised data, and analysed SHAP values across drivetrain and fuel categories to check for and reduce bias, so the guidance stays fair as well as accurate.

Public data

Anonymised, openly published Canadian records.

Fairness

SHAP checked across fuel and drivetrain groups.

Transparency

Every prediction is explainable, not a black box.

Impact

Lower-emission choices made easier for buyers.

Limitations

  • The data is Canadian only, so it may not generalise to other fuel mixes or regulations.
  • Interaction effects beyond tree ensembles were not modelled explicitly.
  • Real-world factors like vehicle age, maintenance and driving behaviour were not captured.

Future work

  • Add real-time telemetry or IoT sensor data for greater realism.
  • Expand the dataset across regions and climates for broader generalisation.
  • Combine deep learning with interpretable models to balance power and transparency.
  • Extend to lifecycle CO₂, including manufacturing and disposal.
Richard Mensah in research and academic settings

Supervision & acknowledgement

This research was completed under the supervision of Dr. Roggers Giddings at the School of Computer Science, Bangor University, whose guidance shaped the work throughout.

The full paper is available on request. A downloadable copy will be added here soon.