Drive Dutch LogoDrive Dutch
Model Evaluation

Toyota Aygo Price Prediction: Random Forest

DriveDutch
July 21, 2025

Data Sources

Marktplaats.nl Toyota Aygo listings (July 2025). Nested CV with 30 ShuffleSplits for Random Forest tuning on the 80% outer‑train (n=245); final evaluation on 20% outer‑test (n=62). Features: age, mileage_km.

Toyota Aygo Price Prediction: Random Forest

Executive Summary

We modeled Toyota Aygo listing prices (€) from age and mileage using a Random Forest with nested cross‑validation. After tuning on the training data only, the final model achieved R² = 0.9464 (adjusted 0.9445) on the untouched outer‑test set.

🔗 Source: Active listings on Marktplaats.nl (July 2025).

Data & Splits

  • Total sample: 307 cars
  • Outer‑train: 245 (80%)
  • Outer‑test: 62 (20%, never used for tuning)

Method (Nested CV)

  1. Outer split: Hold out 20% as the final test set.
  2. Inner model selection (on outer‑train only): For each hyperparameter candidate, run 30× ShuffleSplit (each split: 80% inner‑train / 20% inner‑val). Fit on inner‑train, score on inner‑val, then aggregate mean ± std (also record min/max).
  3. Pick best params: Choose the candidate with the highest mean inner‑CV R².
  4. Final fit: Refit a new Random Forest with the winning params on all outer‑train, then evaluate once on outer‑test.

Inner‑CV Results (30 splits on outer‑train)

Random Forest Inner-CV Results

Winner: {n_estimators: 400, max_depth: 8, min_samples_leaf: 2, max_features: 'sqrt'}
Mean inner‑CV R²: 0.9177 ± 0.0135

Final Model & Test Performance

  • Refit: Best params on outer‑train (n=245).
  • Feature importances: age 0.5632, mileage_km 0.4368.
  • Outer‑test (n=62): R² = 0.9464, adjusted R² = 0.9445.

What the Plot Shows

The Predicted vs Actual scatter (outer‑test) clusters tightly around the dashed 45° line, with slight widening at the highest prices—evidence of good calibration and strong generalization using just age and mileage.

Takeaways

  • A moderately deep forest with sqrt feature sampling offers the best bias‑variance trade‑off.
  • Age is slightly more influential than mileage, but both are important.
  • With 95% of price variance explained on unseen data, this approach is reliable within the observed ranges of age and mileage.

Explore another approach on the same data (polynomial regression):
Polynomial Regression

About This Research

This report is part of our ongoing analysis of the Dutch automotive market. Our research combines multiple data sources to provide comprehensive insights for industry professionals and market participants.

Category: Model Evaluation