Toyota Aygo Price Prediction: Random Forest
Data Sources
Marktplaats.nl Toyota Aygo listings (July 2025). Nested CV with 30 ShuffleSplits for Random Forest tuning on the 80% outer‑train (n=245); final evaluation on 20% outer‑test (n=62). Features: age, mileage_km.

Executive Summary
We modeled Toyota Aygo listing prices (€) from age and mileage using a Random Forest with nested cross‑validation. After tuning on the training data only, the final model achieved R² = 0.9464 (adjusted 0.9445) on the untouched outer‑test set.
🔗 Source: Active listings on Marktplaats.nl (July 2025).
Data & Splits
- Total sample: 307 cars
- Outer‑train: 245 (80%)
- Outer‑test: 62 (20%, never used for tuning)
Method (Nested CV)
- Outer split: Hold out 20% as the final test set.
- Inner model selection (on outer‑train only): For each hyperparameter candidate, run 30× ShuffleSplit (each split: 80% inner‑train / 20% inner‑val). Fit on inner‑train, score R² on inner‑val, then aggregate mean ± std (also record min/max).
- Pick best params: Choose the candidate with the highest mean inner‑CV R².
- Final fit: Refit a new Random Forest with the winning params on all outer‑train, then evaluate once on outer‑test.
Inner‑CV Results (30 splits on outer‑train)
Winner: {n_estimators: 400, max_depth: 8, min_samples_leaf: 2, max_features: 'sqrt'}
Mean inner‑CV R²: 0.9177 ± 0.0135
Final Model & Test Performance
- Refit: Best params on outer‑train (n=245).
- Feature importances: age 0.5632, mileage_km 0.4368.
- Outer‑test (n=62): R² = 0.9464, adjusted R² = 0.9445.
What the Plot Shows
The Predicted vs Actual scatter (outer‑test) clusters tightly around the dashed 45° line, with slight widening at the highest prices—evidence of good calibration and strong generalization using just age and mileage.
Takeaways
- A moderately deep forest with
sqrt
feature sampling offers the best bias‑variance trade‑off. - Age is slightly more influential than mileage, but both are important.
- With 95% of price variance explained on unseen data, this approach is reliable within the observed ranges of age and mileage.
Explore another approach on the same data (polynomial regression):
Polynomial Regression