29 March 2026 | 4 min read

XGBoost vs SARIMA for PM10 Forecasting: Rolling Validation Results

machine learning time series air quality model validation

What Happened

Researchers have published a comprehensive comparison of XGBoost and SARIMA (Seasonal AutoRegressive Integrated Moving Average) models for forecasting PM10 air quality levels, introducing a critical methodological shift in how we evaluate time series forecasting models. The study, available on arXiv, employs rolling-origin validation instead of traditional static chronological splits, revealing potentially significant differences in how these models perform under realistic deployment conditions.

The research challenges a fundamental assumption in time series forecasting evaluation: that models tested on a single held-out period will perform similarly in continuous operation. By implementing rolling-origin validation—where the model is retrained and tested multiple times across different time windows—the study provides a more robust assessment of real-world performance characteristics.

Why This Matters for Time Series Model Development

This research addresses a critical gap in machine learning model evaluation that has practical implications for any developer working with time series forecasting. Traditional evaluation methods often use a single train-test split based on chronological ordering, which can lead to overly optimistic performance estimates and poor model selection decisions.

The distinction between XGBoost and SARIMA represents a broader choice between modern machine learning approaches and classical statistical methods. XGBoost, a gradient boosting framework, excels at capturing complex non-linear relationships in data but can be prone to overfitting on temporal patterns. SARIMA, conversely, is specifically designed for time series data with explicit modeling of seasonal patterns, trends, and autocorrelation.

For air quality forecasting specifically, this comparison is particularly relevant because PM10 concentrations exhibit complex seasonal patterns, meteorological dependencies, and non-linear relationships that challenge both approaches. The rolling-origin validation methodology helps expose which model maintains consistent performance as conditions change over time.

Technical Deep Dive: Rolling-Origin Validation

Rolling-origin validation, also known as walk-forward validation, addresses temporal data leakage issues that plague traditional cross-validation techniques. Instead of randomly splitting data, this method respects the temporal ordering and simulates real-world model deployment scenarios.

The process works by defining an initial training window, making predictions for the next time period, then incrementally expanding the training set to include the previously predicted period. This creates multiple evaluation points across the entire dataset, providing a more comprehensive view of model stability and performance degradation over time.

For XGBoost in time series contexts, this validation approach is particularly revealing because gradient boosting models can memorize specific temporal patterns that don't generalize well. The rolling validation exposes whether the model's feature engineering and hyperparameter choices remain effective as new data arrives.

SARIMA models, being explicitly designed for temporal data, might show more consistent performance across rolling windows due to their built-in assumptions about stationarity and seasonal decomposition. However, they may struggle with sudden regime changes or non-linear relationships that XGBoost could potentially capture.

Implementation Considerations

When implementing rolling-origin validation for your own forecasting projects, several technical details become critical. The choice of initial training window size affects model stability—too small and early predictions suffer from insufficient data, too large and you reduce the number of evaluation points. The step size for rolling forward impacts computational cost versus granularity of performance assessment.

For production deployments, this validation methodology helps estimate model drift and retraining frequency. If performance degrades significantly across rolling windows, it suggests the need for more frequent model updates or different feature engineering approaches.

Practical Implications for Air Quality Systems

Air quality forecasting systems face unique challenges that make this research particularly relevant. PM10 concentrations are influenced by meteorological conditions, seasonal patterns, industrial activity, and transportation patterns—creating a complex forecasting environment where model robustness is crucial.

The choice between XGBoost and SARIMA for operational air quality systems depends heavily on infrastructure constraints and accuracy requirements. XGBoost typically requires more computational resources and careful feature engineering but can potentially capture complex interactions between meteorological variables and pollution sources. SARIMA models are more interpretable and computationally efficient but may miss important non-linear relationships.

For developers building air quality monitoring platforms, this research suggests that model selection decisions should be based on rolling validation rather than single-period holdout testing. This is particularly important when models inform public health decisions or regulatory compliance monitoring.

Broader Implications for Machine Learning Model Evaluation

This study's methodology has applications far beyond air quality forecasting. Any time series prediction task—from financial forecasting to demand planning to IoT sensor monitoring—can benefit from more rigorous evaluation approaches that better simulate production conditions.

The research highlights a broader issue in machine learning evaluation: the gap between academic benchmarks and real-world performance. While static test sets provide convenient evaluation metrics, they often fail to capture the dynamic challenges of production deployment, including data drift, concept drift, and temporal distribution shifts.

For teams developing time series forecasting systems, incorporating rolling-origin validation into your evaluation pipeline provides several benefits: better model selection decisions, more accurate performance estimates, and insights into model stability over time. This approach can inform decisions about model retraining schedules and performance monitoring thresholds.

Looking Ahead

This research represents a growing trend toward more realistic machine learning evaluation methodologies. As AI systems increasingly operate in dynamic, real-world environments, evaluation techniques must evolve to better predict actual deployment performance.

The air quality forecasting domain will likely see continued evolution toward hybrid approaches that combine the interpretability of classical methods like SARIMA with the flexibility of modern machine learning techniques. Ensemble methods that leverage both approaches' strengths while mitigating their individual weaknesses may emerge as preferred solutions.

For practitioners, the key takeaway is methodological: robust evaluation techniques are as important as sophisticated models. Rolling-origin validation should become standard practice for time series forecasting projects, particularly those deployed in production environments where performance consistency matters.

The research also points toward the need for standardized evaluation frameworks in time series forecasting that better reflect real-world constraints. As the field matures, we can expect more rigorous benchmarks that account for temporal dependencies and operational realities.