Overview
A comprehensive, production-grade sales forecasting solution developed during my internship at Decathlon Belgium. This project industrializes the prediction of 8 key sales KPIs (GMV, items sold) segmented by channel (InStore/OutStore, 1P/3P) for all 64 sports departments.
Replaced manual forecasting processes with an automated, scalable ML solution. Achieved 15% improvement in forecast accuracy (MAPE) through rigorous model comparison and advanced feature engineering.
Technical Architecture
ML & Data Processing
- Prophet - Facebook's time-series forecasting library for multi-output predictions (selected after benchmarking)
- Model Comparison - Rigorous evaluation of Prophet, XGBoost, LightGBM, Chronos-Bolt to select best approach
- Apache Spark / PySpark - Distributed data processing at scale
- Delta Lake - ACID transactions and versioned data storage
- Pandas / NumPy - Data manipulation and numerical computing
- scikit-learn - Feature engineering and preprocessing
- joblib - Parallel processing with threading backend for concurrent model training
MLOps Platform
- Databricks - Unified analytics platform for notebooks, compute, and orchestration
- Apache Airflow - Workflow orchestration for the entire data pipeline (ingestion to prediction delivery)
- MLflow - Complete ML lifecycle management:
- Experiment tracking with metrics and parameters
- Model Registry with versioning (prophet-sport-<sport_code>)
- Model serving and deployment
- Databricks Bundles - Infrastructure-as-code for Databricks deployments
Cloud Infrastructure
- AWS S3 - Data lake storage for inputs, intermediate outputs, and final predictions
- Amazon SageMaker - ML model training and experimentation environment
- Amazon Bedrock - Foundation models exploration
- AWS SSO - Identity and access management
Data Pipeline
| Stage | Description |
|---|---|
| Feature Engineering | Parallel sub-jobs: SalesFeaturesJob (lag features, rolling averages), WeatherJob (temperature, precipitation), HolidaysJob (Belgian holidays, school vacations) |
| Model Training | One Prophet model per sport with multi-output forecasting, MLflow registration |
| Prediction | Latest model retrieval, forecast generation, YoY progression analysis |
| Export | Google Sheets integration with department-specific tabs |
CI/CD & Quality
- GitHub Actions - Multiple workflows:
- checkers.yml - Code quality checks
- sonarcloud.yml - Security and coverage reports
- sphinxdocs.yml - Auto-generated documentation
- publication-preprod.yml / publication-prod.yml - Deployment pipelines
- SonarCloud - Code quality dashboards and vulnerability scanning
- Pre-commit hooks - Automated checks before commits
Code Quality Tools
- Ruff - Fast Python linting and formatting
- Mypy - Static type checking
- Bandit - Security vulnerability scanner
- Sphinx - API documentation hosted on GitHub Pages
Development Tools
- uv - Modern Python package manager
- Makefile - Task automation (install, format, test, deploy)
- VS Code Workspace - Configured IDE settings
- Cookiecutter / Cruft - Template-based project structure
Project Management
- Confluence - Technical documentation and specs
- Jira - Sprint planning and task tracking
Pipeline Architecture
1. Feature Engineering (Parallel Execution)
Three sub-jobs run concurrently:- SalesFeaturesJob: Temporal attributes, YoY lag features, 28-day rolling averages
- WeatherJob: Historical and forecast weather alignment
- HolidaysJob: Belgian public holidays and school vacation features
2. Model Training
- Iterates through configured sports_to_train list
- Trains multi-output Prophet model per sport
- Registers in MLflow Model Registry with unique naming
- Logs transformation parameters for prediction context
3. Prediction & Export
- Retrieves latest model version from registry
- Generates forecasts for specified date range
- Calculates YoY progression metrics
- Exports to Google Sheets with department-specific tabs
Key Features
Parallel Processing
Uses joblib threading backend (n_jobs=-1) for concurrent sport processing, ideal for I/O-bound MLflow operations.Resilient Execution
Errors in individual sports don't halt the pipeline - other sports continue processing.Environment Support
- Local: File-based I/O with SQLite MLflow backend
- Dev: S3 data lake with Databricks MLflow tracking
Technologies Summary
| Category | Technologies |
|---|---|
| ML | Prophet, XGBoost, LightGBM, Chronos-Bolt, scikit-learn, joblib |
| Data | PySpark, Delta Lake, Pandas, NumPy |
| MLOps | Databricks, MLflow, Apache Airflow, Databricks Bundles |
| Cloud | AWS S3, SageMaker, Bedrock |
| CI/CD | GitHub Actions, SonarCloud |
| Quality | Ruff, Mypy, Bandit, Sphinx |
| Tools | uv, Makefile, Cookiecutter |
| Integrations | Google Sheets API |
Results
- 15% improvement in forecast accuracy (MAPE) vs previous manual process
- Predicts sales for 64 sports categories
- Forecasts 8 KPIs per sport (GMV, items by channel)
- Replaced manual forecasting with automated, scalable ML pipeline
- Processes multi-source data (sales, weather, holidays)
- Runs on scheduled production jobs with automated retraining
- Exports to Google Sheets for business stakeholders
Bonus: Jupiler Pro League Ball Launch
During this internship, I also contributed to the official launch of the new Jupiler Pro League ball. Performed a freestyle football demonstration and appeared in the official presentation video alongside Antoine Griezmann, showcasing the synergy between employee passions and brand projects at Decathlon.
