Overview
Desertflow Global Bundle is a noncommercial research project studying how vegetation responds to climate, soil, fire, and human-driven stressors across the world's terrestrial ecoregions. The core question: can per-region machine-learning specialists, combined with a principled cross-region router, beat traditional global vegetation models — and do so without secretly training on the answer?
The system fetches per-cell, per-year aggregates for 133 World Wildlife Fund (WWF) terrestrial ecoregions across 22 years (2002–2024), trains a CatBoost regression specialist for each region in a 89-region training split, then combines those specialists at prediction time via a per-cell Bayesian-ridge router. Held-out evaluation uses a 44-region cross-continental test split (the "v2" split) that the specialists have never seen.
The framework, all driver modules, the validation harness, and trained model artifacts are being prepared for open-source release alongside an EarthArXiv preprint.
Why this matters
Desertification and land degradation affect over 3.2 billion people and cost the global economy an estimated $490 billion per year (UNCCD figures). Accurate, regionally-tuned vegetation forecasting is foundational for:
- Land Degradation Neutrality (SDG 15.3): tracking SDG indicator 15.3.1 — the proportion of land that is degraded over total land area — requires consistent, reproducible vegetation baselines across ecoregions.
- Climate adaptation planning: identifying which regions are most vulnerable to vegetation collapse under projected 2030+ climate scenarios.
- Conservation prioritization: directing limited resources toward ecosystems with the highest combined risk and recovery potential.
- Policy verification: independent, methodologically-transparent baselines for verifying claims under the UN Convention to Combat Desertification (UNCCD) and REDD+ frameworks.
This project does not aim to replace existing operational systems (MODIS-based monitoring, Copernicus land services, etc.). It aims to provide a methodologically clean reference baseline that other studies can compare against — one that has been explicitly audited for the leakage failure modes that quietly inflate accuracy in much of the published NDVI forecasting literature.
The problem: cross-region NDVI forecasting
Predicting NDVI (Normalized Difference Vegetation Index) is the standard benchmark task for vegetation modeling. The challenge is that vegetation response to climate varies enormously between ecoregions:
- Sahara: NDVI is near-zero year-round; a single rainfall event causes brief, sharp greening.
- Amazon basin: NDVI is high and seasonally damped; the dominant signal is cloud cover, not climate.
- Yakutian larch forest: a strong but short summer growing season; winter NDVI is meaningless under snow.
- Indo-Gangetic plain: NDVI is dominated by human agriculture, not natural climate response.
Global vegetation models that train one set of weights across all biomes typically achieve cross-region R² in the 0.6–0.7 range — the global average smooths over what's actually happening in any one place. Per-region specialists routinely score higher within their training region, but the published "high" numbers often disappear when the model is moved to a region it hasn't seen, or when leakage-safe validation is applied.
Desertflow targets the middle ground: per-region specialists combined with a principled router that handles regions the specialists weren't trained on, evaluated on a held-out 44-region cross-continental test set with no NDVI-derived inputs.
System architecture
The system uses a two-layer architecture, internally referred to as the "B2" router pattern:
Key design choices:
- Specialists are isolated. Specialist N never sees data from specialist M's region. This eliminates the most common form of cross-region leakage.
- The router is the only place cross-region information is combined, and it does so via per-cell weights — not by sharing training data.
- Router calibration is also leakage-safe: it uses a calibration year window separate from the prediction year, so the router never sees future labels at calibration time.
Project details
- Spatial coverage: 133 WWF terrestrial ecoregions spanning approximately 45 million km² of land surface — including the Sahara, Sahel, Amazon basin, Congo basin, Indo-Gangetic plain, Siberian taiga, Patagonian steppe, Australian outback, Mongolian steppe, and 124 other ecoregions across all inhabited continents.
- Temporal coverage: 22 continuous years (2002–2024), monthly cadence, enabling decadal-scale trend analysis.
- Spatial resolution: ~4 km native grid (MODIS-derived), aggregated per ecoregion for training.
- Target variable: V11_mean_ndvi — region-mean MODIS NDVI per cell-year, computed from MOD13Q1.
- Driver candidates: ~250 leakage-safe candidate features per cell, of which ~33 final drivers were selected after rigorous leakage auditing.
- Train/test split: 89 training regions / 44 cross-continental held-out test regions (the "v2" split).
- Validation rule: the "2030 rule" — no model is allowed any feature that wouldn't be available for a 2030 forward prediction (rules out NDVI-derived features like LAI, fPAR, GPP, and most phenology metrics).
Methodology
Training proceeds in five clearly separated phases:
- Per-region data assembly. For each of the 89 training regions, the Earth Engine pipeline assembles a parquet file with all cell-year combinations and all 19 driver modules. Modules are pre-filtered for leakage at this stage.
- Per-region specialist training. Each region trains an independent CatBoost regressor. Optuna runs ~30 hyperparameter trials per region. The specialist is then frozen.
- Router calibration. On a calibration year window inside the training regions, the router learns how to weight specialists for a given cell's environmental profile. Uses OOD signals: rank correlation between candidate specialists' predictions, environmental distance, calibration-window error patterns.
- Held-out evaluation. On the 44 held-out test regions, the router produces per-cell weights without ever seeing test-region training data. The final R² is computed cell-by-cell, not region-mean.
- Audit pass. Every input column is checked against the leakage-feature list (
leakage_features.yaml) and the audit warning is logged if any leak slips through.
Driver modules (19)
The 19 modules group into five families. Each module is a standalone Earth Engine fetcher with its own resumability marker, allowing partial re-fetches without recomputing the whole pipeline.
- ERA5 monthly air temperature
- ERA5 monthly precipitation
- CHIRPS daily-aggregated precipitation
- TerraClimate climatic water deficit
- VPD extremes (anomaly module — E26 set)
- SST anomalies (anomaly module — E26 set)
- SoilGrids — sand, silt, clay fractions
- SoilGrids — soil organic carbon
- SMAP soil moisture (top-layer)
- GLDAS soil moisture (root-zone)
- MCD64A1 burned area
- Snow phenology (E26 set)
- Land surface temperature day–night range
- SRTM elevation
- SRTM slope
- Aspect (north–south, east–west components)
- Distance to coast (manual driver)
- V11_mean_ndvi — the target
- Cell & year identifiers (not features — for joins)
Notably not in the driver list (deliberately excluded by the leakage audit): LAI, fPAR, GPP, MODIS land-cover classification, sister vegetation cover percentages, NDVI-derived phenology start-of-season day-of-year, and any other feature derived from NDVI or NDVI-equivalent optical bands.
Validation protocol
Validation is the heart of this work. The protocol enforces four rules in sequence:
- The 2030 rule. No feature is allowed unless it would be computable for a 2030 forward prediction without seeing 2030 NDVI. This rules out everything NDVI-derived.
- V/O/C/P checks — four explicit leakage gates:
- V (Vegetation circularity): blocks features computed from NDVI time series (phenology start-of-season day, GPP, fPAR).
- O (Optical bands): blocks raw MODIS red/NIR bands and indices derived from them.
- C (Classifier): blocks features derived from supervised vegetation classifiers (MCD12Q1 land cover, tree cover %).
- P (Pheno-prior): blocks any feature whose value depends on knowing the target year's NDVI peak.
- Cross-region split. Specialists are trained on 89 regions; the 44 held-out test regions are never used in training, hyperparameter tuning, or router calibration.
- Temporal split. Within each region, calibration and prediction year windows are separated, so the router never sees the prediction year's labels at calibration time.
Every column loaded by the framework is checked against leakage_features.yaml at training time. Matched columns are auto-set to "(ignore)" with a FAIL-2030 marker, and a banner alerts the operator that leakage features were detected and skipped.
Technology stack
Leakage-aware validation: a methodological contribution
The distinguishing contribution of this work is a documented leakage-aware validation protocol. Many published NDVI forecasting studies inadvertently leak future information through one of four channels:
- Sister vegetation features. NDVI-derived land cover, GPP, fPAR, LAI, and phenology metrics are all computed from the same MODIS optical bands as NDVI itself. Including them when predicting NDVI is approximately equivalent to predicting NDVI from NDVI — the model learns the identity function, dressed up as climate inference.
- Shared cross-region statistics. If feature scalers, gap-fillers, or normalization layers are fit across all regions before splitting train/test, statistics from the test regions leak into the training signal.
- Improperly designed cross-validation splits. Random k-fold on cells within a region leaks neighboring cells' values; random splits across years leak seasonal patterns; even some "spatial" splits leave too-close test cells.
- Pheno-prior contamination. Some published features (start-of-season day-of-year, peak NDVI day) are computed from the target NDVI series itself but presented as if they were independent inputs.
Bundle Research documents each failure mode and tests against it before reporting metrics. The validation harness includes explicit unit tests for each leakage channel, and the schema-agnostic Bundle Core auto-applies the audit on every training run.
Results
Headline metrics on the v2 held-out test split (44 regions, not seen during training or router calibration):
| Metric | Value | Notes |
|---|---|---|
| Cross-continental mean R² | 0.93 | Cell-level, not region-mean |
| Mean Absolute Error (NDVI units) | ~0.015 | Roughly 1.5% of dynamic range |
| Regions with R² > 0.85 | 35 of 44 | ~80% of test regions |
| Regions with R² > 0.95 | 17 of 44 | Mostly arid + temperate |
| Regions with R² < 0.70 | 9 of 44 | The "hard regions" — see below |
| Specialist-only baseline (no router) | 0.88 | +0.05 lift from router |
| Global-model baseline (XGBoost, no per-region) | ~0.67 | Standard literature comparison |
The +0.05 router lift over specialist-only baseline is the empirical justification for the two-layer architecture: per-region specialists alone don't generalize cross-continent, but the router learns to combine them in regionally-appropriate ways.
Open problems: regions where the model struggles
Nine of the 44 held-out regions score R² < 0.7. These are the open problems being worked on:
Persistent cloud cover, weak seasonal signal, NDVI saturation at high values. Climate drivers explain little additional variance once you've controlled for time-of-year.
NDVI dominated by intensive agriculture rather than natural climate response. Need irrigation drivers + cropping-calendar features (deferred — manual drivers).
Extreme seasonality (winter NDVI is meaningless under snow). The masked-NDVI signal during the short summer is sharply nonlinear in temperature.
Same problem as Indo-Gangetic — agriculture dominates. Crop rotation cycles confuse a per-cell-year model that doesn't see crop type.
Cloud cover + saturation + degradation events (fire, clearing) that aren't well-captured by monthly aggregates.
The remaining four hard regions are still being analyzed — preliminary results suggest they share the agriculture-dominated or cloud-saturated profiles described above.
The E26 anomaly modules (VPD extremes, SST anomalies, snow phenology) currently being fetched are partly motivated by this — anomaly drivers may help in regions where mean climate isn't predictive enough.
Recent progress
leakage_features.yaml + SchemaWidget integration so the framework auto-detects and skips known leakage features. Reverted GEE fetcher to per-(module, year) sequential pattern after bulk-submit pattern triggered Earth Engine throttling.Earth Engine compute requirements
This work requires sustained Earth Engine compute for large-scale ecoregion processing. A single full bundle refresh requires approximately 30,000 batch export tasks across 133 regions × 22 years × 19 driver modules. Per-task EECU consumption ranges from ~0.05 (lightweight modules like VPD extremes) to ~0.15 (snow phenology). A full retrain consumes approximately 860 EECU-hours.
The project is registered under Google Earth Engine's noncommercial research program (project ID: celtic-science-493107-c6). The Community Tier monthly EECU budget (150 hours) covers roughly one-sixth of a full refresh, which is why iterative experiments are blocked at the current tier and a Contributor / Partner Tier uplift application is in review.
Roadmap
- Complete E26 anomaly-module fetch (VPD extremes, snow phenology, SST anomalies)
- Re-train with anomaly drivers; expect lift in agriculture-dominated and cloud-saturated regions
- E27 adaptive-λ OOD router experiments
- Stress-test the framework on synthetic-leakage adversarial datasets (does our audit actually catch every leak?)
- EarthArXiv preprint draft (methodology focus: the leakage audit protocol)
- Submit to a remote-sensing journal (target: Remote Sensing of Environment or similar)
- Open-source release: framework code + trained model weights + driver modules + validation harness
- Companion blog series explaining the leakage audit for practitioners
- Deforestation early warning — applying the same architecture to tropical-forest biomes (Amazon, Congo, Borneo, Atlantic Forest, New Guinea, Indonesian peat forests) with target lead time of 1–3 years.
- Wildfire burn-area forecasting — per-region specialists for fire-prone regions (Australian bushfire belt, Iberian wildfire zone, Siberian taiga, California chaparral, Mediterranean shrub).
- Coastal/mangrove degradation monitoring — extending to Sundarbans, Niger Delta, western Sahara coast, Madagascar mangroves.
- Methodology generalization — packaging the leakage audit as a standalone library usable for any remote-sensing forecasting task.
References & inspirations
Key data sources and methodological references:
- WWF Terrestrial Ecoregions of the World — Olson et al., 2001 — used for the 133-region biogeographic stratification.
- MOD13Q1 Vegetation Indices 16-Day L3 Global 250m — Didan, 2015 — primary NDVI source.
- ERA5 reanalysis — Hersbach et al., 2020 — primary climate driver source.
- CHIRPS precipitation — Funk et al., 2015 — daily precipitation aggregates.
- TerraClimate — Abatzoglou et al., 2018 — monthly climatic water deficit.
- SoilGrids 2.0 — Poggio et al., 2021 — soil texture and organic carbon.
- Google Earth Engine: Planetary-scale geospatial analysis for everyone — Gorelick et al., 2017 — the platform.
- CatBoost: gradient boosting with categorical features support — Prokhorenkova et al., 2018 — the chosen base learner.
- UN Convention to Combat Desertification (UNCCD) — policy framing for the SDG 15.3 connection.
Full citations and additional methodological references will be in the EarthArXiv preprint.
Frequently asked questions
What is NDVI and why predict it?
NDVI (Normalized Difference Vegetation Index) is a satellite-derived measure of how much photosynthetically active vegetation is present in a pixel. It's the standard proxy for vegetation health and the primary input to most desertification-monitoring systems. Forecasting NDVI from environmental drivers lets us model how vegetation responds to climate, soil, and disturbance — which is the foundation for desertification risk assessment.
Why per-region specialists instead of one global model?
Vegetation response to climate is non-stationary across ecoregions — the same rainfall produces very different NDVI responses in the Sahel vs the Amazon vs the Mongolian steppe. A single global model averages these responses and ends up explaining none of them well (literature R² ~0.6–0.7). Per-region specialists capture region-specific response curves; the router then learns to combine them for new regions the specialists haven't seen.
What is leakage in vegetation forecasting?
Leakage is when features that shouldn't be available at prediction time accidentally end up in the training set, inflating reported accuracy. In vegetation forecasting, the most common form is including NDVI-derived features (like LAI, fPAR, or phenology metrics computed from NDVI itself) when predicting NDVI — which is approximately the identity function, dressed up as climate inference. The leakage audit protocol systematically catches these.
How can I use this framework for my own region?
Once the open-source release lands (planned alongside the preprint), the framework will be installable as a Python package. You supply your own region polygons and driver list; the framework handles the Earth Engine fetch, per-region training, router calibration, and leakage audit. A PySide6 desktop GUI (the Bundle Framework spinoff) provides a no-code interface for the same workflow.
Is the code open source?
The full framework, trained model artifacts, driver modules, and validation harness are being prepared for open-source release alongside the EarthArXiv preprint. Target release date: Q4 2026 / Q1 2027. License: planned to be Apache 2.0.
Can I cite this work before the paper is out?
If you want to cite the project before the formal preprint, please email and I'll send a recommended citation. For now, citing this site URL and the GitHub repository is appropriate.
How long until results stabilize?
The current 0.93 R² figure is from the leakage-audited v2 split and is stable across multiple retraining runs (±0.01). The E26 anomaly module fetch (in progress) is expected to lift performance in the 9 hard regions but may not change the cross-continental headline number much. The preprint will report the final, fully-audited numbers including ablations.
Why is this run by a high school student?
I find the problem genuinely interesting and the existing literature has obvious methodological gaps that I wanted to address rigorously. Running this independently means moving slower than a funded research group, but it also means I can apply the validation discipline I think is missing — without pressure to publish on the inflated numbers that don't survive the audit. Reviewers welcome to verify the methodology directly on the public code.
Acknowledgments
Data: NASA (MOD13Q1, MCD64A1, MOD10A1, MCD12Q1, SMAP), ECMWF (ERA5), UCSB CHG (CHIRPS), TerraClimate, ISRIC (SoilGrids), WWF (Ecoregions). Compute: Google Earth Engine noncommercial research program. Software: open-source contributors to CatBoost, scikit-learn, PySide6, Optuna, rclone, and pyarrow.
This is independent research with no institutional affiliation; the work is published openly so others can verify, criticize, and extend the methodology.
Get in touch
For collaboration, citations, mentorship offers, or any questions about the methodology — (click to copy)
Code: github.com/mohtheghost
← Back to all projects