Research · Active

Per-region NDVI forecasting across 133 global ecoregions

A leakage-audited vegetation prediction framework for desertification monitoring. Cross-continental R² ≈ 0.93 on held-out test regions after rigorous leakage controls.

View Code →
133
WWF ecoregions covered
22
Years of data (2002–2024)
19
Earth Engine driver modules
0.93
R² cross-continental
Active development · Phase 3: validation hardening · Currently fetching E26 anomaly modules (VPD extremes, snow phenology, SST anomalies)
Abstract. Desertflow Global Bundle is a per-region NDVI prediction system built on Google Earth Engine. It combines 89 region-specific CatBoost specialists with a per-cell Bayesian-ridge router (the "B2" architecture), trained on 22 years of satellite-derived climate, soil, fire, and topographic drivers across 133 WWF terrestrial ecoregions. The distinguishing methodological contribution is a leakage-aware validation protocol that catches four common failure modes — sister-vegetation features, shared cross-region statistics, leaky CV splits, and pheno-prior contamination — most of which inflate published R² scores by 0.05–0.20. After applying the audit, cross-continental R² on a held-out 44-region test set is approximately 0.93, comparable to within-region baselines that typically benefit from data leakage.

Overview

Desertflow Global Bundle is a noncommercial research project studying how vegetation responds to climate, soil, fire, and human-driven stressors across the world's terrestrial ecoregions. The core question: can per-region machine-learning specialists, combined with a principled cross-region router, beat traditional global vegetation models — and do so without secretly training on the answer?

The system fetches per-cell, per-year aggregates for 133 World Wildlife Fund (WWF) terrestrial ecoregions across 22 years (2002–2024), trains a CatBoost regression specialist for each region in a 89-region training split, then combines those specialists at prediction time via a per-cell Bayesian-ridge router. Held-out evaluation uses a 44-region cross-continental test split (the "v2" split) that the specialists have never seen.

The framework, all driver modules, the validation harness, and trained model artifacts are being prepared for open-source release alongside an EarthArXiv preprint.

Why this matters

Desertification and land degradation affect over 3.2 billion people and cost the global economy an estimated $490 billion per year (UNCCD figures). Accurate, regionally-tuned vegetation forecasting is foundational for:

This project does not aim to replace existing operational systems (MODIS-based monitoring, Copernicus land services, etc.). It aims to provide a methodologically clean reference baseline that other studies can compare against — one that has been explicitly audited for the leakage failure modes that quietly inflate accuracy in much of the published NDVI forecasting literature.

The problem: cross-region NDVI forecasting

Predicting NDVI (Normalized Difference Vegetation Index) is the standard benchmark task for vegetation modeling. The challenge is that vegetation response to climate varies enormously between ecoregions:

Global vegetation models that train one set of weights across all biomes typically achieve cross-region R² in the 0.6–0.7 range — the global average smooths over what's actually happening in any one place. Per-region specialists routinely score higher within their training region, but the published "high" numbers often disappear when the model is moved to a region it hasn't seen, or when leakage-safe validation is applied.

Desertflow targets the middle ground: per-region specialists combined with a principled router that handles regions the specialists weren't trained on, evaluated on a held-out 44-region cross-continental test set with no NDVI-derived inputs.

System architecture

The system uses a two-layer architecture, internally referred to as the "B2" router pattern:

┌─────────────────────────────────────────────────────────────┐ │ RAW DATA · Earth Engine batch exports (~30,000 tasks) │ │ 19 driver modules × 133 regions × 22 years │ └──────────────────────┬──────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 1 — 89 PER-REGION CATBOOST SPECIALISTS │ │ Each specialist is trained only on its own region's │ │ cells & years. Hyperparameters tuned per-region via │ │ Optuna (~30 trials). No cross-region weight sharing. │ └──────────────────────┬──────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ LAYER 2 — PER-CELL BAYESIAN-RIDGE ROUTER │ │ For each prediction cell, the router computes a │ │ prior weight over all 89 specialists from out-of-domain │ │ (OOD) signals — environmental similarity, rank-corr, │ │ λ-decay over calibration error. │ │ │ │ Final prediction = weighted ensemble of specialists. │ └──────────────────────┬──────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ EVALUATION — 44-region held-out cross-continental test │ │ Never seen during specialist training or router calib. │ └─────────────────────────────────────────────────────────────┘

Key design choices:

Project details

Methodology

Training proceeds in five clearly separated phases:

  1. Per-region data assembly. For each of the 89 training regions, the Earth Engine pipeline assembles a parquet file with all cell-year combinations and all 19 driver modules. Modules are pre-filtered for leakage at this stage.
  2. Per-region specialist training. Each region trains an independent CatBoost regressor. Optuna runs ~30 hyperparameter trials per region. The specialist is then frozen.
  3. Router calibration. On a calibration year window inside the training regions, the router learns how to weight specialists for a given cell's environmental profile. Uses OOD signals: rank correlation between candidate specialists' predictions, environmental distance, calibration-window error patterns.
  4. Held-out evaluation. On the 44 held-out test regions, the router produces per-cell weights without ever seeing test-region training data. The final R² is computed cell-by-cell, not region-mean.
  5. Audit pass. Every input column is checked against the leakage-feature list (leakage_features.yaml) and the audit warning is logged if any leak slips through.

Driver modules (19)

The 19 modules group into five families. Each module is a standalone Earth Engine fetcher with its own resumability marker, allowing partial re-fetches without recomputing the whole pipeline.

Climate (6 modules)
  • ERA5 monthly air temperature
  • ERA5 monthly precipitation
  • CHIRPS daily-aggregated precipitation
  • TerraClimate climatic water deficit
  • VPD extremes (anomaly module — E26 set)
  • SST anomalies (anomaly module — E26 set)
Soil & moisture (4 modules)
  • SoilGrids — sand, silt, clay fractions
  • SoilGrids — soil organic carbon
  • SMAP soil moisture (top-layer)
  • GLDAS soil moisture (root-zone)
Disturbance (3 modules)
  • MCD64A1 burned area
  • Snow phenology (E26 set)
  • Land surface temperature day–night range
Topography & context (4 modules)
  • SRTM elevation
  • SRTM slope
  • Aspect (north–south, east–west components)
  • Distance to coast (manual driver)
Target & identifiers (2 modules)
  • V11_mean_ndvi — the target
  • Cell & year identifiers (not features — for joins)

Notably not in the driver list (deliberately excluded by the leakage audit): LAI, fPAR, GPP, MODIS land-cover classification, sister vegetation cover percentages, NDVI-derived phenology start-of-season day-of-year, and any other feature derived from NDVI or NDVI-equivalent optical bands.

Validation protocol

Validation is the heart of this work. The protocol enforces four rules in sequence:

  1. The 2030 rule. No feature is allowed unless it would be computable for a 2030 forward prediction without seeing 2030 NDVI. This rules out everything NDVI-derived.
  2. V/O/C/P checks — four explicit leakage gates:
    • V (Vegetation circularity): blocks features computed from NDVI time series (phenology start-of-season day, GPP, fPAR).
    • O (Optical bands): blocks raw MODIS red/NIR bands and indices derived from them.
    • C (Classifier): blocks features derived from supervised vegetation classifiers (MCD12Q1 land cover, tree cover %).
    • P (Pheno-prior): blocks any feature whose value depends on knowing the target year's NDVI peak.
  3. Cross-region split. Specialists are trained on 89 regions; the 44 held-out test regions are never used in training, hyperparameter tuning, or router calibration.
  4. Temporal split. Within each region, calibration and prediction year windows are separated, so the router never sees the prediction year's labels at calibration time.

Every column loaded by the framework is checked against leakage_features.yaml at training time. Matched columns are auto-set to "(ignore)" with a FAIL-2030 marker, and a banner alerts the operator that leakage features were detected and skipped.

Technology stack

Google Earth Engine Python 3.11 CatBoost XGBoost LightGBM Bayesian Ridge (scikit-learn) Optuna MODIS (MOD13Q1, MCD64A1, MOD10A1, MCD12Q1) ERA5 CHIRPS TerraClimate SMAP GLDAS SoilGrids PySide6 PowerShell rclone pyarrow / parquet

Leakage-aware validation: a methodological contribution

The distinguishing contribution of this work is a documented leakage-aware validation protocol. Many published NDVI forecasting studies inadvertently leak future information through one of four channels:

  1. Sister vegetation features. NDVI-derived land cover, GPP, fPAR, LAI, and phenology metrics are all computed from the same MODIS optical bands as NDVI itself. Including them when predicting NDVI is approximately equivalent to predicting NDVI from NDVI — the model learns the identity function, dressed up as climate inference.
  2. Shared cross-region statistics. If feature scalers, gap-fillers, or normalization layers are fit across all regions before splitting train/test, statistics from the test regions leak into the training signal.
  3. Improperly designed cross-validation splits. Random k-fold on cells within a region leaks neighboring cells' values; random splits across years leak seasonal patterns; even some "spatial" splits leave too-close test cells.
  4. Pheno-prior contamination. Some published features (start-of-season day-of-year, peak NDVI day) are computed from the target NDVI series itself but presented as if they were independent inputs.

Bundle Research documents each failure mode and tests against it before reporting metrics. The validation harness includes explicit unit tests for each leakage channel, and the schema-agnostic Bundle Core auto-applies the audit on every training run.

Headline result. Cross-continental mean R² ≈ 0.93 on the 44-region held-out test set after applying the full leakage audit. This is comparable to within-region baselines that typically benefit from data leakage, and substantially above global cross-region baselines (~0.6–0.7) reported in the literature for similar tasks.

Results

Headline metrics on the v2 held-out test split (44 regions, not seen during training or router calibration):

MetricValueNotes
Cross-continental mean R²0.93Cell-level, not region-mean
Mean Absolute Error (NDVI units)~0.015Roughly 1.5% of dynamic range
Regions with R² > 0.8535 of 44~80% of test regions
Regions with R² > 0.9517 of 44Mostly arid + temperate
Regions with R² < 0.709 of 44The "hard regions" — see below
Specialist-only baseline (no router)0.88+0.05 lift from router
Global-model baseline (XGBoost, no per-region)~0.67Standard literature comparison

The +0.05 router lift over specialist-only baseline is the empirical justification for the two-layer architecture: per-region specialists alone don't generalize cross-continent, but the router learns to combine them in regionally-appropriate ways.

Open problems: regions where the model struggles

Nine of the 44 held-out regions score R² < 0.7. These are the open problems being worked on:

Congo Basin

Persistent cloud cover, weak seasonal signal, NDVI saturation at high values. Climate drivers explain little additional variance once you've controlled for time-of-year.

Indo-Gangetic plain

NDVI dominated by intensive agriculture rather than natural climate response. Need irrigation drivers + cropping-calendar features (deferred — manual drivers).

Yakutia larch forest

Extreme seasonality (winter NDVI is meaningless under snow). The masked-NDVI signal during the short summer is sharply nonlinear in temperature.

US Corn Belt

Same problem as Indo-Gangetic — agriculture dominates. Crop rotation cycles confuse a per-cell-year model that doesn't see crop type.

Borneo lowland rainforest

Cloud cover + saturation + degradation events (fire, clearing) that aren't well-captured by monthly aggregates.

4 more

The remaining four hard regions are still being analyzed — preliminary results suggest they share the agriculture-dominated or cloud-saturated profiles described above.

The E26 anomaly modules (VPD extremes, SST anomalies, snow phenology) currently being fetched are partly motivated by this — anomaly drivers may help in regions where mean climate isn't predictive enough.

Recent progress

June 2026
Completed leakage audit and removed 8 sister-vegetation features from training set. Built leakage_features.yaml + SchemaWidget integration so the framework auto-detects and skips known leakage features. Reverted GEE fetcher to per-(module, year) sequential pattern after bulk-submit pattern triggered Earth Engine throttling.
May 2026
Finalized 89-region train / 44-region test split (v2). Released schema-agnostic Bundle Core supporting custom drivers per region. Memory-safe per-region training pipeline (eliminated the 26.9 GiB concat OOM).
April 2026
Bayesian-ridge router (E21d super-picker) selected as production router after sweep over 12 candidate architectures. Per-cell calibration outperforms global stacking by ~0.04 R².
March 2026
E16 base-learner algorithm sweep: CatBoost selected over XGBoost / LightGBM / random forest as the per-region specialist. CatBoost's native categorical handling matters for landcover/biome covariates.
February 2026
E14 master dataset assembled: 19 driver modules × 133 regions × 22 years, ~9.1 GB train + 2.1 GB test parquets. First end-to-end pipeline run from raw Earth Engine to held-out evaluation.

Earth Engine compute requirements

This work requires sustained Earth Engine compute for large-scale ecoregion processing. A single full bundle refresh requires approximately 30,000 batch export tasks across 133 regions × 22 years × 19 driver modules. Per-task EECU consumption ranges from ~0.05 (lightweight modules like VPD extremes) to ~0.15 (snow phenology). A full retrain consumes approximately 860 EECU-hours.

The project is registered under Google Earth Engine's noncommercial research program (project ID: celtic-science-493107-c6). The Community Tier monthly EECU budget (150 hours) covers roughly one-sixth of a full refresh, which is why iterative experiments are blocked at the current tier and a Contributor / Partner Tier uplift application is in review.

Roadmap

Phase 4 (Q3 2026) — Validation hardening
  • Complete E26 anomaly-module fetch (VPD extremes, snow phenology, SST anomalies)
  • Re-train with anomaly drivers; expect lift in agriculture-dominated and cloud-saturated regions
  • E27 adaptive-λ OOD router experiments
  • Stress-test the framework on synthetic-leakage adversarial datasets (does our audit actually catch every leak?)
Phase 5 (Q4 2026) — Publication
  • EarthArXiv preprint draft (methodology focus: the leakage audit protocol)
  • Submit to a remote-sensing journal (target: Remote Sensing of Environment or similar)
  • Open-source release: framework code + trained model weights + driver modules + validation harness
  • Companion blog series explaining the leakage audit for practitioners
Phase 6 (2027) — Extensions
  • Deforestation early warning — applying the same architecture to tropical-forest biomes (Amazon, Congo, Borneo, Atlantic Forest, New Guinea, Indonesian peat forests) with target lead time of 1–3 years.
  • Wildfire burn-area forecasting — per-region specialists for fire-prone regions (Australian bushfire belt, Iberian wildfire zone, Siberian taiga, California chaparral, Mediterranean shrub).
  • Coastal/mangrove degradation monitoring — extending to Sundarbans, Niger Delta, western Sahara coast, Madagascar mangroves.
  • Methodology generalization — packaging the leakage audit as a standalone library usable for any remote-sensing forecasting task.

References & inspirations

Key data sources and methodological references:

Full citations and additional methodological references will be in the EarthArXiv preprint.

Frequently asked questions

What is NDVI and why predict it?

NDVI (Normalized Difference Vegetation Index) is a satellite-derived measure of how much photosynthetically active vegetation is present in a pixel. It's the standard proxy for vegetation health and the primary input to most desertification-monitoring systems. Forecasting NDVI from environmental drivers lets us model how vegetation responds to climate, soil, and disturbance — which is the foundation for desertification risk assessment.

Why per-region specialists instead of one global model?

Vegetation response to climate is non-stationary across ecoregions — the same rainfall produces very different NDVI responses in the Sahel vs the Amazon vs the Mongolian steppe. A single global model averages these responses and ends up explaining none of them well (literature R² ~0.6–0.7). Per-region specialists capture region-specific response curves; the router then learns to combine them for new regions the specialists haven't seen.

What is leakage in vegetation forecasting?

Leakage is when features that shouldn't be available at prediction time accidentally end up in the training set, inflating reported accuracy. In vegetation forecasting, the most common form is including NDVI-derived features (like LAI, fPAR, or phenology metrics computed from NDVI itself) when predicting NDVI — which is approximately the identity function, dressed up as climate inference. The leakage audit protocol systematically catches these.

How can I use this framework for my own region?

Once the open-source release lands (planned alongside the preprint), the framework will be installable as a Python package. You supply your own region polygons and driver list; the framework handles the Earth Engine fetch, per-region training, router calibration, and leakage audit. A PySide6 desktop GUI (the Bundle Framework spinoff) provides a no-code interface for the same workflow.

Is the code open source?

The full framework, trained model artifacts, driver modules, and validation harness are being prepared for open-source release alongside the EarthArXiv preprint. Target release date: Q4 2026 / Q1 2027. License: planned to be Apache 2.0.

Can I cite this work before the paper is out?

If you want to cite the project before the formal preprint, please email and I'll send a recommended citation. For now, citing this site URL and the GitHub repository is appropriate.

How long until results stabilize?

The current 0.93 R² figure is from the leakage-audited v2 split and is stable across multiple retraining runs (±0.01). The E26 anomaly module fetch (in progress) is expected to lift performance in the 9 hard regions but may not change the cross-continental headline number much. The preprint will report the final, fully-audited numbers including ablations.

Why is this run by a high school student?

I find the problem genuinely interesting and the existing literature has obvious methodological gaps that I wanted to address rigorously. Running this independently means moving slower than a funded research group, but it also means I can apply the validation discipline I think is missing — without pressure to publish on the inflated numbers that don't survive the audit. Reviewers welcome to verify the methodology directly on the public code.

Acknowledgments

Data: NASA (MOD13Q1, MCD64A1, MOD10A1, MCD12Q1, SMAP), ECMWF (ERA5), UCSB CHG (CHIRPS), TerraClimate, ISRIC (SoilGrids), WWF (Ecoregions). Compute: Google Earth Engine noncommercial research program. Software: open-source contributors to CatBoost, scikit-learn, PySide6, Optuna, rclone, and pyarrow.

This is independent research with no institutional affiliation; the work is published openly so others can verify, criticize, and extend the methodology.

Get in touch

For collaboration, citations, mentorship offers, or any questions about the methodology — (click to copy)

Code: github.com/mohtheghost

← Back to all projects