Quantifying sample size sensitivity in machine learning models for digital mapping of soil organic matter: implications for MRV protocols

Poppiel, R. R.; Victorero, C.; Guzman, M.; Lazaro, M.; Ureta, A.; Demattê, J. A. M.

Quantifying sample size sensitivity in machine learning models for digital mapping of soil organic matter: implications for MRV protocols

- 336914

Posters

How to cite this paper?

Abstract

Monitoring Soil Organic Matter (SOM) is essential for Monitoring, Reporting, and Verification (MRV) protocols in agricultural ecosystems. However, balancing sampling costs with fit-for-purpose modeling remains a critical bottleneck for scalable DSM. This study evaluates the sensitivity of the Random Forest (RF) algorithm to training sample density for predicting SOM (0–30 cm) across a 52-ha agricultural area in Chile. The methodology integrated georeferenced SOM data with multi-source geoenvironmental covariates at 30 m resolution. Predictors included terrain attributes (elevation, slope, curvatures), Sentinel-1/2 multispectral and radar bands, and ALOS PALSAR data. To quantify model sensitivity, multiple sample size scenarios (n) were generated using conditioned Latin Hypercube Sampling (cLHS) within stratified principal components (PC1–PC5). RF models were optimized via grid-search and validated against an independent dataset using RMSE, R2, MAPE, and RPIQ. K-means clustering (k=3) was applied to performance metrics to identify Low, Medium, and High-performance tiers. Results indicated a clear performance plateau within the "Medium" cluster at an average of 26 samples, translating to a density of 1.98 ha/sample. High-performance stability was achieved at approximately 1.7 ha/sample (~30 samples). Below these thresholds, model error increased significantly and predictability (R2) became low. These findings demonstrate that increasing sampling density beyond 2 ha/sample yields diminishing returns in predictive accuracy for this landscape. This study provides a data-driven framework for optimizing soil sampling designs, ensuring robust SOM predictions while minimizing operational costs. Furthermore, integrating regional datasets through similarity-based weighting can enhance local model performance, effectively reducing the necessity for intensive primary and additional sampling.

Institutions

¹ ESALQ/USP
² Neutral Farming
³ Centro de Investigación y Desarrollo Agrícola

Track

Monitoring, Reporting and Verification (MRV) protocols

Keywords

Pedometrics

cLHS

Precision agriculture

Carbon sequestration

fit-for-purpose

SOM 2026

Book of abstracts of the 10th International Symposium on Soil Organic Matter

Quantifying sample size sensitivity in machine learning models for digital mapping of soil organic matter: implications for MRV protocols

How to cite this paper?

Share your ideas or questions with the authors!

Streamline your Scholarly Event