Local regression algorithm improves NIRS predictions when the target constituent evolves in breeding populations
The limited nutritional quality of cassava roots has dire implications for large numbers of people. The CGIAR Harvest Plus Challenge Program began in the mid-2000s to support genetic improvement of nutritional quality in various crops, including the carotenoid content of cassava roots. Breeding for higher carotenoid levels in cassava advanced, selection became a major bottleneck. Only few samples can be quantified each day for total carotenoids (TCC) and β-carotene (TBC) contents, limiting the gains from breeding. Cassava roots cannot be stored. This study describes the usefulness of NIRS as a pre-selection tool. Predictions based on the classical approach were “extrapolation”; due to the breeding program which led year after year to cassava cultivars with higher carotenoids content. The result of this was a nonlinear response for the higher contents. To overcome this, the Local Regression algorithm was successfully used.
The cassava data base used for calibration (6031 samples) was built over 6 years. Fresh root samples were ground with a food processor prior to NIRS analysis (FOSS 6500). Quantification of total carotenoids and beta carotenoid was made using HPLC/UV detection. Dry matter content was estimated on ground root tissue by drying in an oven at 105°C for 24 h.
TCC values ranged from 0.07 to 26.1μg.g-1, whereas TBC ranged from negligible values up to 20.1μg.g-1. The increase in TCC was 89% between 2009 and 2014 and 122% for TBC. The average dry matter content was constant for the same period, with and overall average content equal to 33.8%.
A comparison of classical calibration using PLS regression and local regression was done using samples analyzed from 2009 to 2013 (n = 4468) for calibration and 2014 samples (n = 463) for validation. The SEP were 2.17 µg. g-1 for TCC and 1.69 µg g-1 for TCB using PLS model and SEP were 1.59 µg g-1 for TCC and 1 03 µg.g-1 for TCB using Local regression. Moreover, the Local regression detected 103 samples out of the 131 with lab TTC content ≥18 µg.g-1, while PLS regression found only 28. For TBC, Local regression detected 150 samples out the 179 with TBC ≥ 12 µg g-1, while PLS regression detected only 59. The Local regression led to only 5 and 6 false positives for TCC and TCB, respectively.
The specificity of the data, with increasing content of the constituent of interest year after year, clearly showed the limitation of a classical PLS regression approach. The increasing range of the constituent forced the model to work in extrapolation inducing a greater error of prediction. The Local regression algorithm takes advantage of large databases; the single sample prediction concept provides the highest level of accuracy by selecting samples similar to the target sample. This study highlighted the efficiency of this concept which led to models that were able to manage the non-linearity observed with PLS regression for high content. NIRS coupled to Local regression led to more efficient models for breeding programs aiming at increasing carotenoids content in fresh cassava roots.