35123

An infrastructure for automatic creation and documentation of large-scale chemometrical classifiers

Favoritar este trabalho

As a general rule, NIR spectroscopy requires reliable chemometrical classifiers or predictors.
Classifiers are used for qualitative, predictors for quantitative applications. The reliability of the
implemented (e.g. statistical) method must be proven by validation against the respective target
application. For many applications this validation procedure must be well and reproducibly
documented. Especially for qualitative analyses the resulting disproportionate effort to create and
maintain validated and well documented chemometrical classifiers often limits the use of the
method. This is particularly true for classifiers basing on several hundreds of thousand NIR spectra
which have to be updated and extended regularly.

We have developed an automated infrastructure to create and maintain classifier modules which can
be used to identify a wide range of pharmaceutical substances. This modules contain multiple
chemometrical models, basing on different types of spectral data. This includes 310 000 NIR
spectra of solids, measured in diffuse reflection, and 90 000 NIR transflectance spectra of semisolids
and fluids. This large collection of spectral data is used for calibration and validation of the
particular chemometrical models, and the complete classifier module respectively. The results of the
very time-consuming validation procedure are summarized in an automatically generated, revision
safe PDF document which typically comprises up to several thousand pages. The accuracy and
integrity of this documentation is essential for pharmaceutical applications.

Information which defines structure and parameters of the models, as well as meta information like
test certificates, manufacturer and purity is managed in a dedicated database. This also includes
information regarding related substances, wavelength range, preconditioner, classifying algorithm
etc.

For the creation of chemometrical models, an automated tool has been implemented, which creates
text- and XML-based parameter files. These are used by our chemometrics toolchain to perform a
Principal Component Analysis (PCA), resulting in chemometrical classifiers. This semi-automatic,
usually iterative process leads to a classifier module. The resulting module is then validated with a
large independent test set, as well as a large set of field data (more than 60 000 spectra). The very
detailed validation documentation is created automatically.

This rather complex process has been implemented as platform-independent software infrastructure.
It is currently running on an Linux-based server system, realizing an overall processing time of
about 3 days per classifier module.