Machine learning classifier algorithms applied for sedimentary provenance based on detrital zircon ages

- 305641
Oral presentation
Favorite this paper
How to cite this paper?
Abstract

High-quality statistical analysis is required for the proper interpretation of the detrital zircon geochronological data, which is easily generated by machine-learning algorithms. Therefore, in this study, five machine-learning classifier algorithms (Logistic Regression, K-Neighbors, Random Forest, Extra Trees, and Hist Gradient Boosting) were tested for estimating the detrital zircon U-Pb age provenance apportionment of two source areas of the Cenozoic Resende half-graben, southeast Brazil.

The detrital zircon U-Pb ages dataset corresponds to near-concordant (within 10 %) syn-crystallization, inherited, and metamorphic U-Pb zircon ages referenced in Carvalho et al. (2023). The sample dataset comprises between 86 and 115 published detrital zircons of U-Pb ages from sandstones. The source dataset contains a compilation of 637 and 959 ages from the faulted (a) and flexural (b) margins of the surrounding basement rocks.

A training dataset was generated from the source database with data augmentation coding, resulting in a table with 3000 observations and 6 features (apportionment classes). Each observation corresponds to a random selection of 100 zircon ages from the source database, divided accordingly with the 6 apportionment classes. The classes were delineated from 0 to 100 % of each source reservoir, varying with increments of 20 (100a-0b, 80a-20b, 60a-40b, 40a-60b, 20a-80b, 0a-100b). For each class, the training dataset had 500 different combinations of zircon ages. Ages were grouped and counted in 7 features, drawn according to Earth's time scale (Cenozoic, Mesozoic, Paleozoic, Neoproterozoic, Mesoproterozoic, Paleoproterozoic, and Archean). Missing values in the samples’ dataset were corrected with a Simple Imputer coding using a median descriptive statistic so that all columns had the same number of observations (= 115). Features were standardized with a standard scaler preprocessing code. A train-test split was applied with a size of 70–30%. Tests were performed 50 times, and the final result consisted of modes and means, respectively, for class predictions and accuracy-error values.

For the Resende Basin database, the classifier models allowed the estimation of the proportion of the faulted and flexural margin contribution with accuracy and precision varying both from ~ 59 % to ~ 67 %. The same interval was measured for the success rate (f1-score). Of the 5 tested algorithms, Logistic Regression had the best performance. Results show a predominance of zircon age contribution from the flexural margin reservoir, with a prevalence of the 0a-100b and 40a-60b apportionment classes (~ 67 % of the results).

Sample classification and model performance are strongly dependent on the input database, the more robust the training dataset, the greater the model accuracy will be, so the challenge is finding the best training dataset.

 

 

Share your ideas or questions with the authors!

Did you know that the greatest stimulus in scientific and cultural development is curiosity? Leave your questions or suggestions to the author!

Sign in to interact

Have a question or suggestion? Share your feedback with the authors!

Institutions
  • 1 Universidade do Estado do Rio de Janeiro (UERJ)
  • 2 OptiMargin Software Co.
  • 3 Brasil
Track
  • 5. Isotopes in Sedimentary Systems: Stratigraphy, Provenance and Petroleum Systems
Keywords
Python
Scikit Learn
Geochronology