To cite this paper use one of the standards below:
High-quality statistical analysis is required for the proper interpretation of the detrital zircon geochronological data, which is easily generated by machine-learning algorithms. Therefore, in this study, five machine-learning classifier algorithms (Logistic Regression, K-Neighbors, Random Forest, Extra Trees, and Hist Gradient Boosting) were tested for estimating the detrital zircon U-Pb age provenance apportionment of two source areas of the Cenozoic Resende half-graben, southeast Brazil.
The detrital zircon U-Pb ages dataset corresponds to near-concordant (within 10 %) syn-crystallization, inherited, and metamorphic U-Pb zircon ages referenced in Carvalho et al. (2023). The sample dataset comprises between 86 and 115 published detrital zircons of U-Pb ages from sandstones. The source dataset contains a compilation of 637 and 959 ages from the faulted (a) and flexural (b) margins of the surrounding basement rocks.
A training dataset was generated from the source database with data augmentation coding, resulting in a table with 3000 observations and 6 features (apportionment classes). Each observation corresponds to a random selection of 100 zircon ages from the source database, divided accordingly with the 6 apportionment classes. The classes were delineated from 0 to 100 % of each source reservoir, varying with increments of 20 (100a-0b, 80a-20b, 60a-40b, 40a-60b, 20a-80b, 0a-100b). For each class, the training dataset had 500 different combinations of zircon ages. Ages were grouped and counted in 7 features, drawn according to Earth's time scale (Cenozoic, Mesozoic, Paleozoic, Neoproterozoic, Mesoproterozoic, Paleoproterozoic, and Archean). Missing values in the samples’ dataset were corrected with a Simple Imputer coding using a median descriptive statistic so that all columns had the same number of observations (= 115). Features were standardized with a standard scaler preprocessing code. A train-test split was applied with a size of 70–30%. Tests were performed 50 times, and the final result consisted of modes and means, respectively, for class predictions and accuracy-error values.
For the Resende Basin database, the classifier models allowed the estimation of the proportion of the faulted and flexural margin contribution with accuracy and precision varying both from ~ 59 % to ~ 67 %. The same interval was measured for the success rate (f1-score). Of the 5 tested algorithms, Logistic Regression had the best performance. Results show a predominance of zircon age contribution from the flexural margin reservoir, with a prevalence of the 0a-100b and 40a-60b apportionment classes (~ 67 % of the results).
Sample classification and model performance are strongly dependent on the input database, the more robust the training dataset, the greater the model accuracy will be, so the challenge is finding the best training dataset.
With nearly 200,000 papers published, Galoá empowers scholars to share and discover cutting-edge research through our streamlined and accessible academic publishing platform.
Learn more about our products:
This proceedings is identified by a DOI , for use in citations or bibliographic references. Attention: this is not a DOI for the paper and as such cannot be used in Lattes to identify a particular work.
Check the link "How to cite" in the paper's page, to see how to properly cite the paper