Improving the Performance of Machine Learning Models Through a Data-centric Approach

Vol 55, 2023 - 161161
Favoritar este trabalho
Como citar esse trabalho?

Machine learning (ML) systems heavily rely on training data, and any biases or limitations in these datasets can significantly impact the performance and trustworthiness of the systems. Therefore, adopting a data-centric approach is crucial in guiding the design of a problem and the creation of a dataset. This approach involves careful consideration of the problem's objectives, scope, and limitations, as well as the quality, diversity, and relevance of the data required to effectively solve it. By placing data at the centre of the decision-making process, we gain a deeper understanding of the characteristics and limitations of the training data, ensuring a well-designed problem. In addition to problem design, a data-centric perspective can also be adopted during the model deployment stage. By examining instances that degrade model performance, data scientists and data specialists can identify and inspect problematic instances. One way to accomplish this is by leveraging the concept of instance hardness, which measures the level of difficulty in classifying individual instances. Instances consistently receiving incorrect classifications from multiple ML techniques are considered hard, while the correctly classified instances are named easy. The instance hardness level is calculated using a data-driven measure that takes into account the likelihood of misclassification across different algorithms. In this paper, we propose a data-centric approach to ML systems that focuses on exploring the potential of data to better design classification problems. We adopt a COVID dataset assembled from a public repository, with the objective of predicting an aggravated condition from patients' blood test results on their first day of attendance. A set of data handling steps was carefully established, guided by the expertise of a doctor, to create a consistent dataset. We investigate how changes in the dataset design can improve the performance of the built models. After establishing hypotheses, we investigated the relationship between the original class, the instance hardness level, and the information contained in the raw data source. Our findings suggest that the inclusion of death as the sole criterion for defining severity, as well as the incorporation of the place of hospitalisation as a feature, leads to improved predictive performance. By leveraging these insights and adopting a data-centric approach, ML systems can be enhanced for more accurate and reliable predictions in the healthcare domain.

Compartilhe suas ideias ou dúvidas com os autores!

Sabia que o maior estímulo no desenvolvimento científico e cultural é a curiosidade? Deixe seus questionamentos ou sugestões para o autor!

Faça login para interagir

Tem uma dúvida ou sugestão? Compartilhe seu feedback com os autores!

  • 1 Universidade Federal de São Paulo
  • 2 Instituto Tecnológico de Aeronáutica
Eixo Temático
  • 17. SA – PO na Área de Saúde
Machine Learning; Data-centric; healthcare