Favorite this paper
How to cite this paper?
Abstract

As cyberspace grows, so does the damage caused by malware, which is one of the main tools used by malicious agents. Machine learning algorithms have been consolidated as important tools for detecting threats. Models used by these algorithms depend on data for training and testing. In this sense, malware datasets have become valuable in the deployment of modern anti-malware systems. However, these datasets face problems with the quality of the samples, as well as not keeping up with the speed of technological evolution and becoming obsolete. In addition, many of the datasets used in research are not publicly accessible. This paper proposes a quality assessment framework based on metrics focused on sampling and data temporality. It also incorporates criteria aligned with the FAIR principles, with the aim of encouraging the publication of more reliable and reusable datasets.

Share your ideas or questions with the authors!

Did you know that the greatest stimulus in scientific and cultural development is curiosity? Leave your questions or suggestions to the author!

Sign in to interact

Have a question or suggestion? Share your feedback with the authors!

Institutions
  • 1 Instituto Militar de Engenharia (IME) e Diretoria de Comunicações e Tecnologia da Informação da Marinha (DCTIM)
  • 2 CASNAV
  • 3 Instituto Militar de Engenharia (IME)
  • 4 Universidade do Estado do Rio de Janeiro
  • 5 Universidade Federal Fluminense (UFF)
  • 6 University of Twente
Track
  • 25. SE-PODMAR
Keywords
Datasets
Malware Analysis
FAIR