A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets

Fecha

2024-01-21

Título de la revista

ISSN de la revista

Título del volumen

Editor

MDPI

Enlace externo

Resumen

Multilabel classification is a recently conceptualized task in machine learning. Contrary to most of the research that has so far focused on classification machinery, we take a data-centric approach and provide an integrative framework that blends qualitative and quantitative descriptions of multilabel data sources. By combining lattice theory, in the form of formal concept analysis, and entropy triangles, obtained from information theory, we explain from first principles the fundamental issues of multilabel datasets such as the dependencies of the labels, their imbalances, or the effects of the presence of hapaxes. This allows us to provide guidelines for resampling and new data collection and their relationship with broad modelling approaches. We have empirically validated our framework using 56 open datasets, challenging previous characterizations that prove that our formalization brings useful insights into the task of multilabel classification. Further work will consider the extension of this formalization to understand the relationship between the data sources, the classification methods, and ways to assess their performance.

Descripción

This research was funded by the Spanish Ministerio de Ciencia e Innovación grant number PID2021-125780NB-I00, EMERGE and Línea de Actuación No 3. Programa de Excelencia para Francisco José Valverde Albacete. Convenio Plurianual entre Comunidad de Madrid y la Universidad Rey Juan Carlos.

Citación

Valverde-Albacete, F.J.; Peláez-Moreno, C. A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets. Mathematics 2024, 12, 346. https://doi.org/10.3390/math12020346
license logo
Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution 4.0 International