Abstract
Partitioning a dataset into subsets, or clustering, is a fundamental technique in
data mining for grouping similar data points. A common approach is to formulate
clustering as an optimization problem, where an objective function evaluates
how well the data are grouped. A variance-based objective function sums the
dispersions of all clusters, which are measured as the sum of the squared distances
of all points in the cluster to the center of mass.
Finding the globally optimal partition into k clusters with this objective func tion is known to be NP-hard, even for k = 2 in dimensions greater than one.
These results fostered the proposal of heuristic algorithms, which outputs good
candidates for solution. However, these heuristic approaches do not guarantee
an optimal solution. In fact, it remains an open problem to ensure that the so lution found is close to the global optimum, even for small to moderately sized
datasets. This fundamental challenge motivates the development of exact algo rithms, which explicitly find the globally optimal clustering and can be used to
measure the fitness of heuristic algorithms—this being the central research topic
of this thesis.
In this work, we propose new strategies to find the globally optimal solution,
extending previous ideas and perspectives in the literature. Our analysis shows
that the complexity of our approach improves in terms of theoretical bounds and
experimental simulations. The strategy reduces the solution space, with a careful
enumeration of all possible candidates jointly with a tailored branch-and-bound
method.
Building upon this core approach, we propose additional strategies that in corporate randomization, specifically through the generation of Gaussian pseu dorandom numbers. One of these strategies consistently provides the exact solu tion efficiently, while others rely on random initialization within a metaheuristic
framework. Motivated by this approach, this thesis also investigates new algebraic
constructions for the efficient generation of pseudorandom number sequences with
strong statistical properties. Additionally, we offer new insights into the selection
of parameters for these generators, which may be useful for practitioners working
on similar problems.
Journal Title
Journal ISSN
Volume Title
Publisher
Universidad Rey Juan Carlos
DOI
Date
Description
Tesis Doctoral leída en la Universidad Rey Juan Carlos de Madrid en 2025. Directores: Ana Isabel Gómez Pérez
Domingo Gómez Pérez
Keywords
Citation
Soto Sánchez, F. J. (2025). New strategies for global optimum search in variance-based k-clustering (Tesis doctoral). Escuela Internacional de Doctorado, Programa de Doctorado en Tecnologías de la Información y las Comunicaciones.
Collections
Endorsement
Review
Supplemented By
Referenced By
Document viewer
Select a file to preview:
Reload



