Interpretable Data Science Methods for Knowledge Discovery from Ovarian Cancer Data

Fecha

2021

Título de la revista

ISSN de la revista

Título del volumen

Editor

Universidad Rey Juan Carlos

Resumen

Background. Ovarian cancer (OC) is the second most common gynecological malignancy, which represents the gynecological tumor with the worst prognosis and the fth most common cause of cancer-related death. This situation is due, in part, to the advanced stage at presentation in most patients. Early detection is a di cult task in OC because of poor speci c signs and symptoms at early stages, and lack of reliable screening techniques. Thus, biomarkers, and more speci cally those ones based on omics data, have a great potential in this detection task at early stages. Modern data science techniques, such as big data and deep learning, are helping to properly interpret clinical and omics data by determining associations with the occurrence of diseases, with a given prognosis, or even with a certain response to a de ned therapeutic intervention, thus being used in the discovery of new biomarkers. In this regard, data interpretability analyses should require special attention. These analyses are intended to understand the data, to nd basic patterns in them, and to obtain inferences from the most representative patterns. However, these types of analyses are frequently obviated in machine learning and deep learning, focused mostly on the accuracy, which could result in missing relevant information for the practitioner or expert in the application eld. Objectives. The general objectives proposed in this doctoral thesis are as follows: (1) to study existing and new data interpretability analysis methods with the intention of understanding the data, nding patterns in them, and trying to obtain inferences due to the underlying patterns observed in the data; and (2) to nd relationships of clinical and genetic factors in patients of OC with regard to the disease progression, using for it data interpretability analysis methods with clinical and genomic data collected from patients diagnosed with OC. Methodology. In order to reach the proposed objectives, we have followed a general methodology consisting of: (1) performing an extensive review of the literature about data science methods; (2) obtaining an interpretation of OC data with univariate analysis methods using descriptive statistics and statistical tests; and (3) obtaining an interpretation of OC data with multivariate analysis methods using feature extraction methods, both linear and non-linear, and feature selection methods. Results. An initial exploration of the current state of big data and deep learning, two major branches of data science, has resulted in an in-depth snapshot of these two areas. Also, the application of a univariate analysis framework to a clinical and genomic OC dataset has resulted in some features of the dataset leading to statistical di erences between disease progression groups, that is, between platinum-resistant and platinum-sensitive groups, appearing these individual di erences in words of text features as well. Regarding the linear multivariate feature extraction analyses, clinical data results have showed separability patterns for the methods used according to the platinum-sensitivity degree, and they have con rmed the predictive and prognostic role of widely-known clinical and genetic variables, as well as demonstrating signi cant associations in other variables whose role in OC development has been studied to a lesser extent. The pattern of separability between disease progression groups in clinical data is also present in the results of the non-linear feature extraction method used. Finally, results of the feature selection method used have showed predictive and prognostic capacities for both previously known relevant clinical variables and low-risk genetic features, highlighting the e cacy of the method to better understand the clinical course of OC. Conclusions. Regarding the conclusions related to the general objective of studying existing and new data interpretability analysis methods, we can determine that the use of those is a necessary step to achieve a deeper understanding of the data that we are dealing with, revealing the quality of the data and nding intrinsic patterns that provide us with valuable information at later analysis stages. As for the conclusions related to the general objective of trying to nd relationships of clinical and genetic factors in patients of OC with regard to the disease progression, we can report that the separability patterns found in the OC dataset with respect to the disease progression, both in univariate and multivariate problem statements, can be indicators of success for the task of classifying between disease progression groups. Additionally, features that have appeared relevant in some of the proposed methods could work as potential biomarkers of the disease.

Descripción

Tesis Doctoral leída en la Universidad Rey Juan Carlos de Madrid en 2021. Directores de la Tesis: Sergio Muñoz Romero y José Luis Rojo Álvarez

Palabras clave

Citación

Colecciones

license logo
Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional