Examinando por Autor "Soguero Ruiz, Cristina"
Mostrando 1 - 8 de 8
- Resultados por página
- Opciones de ordenación
Ítem Data-Driven Visual Characterization of Patient Health-Status Using Electronic Health Records and Self-Organizing Maps(IEEE, 2020-07-27) Chushig Muzo, David; Soguero Ruiz, Cristina; Engelbrecht, Andries; de Miguel Bohoyo, Pablo; Mora Jiménez, InmaculadaHypertension and diabetes have become a global health and economic issue, being among the major chronic conditions worldwide, particularly in developed countries. To face this global problem, a better knowledge about these diseases becomes crucial to characterize chronic patients. Our aim is two-fold: (1) to provide an efficient visual tool for identifying clinical patterns in high-dimensional data; and (2) to characterize the patient health-status through a data-driven approach using electronic health records of healthy, hypertensive and diabetic populations. We propose a two-stage methodology that uses diagnosis and drug codes of healthy and chronic patients associated to the University Hospital of Fuenlabrada in Spain. The first stage applies the Self-Organizing Map on the aforementioned data to get a set of prototype patients which are projected onto a grid of nodes. Each node has associated a prototype patient that captures relationships among clinical characteristics. In the second stage, clustering methods are applied on the prototype patients to find groups of patients with a similar health-status. Clusters with distinctive patterns linked to chronic conditions were found, being the most remarkable highlights: a cluster of pregnant women emerged among the hypertensive population, and two clusters of diabetic individuals with significant differences in drug-therapy (insulin and non-insulin dependant). The proposed methodology showed to be effective to explore relationships within clinical data and to find patterns related to diabetes and hypertension in a visual way. Our methodology raises as a suitable alternative for building appropriate clinical groups, becoming a promising approach to be applied to any population due to its data-driven philosophy. A thorough analysis of these groups could spawn new and fruitful findings.Ítem Diagnóstico de modelo y selección de variables para métodos de aprendizaje estadístico aplicados a efectividad promocional(Universidad Rey Juan Carlos, 2011) Soguero Ruiz, CristinaLa inestabilidad económica de los últimos años está produciendo una disminución generalizada de las ventas y en concreto de los productos de alimentación. Esta situación ha hecho que muchos distribuidores minoristas hayan puesto en marcha acciones promocionales como el descuento directo y las promociones en cantidad (3x2 ). La información digital disponible hoy en día ha provocado una evolución en el desarrollo de estas actividades, y debido a su potencial, los métodos de aprendizaje estadístico han empezado a cobrar verdadera importancia para conseguir aumentar el volumen de ventas. En el presente proyecto se utilizan los métodos de aprendizaje estadístico para analizar el comportamiento de las promociones en términos de unidades vendidas, realizando una comparación estadística detallada entre diferentes métodos para determinar de forma objetiva cuál de ellos ofrece mejores prestaciones. La nalidad de este trabajo es proponer un procedimiento operativo para el diagnóstico de modelo y la selección de variables utilizando técnicas estadísticas en aplicaciones de efectividad promocional. En concreto, se han analizado las promociones realizadas por un distribuidor minorista en 6 productos de la categoría de leche y en 14 productos de la categoría de cerveza durante un año, y para ello se han realizado diferentes experimentos. El primero ha consistido en comparar las prestaciones de cuatro métodos de aprendizaje estadístico: k-NN (k-Nearest Neighbors), GRNN (General Regression Neural Network), MLP (Multi Layer Perceptron) y SVM (Support Vector Machine) en términos absolutos utilizando el MAE (Mean Absolute Error) como gura de mérito. Se ha contrastado si algún método es signi cativamente mejor que otro mediante un test estadístico no paramétrico, basado en remuestreo bootstrap. Esta metodología se ha utilizado posteriormente para contrastar las prestaciones del esquema -SVM diseñado con núcleo RBF y con núcleo semiparamétrico, para analizar los elementos de diseño del MLP y comprobar la conveniencia de incluir determinadas variables (en general de naturaleza dicotómica) en los modelos promocionales. En conclusión, las técnicas de aprendizaje estadístico y la utilización del test bootstrap propuesto permiten extraer información relevante en el análisis de la efectividad promocional.Ítem Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability(BioMed Central, 2024-10-30) Gómez Martínez, Vanesa; Chushig-Muzo, David; Veierød, Marit B.; Granja, Conceição; Soguero Ruiz, CristinaCutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.MethodsIn this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.ResultsThe combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.ConclusionsOur results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.Ítem Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors(MDPI, 2023-03-23) García Vicente, Clara; Chushig-Muzo, David; Mora Jiménez, Inmaculada; Fabelo, Himar; Torhild Gram, Inger; Løchen, Maja-Lisa; Granja, Conceição; Soguero Ruiz, CristinaMachine Learning (ML) methods have become important for enhancing the performance of decision-support predictive models. However, class imbalance is one of the main challenges for developing ML models, because it may bias the learning process and the model generalization ability. In this paper, we consider oversampling methods for generating synthetic categorical clinical data aiming to improve the predictive performance in ML models, and the identification of risk factors for cardiovascular diseases (CVDs). We performed a comparative study of several categorical synthetic data generation methods, including Synthetic Minority Oversampling Technique Nominal (SMOTEN), Tabular Variational Autoencoder (TVAE) and Conditional Tabular Generative Adversarial Networks (CTGANs). Then, we assessed the impact of combining oversampling strategies and linear and nonlinear supervised ML methods. Lastly, we conducted a post-hoc model interpretability based on the importance of the risk factors. Experimental results show the potential of GAN-based models for generating high-quality categorical synthetic data, yielding probability mass functions that are very close to those provided by real data, maintaining relevant insights, and contributing to increasing the predictive performance. The GAN-based model and a linear classifier outperform other oversampling techniques, improving the area under the curve by 2%. These results demonstrate the capability of synthetic data to help with both determining risk factors and building models for CVD prediction.Ítem Interpretable Data-Driven Approach Based on Feature Selection Methods and GAN-Based Models for Cardiovascular Risk Prediction in Diabetic Patients(Institute of Electrical and Electronics Engineers, 2024-06-11) Chushig Muzo, David; Calero Díaz, Hugo; Lara Abelenda, Francisco J.; Gómez Martínez, Vanesa; Granja, Conceição; Soguero Ruiz, CristinaNoncommunicable diseases (NCDs) are the leading cause of morbidity and mortality worldwide. Cardiovascular diseases (CVDs) and diabetes are the most prevalent NCDs, causing 1.9 and 1.5 million deaths yearly. Individuals diagnosed with type 1 diabetes (T1D) are at high risk of developing CVDs. Machine learning (ML) models have provided outstanding results in different domains, including healthcare, allowing to obtain models with high predictive performance. The aim of this study was to develop an interpretable data-driven approach to predict the 10-year CVD risk for T1D older individuals, aiming to provide both reasonable predictive performance and the identification of risk factors associated with CVDs. Data from T1D individuals at the Steno Diabetes Center Copenhagen were used. Different ML-based models were considered, including KNN, decision tree, random forest, and multilayer perceptron (MLP). To enhance the predictive performance of ML models, the conditional tabular generative adversarial network (CTGAN) was used to create synthetic data and increase the size of the training data. Several filter and wrapper feature selection (FS) techniques were considered for identifying the most relevant features involved in CVD risk and enhancing the performance of the ML-based models used. To gain interpretability on predictive models, we used the post-hoc methods: SHAP and accumulated local effects. The experimental results showed a great performance of FS and ML-based models for predicting CVD risk. In particular, the MLP obtained the best results, with a mean absolute error of 0.0088 and mean relative absolute error of 0.0817. Regarding risk factors, age, Hba1c, and albuminuria were identified as crucial in CVD risk prediction, which is in line with recent clinical evidence. Our study contributes to identifying CVD risk and associated risk factors in a data-driven manner, helping to make early interventions and adequate treatments to prevent CVDs.Ítem Learning and visualizing chronic latent representations using electronic health records(Springer, 2022-09-05) Chushig-Muzo, David; Soguero Ruiz, Cristina; de Miguel Bohoyo, Pablo; Mora Jiménez, InmaculadaNowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches. We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient’s health status evolution, which is of paramount importance in the clinical setting. To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients. Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient’s health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes.Ítem Machine Learning and Knowledge Management for Decision Support. Applications in Promotional Efficiency and Healthcare(Universidad Rey Juan Carlos, 2015) Soguero Ruiz, CristinaEl desarrollo alcanzado en las Tecnologías de la Información y las Comunicaciones en las últimas décadas, ha traído consigo la recopilación y almacenamiento creciente de datos en ámbitos tan diversos como pueden ser marketing, salud o seguridad. La disponibilidad de grandes cantidades de datos hace necesaria la búsqueda de nuevos paradigmas de aprendizaje máquina, capaces de abordar el análisis automatizado de los mismos con la consiguiente extracción de información. En concreto, las técnicas de aprendizaje máquina permiten diseñar modelos estadísticos no paramétricos que aprendan las relaciones existentes entre un conjunto suficientemente representativo de ejemplos, cada uno de ellos formado por unas variables observadas (características), y su correspondiente salida. Se desea que el modelo construido pueda generalizar, es decir, obtener una salida adecuada ante ejemplos de entrada no considerados durante la fase del diseño. En los últimos años, estas técnicas han experimentado un avance espectacular, tanto en fundamentos teóricos como en su aplicación a distintos y numerosos dominios de conocimiento. El objetivo general de esta Tesis es el desarrollo teórico y la implementación de métodos de aprendizaje máquina, con énfasis en las etapas de selección de características y diseño del modelo predictivo, de forma que permita abordar el análisis de grandes cantidades de datos de naturaleza diversa, creando procedimientos específicos para cada etapa pero al tiempo aplicables en distintos ámbitos. En esta Tesis se han abordado tres áreas específicas de creciente interés económico y social: (a) el modelado de las interacciones entre productos de consumo diario y su eficiencia promocional; (b) el apoyo a la toma de decisiones para la predicción temprana de complicaciones tras la cirugía de cáncer de colon; (c) la estratificación de riesgo de muerte súbita cardíaca a partir de índices predictores obtenidos de las señales eléctricas del corazón, utilizando un modelo de conocimiento clínico y una terminología estandarizada. El análisis de datos de cada una de estas aplicaciones presenta como denominador común la utilización de técnicas de aprendizaje máquina, de acuerdo con el objetivo general. Sin embargo, la naturaleza tan diversa de dichas aplicaciones hace que cada una represente por sí misma un objetivo específico de la presente Tesis. El primer objetivo específico consiste en profundizar en la evaluación y análisis de las ventas promocionales, tradicionalmente basado en técnicas de estadística clásica. Un apoyo sustancial en la toma de decisiones ha de venir necesariamente del análisis sistemático de datos masivos sobre el control y monitorización de las promociones y sus complejas interacciones. Por ello se propone el análisis y la comparación estadística de distintas técnicas de aprendizaje máquina. Otro ámbito de naturaleza muy diversa al anterior, pero de indudable interés social, es el de la salud. El análisis de datos clínicos, tanto estructurados (constantes vitales o análisis de sangre) como no estructurados (texto libre en documentos clínicos), recogidos longitudinal y sistemáticamente en las historias clínicas electrónicas (HCEs) de un conjunto numeroso de pacientes, permite incrementar sustancialmente el conocimiento clínico y apoyar la toma de decisiones. Sin embargo, las técnicas de aprendizaje máquina y el análisis de datos han tenido, hasta la fecha, un alcance limitado en este ámbito. Esta situación se debe principalmente a la dificultad de extraer información útil de datos clínicos procedentes de fuentes heterogéneas. Además, existen muy pocos precedentes de sistemas que permitan la explotación automática de la información a nivel agregado entre diferentes entidades hospitalarias y existe gran necesidad de disponer de datos que sirvan de base para el avance científico, con mayor impacto en la práctica clínica. En esta Tesis se analizan dos dominios del ámbito salud de gran prevalencia en el mundo occidental, a saber, el cáncer de colon y las enfermedades cardíacas. El segundo objetivo específico consiste en la adaptación y aplicación de métodos de aprendizaje máquina para la detección temprana de complicaciones tras la cirugía de cáncer de colon, analizando tanto individual como conjuntamente variables procedentes de fuentes heterogéneas, extraídas todas ellas de la HCE. El tercer objetivo específico consiste en la creación de modelos de conocimiento clínico que permitan intercambiar datos y comprender semánticamente la información clínica de distintas HCEs. En los últimos años se han propuesto numerosos índices predictores del riesgo cardíaco. En concreto, en esta Tesis se analiza el dominio de la turbulencia del ciclo cardíaco por ser un predictor de muerte súbita cardíaca con guías clínicas claras y concisas. El análisis de grandes cantidades de datos y el desarrollo teórico de nuevos algoritmos de aprendizaje estadístico representan hoy, sin duda, un área de investigación muy activa en distintos dominios. Esta Tesis contribuye a mejorar el conocimiento y la toma de decisiones en aplicaciones reales de muy diversa naturaleza, y al tiempo con claros denominadores comunes.Ítem Ontology for Heart Rate Turbulence Domain Applying the Conceptual Model of SNOMED-CT(2012-07-18) Soguero Ruiz, CristinaAlthough cardiovascular risk stratification (CVRS) based on ECG-derived indices has been deeply studied, many current findings are not being widely used in the clinical practice. We hypothesized that, in addition to the necessary scientific evidence, also a clear and standardized connection among the current knowledge in the scientific literature, its availability for the cardiologist, and the actual patient data, is necessary for the practical implementation and refinement of these indices. For this purpose, we implemented an standardized framework for CVRS based on ECG-derived indices, focused on the actual knowledge of Heart Rate Turbulence (HRT) indices (with concise guidelines and clear procedures to parameter calculations). An ontology for HRT was built according to a set of logical and relational rules, yielding the class hierarchy model and its corresponding inferred model (Prot¿eg¿e-OWL, 4.1) for completeness. Different from other biomedical ontologies, ours was based on the international standard SNOMED-CT. The model of SNOMEDCT not only considers terminology, but also properties and relationships, what guaranteed the standardization and compatibility with current and emerging Electronic Health Records. Our HRT ontology consisted of 308 concepts (289 from SNOMED-CT, and 19 a local extension to model the main concepts of the HRTdomain). As an application example, a database of 27 instances of patients with HRT from 24-Holter monitoring recordings was considered, with basic HRT indices and also conventional and emergent signal processing calculations. A consistence of 86% and 77% was achieved between averaged procedure for HRT index calculations given in the guidelines and with a filtering procedure.