Development and validation of data analysis automation methods using pattern recognition
Archivos
Fecha
2020
Autores
Título de la revista
ISSN de la revista
Título del volumen
Editor
Universidad Rey Juan Carlos
Resumen
Data analysis automation is an area of growing interest thanks to the increasing need of
processing large amounts of data in a timely fashion, the large volumes of labeled data generated
collectively, and the recent technological advances that enabled widespread adoption of
multicore computing. This thesis explores three main areas where analysis automation has
been proven essential.
First, the field of cellular astronomy is studied. Cell astronomy is a specific type of low
magnification imaging cytometry where fluorescent samples are imaged with a magnification
such that cells have only a couple of pixels in radius. While significantly increasing the field of
view and enabling cheap and quick analysis of thousands of cells, cell astronomy introduces
some important challenges. These challenges include the detection of bright spots at low
signal to noise ratios (SNRs), the estimation of cell diameter in the presence of partial volume
effects, and the estimation of fluorescence intensity despite local background fluorescence.
Fortunately, cell astronomy images resemble both astronomy and superresolution microscopy
images, so popular image analysis methods in these fields can be used to overcome some
of the main challenges. Using these fields as inspiration, a novel image analysis pipeline
was created, which estimates both fluorescence intensity and cell diameter by fitting an
heterogeneous mixture model using expectation maximization. This method is explained
thoroughly in the included journal publication, which also validates the proposed pipeline
using cell controls and microbeads.
Second, we explore the task of automatic chromosome identification. Chromosomes
can be imaged using a technique called multiplex fluorescence in situ hybridization, where
chromosomes are labeled using at least five different fluorescent probes, and captured using
multispectral imaging. Despite the developments in chromosome labeling, the analysis of this
images still remains a manual or semaiautomated approach, where karyotyping is performed
using both spectral and spatial information. Due to the recent popularity of convolutional
networks, and the reach of near human performance for multiple tasks, we theorize that
image segmentation using convolutional networks can achieve state of the art results for the
analysis of multispectral chromosome images. To prove this, we have published a paper
where a convolutional approach for chromosome identification is proposed. The attached journal publication describes an end to end segmentation network for the interpretation of
multispectral chromosome images which uses both spectral and spatial information. The
proposed method was evaluated using a publicly available dataset, outperforming previous
automated methods, and achieving an average correct classification ratio (CCR) that has only
been previously achieved using semiautomated approaches.
Third, we investigate seismic phase picking automation. Phase picking deals with the
identification of the arrival times of seismic waves, which is usually performed manually, or
in a semiautomated fashion. The process of performing manual picks is described, underlining
how this cumbersome process is often overlooked leading to intra-, and intersubject
biases. Additionally, while there are widely available algorithms that automate this task,
they are dated and do not offer the performance necessary to fully offload the job. On the
other hand, while convolutional network approaches have been proposed for the analysis
of seismic phases, we show some important issues that arise when directly applying regression
or segmentation networks. In order to overcome this issues, we propose a two
stage convolutional network where the first step computes a rough segmentation mask, a
the second step computes a distance map to pinpoint the precise location, and then both
steps are combined using an adaptation of the Hough transform. The proposed network
was evaluated on publicly available data collected by the Northern California Earthquake
Data Center (NCEDC), achieving a mean absolute error lower than previously proposed
convolutional networks.
Finally, an additional chapter includes some smaller contributions. On the one hand, an air
quality forecasting method is presented. This method uses long short-term memory (LSTM)
units to analyze a time series comprised by both air quality, and meteorological information.
Then, the method is compared with Caliope, a model based air quality forecasting method,
achieving lower mean squared error for the open data published by the city of Madrid.
On the other hand, we study wood conductivity assessment using xylem cross sections.
Traditionally, a specific set of hand tuned parameters would be necessary to analyze each tree
species. This thesis shows that convolutional networks can learn the features used to segment
conductive elements and ring paths of multiple tree species simultaneously. Additionally, a
web application, Xyat (https://xyat.app), has been developed to enable researchers to use the
proposed method without installing any software.
Descripción
Tesis Doctoral leída en la Universidad Rey Juan Carlos de Madrid en 2020. Director de la Tesis: Norberto Malpica González de Vega
Palabras clave
Citación
Colecciones
Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional