Development and validation of data analysis automation methods using pattern recognition
Data analysis automation is an area of growing interest thanks to the increasing need of processing large amounts of data in a timely fashion, the large volumes of labeled data generated collectively, and the recent technological advances that enabled widespread adoption of multicore computing. This thesis explores three main areas where analysis automation has been proven essential. First, the field of cellular astronomy is studied. Cell astronomy is a specific type of low magnification imaging cytometry where fluorescent samples are imaged with a magnification such that cells have only a couple of pixels in radius. While significantly increasing the field of view and enabling cheap and quick analysis of thousands of cells, cell astronomy introduces some important challenges. These challenges include the detection of bright spots at low signal to noise ratios (SNRs), the estimation of cell diameter in the presence of partial volume effects, and the estimation of fluorescence intensity despite local background fluorescence. Fortunately, cell astronomy images resemble both astronomy and superresolution microscopy images, so popular image analysis methods in these fields can be used to overcome some of the main challenges. Using these fields as inspiration, a novel image analysis pipeline was created, which estimates both fluorescence intensity and cell diameter by fitting an heterogeneous mixture model using expectation maximization. This method is explained thoroughly in the included journal publication, which also validates the proposed pipeline using cell controls and microbeads. Second, we explore the task of automatic chromosome identification. Chromosomes can be imaged using a technique called multiplex fluorescence in situ hybridization, where chromosomes are labeled using at least five different fluorescent probes, and captured using multispectral imaging. Despite the developments in chromosome labeling, the analysis of this images still remains a manual or semaiautomated approach, where karyotyping is performed using both spectral and spatial information. Due to the recent popularity of convolutional networks, and the reach of near human performance for multiple tasks, we theorize that image segmentation using convolutional networks can achieve state of the art results for the analysis of multispectral chromosome images. To prove this, we have published a paper where a convolutional approach for chromosome identification is proposed. The attached journal publication describes an end to end segmentation network for the interpretation of multispectral chromosome images which uses both spectral and spatial information. The proposed method was evaluated using a publicly available dataset, outperforming previous automated methods, and achieving an average correct classification ratio (CCR) that has only been previously achieved using semiautomated approaches. Third, we investigate seismic phase picking automation. Phase picking deals with the identification of the arrival times of seismic waves, which is usually performed manually, or in a semiautomated fashion. The process of performing manual picks is described, underlining how this cumbersome process is often overlooked leading to intra-, and intersubject biases. Additionally, while there are widely available algorithms that automate this task, they are dated and do not offer the performance necessary to fully offload the job. On the other hand, while convolutional network approaches have been proposed for the analysis of seismic phases, we show some important issues that arise when directly applying regression or segmentation networks. In order to overcome this issues, we propose a two stage convolutional network where the first step computes a rough segmentation mask, a the second step computes a distance map to pinpoint the precise location, and then both steps are combined using an adaptation of the Hough transform. The proposed network was evaluated on publicly available data collected by the Northern California Earthquake Data Center (NCEDC), achieving a mean absolute error lower than previously proposed convolutional networks. Finally, an additional chapter includes some smaller contributions. On the one hand, an air quality forecasting method is presented. This method uses long short-term memory (LSTM) units to analyze a time series comprised by both air quality, and meteorological information. Then, the method is compared with Caliope, a model based air quality forecasting method, achieving lower mean squared error for the open data published by the city of Madrid. On the other hand, we study wood conductivity assessment using xylem cross sections. Traditionally, a specific set of hand tuned parameters would be necessary to analyze each tree species. This thesis shows that convolutional networks can learn the features used to segment conductive elements and ring paths of multiple tree species simultaneously. Additionally, a web application, Xyat (https://xyat.app), has been developed to enable researchers to use the proposed method without installing any software.
Tesis Doctoral leída en la Universidad Rey Juan Carlos de Madrid en 2020. Director de la Tesis: Norberto Malpica González de Vega
- IA - Tesis Doctorales