Dr Marta Milo: Research Themes

The main focus of my professional career has been to develop truly interdisciplinary skills, complementing and refining my bioinformatics skills with a deep understanding of the biological nature of the data collected. This is to better identify limitations in the experimental designs and better quantify variations in the data collection and validation. My main stream of research, has been concentrating on the analysis and interpretation of high-throughput biological data, with the aim to produce feasible and robust hypothesis for a deeper understanding of the biological systems under study.

Quantifying Uncertainty in Biology with Probabilistic models:

Lord KelvinIn quantitative sciences numerical knowledge is not enough to understand and predict systems behaviours that are only partially observed. Since the beginning of 20th century it was clear that predictions of data required an additional “knowledge” to become meaningful. This knowledge needed to be quantified in a way that reflects our prior knowledge of the systems and what we were able to measure. It signed the start of introducing the concept of quantified uncertainty.

The evolution of the technology for biological sciences enables us to apply the concepts of uncertainty on complex biological data. Modern measurements, despite being complex, limited and restrictive at times, shed complete new insights in understanding complex systems. My research interests mainly focus on exploring, develop and quantify the concept of uncertainty in Biology. This becomes an important step when we make predictions from complex data that want to be meaningful and satisfactory.

Here are some examples from my research, where the use of uncertainty in modelling, made substantial difference in improving accuracy and sensitivity of the analysis.

Microarray data analysis

Picture: Image 1Microarrays provided a practical method for measuring the expression of thousand of genes simultaneously. Although next generation sequencing has mainly replaced these assays, there is still a large amount of data available in public databases, that would enable to better design sequencing experiment with the insight of an high-throughput gene expression screening. For this reasons methods that have been developed in the past to analyse microarrays gene expression data, are still a valuable resource.  Microarray technology is associated with many significant sources of experimental uncertainty, which must be considered in order to make confident inference from the data. Estimate of uncertainty is not entirely achieved using repeat experiments. Outliers are often due to flaws in the microarray technique or to problems in the hybridization of the biological material. In high-density oligonucleotide arrays as well as in cDNA spotted arrays the aim is to extract from pixel intensity signals an estimate of gene expression levels.

Specifically for oligonucleotide arrays, such as Affymetrix GeneChip®, multiple probes are associated with each target paired as perfect match (PM) probes, designed to capture specific binding and mismatch (MM) probes, designed to capture non-specific binding. The probe-set is used to measure the target gene expression level and this measurement is then utilised to detect differentially expressed genes between different conditions or for visualisation, clustering or inference of gene networks. My research mainly focused on developing computational tools that are assisting in improving accuracy of both low-level and downstream analysis of biological data. In collaboration with PUMA (Propagating Uncertainty in Microarray Analysis) group we have developed a family of probabilistic models, that estimate gene expression levels with credibility intervals to quantify the measurement variance associated with the estimates of target concentration within a sample. 

The software puma is fully integrated in Bioconductor – Open Source Software for Bioinformatics.

Next Generation Sequencing

I am currently working to extend the use of probabilistic models to Next Generation Sequencing data, with particular focus on de-novo isoforms identification and data integration. In both cases we are integrating uncertainty in the models using probabilistic approaches, optimising computational time and accuracy. This research is done in close collaboration with the PUMA project to extend it to Next Generation Sequencing (NGS) data applications.

I am involved in a NGS cross-faculty network in Sheffield and I working with a consortium of 13 scientists from 8 different countries to work effect of Splicing dyfunction on disease with both RNA_Seq data and proteomics. Both networks provides questions and data for testing these developing methods.

Effect of splicing on disease

I am involved in a Network that was created with the aim to develop innovative and multidisciplinary approaches to investigate splicing dysfunction as a common mechanisms of disease. The spliceosome regulates the mechanism of transcription from DNA to RNA with generating different forms of mature RNA by splicing the basic RNA molecule called pre-RNA. This enables functionally diverse protein isoforms to be expressed according to different regulatory programs. With the integration of data and expertise from this network of scientists, we focus on the disruption of normal splicing patterns which are likely to contribute to the pathology underlying Motor Neuron Disease, Parkinson Disease, Huntington Disease, Rett Syndrome, X-linked agammaglobulinemia and Deafness. Despite the large wealth of data generated and available in public databases, only convoluted signals of biological processes in disease can be identified and measured. Reverse engineering is required to produce informative knowledge from modern data, for this we are working on generating novel integrative data approaches for splicing dysfunction, based on synergy between computer models and experimental procedures to:
• analyse RNA transcripts from the different compartments at cellular level; 
• quantify and identify new isoforms expression from sequencing data;
• optimise methods of therapeutic splicing correction.

Effect of genetic mutations on selection in Embryonic Stem Cells

I am part of the 

Analysis of Single-cell population at whole genome level 

Experimental Biology:

I have also acquired skills in experimental biology that allow me to complement my bioinformatics skills with a deep understanding of the biological nature of the data and of the limitations and variations in the data collection.

Genetic profiling of mammalian inner ear:
Molecular mechanisms to stimulate sensory regeneration in the mammalian inner ear are commonly searched in studies based upon embryonic and post-natal developmental in animal models. This has revealed many genes that regulate the differentiation of sensory cells. A major challenge is to place these genes into the context of functional networks. This is to be able to describe developmental processes in more details and increase the chances of identifying useful therapeutic targets.

We used high throughput gene expression assays, specifically microarray assays, to identify gene networks related to transcription factors, gata3 and gata2 during development in the mammalian inner ear, as well has to identify networks for sensory neural development in Igf-1 null mice, in collaboration with Prof Varela-Nieto laboratory. The prediction made with probabilistic models for data collected in these studies, were supported but extensive validation in vivo and in vitro and opened the way to detailed questions that remain still work in progress.

Figures 5 and 6

Figure 4, 5 and 6 in Milo et al., PLoS ONE 4(9): e7144. doi:10.1371/journal.pone.0007144

Gene expression profiling in Acute Coronary Syndrome:
Acute coronary syndrome (ACS) is the cause of over 114 000 UK hospital and causes large associated costs to the National Health system. Advances in microarray technology allow a detailed understanding of genome-wide expression profiles of pathological processes. We hypothesised that analysis of ACS, at the time of an acute event and throughout recovery up to 90 days post event, would provide insight into pathology, as well as identify genes as potential drug targets and both diagnostic and prognostic markers. Using Microarray technology and screening of miRNA from whole blood samples, we aimed to identify specific biological pathways showing the late effects of acute events that can be used to discover biomarkers of coronary heart disease.

This study in still in its infancy and has identified a set of differentially expressed genes that are associated to relevant pathways, like Rho GTPase cytoskeletal, endothelin signalling, integrin signalling, G-protein signalling and inflammation-mediated pathways. We used principal component analysis with propagated uncertainty (pumaPCA) to visualise and interpret the data. With clinical information incorporated, it was found that the data discriminated between patients, putting them into troponin-positive and troponin-negative groups across all time points.
Appropriate filtering of the data and use of probabilistic model to combine replicates and define Differential expression (pumaComb and PPLR) defined a set of relevant genes. Hierarchical clustering, comparing the expression profiles between groups, identified different clusters of genes that increased in expression over time in the troponin-positive group.
Patients cohort:

50 patients presenting with chest pain consistent with ACS were recruited within 48 h of admission. 3 ml of peripheral whole blood was collected using Tempus RNA tubes at days 1, 3, 7, 30 and 90. Total RNA was extracted, cleared of globin mRNA and arrayed using Affymetrix HG_U133 plusv.2 GeneChips. Data were analysed using open source software PUMA.

Data were analysed using open source software PUMA.<< Marta Milo homepage.