In the last years, both sequencing and microarray have been widely used to search for relations between genetic variations and predisposition to complex pathologies such as diabetes or neurological disorders. In general, a combination of rare and common variants influencing the genotype with a protective or detrimental effect is likely to contribute to the disease, an ideal method should be robust to different MAF, to different direction of genotype effects and to the number of associated SNPs within the SNP-set being analyzed.
ABACUS, Algorithm based on a BivAriate CUmulative Statistic, is designed to identify SNPs significantly associated with a disease within predefined sets of SNPs such as pathways or genomic regions. Applied to a whole SNP dataset, ABACUS gives as output a list of SNP-sets associated with the disease and, for each SNP-set, the list of significant SNPs. ABACUS is robust to the concurrent presence of SNPs with protective and detrimental effects and of common and rare variants; moreover it is powerful even when few SNPs in the SNP-set are associated with the phenotype.
ABACUS first requires the definition of the SNP-sets, such as pathways, genes or genomic regions encoding a priori information on the potential point effects of the SNPs in each subset. We consider biological pathways as the preferred definition of SNP-sets, since studying the cumulative variation of SNPs mapping on genes in the same pathway (interacting genes) might fill in part the missing heritability and guide mechanistic studies helping uncovering the underlying disease pathways. Moreover, ABACUS is particularly suited for pathway analysis, given its ability of simultaneously considering common and rare variants and different direction of genotype effects.
An R package is available here
A rule-based model of insulin signalling pathway
The insulin signalling pathway (ISP) is an important biochemical pathway, which regulates some fundamental biological functions such as glucose and lipid metabolism, protein synthesis, cell proliferation, cell differentiation and apoptosis. In the last years, different mathematical models based on ordinary differential equations have been proposed in the literature to describe specific features of the ISP, thus providing a description of the behaviour of the system and its emerging properties. However, protein-protein interactions potentially generate a multiplicity of distinct chemical species, an issue referred to as “combinatorial complexity”, which results in defining a high number of state variables equal to the number of possible protein modifications. This often leads to complex, error prone and difficult to handle model definitions.
Here, we present a comprehensive model of the ISP, which integrates three models previously available in the literature by using the rule-based modelling (RBM) approach. RBM allows for a simple description of a number of signalling pathway characteristics, such as the phosphorylation of signalling proteins at multiple sites with different effects, the simultaneous interaction of many molecules of the signalling pathways with several binding partners, and the information about subcellular localization where reactions take place. Thanks to its modularity, it also allows an easy integration of different pathways.
The model is available here
Increasing attention has recently been devoted, in the bioinformatics community, to Bayesian Networks, which are probabilistic graphical models that encode in a graph-based form the joint probability distribution of a set of random variables. Given a dataset, consisting of several observations for a set of variables, a common problem is to learn the most probable network that may have generated the dataset. In biological contexts, the problem is often complicated by missing values in the data due to out-of-threshold measurements, lost observations or impossibility of taking measures.
We present an R package, bnstruct, that performs structure and parameter learning even in the presence of missing values using state-of-art algorithms for network learning and provides also methods for imputation, bootstrap re-sampling of the data and inference. bnstruct can handle both discrete and continuous variables in the dataset manipulation and imputation. However, as a design choice, learning is implemented with discrete variables alone, i.e. continuous variables are quantized after imputation. To our knowledge, there are no open source packages that use state-of-art algorithms for structure learning and inference in case of missing data.
Source code of the bnstruct algorithm is released under the GNU General Public Licence and is available here
Multifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.
In this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.
The significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.
Source code of the BoNB algorithm is released under the GNU General Public Licence and is available here
Next-generation sequencing, and particularly 16S ribosomal RNA (16S rRNA) gene sequencing, is a powerful technique for the identification and quantification of human-resident microbes, collectively known as the human microbiota.
Once bacterial abundances are profiled via 16S rRNA gene sequencing and summarized in a count data set, diversity indices provide valuable mathematical tools to investigate the composition of the human microbiota. In brief, alpha diversity can be used to describe the taxonomical complexity of a single sample, whereas beta diversity can be used to identify differences between samples.
The DiversitySeq package implements in a unified framework the whole panel of diversity indices reviewed in (Finotello et al., 2016), enabling the assessment of diversity from count data sets. DiversitySeq also implements a simulator for the generation of synthetic count data sets from 16S rRNA gene sequencing.
Besides 16S rRNA gene sequencing data, this package can be employed with other data sets with similar characteristics, such as 5S rRNA gene sequencing, environmental metagenomics or, more generally, any kind of matrix were counts are computed for different non-overlapping classes.
DiversitySeq is available here (Latest update: 20/10/2016)
Dynamic expression data, nowadays obtained using high-throughput RNA sequencing (RNA-seq), are essential to monitor transient gene expression changes and to study the dynamics of their transcriptional activity in the cell or response to stimuli. FunPat is an R package designed to provide:
- a useful tool to analyze time series genomic data;
- a computational pipeline which integrates gene selection, clustering and functional annotations into a single framework to identify the main temporal patterns associated to functional groups of differentially expressed genes;
- an easy way to exploit different types of annotations from currently available databases (e.g. Gene Ontology) to extract the most meaningful information characterizing the main expression dynamics;
- a user-friendly organization and visualization of the outcome, automatically linking the differentially expressed genes and their temporal patterns to the functional information for an easy biological interpretation of the results.
R package is available here (Latest update: 23/06/2015)
Di Camillo B, Hakaste L, Sambo F, Gabriel R, Kravic J, Isomaa B, Tuomilehto J, Alonso M, Longato E, Facchinetti A, Groop LC, Cobelli C, Tuomi T. HAPT2D: High accuracy of prediction of T2D with a model combining basic and advanced data depending on availability. European Journal of Endocrinology. 2018 Apr; 178(4):331-341.
HAPT2D is a prediction tool to identify patients at risk of developing diabetes.
It is designed for predicting the risk of T2D using various amounts of background information depending on data availability.
Scenario 1 includes the fewest variables without any laboratory tests (age, gender, marital status, BMI, physical activity, medical treatment of hypertension, history of hyperglycemia, family history, …).
Scenario 2 adds to scenario 1 the usual information available to a health professional clinician after a general visit and routine laboratory tests (pulse, blood pressure, fasting glucose, triglycerides, …).
Scenario 3 complements scenario 2 with results from an Oral Glucose Tolerance Test.
The web tool is available here
Finotello F, Lavezzo E, Bianco L, Barzon L, Mazzon P, Fontana P, Toppo S, Di Camillo B. Reducing bias in RNA sequencing data: a novel approach to compute counts. BMC Bioinformatics. 2014 Jan; 15 (Suppl 1):S7.
Maxcounts is a novel approach to compute exon counts from RNA-seq reads aligned on a reference genome (Finotello at al., 2014). Once reads have been aligned to an exon, using any alignment tool, read coverage can be exploited to obtain counts for every position in its sequence (i.e. the number of reads covering each base along its sequence). maxcounts quantify exon expression as the maximum of its positional counts.
All the codes are freely available under a GNU Public License (Version 2) and can be downloaded here
In the last few years, 16S rRNA gene sequencing (16S rDNA-seq) has seen a surprisingly rapid increase in election rate as a methodology to perform microbial community studies. Despite the considerable popularity of this technique, an exiguous number of specific tools are currently available for 16S rDNA-seq count data pre-processing and simulation that consider their peculiar characteristics. In this work we present metaSPARSim, a sparse count matrix simulator intended for usage in development of pipelines 16S rRNA metagenomic data processing. metaSPARSim implements a new generative process that models the sequencing process with a Multivariate Hypergeometric in order to realistically reproduce these data considering their characteristic aspects, such as compositionality and sparsity. It provides ready-to-use count matrices and comes with the possibility to reproduce different pre-coded scenarios or to tune internal parameters in order to create a tailored count matrix that better fits some prior information or specific characteristic an expert user may want to consider. metaSPARSim was proven to be able to generate count matrices resembling real 16S rDNA-sequencing data. The availability of count data simulators is extremely valuable both for methods developers, for which a ground truth for tools validation is needed, and for tool users who want to assess state of the art analysis tools for choosing the most accurate one. Thus, we believe that metaSPARSim could be a valuable tool for researchers involved in developing, testing and using robust and reliable data analysis methods in the context of 16S rRNA gene sequencing.
All the codes are freely available and can be downloaded here
Sambo F, Finotello F, Lavezzo E, Baruzzo G, Masi G, Peta E, Falda M, Toppo S, Barzon L, Di Camillo B. Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene. BMC Bioinformatics. 2018 Sep; 19(1):343.
Targeted amplicon sequencing of the 16S ribosomal RNA gene is one of the key tools for studying microbial diversity. The accuracy of this approach strongly depends on the choice of primer pairs and, in particular, on the balance between efficiency and specificity in the amplification of the different bacterial 16S sequences contained in a sample. mopo16S is a command line tool for the design of primer sets, based on multi-objective optimization, which simultaneously: 1) maximizes efficiency and specificity of target amplification; 2) maximizes the number of different bacterial 16S sequences matched by at least one primer; 3) minimizes the differences in the number of primers matching each bacterial 16S.
Source code of the mopo16S algorithm is released under the GNU General Public Licence and is available here
In the context of reverse engineering of biological networks, simulators are helpful to test and compare the accuracy of different reverse-engineering approaches in a variety of experimental conditions. A novel gene-network simulator is presented that resembles some of the main features of transcriptional regulatory networks related to topology, interaction among regulators of transcription, and expression dynamics. The simulator generates network topology according to the current knowledge of biological network organization, including scale-free distribution of the connectivity and clustering coefficient independent of the number of nodes in the network. It uses fuzzy logic to represent interactions among the regulators of each gene, integrated with differential equations to generate continuous data, comparable to real data for variety and dynamic complexity. Finally, the simulator accounts for saturation in the response to regulation and transcription activation thresholds and shows robustness to perturbations. It therefore provides a reliable and versatile test bed for reverse engineering algorithms applied to microarray data. Since the simulator describes regulatory interactions and expression dynamics as two distinct, although interconnected aspects of regulation, it can also be used to test reverse engineering approaches that use both microarray and protein-protein interaction data in the process of learning.
A web application is available here
Daberdaku S, Tavazzi E, Di Camillo B. A Combined Interpolation and Weighted K-Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data. Journal of Healthcare Informatics Research (2020): 1-15.
The presence of missing data is a common problem that affects almost all clinical datasets. Since most available data mining and machine learning algorithms require complete datasets, accurately imputing (i.e. “filling in”) the missing data is an essential step. This paper presents a methodology for the missing data imputation of continuous longitudinal clinical data based on the integration of linear interpolation and a weighted K-Nearest Neighbours (KNN) algorithm. The Maximal Information Coefficient (MIC) values among features are employed as weights for the distance computation in the KNN algorithm in order to integrate intra- and inter-patient information. An interpolation-based imputation approach is also employed and tested both independently and in combination with the KNN algorithm. The final imputation is carried out by applying the best performing method for each feature.
All the codes are freely available and can be downloaded here
Microarray time series studies are essential to understand the dynamics of molecular events. In order to limit the analysis to those genes that change expression over time, a first necessary step is to select differentially expressed transcripts. A variety of methods have been proposed to this purpose; however, these methods are seldom applicable in practice since they require a large number of replicates, often available only for a limited number of samples. In this data-poor context, we evaluate the performance of three selection methods, using synthetic data, over a range of experimental conditions.
Method 1 uses a threshold on individual samples based on a model of the experimental error. Method 2 calculates the area of the region bounded by the time series expression profiles, and considers the gene differentially expressed if the area exceeds a threshold based on a model of the experimental error. These two methods are compared to Method 3, recently proposed in the literature, which exploits splines fit to compare time series profiles. Application of the three methods to synthetic data indicates that Method 2 outperforms the other two both in Precision and Recall when short time series are analyzed, while Method 3 outperforms the other two for long time series.
These results help to address the choice of the algorithm to be used in data-poor time series expression study, depending on the length of the time series.
R scripts are available here
Studying biological networks at topological level is a major issue in computational biology studies and simulation is often used in this context, either to assess reverse engineering algorithms or to investigate how topological properties depend on network parameters. In both contexts, it is desirable for a topology simulator to reproduce the current knowledge on biological networks, to be able to generate a number of networks with the same properties and to be flexible with respect to the possibility to mimic networks of different organisms. We propose a biological network topology simulator, SimBioNeT, in which module structures of different type and size are replicated at different level of network organization and interconnected, so to obtain the desired degree distribution, e.g. scale free, and a clustering coefficient constant with the number of nodes in the network, a typical characteristic of biological networks. Empirical assessment of the ability of the simulator to reproduce characteristic properties of biological network and comparison with E. coli and S. cerevisiae transcriptional networks demonstrates the effectiveness of our proposal.
A web application is available here
Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, Barla A, Squillario M, Furlanello C, Toffolo G, Cobelli C. Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment. PLoS One. 2012;7(3):e32200.
The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods.
We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state. The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results.
Simulated Expression Data are available here
The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data. SNPack are a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data.
All the codes are freely available under a GNU Public License (Version 2) and can be downloaded here
Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data.
In this work we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of counts intensity, variability and sparsity, performing comparable or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data.
All the codes are freely available under a GNU Public License (Version 3) and can be downloaded here
Tavazzi E, Daberdaku S, Vasta R, Calvo A, Chiò C, Di Camillo B. Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach. BMC Medical Informatics and Decision Making (2020): 20(suppl.5), 174.
Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data.
We present the implementation of an adaptive mutual information-weighted k-nearest neighbours (wk-NN) imputation algorithm for clinical register data developed to explicitly handle missing values of continuous/ordinal/categorical and static/dynamic features conjointly. For each subject with missing data to be imputed, the method creates a feature vector constituted by the information collected over his/her first ‘window_size’ time units of visits. This vector is used as sample in a k-nearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the cross-information among features.
Source code of the wkNNMI algorithm is released under the GNU General Public Licence and is available here