Software Notes: 

SimExpr2SampleData contains simulated gene expression data for the benchmarking of different classification and feature selection methods. The datasets are matrices representing the simulated expression values of 10000 genes (G1, G2,….G10000) for 1000 subjects, partitioned into two groups of subjects (healthy vs. diseased or two different variants of the same disease). SimExpr2SampleData also includes the list of the genes selected as putative biomarkers having expression modified by the knock-out (down) in the diseased subjects.

The .zip file contains two folders, “Heathy_vs_Diseased” and “Disease_variants_1vs2”, which report two different case studies. The former represents the data used in Di Camillo et al. (PLoS One 2012). The latter uses the same simulation schema of the population variability described in the PLoS One article but, in order to make the classification task more challenging, two different variants of the same disease (e.g. ER+ vs ER- in Breast Cancer sample classification) are defined. The README files available in each folder report the details on how data were generated.

For any questions, please contact us by sending an email to barbara.dicamillo @


If you download and use these data in a published work, please cite in your publication references the associated article.

Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, Barla A, Squillario M, Furlanello C, Toffolo G, Cobelli C. Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment. PLoS One. 2012;7(3):e32200.