In the last few years, 16S rRNA gene sequencing (16S rDNA-seq) has seen a surprisingly rapid increase in election rate as a methodology to perform microbial community studies. Despite the considerable popularity of this technique, an exiguous number of specific tools are currently available for 16S rDNA-seq count data pre-processing and simulation that consider their peculiar characteristics.
In this work we present metaSPARSim, a sparse count matrix simulator intended for usage in development of pipelines 16S rRNA metagenomic data processing. metaSPARSim implements a new generative process that models the sequencing process with a Multivariate Hypergeometric in order to realistically reproduce these data considering their characteristic aspects, such as compositionality and sparsity. It provides ready-to-use count matrices and comes with the possibility to reproduce different pre-coded scenarios or to tune internal parameters in order to create a tailored count matrix that better fits some prior information or specific characteristic an expert user may want to consider.
metaSPARSim was proven to be able to generate count matrices resembling real 16S rDNA sequencing data. The availability of count data simulators is extremely valuable both for methods developers, for which a ground truth for tools validation is needed, and for tool users who want to assess state of the art analysis tools for choosing the most accurate one. Thus, we believe that metaSPARSim could be a valuable tool for researchers involved in developing, testing and using robust and reliable data analysis methods in the context of 16S rRNA gene sequencing.