SPARSim is an R tool for the simulation of single cell RNA-seq (scRNA-seq) count table. This vignette is an introduction to the use of SPARSim basic functions.

1. Installation


SPARSim can be downloaded at http://sysbiobig.dei.unipd.it/?q=SPARSim

It requires RCpp, scran and edgeR to work.

2. SPARSim input parameter


SPARSim simulation requires an input parameters to work, describing the macro characteristics of the desired synthetic count table. More specifically, for each experimental conditions to simulate SPARSim needs 3 information as input: gene expression level intensities, gene expression level variabilities and sample library sizes.

User change chose among 4 ways to provide SPARSim input parameter:

Section 2.1, 2.2, 2.3 and 2.4 describe each of the 4 input options.

2.1 Input parameter estimated from data

SPARSim allows user to estimate the input parameters from an existing count table. This is especially useful when the user want to simulate a count matrix with characteristic similar to an existing one.

As example of the estimation procedure, we will use a count matrix available in the SPARSim package (Example_count_matrix).

The count matrix contains 5000 genes and 150 cells. Cells belong to 3 experimental conditions (first 50 cells correspond to condition A, second 50 cells correspond to condition B and last 50 cells correspond to conditions C).

# Load data
load(Example_count_matrix)
dim(Example_count_matrix)

SPARSim provide the function estimate_parameter_from_data(raw_data, norm_data, conditions) to automatically estimate the SPARSim simulation parameter from a give count matrix. The function requires 3 input parameters

  • raw_data: the existing count matrix

  • norm_data: the existing normalized count matrix

  • conditions: a list indicating the experimental conditions. Each element in the list contains the index of the columns belonging to the experimental condition.

First, let normalize the count matrix. Here we use the scran normalization, but other normalization procedure could be used by the user. Function scran_normalization is SPARSim built-in function to perform the steps required by scran normalization (i.e. create a SingleCellExperiment object, compute normalization factor, perform normalization procedure and extract normalized count)

# Perform scran normalization
Example_count_matrix_norm <- scran_normalization(Example_count_matrix)

Then, let create the required conditions parameter:

# Get column index for each experimental condition
cond_A_column_index <- c(1:50) # Condition A column indices: from column 1 to column 50
cond_B_column_index <- c(51:100) # Condition B column indices: from column 51 to column 100
cond_C_column_index <- c(101:150) # Condition C column indices: from column 101 to column 150

# Create conditions param
Example_count_matrix_conditions <- list(cond_A = cond_A_column_index, 
                                        cond_B = cond_B_column_index, 
                                        cond_C = cond_C_column_index)

Now that all the required input parameters of function estimate_parameter_from_data() are available, the user could estimate the SPARSim simulation parameter from the existing count matrix and run the SPARSim simulation

# Create SPARSim simulation parameter through the estimation from an existing count matrix
SPARSim_sim_param <- estimate_parameter_from_data(raw_data = Example_count_matrix, 
                                                  norm_data = Example_count_matrix_norm, 
                                                  conditions = Example_count_matrix_conditions)
str(SPARSim_sim_param)

# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim(dataset_parameter = SPARSim_sim_param)

2.2 Input parameter from presets

SPARSim provides some parameters presets which allow the user to simulate count data resembling the characteristics of 6 existing count matrices (Bacher et al., Camp et al., Chu et al., Engel et al., Horning et al., Tung et al.), in terms of sparsity, library size and counts intensity/variability.

As example of using a parameter preset, we use the preset of Bacher dataset.

# Load Bacher preset
load(Bacher_preset) # Bacher_parameter_preset
str(Bacher_parameter_preset)

# Run SPARSim simulation using the just loaded parameter preset
sim_result <- SPARSim(dataset_parameter = Bacher_parameter_preset)

2.3 Input parameter specified by the user

SPARSim allows user to specify the input parameters by his/her own. As example, let simulate a count table with

  • 5000 genes

  • 2 experimental conditions A and B

  • condition A has 100 samples, with an average library size of 2M reads

  • condition B has 200 samples, with an average library size of 1M reads

The result of the simulation will be a count matrix of 5000 rows and 300 (100+200) columns.

For each experimental condition to simulate, user can use the function create_SPARSim_simulation_parameter_single_condition() to specify the required simulation information. The function has 3 mandatory input parameters intensity, variability and library_size which describe gene expression level intensities, gene expression level variabilities and sample library sizes, respectively. Moreover, 3 optional parameters are available to specify gene IDs, sample names and experimental condition ID (function parameters feature_names, sample_names and condition_name, respectively); if one or more of the optional parameter are not provided, the function automatically set some default values.

# Create simulation parameter for condition A
cond_A_param <- create_SPARSim_simulation_parameter_single_condition(
                                  intensity = runif(n = 5000, min = 0, max = 10000), 
                                  variability = runif(n = 5000, min = 0.001, max = 1), 
                                  library_size = round(rnorm(n = 100, mean = 2*10^6, sd = 10^3)),
                                  feature_names = paste0("Gene_",c(1:5000)), 
                                  sample_names = paste0("cond_A_cell_",c(1:100), 
                                  condition_name = "cond_A")

# Create simulation parameter for condition B
cond_B_param <- create_SPARSim_simulation_parameter_single_condition(
                                  intensity = runif(n = 5000, min = 0, max = 10000), 
                                  variability = runif(n = 5000, min = 0.001, max = 1), 
                                  library_size = round(rnorm(n = 200, mean = 1*10^6, sd = 10^3)),
                                  feature_names = paste0("Gene_",c(1:5000)), 
                                  sample_names = paste0("cond_B_cell_",c(1:200), 
                                  condition_name = "cond_B")

Then, the parameter of each experimental conditions must be collected in a list.

# Create SPARSim simulation parameter
SPARSim_sim_param <- list(cond_A_param, cond_B_param)

Now that the SPARSim input parameter is ready, it could be used to run the simulation

# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim(dataset_parameter = SPARSim_sim_param)

2.4 Combining different input parameters

SPARSim allows user to combine parameters obtained from the three ways described above. For additional details about SPARSim simulation parameter, please see section 2.5.

As example, let simulate a count table having

  • 17128 genes

  • 2 experimental conditions A and B

  • condition A has 50 samples, with intensity and variability from Bacher dataset preset (experimental condition Bacher_C1) and with library size estimated from Camp dataset count matrix

  • condition B has 100 samples, with intensity and variability from Bacher dataset preset (experimental condition Bacher_C2) and with an average library size of 0.5M reads (specified by user)

First, load Bacher parameter preset and Camp count table

# load Bacher data parameter preset
load(Bacher_preset)
Bacher_cond_1 <- Bacher_preset$Bacher_C1 # get parameter preset of experimental condition Bacher_C1
Bacher_cond_2 <- Bacher_preset$Bacher_C2 # get parameter preset of experimental condition Bacher_C2

# load Camp data count table
load(Camp_data_for_package)
Camp_lib_size <- estimate_library_size(Camp_data) # estimate library size from Camp data count table

Then, create simulation parameter for condition A, combining parameter from Bacher data preset (Bacher_C1) and direct estimation from Camp data count matrix

cond_A_param <- create_SPARSim_simulation_parameter_single_condition(
                                  intensity = Bacher_cond_1$intensity, 
                                  variability = Bacher_cond_1$variability, 
                                  library_size = sample(Camp_lib_size, size =  50), 
                                  condition_name = "cond_A")

Next, create simulation parameter for condition B, combining parameter from Bacher data preset (Bacher_C2) and parameter specified by the user

cond_B_param <- create_SPARSim_simulation_parameter_single_condition(
                                  intensity = Bacher_cond_2$intensity, 
                                  variability = Bacher_cond_2$variability, 
                                  library_size = round(rnorm(n = 100, mean = 0.5*10^6, sd = 0.01*10^6)), 
                                  condition_name = "cond_B")

Last, the parameter of each experimental conditions must be collected in a list. Once SPARSim input parameter is ready, it could be used to run the simulation

# Create SPARSim simulation parameter
SPARSim_sim_param <- list(cond_A = cond_A_param, cond_B = cond_B_param)

# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim(dataset_parameter = SPARSim_sim_param)

2.5 Additional details about SPARSim input parameter

SPARSim input parameter is implemented as an R list, where each element of the list contains the information to simulate a single experimental condition. As example, consider a SPARSim simulation parameter to create a synthetic count table describing two experimental conditions. Let sim_param be such SPARSim input parameter. Then sim_param is list of 2 elements, let call them cond_A and cond_B. Both cond_A and cond_B are list themselfs, containg the elements called intensity, variability and lib_size. So, using the just introduced notation, sim_param$cond_A$intensity would contain the gene expression level intensities used to simulate the first experimental condition, while sim_param$cond_B$lib_size would contain the sample library sizes used to simulate the second experimental condition. The just provided implementation details are useful only for users interested in particular simulation scenarios, such the one described in section 2.4. For the great majority of users, such low level implementation details are negligible, since SPARSim provides a complete set of functions to easly create the simulation parameters, as described in the previous sections.

3. Run SPARSim simulation


Once the input parameter are ready, SPARSim simulation can be launched calling the function SPARSim(dataset_parameter). Let SPARSim_sim_param be the input simulation parameter obtained as described in section 2, the SPARSim simulation could be launched as follow:

# Run SPARSim simulation using the created simulation parameter
sim_result <- SPARSim(dataset_parameter = SPARSim_sim_param)

sim_result will contain the output of the SPARSim simulation, details about the output structure are provided in the next section.

4. SPARSim simulation output


The output of SPARSim simulation is a list of two elements:

Please note that SPARSim uses a mixed model, made of a first step to simulate the biological variability (i.e. simulate the gene expression across cells belonging to the same experimental condition) followed by a second step to simulate the technical variability (i.e. given as input the gene expression, simulate the count table resulting from the experimental/sequencing procedure). In this framework, sim_abundance could be considered as the output the the first step, while count_matrix is the output of the second (i.e. last) step.

Considering an in-silico scRNA-seq experiment, sim_abundance provides the (simulated) gene expression level in the cells, and so the quantities of interest in a sequencing quantification study, while count_matrix provided the (simulated) measured gene expression level through the sequencing experiment, as in any real count table.

Therefore, sim_abundance represent the TRUE gene expression (unknown in a real experiment), while count_matrix represents the MEASURED gene expression (in terms of read counts).

Compared to a real sequencing experiment, where only the count table (i.e. the MEASURED gene expression) is available, the main advantage of simulated data is that they provide both the TRUE quantities of interest and the MEASURED ones.

5. SPARSim additional simulation features


SPARSim provides some built-in functions to simulate the presence to of batch effects, spike-ins and bimodal genes. The following sections show how to use these functions.

5.1 Simulation of batch effects

To simulate batch effect, the first step is create the desidered batchs using the function create_batch(). See documentation of create_batch() for the complete details on how to use it. We will create 2 batches, named “Lane_1” and “Lane_2”.

batch_lane_1 <- create_batch(name = "Lane_1", distribution = "normal", param_A = 0, param_B = 1)
batch_lane_2 <- create_batch(name = "Lane_2", distribution = "gamma", param_A = 1, param_B = 1)

Then, let create a batch set (i.e. a collection of all the desired batches) using the function create_batch_set():

batch_set <- create_batch_set(batch_list = list(batch_1 = batch_lane_1, batch_2 = batch_lane_2))

Next, SPARSim requires to associate each batch to a set of samples (i.e. cells). As example consider the preset of Bacher dataset, which by default simulate a total of 366 samples.

In that preset, let simulate the presence of batch “Lane_1” of the first and last 50 samples (i.e. samples from 1 to 50 and from 316 to 366), while the remaining 266 samples (i.e. samples from 51 to 315) will be affected by batch “Lane_2”. The just mentioned scenario could be simulated as follow:

batch_sample_association <- c(rep("Lane1", 50), rep("Lane2", 266), rep("Lane1", 50))

Last, let create the SPARSim simulation parameter using the function create_SPARSim_batch_parameter() :

SPARSim_batch_parameter <- create_SPARSim_batch_parameter(batch_set = batch_set, 
                                                          batch_sample = batch_sample_association)

The presence of the just created batch effects to the Bacher data could be perfomed as follow

# Load Bacher data preset
load(Bacher_preset) # Bacher_parameter_preset

# Run SPARSim simulation using the Bacher parameter preset and the batch parameter created above
sim_result <- SPARSim(dataset_parameter = Bacher_parameter_preset, batch_parameter = SPARSim_batch_parameter)

5.2 Simulation of spike-in presence

To simulate spike-in presence, the first step is create the desidered spike-in mix using the function create_spikein(). See documentation of create_spikein() for the complete details on how to use it.

For each spike-in mix, user must specify the name of the mix and the abundance of spike-ins in the mix (parameters mix_name and abundance, respectively). Optionally, user can specify the ID assigned to each spike-in and the presence of some extra variability in spike-in abundance (parameters spike_in_IDS and extra_variability, respectively). If not specified, spike-in IDs are set to “spikein_1”, “spikein_2”, …, “spikein_<S>” (with S be the number of spike-ins) and no extra variability is simulated.

We will create 2 spike-in mixes, named “spikein_M1” and “spikein_M2”, the first one containing 100 spike-ins and the second one containg 90 spike-ins. Spike-ins in “spikein_M2” will be simulated with some extra variability. (Values assigned to parameters abundance and extra_variability are used for example purposes only)

# First spike-in mix
spikein_mix1_abund <- runif(n = 100, min = 0.01, max = 1000); 
spikein_mix1 <- create_spikein(mix_name= "spikein_M1", abundance = spikein_mix1_abund)

# Second spike-in mix
spikein_mix2_abund <- runif(n = 90, min = 0.001, max = 10000); 
spikein_mix2_extra_var <- runif(n = 90, 0.01, 0.02)
spikein_mix2 <- create_spikein(mix_name= "spikein_M2", abundance = spikein_mix2_abund, 
                               extra_variability = spikein_mix2_extra_var)

Then, let create a spike-in set (i.e. a collection of all the desired spike-in mixes) using the function create_spikein_set():

spikein_set <- create_spikein_set(spikein_list = list(mix_1 = spikein_mix1, mix2 = spikein_mix2) )

Next, SPARSim requires to associate each spike-in mix to a set of samples (i.e. cells). As example consider the preset of Bacher dataset, which by default simulate a total of 366 samples.

In that preset, let simulate the presence of spike-in mix “spikein_M1” in the first 50 samples (i.e. samples from 1 to 50), the presence of spike-in mix “spikein_M2” in the last 50 samples (i.e samples from 316 to 366), while the remaining 266 samples (i.e. samples from 51 to 315) will contain no spike-in mixes. The just mentioned scenario could be simulated as follow:

spikein_sample_association <- c( rep("spikein_M1", 50) , rep(NA, 266) , rep("spikein_M2", 50) ) 

Spike-in addition were simulated additing a certain quantity to the indicated samples. The quantity to add is computed as percentage of the material present in reference samples (by default, the ones with the average aboundance). For each spike-in mix, the user must specify that percentage. Here as example, we set 3% to “spikein_M1” and 5% “spikein_M2”.

spikein_aboundance <- c(0.03,0.05)

Last, let create the SPARSim simulation parameter using the function create_SPARSim_spikein_parameter() :

SPARSim_spikein_parameter <- create_SPARSim_spikein_parameter(spikein_set = spikein_set, 
                                                              spikein_sample = spikein_sample_association, 
                                                              spikein_proportion = spikein_aboundance)

The presence of the just created spike-ins mixes to the Bacher data could be perfomed as follow

# Load Bacher data preset
load(Bacher_preset) # Bacher_parameter_preset

# Run SPARSim simulation using the Bacher parameter preset and the spike-in parameter created above
sim_result <- SPARSim(dataset_parameter = Bacher_parameter_preset, spikein_parameter = SPARSim_spikein_parameter)

SPARSim provides presets to emulate the spike-ins described in Jiang et al.

5.3 Simulation of bimodal genes

SPARSim allows to simulate genes having a bimodal expression level. Compared to the standard input parameter, user should specify additional intensity and variability values for the bimodal genes, specifing also the percentage of expression values belonging to the first/second mode.

Once these additional values are specified, SPARSim simulation could be performed as described in section 3.

6. Example of SPARSim applications


Simulated data are particulary useful to test the performance of bioinformatics preprocessing/analysis methods, since the known ground truth could be exploited to assess methods output.

In the following sections, we describe some applications of SPARSim simulated data to the assessment of common preprocessing/analysis methods such as normalization, zero-imputation, cell clustering and differential expression (DE) analysis.

6.1 Simulation DE genes and assessment/benchmarking of DE methods

In this section, it is provided a simple workflow to simulate DE genes with SPARSim and the use of such simulated data to assess the performance of DE methods. Please note that both the simulation of DE genes and the assessment of DE methods can be performed in many different ways. In this section it is described only one of the available options.

6.1.1 Simulation DE genes

To simulate DE genes, here we will use a very simple idea based on fold-change. Given a quantity x and a quantity y, the fold change is defined as the ratio x/y. For example, if x = 20 and y = 80, then the fold-change of x and y is x/y = 20/80 = 1/4 = 0.25. Analogously, if x = 80 and y = 20, the the fold change of x and y is x/y = 80/20 = 4.

Considering a scenario with two experimental conditions A and B. Considering a gene Z which is expressed with level x in experimental condition A and with expression level y in experimental condition B. Then we could consider the gene Z as differential expressed among the two conditions A and B if the fold change x/y is less than FC_1 = 0.25 (i.e. y is at least four times x) or greater than FC_2 = 4 (i.e. x is at least for times y).

As example, we will simulate a scenario similar to the one described in section 2.3:

  • 5000 genes

  • 2 experimental conditions A and B

  • 500 genes will be DE (250 genes having FC < 0.25 and 250 genes having FC > 4), 4500 genes will be not DE

  • condition A has 100 samples, with an average library size of 2M reads

  • condition B has 200 samples, with an average library size of 1M reads

The result of the simulation will be a count matrix of 5000 rows and 300 (100+200) columns.

## STEP 1: Create simulation parameter for condition A
cond_A_param <- list()
cond_A_param$intensity <- runif(n = 5000, min = 0, max = 10000)
cond_A_param$variability <- runif(n = 5000, min = 0.001, max = 1)
cond_A_param$lib_size <- round(rnorm(n = 5000, mean = 2*10^6, sd = 10^3))

## STEP 2: Prepare fold changes multipliers
# Without loss of generality, we will simulate the first 500 of the 5000 genes as the DE ones, 
# setting a FC < 0.25 for the first 250 genes and a FC > 4 for the remaining 250 genes
DE_multiplier <- c( runif(n = 250, min = 0.0001, max = 0.25), runif(n = 250, min = 4, max = 100) )

# The remaining 4500 genes will be simulated as not DE, setting a FC between 0.25 and 4.
not_DE_multiplier <- runif(n = 4500, min = 0.251, max = 3.999)

# Combine the FC multipliers
fold_change_multiplier <- c( DE_multiplier, not_DE_multiplier)


## STEP 3: Create simulation parameter for condition B
cond_B_param <- list()
cond_B_param$intensity <- cond_A_sim_param$intensity * fold_change_multiplier
cond_B_param$variability <- runif(n = 5000, min = 0.001, max = 1)
cond_B_param$lib_size <- round(rnorm(n = 5000, mean = 1*10^6, sd = 10^3))


## STEP 4: Run simulation **

SPARSim_sim_param_with_DE <- list(cond_A = cond_A_sim_param, cond_B = cond_B_sim_param)
sim_result_with_DE <- SPARSim(SPARSim_sim_param_with_DE)

Please note that the described simulation scenario is just a toy example. It is used only to explain in a easy way how the simulator can be used. A more realistic scenario is described above, where the same basic idea of fold-change is used but providing realistic values for the simulation parameters.

As example, we will use intensity, variability and library sizes values took from one of parameter preset available in SPARSim database. In particular, intensity, variability and library sizes will be taken from the first experimental condition of Bacher data preset (Bacher_C1: 17128 genes across 91 cells/samples, average library size ~4*10^6 reads). A total of 1000 genes will be simulated as DE between the two conditions: 600 genes will be simulated as DE with an upregulation (i.e FC > 4) in condition B compared to condition A, while 400 genes will be simulated as DE with a downregulation (i.e FC <0.25) in condition B compared to condition A.

Summarizing, we will simulate the following scenario:

  • 2 experimental conditions A and B

  • a total of 17128 genes

  • 1000 genes will be DE, 16128 genes will be not DE

  • 80 samples for condition A

  • 70 samples for conditions B

## STEP 0: load simulation preset and extract parameter values

# Load Bacher data preset
load("Bacher_preset")

# Extract intesity, variability and library size parameters from the first experimental condition of Bacher data preset (i.e. Bacher_C1)
param_preset <- Bacher_parameter$Bacher_C1
intesity <- param_preset$intensity
variability <- param_preset$variability
lib_size <- param_preset$lib_size


## STEP 1: Prepare simulation parameters for condition A**
cond_A_sim_param <- list()
cond_A_sim_param$intensity <- intesity
cond_A_sim_param$variability <- variability
cond_A_sim_param$lib_size <- sample(lib_size, size = 80)


## STEP 2: Prepare fold-change multipliers

# not DE genes will have a fold change between 0.25 and 4
not_DE_multiplier <- runif(n = 16128, min = 0.251, max = 3.999)

# DE genes will have a fold change less than 0.25 or greater than 4
# here we simulate 400 fold-changes lower thant 0.25 and 600 fold changes greater than 4
DE_multiplier <- c( runif(n = 400, min = 0.0001, max = 0.25), runif(n = 600, min = 4, max = 100) )

# In this example, the first 1000 genes will be the DE ones, while the last 16128 will be the not DE ones
fold_change_multiplier <- c(DE_multiplier, not_DE_multiplier)


## STEP 3: Prepare simulation parameters for condition B
cond_B_sim_param <- list()
cond_B_sim_param$intensity <- cond_A_sim_param$intensity * fold_change_multiplier # apply the fold-changes
cond_B_sim_param$variability <- cond_A_sim_param$variability
cond_B_sim_param$lib_size <- sample(lib_size, size = 70)


## STEP 4: Run simulation

# Create the global parameter
SPARSim_param <- list(cond_A = cond_A_sim_param, cond_B = cond_B_sim_param)

# Run SPARSim simulation
SPARSim_result <- SPARSim(SPARSim_param)

Compared to the first DE simulation scenario, the just described simulation procedure allows to generate more realistic scRNA-seq count data. However, even the above simulation procedure could be further improved. As example, it is well known that biological variability is related to gene expression level, so a change in the level of expression (as the one simulated in condition B due to the fold changes) would correspond to new levels of biological variability. A more realistic simulation procedure could take into account this phenomena, changing not only the gene expression level in condition B, but also their variability values. Another improvment could be done applying fold-change multipliers to only not zero and not too low expression level values, since these values are often ignored in DE analysis.

6.1.2 Assessment/benchmarking of DE methods

When simulating scRNA-seq count data containing DE genes with SPARSim, it is the user to define which genes (and so also how many genes) are simulated as DE (e.g. exploiting the fold-change method described in section 6.1.1). The information about which and how many genes are simulated as DE is fundamental for the DE tool assessment task, since they represent the ground truth to exploit in the performance evaluation.

DE tools usually required a minimal input: the count matrix (raw or normalized, depending on the specific DE tool) and some information about the association between matrix columns and experimental conditions. Using SPARSim, both the input of a DE tool are available: the count matrix is the main output of the SPARSim simulator, while the experimental conditions are part of SPARSim input, and so known by the user. As output, DE tools provide list of DE genes and often some other additional data (e.g. p-values, estimated fold-change, etc.).

Typical assessment metrics for DE tools are precision (i.e. Positive Predicted Value (PPV)), recall (i.e. True Positive Rate (TPR)) and accuracy, defined as follow:

\(precision = TP / (TP + FP)\)

\(recall = TP / (TP + FN)\)

\(accuracy = (TP + TN) / (TP + FP + TN + FN)\)

where:

  • TP (True Positive) is the number of genes SIMULATED as DE and CALLED as DE by the DE tool

  • FP (False Positive) is the number of genes SIMULATED as NOT DE and CALLED as DE by the DE tool

  • TN (True Negative) is the number of genes SIMULATED as NOT DE and CALLED as NOT DE by the DE tools

  • FN (False Negative) is the number of genes SIMULATED as DE and CALLED as NOT DE by the DE tool

Working with SPARSim simulated data, the computation of the above metrics is straightforward, since the TP, FP, TN and FN values could be easy computed from data.

If the DE tool provides also p-values for the analyzed genes, it would be possible to perfom additional performance evaluation, as for example compute the PR-curve (i.e. Precision-Recall curve) and the AUPRC (i.e. Area Under the PR-curve).

When performing the assessment of a DE tool and more generally the benchmarking of several DE methods, it is important to carefully design the simulated data, trying cover as many scenario as possible. In particular, it could be useful generate simulated data having

  • different number/percentage of DE genes

  • different number of samples in an experimental conditions

  • different fold change levels (if using a fold-change criteria to simulate DE genes)

Simulating different scenario allows studying strengths and weaknesses of DE methods, and provide a more fair and robust way to evalute tool performance.

6.2 Assessment/benchmarking of zero-imputation methods

One of the most characterizing features of scRNA-seq count tables is the strong sparsity they show. The zero values present in these data could be linked both to biological and technigal causes. Indeed, while a portion of zeros actually represents genes having null expression values (biological zeros), many of them are artifactual values introduced by the sequencing procedure (technical zeros) as a consequence of the co-occurrance of two main factors: the fixed, limited sequencing depth imposed by the instruments and the heavy skewness of gene expression values internally to each smaple. So, technical zeros represent a portion of information that got lost during gene expression measurement.

To address this issue, many zero-imputation tools have been released in recent years that try to recover the information masked by the sequencing step. In this framework, researchers may be interested in benchmarking already available tools for zero-imputation or to test for a newly developed method performance. SPARSim allows users to create their set of synthetic datasets on which to perform these tests, with the fundamental feature of giving them the golden standard expression values to use when evaluating zero-imputation methods results. In particular, the above mentioned sim_abundance matrix given as output when simulating a dataset via SPARSim procedure provides the TRUE gene expression values (unknown when performing a real experiment) prior to the sequencing experiment. As a consequence, sim_abundance count table can be used as the ground truth about gene expression level when assessing the count tables processed with one or more zero-imputation methods.

As an example, one may want to test for changes in different tools performance when varying the sparsity level of data, i.e. when simulating the sequencing of the same data with different different sequencing depth levels. This can be simulated producing a set of simulations in which, starting from the same intensity and variability values and the same number of conditions and replicates (and, as a consequence, of samples), the lib_size values diminish progressively from the first to the last scenario, causing the related dataset sparsity to grow. In this framework, the sim_abundance golden standard is the same for all the datasets. After having applied the selected zero-imputation tool(s), the obtained SPARSim count table(s) can be compared with the ground truth to test for the goodness in recovering lost information.

This could be done, for example

  • comparing the heatmaps on presence/absence data obtained from the ground truth and the zero-imputed data

  • computing similarity measure between ground truth and zero-imputed data (e.g. Sum of squared error, Pearson correlation coefficient, etc.)

  • studying how the biological structure of the data is maintained (e.g. comparing clustering and/or dimensionality reduction results based on sim_abundance matrix and the zero-imputed count tables).