SPARSim is an R tool for the simulation of single cell RNA-seq (scRNA-seq) count table. This vignette is an introduction to the use of SPARSim basic functions.
SPARSim is available on GitLab at https://gitlab.com/sysbiobig/sparsim
To install SPARSim from GitLab, please use the following commands:
library(devtools)
install_gitlab("sysbiobig/sparsim", build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)
The above commands would install SPARSim, the required dependencies and SPARSim vignette.
To install SPARSim without its vignette, please use the following commands:
library(devtools)
install_gitlab("sysbiobig/sparsim")
SPARSim R package can be downloaded at http://sysbiobig.dei.unipd.it/?q=SPARSim
It requires packages RCpp, Matrix, scran and edgeR to work.
To install from source, please use the following command:
install.packages("SPARSim_0.9.5.tar.gz", repos = NULL, type = "source")
SPARSim simulation requires an input parameter to work, describing the macro characteristics of the desired synthetic count table. More specifically, for each experimental conditions to simulate SPARSim needs 3 information as input: gene expression level intensities, gene expression level variabilities and sample library sizes; condition name could be specified as optional field.
User can choose among 4 ways to provide SPARSim input parameter:
estimated the parameter from an existing count table
use one of the parameter presets available in the SPARSim package
specified by his/her own the input parameter
a combination of the above options
Section 2.1, 2.2, 2.3 and 2.4 describe each of the 4 input options. Section 2.5 provides additional information about the structure of SPARSim simulation parameters.
SPARSim allows user to estimate the input parameters from an existing count table. This is especially useful when the user wants to simulate a count matrix with characteristic similar to an existing one.
As example of the estimation procedure, we will use a count matrix available in the SPARSim package (Example_count_matrix).
The count matrix contains 5000 genes and 150 cells. Cells belong to 3 experimental conditions (first 50 cells correspond to condition A, second 50 cells correspond to condition B and last 50 cells correspond to conditions C).
# Load data
data(Example_count_matrix)
dim(Example_count_matrix)
SPARSim provide the function estimate_parameter_from_data(raw_data, norm_data, conditions)
to automatically estimate the SPARSim simulation parameter from a give count matrix. The function requires 3 input parameters
raw_data: the existing count matrix
norm_data: the existing normalized count matrix
conditions: a list indicating the experimental conditions. Each element in the list contains the index of the columns belonging to the experimental condition.
First, let normalize the count matrix. Here we use the scran normalization, but other normalization procedure could be used by the user. Function scran_normalization
is SPARSim built-in function to perform the steps required by scran normalization (i.e. create a SingleCellExperiment object, compute normalization factor, perform normalization procedure and extract normalized count)
# Perform scran normalization
Example_count_matrix_norm <- scran_normalization(Example_count_matrix)
Then, let create the required conditions parameter:
# Get column index for each experimental condition
cond_A_column_index <- c(1:50) # Condition A column indices: from column 1 to column 50
cond_B_column_index <- c(51:100) # Condition B column indices: from column 51 to column 100
cond_C_column_index <- c(101:150) # Condition C column indices: from column 101 to column 150
# Create conditions param
Example_count_matrix_conditions <- list(cond_A = cond_A_column_index,
cond_B = cond_B_column_index,
cond_C = cond_C_column_index)
Now that all the required input parameters of function SPARSim_estimate_parameter_from_data()
are available, the user could estimate the SPARSim simulation parameter from the existing count matrix and run the SPARSim simulation
# Create SPARSim simulation parameter through the estimation from an existing count matrix
SPARSim_sim_param <- SPARSim_estimate_parameter_from_data(raw_data = Example_count_matrix,
norm_data = Example_count_matrix_norm,
conditions = Example_count_matrix_conditions)
# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim_simulation(dataset_parameter = SPARSim_sim_param)
SPARSim provides some parameters presets which allow the user to simulate count data resembling the characteristics of 12 existing count matrices (Bacher et al., Camp et al., Chu et al., Engel et al., Horning et al., Tung et al., Zheng et al., Macosko et al., Saunders et al., 10X Genomics example datasets) in terms of count intensity/variability, sparsity, read/UMI count per cell, number of cells and number of genes.
As example of using a parameter preset, we use the preset of Chu dataset.
# Load Chu preset
data(Chu_param_preset) # Chu_param_preset
# Run SPARSim simulation using the just loaded parameter preset
sim_result <- SPARSim_simulation(dataset_parameter = Chu_param_preset)
The commands above allow simulating a count table similar to the one in Chu study (11782 genes, 758 cells across 6 different experimental conditions, average read/UMI count per cell of 1335606, sparisity ~51%).
The complete list of parameter presets and the characteristics of the count matrices that could be generated using the parameter presets are available in Section 7.
SPARSim allows user to specify the input parameters by his/her own.
For each experimental condition to simulate, user can use the function SPARSim_create_simulation_parameter()
to specify the required simulation information. The function has 3 mandatory input parameters intensity
, variability
and library_size
which describe gene expression level intensities, gene expression level variabilities and sample library sizes, respectively. Moreover, 3 optional parameters are available to specify gene IDs, sample names and experimental condition ID (function parameters feature_names
, sample_names
and condition_name
, respectively); if one or more of the optional parameter are not provided, the function automatically set some default values.
As example, let simulate a count table with
5000 genes
2 experimental conditions A and B
condition A has 100 samples (i.e. cells), with an average library size of 2M reads
condition B has 200 samples (i.e. cells), with an average library size of 1M reads
The result of the simulation will be a count matrix of 5000 rows and 300 (100+200) columns.
Gene IDs will be set to "Gene_1"
, "Gene_2"
, …, "Gene_5000"
. Sample names for condition A will be set to "cond_A_cell_1"
, "cond_A_cell_2"
, …, "cond_A_cell_100"
; sample names for condition B will be set to "cond_B_cell_1"
, "cond_B_cell_2"
, …, "cond_B_cell_200"
.
As toy example, let gene expression level intensities and gene expression level variabilities values be sampled from a uniform distribution; library size values will be sampled from a normal distribution.
# Create simulation parameter for condition A
cond_A_param <- SPARSim_create_simulation_parameter(
intensity = runif(n = 5000, min = 0, max = 10000),
variability = runif(n = 5000, min = 0.001, max = 1),
library_size = round(rnorm(n = 100, mean = 2*10^6, sd = 10^3)),
feature_names = paste0("Gene_",c(1:5000)),
sample_names = paste0("cond_A_cell_",c(1:100)),
condition_name = "condition_A")
# Create simulation parameter for condition B
cond_B_param <- SPARSim_create_simulation_parameter(
intensity = runif(n = 5000, min = 0, max = 10000),
variability = runif(n = 5000, min = 0.001, max = 1),
library_size = round(rnorm(n = 200, mean = 1*10^6, sd = 10^3)),
feature_names = paste0("Gene_",c(1:5000)),
sample_names = paste0("cond_B_cell_",c(1:200)),
condition_name = "condition_B")
Then, the parameter of each experimental conditions must be collected in a list.
# Create SPARSim simulation parameter
SPARSim_sim_param <- list(cond_A_param, cond_B_param)
Now that the SPARSim input parameter is ready, it could be used to run the simulation
# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim_simulation(dataset_parameter = SPARSim_sim_param)
SPARSim allows user to combine parameters obtained from the three ways described above. For additional details about SPARSim simulation parameter, please see Section 2.5.
As example, let simulate a count table having
17128 genes
2 experimental conditions A and B
condition A has 50 samples, with intensity and variability from Bacher dataset preset (experimental condition Bacher_C1) and with library size estimated from Camp dataset count matrix
condition B has 100 samples, with intensity and variability from Bacher dataset preset (experimental condition Bacher_C2) and with an average library size of 0.5M reads (specified by user)
First, load Bacher parameter preset and Camp count table
# load Bacher data parameter preset
data(Bacher_param_preset)
# Bacher data has 4 experimental conditions, so Bacher_param_preset is a list of 4 simulation parameter
# Let take only Bacher_C1 and Bacher_C2
Bacher_cond_1 <- Bacher_param_preset$Bacher_C1 # get parameter preset of experimental condition Bacher_C1
Bacher_cond_2 <- Bacher_param_preset$Bacher_C2 # get parameter preset of experimental condition Bacher_C2
# load Camp data count table
data(Camp_data) # it will laod Camp_count_matrix
Camp_lib_size <- SPARSim_estimate_library_size(Camp_count_matrix) # estimate library size from Camp count table
Then, create simulation parameter for condition A, combining parameter from Bacher data preset (Bacher_C1) and direct estimation from Camp data count matrix
cond_A_param <- SPARSim_create_simulation_parameter(
intensity = Bacher_cond_1$intensity,
variability = Bacher_cond_1$variability,
library_size = sample(Camp_lib_size, size = 50),
condition_name = "cond_A")
Next, create simulation parameter for condition B, combining parameter from Bacher data preset (Bacher_C2) and parameter specified by the user
cond_B_param <- SPARSim_create_simulation_parameter(
intensity = Bacher_cond_2$intensity,
variability = Bacher_cond_2$variability,
library_size = round(rnorm(n = 100, mean = 0.5*10^6, sd = 0.01*10^6)),
condition_name = "cond_B")
Last, the parameter of each experimental conditions must be collected in a list. Once SPARSim input parameter is ready, it could be used to run the simulation
# Create SPARSim simulation parameter
SPARSim_sim_param <- list(cond_A = cond_A_param, cond_B = cond_B_param)
# Run SPARSim simulation using the just created simulation parameter
sim_result <- SPARSim_simulation(dataset_parameter = SPARSim_sim_param)
SPARSim input parameter is implemented as an R list, where each element of the list contains the information to simulate a single experimental condition. As example, consider a SPARSim simulation parameter to create a synthetic count table describing two experimental conditions. Let sim_param
be such SPARSim input parameter. Then sim_param
is list of 2 elements, let call them cond_A
and cond_B
. Both cond_A
and cond_B
are list themselves, containing the elements called intensity
, variability
and lib_size
. So, using the just introduced notation, sim_param$cond_A$intensity
would contain the gene expression level intensities used to simulate the first experimental condition, while sim_param$cond_B$lib_size
would contain the sample library sizes used to simulate the second experimental condition. The just provided implementation details are useful only for users interested in particular simulation scenarios, such the one described in section 2.4. For the great majority of users, such low level implementation details are negligible, since SPARSim provides a complete set of functions to easily create the simulation parameters, as described in the previous sections.
Once the input parameter is ready, SPARSim simulation can be launched calling the function SPARSim_simulation(dataset_parameter)
. Let SPARSim_sim_param
be the input simulation parameter obtained as described in section 2, the SPARSim simulation could be launched as follow:
# Run SPARSim simulation using the created simulation parameter
sim_result <- SPARSim_simulation(dataset_parameter = SPARSim_sim_param)
sim_result
will contain the output of the SPARSim simulation, details about the output structure are provided in Section 4.
The input parameter dataset_parameter
is the only mandatory parameter. However, additional optional parameter can be specified:
SPARSim_simulation(dataset_parameter,
batch_parameter = NULL,
spikein_parameter = NULL,
output_sim_param_matrices = FALSE,
output_batch_matrix = FALSE)
where:
batch_parameter
and output_batch_matrix
are related to simulating the presence of batch effects, see Section 5.1 for additional details
spikein_parameter
is related to spike-ins simulation, see Section 5.2 for additional details
output_sim_param_matrices
controls the output options, see Section 4 for additional details
The output of SPARSim simulation is a list of 5 elements:
count_matrix
: the simulated count matrix (as a R matrix). It is a raw (i.e. not normalized) count table, having genes on rows and cells on columns, containing read/UMI count values.
gene_matrix
: the simulated gene expression (as a R matrix). The matrix contains the simulated gene expression for each cell (genes on rows and cells on columns).
abundance_matrix
: if output_sim_param_matrices
is FALSE (default), it is set to NULL; if output_sim_param_matrices
is TRUE, it contains the input gene intensity values provided as input (genes on rows, samples on columns).
variability_matrix
: if output_sim_param_matrices
is FALSE (default), it is set to NULL; if output_sim_param_matrices
is TRUE, it contains the input gene variability values provided as input (genes on rows, samples on columns).
batch_factors_matrix
: if output_batch_matrix
is FALSE (default), it is set to NULL; if output_batch_matrix
is TRUE, it contains the multiplicative factor used in batch generation (genes on rows, samples on columns).
For the great majority of users, the most useful outputs are count_matrix
and gene_matrix
.
SPARSim uses a mixed model, made of a first step to simulate the biological variability (i.e. simulate the gene expression across cells belonging to the same experimental condition) followed by a second step to simulate the technical variability (i.e. given as input the gene expression, simulate the count table resulting from the experimental/sequencing procedure). In this framework, gene_matrix
could be considered as the output the first step, while count_matrix
is the output of the second (i.e. last) step.
Considering an in-silico scRNA-seq experiment, gene_matrix
provides the (simulated) gene expression level in the cells, and so the quantities of interest in a sequencing quantification study, while count_matrix
provided the (simulated) measured gene expression level through the sequencing experiment, as in any real count table.
Therefore, gene_matrix
represent the TRUE gene expression (unknown in a real experiment), while count_matrix
represents the MEASURED gene expression (in terms of read counts).
Compared to a real sequencing experiment, where only the count table (i.e. the MEASURED gene expression) is available, the main advantage of simulated data is that they provide both the TRUE quantities of interest and the MEASURED ones.
If output_sim_param_matrices
is TRUE, then abundance_matrix
and variability_matrix
contain the average gene intensity and gene variability used in biological variability simulation. If no batch effect and spike-ins are simulated, these matrices just contain the gene intensity and gene variability specified in the input simulation parameter dataset_parameter
. On the other hand, simulating batch effect and/or spike-ins would change the input gene intensity, so abundance_matrix
and variability_matrix
would contain the gene intensity and gene variability actually used in the simulation process.
SPARSim allows user to simulate the presence to of batch effects (Section 5.1), spike-ins (Section 5.2), bimodal genes (Section 5.3), differentially expressed genes (Section 5.4) and multiple cells types (Section 5.5). The following sections provide detailed instruction on how to use SPARSim in the just listed simulation scenarios.
To simulate batch effect, the first step is create the desired batches using the function SPARSim_create_batch()
. See documentation of SPARSim_create_batch()
for the complete details on how to use it. We will create 2 batches, named “Lane_1” and “Lane_2”.
batch_lane_1 <- SPARSim_create_batch(name = "Lane_1", distribution = "normal", param_A = 0, param_B = 1)
batch_lane_2 <- SPARSim_create_batch(name = "Lane_2", distribution = "gamma", param_A = 1, param_B = 1)
Then, let create a batch set (i.e. a collection of all the desired batches) using the function SPARSim_create_batch_set()
:
batch_set <- SPARSim_create_batch_set(batch_list = list(batch_1 = batch_lane_1,
batch_2 = batch_lane_2))
Next, SPARSim requires to associate each batch to a set of samples (i.e. cells). As example consider the preset of Bacher dataset, which by default simulate a total of 366 samples.
In that preset, let simulate the presence of batch “Lane_1” of the first and last 50 samples (i.e. samples from 1 to 50 and from 316 to 366), while the remaining 266 samples (i.e. samples from 51 to 315) will be affected by batch “Lane_2”. The just mentioned scenario could be simulated as follow:
batch_sample_association <- c(rep("Lane_1", 50), rep("Lane_2", 266), rep("Lane_1", 50))
Last, let create the SPARSim simulation parameter using the function SPARSim_create_batch_parameter()
:
SPARSim_batch_parameter <- SPARSim_create_batch_parameter(batch_set = batch_set,
batch_sample = batch_sample_association)
The presence of the just created batch effect into the Bacher data could be performed as follow
# Load Bacher data preset
data(Bacher_param_preset) # Bacher_param_preset
# Run SPARSim simulation using the Bacher parameter preset and the batch parameter created above
sim_result <- SPARSim_simulation(dataset_parameter = Bacher_param_preset,
batch_parameter = SPARSim_batch_parameter)
To simulate spike-in presence, the first step is create the desired spike-in mix using the function SPARSim_create_spikein_mix()
. See documentation of SPARSim_create_spikein_mix()
for the complete details on how to use it.
For each spike-in mix, user must specify the name of the mix and the abundance of spike-ins in the mix (parameters mix_name
and abundance
, respectively). Optionally, user can specify the ID assigned to each spike-in and the presence of some extra variability in spike-in abundance (parameters spike_in_IDS
and extra_variability
, respectively). If not specified, spike-in IDs are set to “spikein_1”, “spikein_2”, …, “spikein_<S>” (with S be the number of spike-ins) and no extra variability is simulated.
We will create 2 spike-in mixes, named “spikein_M1” and “spikein_M2”, the first one containing 100 spike-ins and the second one containing 90 spike-ins. Spike-ins in “spikein_M2” will be simulated with some extra variability. (Values assigned to parameters abundance
and extra_variability
are used for example purposes only)
# First spike-in mix
spikein_mix1_abund <- runif(n = 100, min = 0.01, max = 1000);
spikein_mix1 <- SPARSim_create_spikein_mix(mix_name= "spikein_M1",
abundance = spikein_mix1_abund)
# Second spike-in mix
spikein_mix2_abund <- runif(n = 90, min = 0.001, max = 10000);
spikein_mix2_extra_var <- runif(n = 90, 0.01, 0.02)
spikein_mix2 <- SPARSim_create_spikein_mix(mix_name= "spikein_M2",
abundance = spikein_mix2_abund,
extra_variability = spikein_mix2_extra_var)
Then, let create a spike-in set (i.e. a collection of all the desired spike-in mixes) using the function SPARSim_create_spikein_set()
:
spikein_set <- SPARSim_create_spikein_set(spikein_mixes = list(mix_1 = spikein_mix1, mix2 = spikein_mix2) )
Next, SPARSim requires to associate each spike-in mix to a set of samples (i.e. cells). As example consider the preset of Bacher dataset, which by default simulate a total of 366 samples.
In that preset, let simulate the presence of spike-in mix “spikein_M1” in the first 50 samples (i.e. samples from 1 to 50), the presence of spike-in mix “spikein_M2” in the last 50 samples (i.e samples from 316 to 366), while the remaining 266 samples (i.e. samples from 51 to 315) will contain no spike-in mixes. The just mentioned scenario could be simulated as follow:
spikein_sample_association <- c( rep("spikein_M1", 50) , rep(NA, 266) , rep("spikein_M2", 50) )
Spike-in addition were simulated adding a certain quantity to the indicated samples. The quantity to add is computed as percentage of the material present in reference samples (by default, the ones with the average abundance). For each spike-in mix, the user must specify that percentage. Here as example, we set 3% to “spikein_M1” and 5% “spikein_M2”.
spikein_abundance <- c(0.03,0.05)
Last, let create the SPARSim simulation parameter using the function SPARSim_create_spikein_parameter()
:
SPARSim_spikein_parameter <- SPARSim_create_spikein_parameter(spikein_set = spikein_set,
spikein_sample = spikein_sample_association,
spikein_proportion = spikein_abundance)
The presence of the just created spike-ins mixes to the Bacher data could be performed as follow
# Load Bacher parameter preset
data(Bacher_param_preset) # Bacher_param_preset
# Run SPARSim simulation using the Bacher parameter preset and the spike-in parameter created above
sim_result <- SPARSim_simulation(dataset_parameter = Bacher_param_preset,
spikein_parameter = SPARSim_spikein_parameter)
SPARSim provides presets to emulate the spike-ins described in Jiang et al.
SPARSim allows to simulate genes having a bimodal expression level. Compared to the standard input parameter, user should specify additional intensity and variability values for the bimodal genes, specifying also the percentage of expression values belonging to the first/second mode.
Once these additional values are specified, SPARSim simulation could be performed as described in section 3.
In this section, it is provided a simple workflow to simulate DE genes with SPARSim and the use of such simulated data to assess the performance of DE methods. Please note that both the simulation of DE genes and the assessment of DE methods can be performed in many different ways. In this section it is described only one of the available options.
SPARSim allows to simulate differentially expressed (DE) genes. There are several ways to simulate DE gene, even if a very common approach is based on applying multiplicative factors (i.e. fold change values) to gene expression level. In the following sections, a simulation procedure based on fold change values will be presented.
Section 5.4.1 will introduce the fold-change multipliers idea on a toy example dataset. Section 5.4.2 will use the same idea in a more realistic scenario. Section 5.4.3 will describe the SPARSim built-in function to simulate DE genes.
Given a quantity x and a quantity y, the fold change is defined as the ratio x/y. For example, if x = 20 and y = 80, then the fold-change of x and y is x/y = 20/80 = 1/4 = 0.25. Analogously, if x = 80 and y = 20, the fold change of x and y is x/y = 80/20 = 4.
Considering a scenario with two experimental conditions A and B. Considering a gene Z which is expressed with level x in experimental condition A and with expression level y in experimental condition B. Then we could consider the gene Z as differential expressed among the two conditions A and B if the fold change x/y is less than FC_1 = 0.25 (i.e. y is at least four times x) or greater than FC_2 = 4 (i.e. x is at least for times y).
As example, we will simulate a scenario similar to the one described in section 2.3:
5000 genes
2 experimental conditions A and B
500 genes will be DE (250 genes having FC < 0.25 and 250 genes having FC > 4), 4500 genes will be not DE
condition A has 100 samples, with an average library size of 2M reads
condition B has 200 samples, with an average library size of 1M reads
The result of the simulation will be a count matrix of 5000 rows and 300 (100+200) columns.
## STEP 1: Create simulation parameter for condition A
cond_A_param <- SPARSim_create_simulation_parameter(
intensity = runif(n = 5000, min = 0, max = 10000),
variability = runif(n = 5000, min = 0.001, max = 1),
library_size = round(rnorm(n = 100, mean = 2*10^6, sd = 10^3)),
condition_name = "cond_A")
## STEP 2: Prepare fold changes multipliers
# Without loss of generality, we will simulate the first 500 of the 5000 genes as the DE ones,
# setting a FC < 0.25 for the first 250 genes and a FC > 4 for the remaining 250 genes
DE_multiplier <- c( runif(n = 250, min = 0.0001, max = 0.25), runif(n = 250, min = 4, max = 100) )
# The remaining 4500 genes will be simulated as not DE, setting a FC between 0.25 and 4.
not_DE_multiplier <- runif(n = 4500, min = 0.251, max = 3.999)
# Combine the FC multipliers
fold_change_multiplier <- c( DE_multiplier, not_DE_multiplier)
## STEP 3: Create simulation parameter for condition B
cond_B_param <- SPARSim_create_simulation_parameter(
intensity = cond_A_param$intensity * fold_change_multiplier,
variability = runif(n = 5000, min = 0.001, max = 1),
library_size = round(rnorm(n = 200, mean = 1*10^6, sd = 10^3)),
condition_name = "cond_B")
## STEP 4: Run simulation **
SPARSim_param_with_DE <- list(cond_A = cond_A_param, cond_B = cond_B_param)
sim_result_with_DE <- SPARSim_simulation(SPARSim_param_with_DE)
Please note that the described simulation scenario is just a toy example. It is used only to explain in a easy way how the simulator can be used. A more realistic scenario is described above, where the same basic idea of fold-change is used but providing realistic values for the simulation parameters.
As example, we will use intensity, variability and library sizes values took from one of parameter preset available in SPARSim database. In particular, intensity, variability and library sizes will be taken from the first experimental condition of Bacher data preset (Bacher_C1: 17128 genes across 91 cells/samples, average library size ~4*10^6 reads). A total of 1000 genes will be simulated as DE between the two conditions: 600 genes will be simulated as DE with an upregulation (i.e. FC > 4) in condition B compared to condition A, while 400 genes will be simulated as DE with a downregulation (i.e. FC <0.25) in condition B compared to condition A.
Summarizing, we will simulate the following scenario:
2 experimental conditions A and B
a total of 17128 genes
1000 genes will be DE, 16128 genes will be not DE
80 samples for condition A
70 samples for conditions B
## STEP 0: load simulation preset and extract parameter values
# Load Bacher parameter preset
data(Bacher_param_preset) # Bacher_param_preset
# Extract intensity, variability and library size parameters from the first experimental condition of Bacher data preset (i.e. Bacher_C1)
param_preset <- Bacher_param_preset$Bacher_C1
intensity <- param_preset$intensity
variability <- param_preset$variability
lib_size <- param_preset$lib_size
## STEP 1: Prepare simulation parameters for condition A**
cond_A_param <- SPARSim_create_simulation_parameter(
intensity = intensity,
variability = variability,
library_size = sample(lib_size, size = 80),
condition_name = "cond_A")
## STEP 2: Prepare fold-change multipliers
# not DE genes will have a fold change between 0.25 and 4
not_DE_multiplier <- runif(n = 16128, min = 0.251, max = 3.999)
# DE genes will have a fold change less than 0.25 or greater than 4
# here we simulate 400 fold-changes lower than 0.25 and 600 fold changes greater than 4
DE_multiplier <- c( runif(n = 400, min = 0.0001, max = 0.25), runif(n = 600, min = 4, max = 100) )
# In this example, the first 1000 genes will be the DE ones, while the last 16128 will be the not DE ones
fold_change_multiplier <- c(DE_multiplier, not_DE_multiplier)
## STEP 3: Prepare simulation parameters for condition B
cond_B_param <- SPARSim_create_simulation_parameter(
intensity = intensity * fold_change_multiplier, # apply the fold-changes
variability = variability,
library_size = sample(lib_size, size = 70),
condition_name = "cond_B")
## STEP 4: Run simulation
# Create the global parameter
SPARSim_param_with_DE <- list(cond_A = cond_A_param, cond_B = cond_B_param)
# Run SPARSim simulation
SPARSim_result <- SPARSim_simulation(SPARSim_param_with_DE)
Compared to the first DE simulation scenario, the just described simulation procedure allows to generate more realistic scRNA-seq count data. However, even the above simulation procedure could be further improved. As an example, it is well known that gene biological variability is related to gene expression level, so a change in the level of expression (as the one simulated in condition B due to the fold changes) would correspond to new levels of biological variability. A more realistic simulation procedure would take into account this phenomenon, changing not only the gene expression level in condition B, but also gene variability values. The just describe simulation procedure is implemented in the function SPARSim_create_DE_genes_parameter(sim_param, fc_multiplier)
.
The function takes as input the simulation parameter for condition A sim_param
and a set of fold-change values fc_multiplier
, and provide as output a SPARSim simulation parameter for condition B such that:
Optionally, the function takes as input additional parameters to specify the number of cells in condition B (parameter N_cells
), the library size values for cells in condition B (parameter lib_size_DE
), the samples IDs in condition B (sample_names
) and the names associated to condition B (parameter condition_name
). For additional details about the optional parameters, please see the function documentation.
The code below show how to use the function SPARSim_create_DE_genes_parameter()
in the same simulation scenario described in Section 5.4.2, here reported for reader convenience:
use of Bacher preset
2 experimental conditions A and B
a total of 17128 genes
1000 genes will be DE, 16128 genes will be not DE
80 samples for condition A
70 samples for conditions B
## STEP 0: load simulation preset and extract parameter values
# Load Bacher parameter preset
data(Bacher_param_preset) # Bacher_param_preset
# Extract intensity, variability and library size parameters from the first experimental condition of Bacher data preset (i.e. Bacher_C1)
param_preset <- Bacher_param_preset$Bacher_C1
intensity <- param_preset$intensity
variability <- param_preset$variability
lib_size <- param_preset$lib_size
## STEP 1: Prepare simulation parameters for condition A**
cond_A_param <- SPARSim_create_simulation_parameter(
intensity = intensity,
variability = variability,
library_size = sample(lib_size, size = 80),
condition_name = "cond_A")
## STEP 2: Prepare fold-change multipliers
# not DE genes will have a fold change between 0.25 and 4
not_DE_multiplier <- runif(n = 16128, min = 0.251, max = 3.999)
# DE genes will have a fold change less than 0.25 or greater than 4
# here we simulate 400 fold-changes lower than 0.25 and 600 fold changes greater than 4
DE_multiplier <- c( runif(n = 400, min = 0.0001, max = 0.25), runif(n = 600, min = 4, max = 100) )
# In this example, the first 1000 genes will be the DE ones, while the last 16128 will be the not DE ones
fold_change_multiplier <- c(DE_multiplier, not_DE_multiplier)
## STEP 3: Use SPARSim built-in function to create a simulation parameter with DE genes
cond_B_param <- SPARSim_create_DE_genes_parameter(
sim_param = cond_A_param,
fc_multiplier = fold_change_multiplier,
N_cells = 70,
condition_name = "cond_B")
## STEP 4: Run simulation
# Create the global parameter
SPARSim_param_with_DE <- list(cond_A = cond_A_param, cond_B = cond_B_param)
# Run SPARSim simulation
SPARSim_result <- SPARSim_simulation(SPARSim_param_with_DE)
Depending of the simulation scenario of interest for the user, there are several elements to consider.
First, the magnitude of fold-change values considerably affects the resulting simulated data. Fold-change values very far from 1 would results in clearly detectable DE genes, while values very close to 1 would create genes having almost undetectable differences in their expression levels across conditions.
Second, the definition of fold-change threshold to define DE vs not DE gene is application specific. Common fold-change thresholds for up regulation are 2 or 4 (with 0.5 or 0.25 for down regulation, respectively) but specific scenarios may require ad-hoc fold-change thresholds.
Third, genes having very low expression levels are often ignored in DE analysis. Thus, applying very small fold-change values to low expressed genes would create DE genes that would probably ignored or filtered out by many DE analysis tools.
Fourth, fold-changes values could be set accordingly with gene expression levels. For example, it is reasonable to think that already low expressed genes would mainly increase their expression levels in a DE scenario, and similarly very high expressed genes could mainly decrease their expression levels. Thus, it would be reasonable to set the fold-change values such that low expressed genes will have only fold-change values > 1, while high expressed gene will have only fold-change values < 1.
One of the main characteristics of scRNA-Seq datasets is the presence of multiple cell types/conditions within the same sequencing experiment. In order to simulate datasets with different cell types, SPARSim requires the definition of different simulation parameters sets, one for each cell type to simulate. As already explained in the previous sections, a simulation parameter set (intensity, variability and library sizes vectors) can be obtained in four different ways: direct specification by the user, estimation from a real count table, use of one the available parameter presets or any combination of the previous 3 options. In the following, we will explain how to use the obtained parameter sets to simulate different cell types.
The most immediate way to simulate multiple cell types consists in combining multiple parameters presets among the ones available in SPARSim parameters database or using a parameter preset describing multiple cell types. For example, consider the parameter preset from Chu data, describing 6 time steps (0h, 12h, 24h, 36h, 72h, 96h) in the differentiation process from Human pluripotent stem cells to definitive endoderm. Using Chu parameter preset, let simulate a count matrix having 3 cells types corresponding to Human pluripotent stem cells (0h), definitive endoderm (96h) and an intermediate cell state (24h):
# Load Chu parameter preset
data(Chu_param_preset)
# Get parameter preset for the Human pluripotent stem cells (0h)
cell_type_1_param <- Chu_param_preset$Chu_C1
# Get parameter preset for the intermediate cell state (24h)
cell_type_2_param <- Chu_param_preset$Chu_C3
# Get parameter preset for the definitive endoderm (96h)
cell_type_3_param <- Chu_param_preset$Chu_C6
# Create the global parameter set
SPARSim_3_cell_types_params <- list(cell_type_1 = cell_type_1_param,
cell_type_2 = cell_type_2_param,
cell_type_3 = cell_type_3_param)
# Run SPARSim simulation
SPARSim_result <- SPARSim_simulation(SPARSim_3_cell_types_params)
If the parameter presets in the databases are not enough and/or they would not fit user needs, a second way to simulate multiple cell types consists in estimating simulation parameters from a real count table having multiple cells types and/or from multiple counts table describing different cell types of interest for the user.
A further alternative to the above-mentioned procedures consists in starting from a generic simulation parameter set and then generate one or more new parameters from it, using different strategies such as one or a combination of the following:
changing average expression levels (i.e. introducing DE genes)
changing gene variability
introducing bimodality for some genes
Strategy A. - Simulating multiple cells types by introducing DE genes
This approach corresponds to the generation of DE genes with the motivation that different cell types are often characterized by different expression levels for a set of marker genes. This can be easily done applying the SPARSim_create_DE_genes_parameter()
built-in function to an existing simulation parameter used as template, following Section 5.4.3 of this guide.
For example, suppose that we want to use the definitive endoderm (96h) cells of Chu dataset as starting template parameter; let’s call this cell type “A”. Then, let’s simulate 2 cell types from it: 100 cells of type “B” and 150 cells of type “C”.
In order to do this, we could use the following code:
## STEP 1: load the simulation parameters used as template
# Load Chu data parameter preset
data(Chu_param_preset)
# Get parameter preset of definitive endoderm (96h) cells, i.e. cell type "A"
cell_type_A <- Chu_param_preset$Chu_C6
## STEP 2: define the fold change values for the marker genes
## (i.e. genes that are DE across different cell types)
# cell type B: assume that the first 500 genes are the marker genes between cell type "A" and cell type "B",
# while the remaining 17282 genes share a common expression level
DE_multiplier_B <- c( runif(n = 250, min = 0.0001, max = 0.25), runif(n = 250, min = 4, max = 100) )
fold_change_multiplier_B <- c(DE_multiplier_B, rep(1, 17282))
# cell type C: assume that the last 500 genes are the marker genes between cell type "A" and cell type "C",
# while the first 17282 genes share a common expression level
DE_multiplier_C <- c( runif(n = 250, min = 0.0001, max = 0.25), runif(n = 250, min = 4, max = 100) )
fold_change_multiplier_C <- c(rep(1, 17282), DE_multiplier_C)
## STEP 3: create simulation parameter for cell type "B" and "C"
cell_type_B <- SPARSim_create_DE_genes_parameter(
sim_param = cell_type_A,
fc_multiplier = fold_change_multiplier_B,
N_cells = 100,
condition_name = "cell_B")
cell_type_C <- SPARSim_create_DE_genes_parameter(
sim_param = cell_type_A,
fc_multiplier = fold_change_multiplier_C,
N_cells = 150,
condition_name = "cell_C")
## STEP 4: perform SPARSim simulation
# Create the global parameter set
SPARSim_3_cell_types_params <- list(cell_type_1 = cell_type_A,
cell_type_2 = cell_type_B,
cell_type_3 = cell_type_C)
# Run SPARSim simulation
SPARSim_result <- SPARSim_simulation(SPARSim_3_cell_types_params)
Strategy B. - Simulating multiple cells types by introducing biological variability
Strategy B corresponds to mimicking the fact that a cell type can constitute a variation of another one when its transcriptional noise increases. For example, an increased variability in gene expression level is observed in cancer cells compared to healthy cells, due to the increased gene expression noise induced by cancer.
To run a simulation that mimics this phenomenon, a strategy similar to the one proposed in the previous subsection could be adopted. Beside the creation of DE genes, the variability values can be modified in order to increase the biological noise of the existing genes. This could be easily obtained, for example, by simple multiplication of the original variability vector by a vector of multipliers.
For example, suppose that we want to use the Human pluripotent stem cells (0h) cells of Chu dataset as starting template parameter; let’s call this cell type “A”. Using cell type “A” as a reference, let’s simulate 100 cells of a new cell type “B” from it. The cell type “B” will have 400 DE genes and all the genes will have an increased transcriptional noise up to 50% of their corresponding values in cell type “A”.
The just described scenario could be simulated as follow:
## STEP 1: load the simulation parameters used as template
# Load Chu data parameter preset
data(Chu_param_preset)
# Get parameter preset of Human pluripotent stem cells (0h) cells, i.e. cell type "A"
cell_type_A <- Chu_param_preset$Chu_C1
## STEP 2: define the fold change values for the 400 marker genes
## (i.e. genes that are DE across "A" and "B")
# assume that the first 400 genes are the marker genes between cell type "A" and cell type "B",
# while the remaining 17382 genes share a common expression level
DE_multiplier_B <- c( runif(n = 200, min = 0.0001, max = 0.25), runif(n = 200, min = 4, max = 100) )
fold_change_multiplier_B <- c(DE_multiplier_B, rep(1, 17382))
## STEP 3: create simulation parameter for cell type "B" with the DE genes
cell_type_B <- SPARSim_create_DE_genes_parameter(
sim_param = cell_type_A,
fc_multiplier = fold_change_multiplier_B,
N_cells = 100,
condition_name = "cell_B")
## STEP 4: increased transcriptional noise up to 50% of the original values
# for each of the 17782 genes, increase the corresponding variability values up to 50%
# applying a multiplicative factor up to 1.5
cell_type_B$variability <- cell_type_B$variability * runif(n = 17782, min = 1, max = 1.5)
## STEP 5: perform SPARSim simulation
# Create the global parameter set
SPARSim_2_cell_types_params <- list(cell_type_1 = cell_type_A,
cell_type_2 = cell_type_B)
# Run SPARSim simulation
SPARSim_result <- SPARSim_simulation(SPARSim_2_cell_types_params)
Strategy C. - Simulating multiple cells types by introducing bimodal gene expression
The last approach corresponds to forcing some genes to have a bimodal gene expression, which can be easily done using the built-in SPARSim function introduced in Section 5.3.
Simulated data are particularly useful to test the performance of bioinformatics preprocessing/analysis methods, since the known ground truth could be exploited to assess methods output.
In the following sections, we describe some applications of SPARSim simulated data to the assessment of common preprocessing/analysis methods such as differential expression (DE) analysis, zero-imputation, normalization and cells clustering.
When simulating scRNA-seq count data containing DE genes with SPARSim, it is the user to define which genes (and so also how many genes) are simulated as DE (e.g. exploiting the fold-change method described in section 6.1.1). The information about which and how many genes are simulated as DE is fundamental for the DE tool assessment task, since they represent the ground truth to exploit in the performance evaluation.
DE tools usually required a minimal input: the count matrix (raw or normalized, depending on the specific DE tool) and some information about the association between matrix columns and experimental conditions. Using SPARSim, both the inputs of a DE tool are available: the count matrix is the main output of the SPARSim simulator, while the experimental conditions are part of SPARSim input, and so known by the user. As output, DE tools provide list of DE genes and often some other additional data (e.g. p-values, estimated fold-change, etc.).
Typical assessment metrics for DE tools are precision (i.e. Positive Predicted Value (PPV)), recall (i.e. True Positive Rate (TPR)) and accuracy, defined as follow:
\(precision = TP / (TP + FP)\)
\(recall = TP / (TP + FN)\)
\(accuracy = (TP + TN) / (TP + FP + TN + FN)\)
where:
TP (True Positive) is the number of genes SIMULATED as DE and CALLED as DE by the DE tool
FP (False Positive) is the number of genes SIMULATED as NOT DE and CALLED as DE by the DE tool
TN (True Negative) is the number of genes SIMULATED as NOT DE and CALLED as NOT DE by the DE tools
FN (False Negative) is the number of genes SIMULATED as DE and CALLED as NOT DE by the DE tool
Working with SPARSim simulated data, the computation of the above metrics is straightforward, since the TP, FP, TN and FN values could be easy computed from data.
If the DE tool provides also p-values for the analyzed genes, it would be possible to perform additional performance evaluation, as for example compute the PR-curve (i.e. Precision-Recall curve) and the AUPRC (i.e. Area Under the PR-curve).
When performing the assessment of a DE tool and more generally the benchmarking of several DE methods, it is important to carefully design the simulated data, trying cover as many scenarios as possible. In particular, it could be useful generate simulated data having
different number/percentage of DE genes
different number of samples in an experimental condition
different fold change levels (if using a fold-change criterion to simulate DE genes)
Simulating different scenario allows studying strengths and weaknesses of DE methods, and provide a more fair and robust way to evaluate tool performance.
One of the most characterizing features of scRNA-seq count tables is the strong sparsity they show. The zero values present in these data could be linked both to biological and technical causes. Indeed, while a portion of zeros actually represents genes having null expression values (biological zeros), many of them are artificial values introduced by the sequencing procedure (technical zeros) as a consequence of the co-occurrence of two main factors: the fixed and limited sampling size (i.e. the combined effect of capture efficiency and sequencing depth) and the heavy skewness of gene expression values internally to each sample. So, technical zeros represent a portion of information that got lost during gene expression measurement.
To address this issue, many zero-imputation tools have been released in recent years that try to recover the information lost during the experimental procedure. In this framework, researchers may be interested in benchmarking already available tools for zero-imputation or to test for a newly developed method performance. SPARSim allows users to create their set of synthetic datasets on which to perform these tests, with the fundamental feature of giving them the golden standard expression values to use when evaluating zero-imputation methods results. In particular, the above mentioned gene_matrix
matrix provides the TRUE gene expression values (unknown when performing a real experiment) prior to the sequencing experiment. As a consequence, gene_matrix
matrix can be used as the ground truth about gene expression level when assessing the count tables processed with one or more zero-imputation methods.
As an example, one may want to test for changes in different tools performance when varying the sparsity level of data, e.g. when simulating the sequencing of the same data with different sequencing depth levels. This can be simulated producing a set of simulations in which, starting from the same intensity and variability values and the same number of conditions and replicates (and, as a consequence, of samples), the lib_size
values diminish progressively from the first to the last scenario, causing the related dataset sparsity to grow. After having applied the selected zero-imputation tool(s), the obtained SPARSim count table(s) can be compared with the ground truth to test for the goodness in recovering lost information.
This could be done, for example
comparing the heatmaps on presence/absence data obtained from the ground truth and the zero-imputed data
computing similarity measure between ground truth and zero-imputed data (e.g. Sum of squared error, Pearson correlation coefficient, etc.)
studying how the biological structure of the data is maintained (e.g. comparing clustering and/or dimensionality reduction results based on gene_matrix
matrix and the zero-imputed count tables).
A typical preprocessing step in the bioinformatics analysis of scRNA-seq count data is count normalization. Count normalization methods take as input a raw count matrix and provide as output a normalized count table. The main goal of count normalization is removing/mitigating technical effects that can negatively affect the accuracy of downstream analyses. Thus, a normalized count matrix should provide a more accurate representation of the “true” gene expression level compared to a raw count matrix. Since the “true” gene expression level is known using simulated data, SPARSim data can be easily used to assess/benchmark scRNA-seq count normalization methods.
A first way to assess normalization methods is measuring how much the normalized count matrix is close to the real expression level (i.e. the gene_matrix
matrix provided as output of SPARSim simulation). The difference between the two matrices can be computed in several ways, for example
gene_matrix
matrix, the normalized count table and the raw count matrix).An alternative way to assess normalization methods is in terms of how they improve downstream analyses. For example, different normalization methods could be studied in terms of how they improve the accuracy of DE analyses or cells clustering. In this case, the assessment metrics used to evaluate the downstream analyses act as a proxy of normalization methods performance.
Please note that above ways to assess count normalization methods are quite consistent to what done in literature for studying bulk RNA-seq normalization methods performance. Among the covariates of interest for studying normalization methods performance, a major role is played by the level of technical noise (differences in sequencing depth, amount of zero entries, etc.), the number of different experimental conditions in the same count table and the number of cells belonging to each experimental condition.
Among the several analyses available to explore and extrapolate information from scRNA-seq dataset, clustering analysis is one of the most commonly performed. This class of methods has their principal aim in clustering input samples into groups of observations that share similar characteristics, i.e. samples that exhibit similar gene expression profiles. This allows researchers to detect the presence of cell subpopulations within the considered dataset or, when subpopulations (labels) are known, to investigate whether all the cells belonging to a specific cell type show similar expression profiles or if they further cluster into subgroups.
In this context, SPARSim could be a useful tool for clustering methods benchmarking. Indeed, it allows the user to generate datasets in which subpopulations are defined a priori, i.e. via simulation parameters specification. That is, data simulated with SPARSim carry the labels describing the subpopulation membership that can be used as ground truth when comparing different clustering methods performance.
Silhouette value, pair-wise Precision and Recall, Mutual Information, Rand Index are only a small part of the plethora of metrics that are available for clustering results evaluation. For the sake of brevity and without loss of generality, we report in the following a practical example to show how SPARSim can be applied for performing the assessment or benchmarking of clustering approaches.
Suppose we have already generated a dataset containing 15 samples from 3 different cell subpopulations (A, B, C) following the procedure introduced in Section 5.5 and suppose we simulated the following set of membership labels for the generated samples:
Samples 1-5: subpopulation A
Samples 6-10: subpopulation B
Samples 11-15: subpopulation C
These labels will represent the ground truth for clustering results performance testing.
Suppose now we have run two different clustering methods that resulted in the following clustering results:
Sample | Clustering results - Method 1 | Clustering results - Method 2 |
---|---|---|
Sample 1 | A | A |
Sample 2 | A | A |
Sample 3 | A | A |
Sample 4 | C | A |
Sample 5 | A | A |
Sample 6 | B | B |
Sample 7 | B | B |
Sample 8 | B | B |
Sample 9 | B | B |
Sample 10 | B | C |
Sample 11 | C | C |
Sample 12 | A | C |
Sample 13 | C | C |
Sample 14 | C | C |
Sample 15 | C | C |
We can now evaluate and compare the accuracy of the two clustering methods based on one of the aforementioned metrics. For example, we could use the Rand Index to compare the labels estimated by the two methods and the ground truth labels:
library(mclust)
R_clu1<-adjustedRandIndex(true, Cluster_method_1)
round(R_clu1,digits = 3)
## [1] 0.627
R_clu2<-adjustedRandIndex(true, Cluster_method_2)
round(R_clu2,digits = 3)
## [1] 0.792
Results indicate that Cluster Method 2 (Rand index 0.792) outperforms Cluster Method 1 (Rand index 0.627) in correctly assigning the subpopulation labels to the considered dataset. As already introduced, a broad set of metrics are available and could be used as evaluation measures in this testing framework. This simple toy example was intended and included to suggest how to take advantage of SPARSim simulated results when evaluating or comparing clustering methods performance.
As previously introduced, SPARSim includes a parameter database containing the simulation parameters obtained from 12 different datasets (Tung et al., 2017; Camp et al., 2015; Engel et al., 2016; Horning et al., 2018; Chu et al., 2016; Bacher et al., 2017; Zheng et al., 2017; Macosko et al., 2015; Saunders et al., 2018; 10X Genomics example datasets) using SPARSim parameters estimation procedure. Considering the single experimental conditions of the 12 datasets, the database contains a total of 40 parameter presets.
The parameter presets allow simulating count tables resembling the characteristics (i.e. count intensity/variability, sparsity, read/UMI count per cell, number of cells and number of genes) of the original count matrices from which they were estimated.
If a user wants to start his/her simulation procedure from a given preset present in the database, he/she will only need to run SPARSim simulation giving the chosen preset as input. As an example, if the user wants to simulate a count matrix having similar characteristics to the one in Chu study, he/she would run the following code:
# Load Chu preset
data(Chu_param_preset)
# Run SPARSim simulation using the loaded parameter preset
sim_result <- SPARSim(dataset_parameter = Chu_param_preset)
Of course, as previously explained in Section 2.4, the user could also use a part of the pre-coded information that can be found in the preset and substitute the other elements of the list (e.g. lib_size
), to personalize the simulation according to his/her needs.
In the following, we include a summary table in which the main characteristics of each count table (from which the parameter presets were estimated) are reported. In particular, for each preset the user will find details on:
The preset accession name (i.e. name of the R object)
The number of features (i.e. genes) in the original count table
The number of samples (i.e. cells) in the original count table
The sparsity (i.e. percentage of zero entries) in the original count table
The number of experimental conditions in the original count table
The species in the original count table
The cell types in the original count table
The platform and protocol used to generate the original data
The information as to whether or not UMIs were used in the experiment.
Preset name | # features | # cells | Sparsity | # experimental conditions | Species | Cell types | Platform/ Protocol | UMIs |
---|---|---|---|---|---|---|---|---|
Tung_param_preset | 18464 | 564 | ~53.16% | 8 | Human | Induced pluripotent stem cells | Fluidigm C1, Modified SMARTer | Yes |
Camp_param_preset | 39913 | 434 | ~84.00% | 6 | Human | Whole brain organoids, Microdissected Cerebral Organoids | Fluidigm C1, SMARTer | No |
Engel_param_preset | 25865 | 203 | ~68.59% | 4 | Mouse | Natural killer T cells | Flow cytometry, Modified Smart-seq2 | No |
Chu_param_preset | 17782 | 758 | ~51.30% | 6 | Human | iPSC to Endoderm | Fluidigm C1,SMARTer | No |
Horning_param_preset | 37347 | 144 | ~39.25% | 3 | Human | LNCap (prostate cancer) | Smart-seq2 | No |
Bacher_param_preset | 17128 | 366 | ~39.08% | 4 | Human | Embryonic stem cells | Fluidigm C1,SMARTer | No |
Brain_10X_param_preset | 21080 | 11843 | ~87.37% | 1 | Mouse | Brain cells | 10X Genomics | Yes |
T_10X_param_preset | 20267 | 8093 | ~94.59% | 1 | Human | Pan T cells | 10X Genomics | Yes |
PBMC_10X_param_preset | 17220 | 5419 | ~95.66% | 1 | Human | Peripheral blood mononuclear cells | 10X Genomics | Yes |
Zheng_param_preset | 19536 | 3388 | ~82.56% | 4 | Human | Jurkat and 293T cells | 10X Genomics | Yes |
Macosko_param_preset | 20230 | 9000 | ~96.54% | 1 | Mouse | Retinal cells | Drop-Seq | Yes |
Saunders_param_preset | 19940 | 5688 | ~94.10% | 1 | Mouse | Polydendrocytes | Drop-Seq | Yes |
An exhaustive description of each dataset used to generate the included presets can be found in the publication/website related to each original dataset, in the Supplementary File of SPARSim paper, or in the documentation of the R package using the command ?<preset name>
(e.g. the description of Chu parameter preset is available using the command ?Chu_param_preset
).
As explained in Section 2.5, SPARSim simulation requires input parameters being included into a R list, in which every single element of the list contains the information to simulate a single experimental condition. Consequently, the parameter presets included in this database are already well-formatted to fit SPARSim requirements. In particular, every preset includes a list of elements in the number of the related experimental conditions; for each experimental condition there are four elements: intensity
, variability
, lib_size
and name
. The intensity
and variability
elements contain the gene expression level intensities and variabilities used to simulate the related experimental condition; lib_size
contains the library sizes (i.e. overall read/UMI counts per cell) used to simulate the samples belonging to that experimental condition; name
contains the name associated to that experimental condition.