Object - Mock RNA-seq • HTRfit

library(HTRfit)

In this section, you will explore how to generate RNA-seq data based on the list var object. The mock_rnaseq() function enables you to manage parameters in your RNA-seq design, including number of genes, minimum and maximum number of replicates within your experimental setup, sequencing depth, basal expression of each gene, and dispersion of gene expression used for simulating counts.

In addition, we will introduce how to generate complex RNA-seq design with diverse gene behaviors by combining several mock_obj into a single one.

Minimal example

list_var <- init_variable( name = "varA", sd = 0.2, level = 2) %>%
            init_variable( name = "varB", sd = 0.43, level = 2) %>%
            add_interaction( between_var = c("varA", "varB"), sd = 0.2)

## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 10
########################

## -- simulate RNA-seq data based on list_var, minimum input required
## -- number of replicate randomly defined between MIN_REP and MAX_REP
mock_data <- mock_rnaseq(list_var, N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES)

## -- simulate RNA-seq data based on list_var, minimum input required
## -- Same number of replicates between conditions
mock_data <- mock_rnaseq(list_var, N_GENES,
                         min_replicates  = MAX_REPLICATES,
                         max_replicates = MAX_REPLICATES)

Scaling genes counts based on sequencing depth

Sequencing depth is a critical parameter affecting the statistical power of an RNA-seq analysis. With the sequencing_depth option in the mock_rnaseq() function, you have the ability to control this parameter.

## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 10
########################

SEQ_DEPTH = c(100000, 5000000, 10000000)## -- Possible number of reads/sample
SEQ_DEPTH =  10000000 ## -- all samples have same number of reads
mock_data <- mock_rnaseq(list_var, N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         sequencing_depth = SEQ_DEPTH)

Set dispersion of gene expression

The dispersion parameter (\(alpha_i\)), characterizes the relationship between the variance of the observed read counts and its mean value. In simple terms, it quantifies how much we expect observed counts to deviate from the mean value. You can specify the dispersion for individual genes using the dispersion parameter.


## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 4
########################

DISP = 0.1 ## -- Same dispersion for each genes
DISP = 1000 ## -- Same dispersion for each genes
DISP = runif(N_GENES, 0, 1000) ## -- Dispersion can vary between genes
mock_data <- mock_rnaseq(list_var, N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         dispersion = DISP  )

Set basal gene expression

The basal gene expression parameter, defined using the basal_expression option, allows you to control the fundamental baseline of expression level of genes. When a single value is specified, the basal expression of all genes is the same and the effects of variables defined in list_var are applied from this basal expression level. When several values are specified, the basal expression of each gene is picked randomly among these values (with replacement). The effects of variables are then applied from the basal expression of each gene. This represents the fact that expression levels can vary among genes even when not considering the effects of the defined variables (for example, some genes are constitutively more highly expressed than others because they have stronger basal promoters).

## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 10
MAX_REPLICATES = 10
########################

BASAL_EXPR = -3 ## -- Value can be negative to simulate low expressed gene
BASAL_EXPR = 2 ## -- Same basal gene expression for the N_GENES
BASAL_EXPR = c( -3, -1, 2, 8, 9, 10 ) ## -- Basal expression can vary between genes
mock_data <- mock_rnaseq(list_var, N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = BASAL_EXPR)

Mock RNA-seq object

str(mock_data, max.level = 1)
#> List of 6
#>  $ settings       :'data.frame': 4 obs. of  2 variables:
#>  $ init           :List of 3
#>  $ groundTruth    :List of 2
#>  $ counts         :'data.frame': 6000 obs. of  40 variables:
#>  $ metadata       :'data.frame': 40 obs. of  3 variables:
#>   ..- attr(*, "out.attrs")=List of 2
#>  $ scaling_factors: NULL

Combine mock RNA-seq object

To generate complex biological experiments with diverse gene behaviors, the combine_mock() function offers the capability to merge multiple mock RNA-seq objects.

N_GENES <- 50
REP_NB <- 4
SEQ_DEPTH = c(100000, 5000000, 10000000)## -- Possible number of reads/sample
BASAL_EXPR = c(2, 3, 4) 
DISP = runif(N_GENES, 0, 1000) ## -- Dispersion can vary between genes
LIST_SD <- c(0.01, 0.04 , 0.07, 0.1, 0.13)

list_mock <- list()
for (SD in LIST_SD) {
  input_var_list <- init_variable(name = "varA", sd = SD, level = 150) %>%
    init_variable(name = "varB", sd = SD, level = 20) %>%
    add_interaction(between_var = c("varA", "varB"), sd =  SD)
  
  ## Use generate_counts = FALSE 
  ## to save computing time and avoid generating something useless
  mock_data <- mock_rnaseq(
                  input_var_list,
                  n_genes = N_GENES/length(LIST_SD),
                  basal_expression = BASAL_EXPR,
                  dispersion = DISP,
                  generate_counts = FALSE) 
  ## append new mock_data in a list
  list_mock <- append(list_mock , list(mock_data))
}

## -- counts are generated with combine_mock from the list_mock obj
mock_data <- combine_mock(list_mock, REP_NB, REP_NB, SEQ_DEPTH)

The combine_mock() function adds a suffix index to each gene, enabling the attribution of genes to their respective group of origin. By plotting the standard deviation observed after simulation, we can easily differentiate each group of origin.

## -- Compute sd of each variable per genes
dt <- data.table::data.table(subset(mock_data$groundTruth$effects, 
                                    select = c("geneID", "varA", "varB", "varA:varB")))
dt_per_genes <- cbind( dt[ , sd(varA), by = 'geneID' ], 
                       dt[ , sd(varB), by = 'geneID' ]$V1, 
                       dt[ , sd(`varA:varB`), by = 'geneID' ]$V1 )
colnames(dt_per_genes) <- c('geneID', 'sd_varA', 'sd_varB', 'sd_varA.varB')

## suffix of geneID allows for reattribution of gene origin
dt_per_genes$group_sd <- sapply(dt_per_genes$geneID, 
                                           function(x) strsplit(x, "[.]")[[1]][2])

## discriminate group of sd
par(mfrow = c(1, 3))
plot(x = dt_per_genes$group_sd, y = dt_per_genes$sd_varA, col= '#ec7063', pch = 19) # red
plot(x = dt_per_genes$group_sd, y = dt_per_genes$sd_varB, col = '#5dade2', pch = 19) # blue
plot(x = dt_per_genes$group_sd, y = dt_per_genes$sd_varA.varB, col= '#58d68d', pch = 19) # green