Skip to contents

In this documentation, we will use HTRfit to simulate various experimental design involving multiple experimental condition. To do so, we will first need to estimate plausible parameter based on publicly available data.

Public dataset for parameters inference

In this section, we utilize publicly available RNAseq data from the public project to estimate parameters for simulating experimental designs using HTRfit. The dataset comprises count data for 30 genotypes across 6428 genes with 2-4 replicates per genotype.

## -- public data embedded in HTRfit
pub_data_fn <- system.file("extdata", "pub_data_rna.tsv", package = "HTRfit") 
pub_data <- read.table(file = pub_data_fn, header = TRUE)
pub_metadata_fn <- system.file("extdata", "pub_metadata.tsv", package = "HTRfit") 
pub_metadata <- read.table(file = pub_metadata_fn,  header = TRUE)

The public project offers comprehensive count data spanning various genotypes and genes. Key statistics about the dataset include:

genes samples genotypes average reads/sample
GSE175398 6428 111 30 6852120
Model fitting on GSE175398 project RNAseq dataset

We fit a generalized linear model (GLM) to the data to estimate plausible parameter distributions. The GLM allows us to estimate coefficients necessary for simulating experimental conditions. The tidy_pub_fit object contains HTRfit’s estimates for each coefficient and gene.

# -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = pub_data, 
                           metadata = pub_metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1", 
                           row_threshold = 10) 
## --  fit a model 
l_tmb <- fitModelParallel(
          formula = kij ~ genotype,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 8)
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- extract results for best genes
tidy_pub_fit <- tidy_results(l_tmb[l_genes])

## Intercept ~= Basal expr 
idx_intercept <- tidy_pub_fit$term == '(Intercept)'
PUB_BASAL_EXPR <- tidy_pub_fit[ idx_intercept , 'estimate' ]
## sd(Beta genotype) = 0.2345902
MAD_OBSERVED <- mad(tidy_pub_fit[ !idx_intercept , 'estimate' ])
## -- dispersion
DISP_PUB  <- glance_tmb(l_tmb)$dispersion

EXPLAIN WHY WE USE MAD ***

Mimicking public RNAseq data

To replicate the experimental setup of the GSE175398 project, we initialize 30 levels of genotypes using the init_variable() function. The estimated intercept obtained from the prior analysis of publicly available data serves as the basal expression level for each gene in our simulated dataset, ensuring consistency with the basal expression levels observed in the public data. Moreover, we reuse the gene dispersion observed in the public data for our simulation. To simplify the example, only 100 genes are simulated. Consequently, the sequencing depth is adjusted to accommodate 100 genes by dividing the sequencing depth observed in the public data by 64.28 (6428/100).

## -- design to simulate
N_GENES <- 100 
MIN_REPLICATES <- 2
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES

## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", sd = MAD_OBSERVED, level = 30) 
## -- simulate RNAseq data to mimic public data
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
Fit model to simulated RNAseq data

We fit a model to the simulated RNAseq data using the fitModelParallel() function. Our model will fit the 30 coefficients of genotypes we simulated.

## -- prepare data simulated to fit
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1", 
                           row_threshold = 10) 
## -- fit a model to the simulated data
l_tmb <- fitModelParallel(
          formula = kij ~ genotype,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
Evaluate fit obtained from simulated data

Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report() with those obtained with fitModelParallel().

## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
Explore evaluation results - Identity plot

The evaluation_report() function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model’s goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R^2) and RMSE metrics can be employed.

## -- Model params
resSimu$identity$params

Explore evaluation results - ROC curve

The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model’s ability to distinguish between genes that are differentially expressed and those that are not, by varying the coeff_threshold and the alt_hypothesis parameters.

## -- ROC curve
resSimu$roc$params

Explore evaluation results - Precision-recall curve

The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model’s ability to correctly identify examples of the minority class while maintaining high precision.

## -- precision recall by params
resSimu$precision_recall$params

Explore evaluation results - classification metrics

The area under the ROC curve (AUC) provides a single metric that summarizes the model’s overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison.

## -- actual/estimate comparison
resSimu$performances$byparams[3, c('description','R2','RMSE')]
description R2 RMSE
genotype 0.8457501 0.1426307

The classification metrics are computed for each fixed parameters (excluding the intercept when skip_eval_intercept = TRUE).

## -- classification metrics
resSimu$performances$byparams[3, c('description',   'pr_AUC',   
                                   'pr_randm_AUC', 'pr_performance_ratio',
                                   'roc_AUC',   'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
genotype 0.9034698 0.5083426 1.777285 0.8716835 0.5

Estimating sd(genotype) with random effects

In this section, we delve into the process of estimating the standard deviation of genotype effects (sd(genotype)) using random effects in HTRfit. By treating genotype as a random effect (kij ~ (1 | genotype)), HTRfit provides estimates for both the (Intercept) and sd_(Intercept) for each gene.

## fit genotype as random effect
l_tmb <- fitModelParallel(
          formula = kij ~ (1 | genotype),
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
Explore evaluation results - Violin plot

To evaluate the consistency between estimated and actual values of the random effect, we employ a violin plot. This graphical representation showcases how well the sd_(Intercept) estimates align with the simulated sd(genotype) value (MAD_OBSERVED = 0.2345902) across genes.

## -- violin
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))

Strategies for improving sd(genotype) estimation

One approach to enhance the estimation of sd(genotype) involves increasing the number of genotype levels. By expanding the number of genotype levels in the simulated data, the model becomes more adept at capturing the inherent variability in genotype effects.
Notice that all other parameters are conserved as before (sequencing depth, replicates number, number of genes, basal gene expression). Only the number of genotype levels is increased to 300 instead of 30.

## -- increase number of level to 300
input_var_list <- init_variable( name = "genotype", sd = MAD_OBSERVED, level = 300) 
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1", 
                           row_threshold = 10) 
## fit random effect
l_tmb <- fitModelParallel(
          formula = kij ~ (1 | genotype),
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test on fixed 
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )

Following the adjustment in genotype levels, we reevaluate the model’s performance using the evaluation_report() function. The resulting violin plot demonstrates the improvement in estimating sd(genotype), as evidenced by the reduction in RMSE.

## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))

It’s important to note that since we only fit random effects, classification metrics are not available in this case.

Simulating multiple main effects with HTRfit

In this section, we demonstrate how HTRfit can simulate multiple main effects by “piping” (%>%) the init_variable() function. To illustrate, we introduce a second main effect, environment, with two levels. This allows us to explore the combined effects of genotype and environment on gene expression.

Simulation setup

To maintain consistency with the public dataset from the public project, all effects are initialized around the observed value of MAD_OBSERVED (0.2345902). This ensures that the simulated data closely mirrors real-world observations. Additionally, gene dispersion, basal expression, and sequencing depth observed in the public data are preserved in our simulation. For simplicity, we simulate data for only 100 genes, adjusting the sequencing depth accordingly.

N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init design
input_var_list <- init_variable( name = "genotype", sd = MAD_OBSERVED, level = 30) %>%
                  init_variable( name = "environment", sd = MAD_OBSERVED, level = 2 )
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
Model fitting and evaluation

We fit a model to the simulated RNAseq data, incorporating both genotype and environment as main effects. Subsequently, we evaluate the model’s performance using metrics and graphical representations available in HTRfit.

## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1", 
                           row_threshold = 10)
## fit complex model
l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
Explore evaluation results - Identity plot

The identity plot provides a visual tool to assess how well the model aligns with the values of the simulated effects. It enables a straightforward analysis of the model’s goodness of fit to the simulated data.

## -- Model params
resSimu$identity$params

Explore evaluation results - ROC curve

The Receiver Operating Characteristic (ROC) curve evaluates the model’s ability to distinguish between differentially expressed genes. By varying the classification threshold, this curve offers insights into the model’s discriminatory power.

## -- ROC curve
resSimu$roc$params

Explore evaluation results - Precision-recall curve

The precision-recall curve illustrates the trade-off between precision and recall at different classification thresholds. Particularly useful for imbalanced datasets, it highlights the model’s ability to correctly identify examples of the minority class while maintaining high precision.

## -- precision recall by params
resSimu$precision_recall$params

Explore evaluation results - Metrics summary

Finally, we summarize the model’s performance using various metrics, including R-squared (R²) and root mean square error (RMSE) for regression tasks, as well as area under the curve (AUC) for classification tasks.

## -- actual/estimate comparison
resSimu$performances$byparams[3:4, c('description',  'R2', 'RMSE')]
description R2 RMSE
environment 0.9954613 0.0262247
genotype 0.9137330 0.1065405
## -- classification metrics
resSimu$performances$byparams[3:4, c('description', 'pr_AUC',   
                                     'pr_randm_AUC', 'pr_performance_ratio',
                                     'roc_AUC', 'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
environment 0.9944614 0.5384615 1.846857 0.9888241 0.5
genotype 0.9484602 0.5240621 1.809824 0.9213559 0.5

Estimating environment and sd(genotype)

In this section, we utilize a mixed model approach to estimate the fixed effect of environment and the variability of genotype using HTRfit. By specifying the model formula as kij ~ environment + (1 | genotype), we obtain estimates for the fixed effect of environment for each gene, along with an estimate of the variability of genotype (referred to as sd(Intercept) by the model).

Simulation setup

The mixed model incorporates both the fixed effect of environment and the random effect of genotype, allowing us to capture the influence of environmental factors on gene expression while accounting for genotype variability. We fit this model to the simulated RNAseq data and evaluate its performance using various metrics.

## fit complex model
l_tmb <- fitModelParallel(
          formula = kij ~  environment + ( 1 | genotype ) ,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
Explore evaluation results - ROC curve

Given the inclusion of a fixed effect (environment), the ROC curve provides insights into the model’s ability to discriminate between differentially expressed genes based on environmental factors. By varying the classification threshold, we assess the model’s discriminatory power in distinguishing between gene expression patterns.

## -- ROC curve
resSimu$roc$params

Explore evaluation results - Precision-recall curve

Similarly, the precision-recall curve evaluates the model’s performance in identifying differentially expressed genes across various classification thresholds. This curve offers valuable insights, particularly in scenarios with imbalanced datasets, by highlighting the trade-off between precision and recall.

## -- precision recall by params
resSimu$precision_recall$params

Explore evaluation results - Violin plot

We also visualize the distribution of random parameters estimation. The violin plot provides a graphical representation of proximity between the actual sd(genotype) values and their corresponding estimations provided by the model.

## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))

Explore evaluation results - Metrics summary

Finally, we summarize the model’s performance using key metrics, including R-squared (R²) and root mean square error (RMSE) for regression tasks, as well as area under the curve (AUC) for classification tasks.

## -- actual/estimate comparison
resSimu$performances$byparams[3:4, c('description',  'R2', 'RMSE')]
description R2 RMSE
environment 0.9956017 0.0260911
sd_(Intercept) NA 0.0352088
## -- classification metrics
resSimu$performances$byparams[3, c('description',   'pr_AUC',   
                                   'pr_randm_AUC', 'pr_performance_ratio',
                                   'roc_AUC',   'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
environment 0.9988191 0.5434783 1.837827 0.9985714 0.5

Simulating interactions with HTRfit

HTRfit offers the capability to initialize and simulate complex experimental designs involving multiple variables with varying numbers of levels, as well as interactions between previously initialized main effects. To illustrate this capability, let’s simulate an RNAseq experiment with 30 genotype levels and 2 environmental conditions. Using the add_interaction() function, interactions between each level can also be simulated.

To maintain consistency with observed effects in previous studies, all effects in our simulation are initialized around MAD_OBSERVED (0.2345902), as observed in the public data from the public project. Gene dispersion, basal expression, and sequencing depth observed in the public data are conserved in our simulation. For simplicity, only 100 genes are simulated, and consequently, the sequencing depth is adjusted to accommodate 100 genes.

N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES

## -- init design
input_var_list <- init_variable( name = "genotype", sd = MAD_OBSERVED, level = 30) %>%
                  init_variable( name = "environment", sd = MAD_OBSERVED, level = 2 ) %>%
                  add_interaction(between_var = c("genotype", "environment"), 
                                  sd = MAD_OBSERVED)
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1",
                           row_threshold = 10) 
## fit complex model
l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
Identity Plot, ROC Curve, and Precision-Recall Curve

These plots provide visual assessments of how well the model aligns with the values of the simulated effects and its ability to distinguish between differentially expressed genes. Each curve represents a different aspect of the model’s performance, including its goodness of fit and classification accuracy.

## -- Model params
resSimu$identity$params

## -- ROC curve
resSimu$roc$params

## -- precision recall by params
resSimu$precision_recall$params

Metrics summary

The following table summarizes the model’s performance metrics, including R-squared (R2), root mean squared error (RMSE), precision-recall area under the curve (PR AUC), and receiver operating characteristic area under the curve (ROC AUC).

resSimu$performances$byparams[3:5, c('description', 'R2', 'RMSE')]
description R2 RMSE
environment 0.8729899 0.1677684
genotype 0.8090228 0.1619318
genotype:environment 0.5141559 0.2266162
resSimu$performances$byparams[3:5, c('description', 'pr_AUC',   
                                     'pr_randm_AUC', 'pr_performance_ratio',
                                     'roc_AUC', 'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
environment 0.9688483 0.5806452 1.668572 0.9458689 0.5
genotype 0.8992190 0.4838710 1.858386 0.8755637 0.5
genotype:environment 0.7011582 0.3385243 2.071220 0.7755076 0.5

Mixed model for estimating sd(genotype:env)

By employing the following mixed model kij ~ environment + ( environment | genotype ), HTRfit provides estimates for a fixed effect the environment for each gene. Additionally, we obtain in results an estimate of the sd(genotype) (referred to as sd_(Intercept) by the model), and the sd(genotype:environment) (referred to as sd_(environment) by the model).

## fit complex model
l_tmb <- fitModelParallel(
          formula = kij ~  environment + ( environment | genotype ) ,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
ROC Curve and Precision-Recall Curve

Both the ROC curve and the precision-recall curve provide insights into the classification performance of the model, particularly regarding its ability to distinguish between differentially expressed genes and non-differentially expressed genes. ROC curve

## -- ROC curve
resSimu$roc$params

## -- precision recall by params
resSimu$precision_recall$params

Violin plot for random parameters

The violin plot visualizes the distribution of random parameter estimates, providing insights into the variability of the sd(genotype:environment) (referred to as sd_(environment) by the model) , and sd(genotype) (referred to as sd_(Intercept) by the model) estimates compared to the actual values. You can notice here that Additionally, cor__(Intercept).env is returned by the model. This random parameter refer to the correlation between the basal expression of the gene and the environment effect. As we did not observed such correlation in public data, this metric cannot bet set with HTRfit and is alway set to 0.

Additionally, the model returns cor__(Intercept).env, which represents the correlation between the basal expression of the gene and the environmental effect. However, as no such correlation was observed in the public data, this metric cannot be set in HTRfit simulation and is always expected to be 0. The violin plot allows to measure the proximity between the actual and inferred correlations

## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))

Metrics summary

The following table summarizes the model’s performance metrics, including R-squared (R2), root mean squared error (RMSE), precision-recall area under the curve (PR AUC), and receiver operating characteristic area under the curve (ROC AUC).

## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,5,6), c('description',   'R2', 'RMSE')]
description R2 RMSE
cor__(Intercept).environment NA 0.2976654
environment 0.9922531 0.0549256
sd_(Intercept) NA 0.0466697
sd_environment NA 0.0507376
## -- classification metrics
resSimu$performances$byparams[4, c('description',   'pr_AUC',   
                                   'pr_randm_AUC', 'pr_performance_ratio',
                                   'roc_AUC',   'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
environment 0.9854788 0.5684211 1.733713 0.9724481 0.5

Simulating triple interaction with HTRfit

HTRfit’s capability extends to simulating triple interactions to evaluate the model’s performance in capturing such complex relationships. To demonstrate this, we introduce a categorical variable, age, assuming that the transcriptome evolves with the age of organisms. We initialize this variable with three levels.

N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init design
input_var_list <- init_variable( name = "genotype", sd = MAD_OBSERVED, level = 30) %>%
                  init_variable( name = "env", sd = MAD_OBSERVED, level = 2 ) %>%
                  init_variable( name = "age", sd = MAD_OBSERVED, level = 3 ) %>%
                  add_interaction(between_var = c("genotype", "env"), 
                                  sd = MAD_OBSERVED) %>% 
                  add_interaction(between_var = c("genotype", "age"), 
                                  sd = MAD_OBSERVED) %>% 
                  add_interaction(between_var = c("env", "age"), 
                                  sd = MAD_OBSERVED) %>%
                  add_interaction(between_var = c("env", "genotype", "age"), 
                                  sd = MAD_OBSERVED)

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1",
                           row_threshold = 10) 

model2fit <- kij ~ genotype + env + age + 
                   genotype:env + genotype:age + env:age + 
                   genotype:age:env

## fit complex model
l_tmb <- fitModelParallel(
          formula = model2fit,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )

The metrics returned by evaluation_report() indicate that capturing triple interactions is significantly more challenging than main and simple interaction effects (between two variables). The metrics comparing actual and inferred values are consistently lower for triple interactions, as well as the classification metrics (AUC)

## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
description R2 RMSE
age 0.7534232 0.1788698
env 0.8356512 0.1597085
genotype 0.7958937 0.1730013
env:age 0.5455451 0.2315923
genotype:age 0.4600377 0.2447068
genotype:env 0.4853566 0.2282988
genotype:env:age 0.3052543 0.3314682
## -- classification metrics
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description',    'pr_AUC',   
                                                  'pr_randm_AUC', 'pr_performance_ratio',
                                                  'roc_AUC',    'roc_randm_AUC')]
description pr_AUC pr_randm_AUC pr_performance_ratio roc_AUC roc_randm_AUC
age 0.9081104 0.5250000 1.729734 0.8582957 0.5
env 0.9352924 0.6200000 1.508536 0.8601443 0.5
genotype 0.8754560 0.5079310 1.723573 0.8258266 0.5
env:age 0.6730168 0.3250000 2.070821 0.7089459 0.5
genotype:age 0.6551313 0.3386207 1.934705 0.7280042 0.5
genotype:env 0.6461839 0.3406897 1.896694 0.7216957 0.5
genotype:env:age 0.5438564 0.3263793 1.666332 0.6567084 0.5
Improving triple interaction estimation

To enhance the estimation of genotype:age:env triple interaction, one strategy is to increase the number of replicates in our simulation. In this scenario, we use 25 replicates for each experimental condition.

MIN_REPLICATES <- 25
MAX_REPLICATES <- 25

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = PUB_BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1",
                           row_threshold = 10) 

model2fit <- kij ~ genotype + env + age + 
                   genotype:env + genotype:age + env:age + 
                   genotype:age:env

## fit complex model
l_tmb <- fitModelParallel(
          formula = model2fit,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )

We observe a significant improvement in performance across all parameters, including the genotype:age:env parameter. The inference of genotype:age:env is closer to the actual values (R2, RMSE), and the model’s ability to determine whether a parameter is differentially expressed or not (with a 25% expression change) is enhanced (pr_AUC, AUC).

## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
description R2 RMSE
age 0.9736151 0.0599414
env 0.9767616 0.0590054
genotype 0.9627687 0.0681128
env:age 0.8572421 0.0882152
genotype:age 0.8645111 0.0916272
genotype:env 0.8744517 0.0911279
genotype:env:age 0.7545189 0.1340333
## -- classification metrics
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description', 'pr_AUC',  
                                                  'pr_randm_AUC', 'roc_AUC',    
                                                  'roc_randm_AUC')]
description pr_AUC pr_randm_AUC roc_AUC roc_randm_AUC
age 0.9824656 0.5550000 0.9665958 0.5
env 0.9461978 0.5800000 0.8850575 0.5
genotype 0.9572818 0.5237931 0.9202956 0.5
env:age 0.8560390 0.3000000 0.8446429 0.5
genotype:age 0.8642243 0.3455172 0.8518661 0.5
genotype:env 0.8728211 0.3444828 0.8599528 0.5
genotype:env:age 0.7815457 0.3348276 0.7940449 0.5
Triple interaction and mixed effect

Currently, HTRfit’s evaluation does not include a comparison between a simulation and a fit capable of capturing the sd(genotype:age:env) parameter. However, feel free to implement this feature.

Advance user

HTRfit’s simulation framework offers flexibility in adjusting input parameters, enabling iterative monitoring of their effects on model fit results.

Monitoring results as sequencing depth increases

Increasing sequencing depth is expected to improve the model’s performance, whether in accurately estimating parameters (measured by R2 or RMSE) or in the quality of classification (distinguishing DE and non-DE genes, measured by AUC and PR_AUC). In this section, we utilized HTRfit to assess the model’s performance across different expression levels (low, medium, high). It is anticipated that genes with higher expression levels will yield better results.

  • Fixed effects model : kij ~ genotype + environment + genotype:environment

Follow these instructions to reproduce the graphs :


  • Mixed effects model : kij ~ environment + ( environment | genotype )

Similar graphs can be obtained for mixed effect model with these instructions.

Monitoring results as replicates number increases

Increasing the number of replicates is also anticipated to enhance performance, contributing to more accurate parameter estimation (measured by R2 or RMSE) and improved classification quality (distinguishing DE and non-DE genes, measured by AUC and PR_AUC).

  • Fixed effects model : kij ~ genotype + environment + genotype:environment

Follow these instructions to reproduce the graphs :


  • Mixed effects model : kij ~ environment + ( environment | genotype )

Similar graphs can be obtained for mixed effect model with these instructions.

Monitoring accuracy in estimating random effect as number increases

To reproduce following graphs, follow this instructions

** TO complete **

Combining genes behavior

Up to this point, during the simulation of mock RNA-seq data using HTRfit, all genes exhibited uniform effect sizes (equivalent sd(genotype), sd(env), sd(genotype:env) across genes), leading to consistent behavior among genes. With HTRfit’s combine_mock() function, it becomes possible to generate intricate biological experiments with varied gene behaviors.
For further details about combine_mock() see object mock_rnaseq documentation

# N_GENES <- 100
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
REP_NB <- 4
LIST_SD <- c(0.01, 0.04 , 0.07, 0.1, 0.13)

list_mock <- list()
for (SD in LIST_SD) {
  input_var_list <- init_variable(name = "genotype", sd = SD, level = 150) %>%
    init_variable(name = "env", sd = SD, level = 2) %>%
    add_interaction(between_var = c("genotype", "env"), sd =  SD)
  
  ## Use generate_counts = FALSE 
  ## to save computing time and avoid generating something useless
  mock_data <- mock_rnaseq(
                  input_var_list,
                  n_genes = N_GENES/length(LIST_SD),
                  basal_expression = PUB_BASAL_EXPR,
                  dispersion = DISP_PUB,
                  generate_counts = FALSE) 
  ## append new mock_data in a list
  list_mock <- append(list_mock , list(mock_data))
}

## -- counts are generated with combine_mock from the list_mock obj
mock_data <- combine_mock(list_mock, REP_NB, REP_NB, SEQ_DEPTH)

The capacity to merge genes with diverse behaviors into a unified experiment proves especially beneficial when assessing a mixed model (kij ~ environment + ( environment | genotype )) utilized to infer sd(genotype) and sd(GxE). In such instances, the identity plot generated by the evaluation_report() will exhibit an R² for the random effect (sd(genotype) and sd(GxE)) since the actual value varies across genes.

## -- prepare data & fit model exactly as before
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij",
                           transform = "x+1",
                           row_threshold = 10) 
l_tmb <- fitModelParallel(
          formula = kij ~  env + ( env | genotype ) ,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, 
                                                         eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_gene = l_genes,
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )
## -- Model params
resSimu$identity$params

## -- Model params
resSimu$violin_rand_params