RNA-seq analysis • HTRfit

library(HTRfit)

In RNA-seq, we employ Generalized Linear Models (GLM) to unravel how genes respond to various experimental conditions. These models assist in deciphering the specific impacts of experimental variables on gene expression.HTRfit can be utilized to analyze such RNA-seq data, providing a robust framework for exploring and interpreting the intricate relationships between genes and experimental conditions.

Input data

HTRfit analysis necessitates a count matrix and sample metadata, in the form of dataframes. Notice that gene_id have to be specified as rownames of count_matrix. In this example we use HTRfit to analyse a common RNA-seq data with 2 genotypes and 2 environments.

## -- gene count matrix
count_matrix[1:4, 1:2]
#>         genotype1_environment1_1 genotype1_environment1_2
#> gene1                         20                        6
#> gene10                         9                       15
#> gene100                       23                       24
#> gene11                        21                       20

## -- samples metadata
head(metaData)
#>    genotype  environment                 sampleID
#> 1 genotype1 environment1 genotype1_environment1_1
#> 2 genotype1 environment1 genotype1_environment1_2
#> 3 genotype1 environment1 genotype1_environment1_3
#> 4 genotype1 environment1 genotype1_environment1_4
#> 5 genotype1 environment2 genotype1_environment2_1
#> 6 genotype1 environment2 genotype1_environment2_2

Prepare data for fitting

The prepareData2fit() function serves the purpose of converting the counts matrix and sample metadata into a dataframe that is compatible with downstream HTRfit functions designed for model fitting. This function also includes an option to perform median ratio normalization and transformation on the data counts.

## -- convert counts matrix and samples metadatas in a data frame for fitting
data2fit = prepareData2fit(
             countMatrix = count_matrix, 
             metadata =  metaData, 
             normalization = NULL,
             response_name = "kij")


## -- median ratio normalization
data2fit = prepareData2fit(
             countMatrix = count_matrix, 
             metadata =  metaData, 
             normalization = 'MRN', 
             response_name = "kij")

## -- apply + 1 transformation on each counts
data2fit = prepareData2fit(
             countMatrix = count_matrix, 
             metadata =  metaData, 
             normalization = 'MRN', 
             transform = "x+1",
             response_name = "kij")

Fit model from your data

The fitModelParallel() function enables independent model fitting for each gene. The number of threads used for this process can be controlled by the n.cores parameter.

l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment  + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1)

Use mixed effect in your model

HTRfit uses the glmmTMB functions for model fitting algorithms. This choice allows for the utilization of random effects within your formula design. For further details on how to specify your model, please refer to the mixed model documentation.

l_tmb <- fitModelParallel(
          formula = kij ~ genotype + ( 1 | environment ),
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1)

Additional settings

The function provides precise control over model settings for fitting optimization, including options for specifying the model family and model control setting. By default, a Gaussian family model is fitted, but for RNA-seq data, it is highly recommended to specify family = glmmTMB::nbinom2(link = "log").

l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment  + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          n.cores = 1, 
          family = glmmTMB::nbinom2(link = "log"),
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
                                                         eval.max=1e5)))

Extracts a tidy result table from a list tmb object

The tidy_results function extracts a data frame containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.

## -- get tidy results
my_tidy_res <- tidy_results(l_tmb, coeff_threshold = 0.1, 
                            alternative_hypothesis = "greaterAbs")
## -- head
my_tidy_res[1:3,]
#>      ID effect component group         term    estimate std.error   statistic
#> 1 gene1  fixed      cond    NA  (Intercept)  2.84081018 0.1209488  2.01401422
#> 2 gene1  fixed      cond    NA    genotype2 -0.02527887 0.1723802 -0.60539186
#> 3 gene1  fixed      cond    NA environment2  0.59941560 0.1504515 -0.06525054
#>         p.value         p.adj
#> 1 5.445603e-114 3.203296e-113
#> 2  9.013490e-01  9.268370e-01
#> 3  4.526502e-04  1.240138e-03

Update fit

The updateParallel() function updates and re-fits a model for each gene. It offers options similar to those in fitModelParallel(). In addition, it is possible to modify the reference level of the categorical variable used in your model in order to use different contrast.

## -- update your fit modifying the model family
l_tmb <- updateParallel(
          formula =  kij ~ genotype + environment  + genotype:environment,
          list_tmb = l_tmb ,
          family = gaussian(), 
          n.cores = 1)

## -- update fit using additional model control settings
l_tmb <- updateParallel(
          formula =  kij ~ genotype + environment  + genotype:environment ,
          list_tmb = l_tmb ,
          family = gaussian(), 
          n.cores = 1,
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e3,
                                                         eval.max=1e3)))


## -- update your model formula and your family model
l_tmb <- updateParallel(
            formula =   kij ~ genotype + environment  ,
            list_tmb = l_tmb ,
            family = glmmTMB::nbinom2(link = "log"), 
            n.cores = 1)

## -- modif reference levels 
## -- genotype reference = "genotype2"
## -- environment reference = "environment2"
l_tmb <- updateParallel(
            formula =   kij ~ genotype + environment  ,
            list_tmb = l_tmb ,
            family = glmmTMB::nbinom2(link = "log"), 
            n.cores = 1, 
            reference_labels = c("genotype2", "environment2"))

Struture of list tmb object

str(l_tmb$gene1, max.level = 1)
#> List of 8
#>  $ obj      :List of 10
#>  $ fit      :List of 7
#>  $ sdr      :List of 8
#>   ..- attr(*, "class")= chr "sdreport"
#>  $ call     : language glmmTMB::glmmTMB(formula = kij ~ genotype + environment, data = data, family = list(     family = "nbinom2", vari| __truncated__ ...
#>  $ frame    :'data.frame':   16 obs. of  5 variables:
#>  $ modelInfo:List of 15
#>  $ fitted   : NULL
#>  $ groupId  : chr "gene1"
#>  - attr(*, "class")= chr "glmmTMB"

Plot fit metrics

Visualizing fit metrics is essential for evaluating your models. Here, we show you how to generate various plots to assess the quality of your models. You can explore all metrics or focus on specific aspects like dispersion and log-likelihood.

## -- plot all metrics
diagnostic_plot(list_tmb = l_tmb)

## -- Focus on metrics
diagnostic_plot(list_tmb = l_tmb, focus = c("AIC"))

Select best fit based on diagnostic metrics

The identifyTopFit function allows for selecting genes that best fit models, based on various metrics such as AIC, BIC, LogLik, deviance, dispersion . This function provides the flexibility to employ multiple filtering methods to identify genes most relevant for further analysis. We implemented a filtering method based on the Median Absolute Deviation (MAD). For example, using the MAD method with a tolerance threshold of 3, the function identifies genes whose metric values are greater (“top”) or lower (“low”) than median(metric) - 3 * MAD(metric). The keep option allows choosing whether to retain genes with metric values above or below the MAD threshold.

identifyTopFit(list_tmb = l_tmb, metric = "AIC", 
               filter_method = "mad", keep = "top", 
               sort = TRUE, decreasing = TRUE, 
               mad_tolerance = 3)
#> Based on the specified metric (AIC) and the MAD filtering method, the following selection criteria were applied:
#> 1. The MAD-based threshold for considering outliers was calculated.
#> 2. Values above the threshold were identified, threshold: 85.2207505555877
#> 3. Summary of selection:
#> - 79 out of 100 observations had values above the threshold for the AIC metric.
#>  [1] "gene28" "gene50" "gene98" "gene39" "gene94" "gene61" "gene95" "gene93"
#>  [9] "gene38" "gene68" "gene30" "gene1"  "gene70" "gene22" "gene17" "gene63"
#> [17] "gene46" "gene27" "gene64" "gene7"  "gene75" "gene62" "gene29" "gene13"
#> [25] "gene78" "gene79" "gene88" "gene77" "gene52" "gene21" "gene35" "gene16"
#> [33] "gene25" "gene42" "gene58" "gene55" "gene9"  "gene43" "gene14" "gene18"
#> [41] "gene44" "gene32" "gene54" "gene5"  "gene10" "gene37" "gene2"  "gene99"
#> [49] "gene8"  "gene60" "gene19" "gene72" "gene23" "gene90" "gene40" "gene83"
#> [57] "gene20" "gene45" "gene3"  "gene96" "gene81" "gene11" "gene76" "gene80"
#> [65] "gene31" "gene51" "gene15" "gene24" "gene12" "gene92" "gene89" "gene87"
#> [73] "gene86" "gene4"  "gene84" "gene26" "gene6"  "gene69" "gene53"

Anova to select the best model

Utilizing the anovaParallel() function enables you to perform model selection by assessing the significance of the fixed effects. You can also include additional parameters like type. For more details, refer to car::Anova.

## -- update your fit modifying the model family
l_anova <- anovaParallel(list_tmb = l_tmb)

## -- additional settings
l_anova <- anovaParallel(list_tmb = l_tmb, type = "III" )