Identify top or low fitting observations based on specified diagnostic metric and filtering method.

This function identifies top or low fitting observations based on a specified metric and filtering method.

Usage

identifyTopFit(
  list_tmb,
  metric = "AIC",
  filter_method = "mad",
  keep = "top",
  sort = F,
  decreasing = T,
  mad_tolerance = 3
)

Arguments

list_tmb: List of glmmTMB objects.
metric: The metric used for diagnostic (e.g., "AIC", "BIC", "logLik", "deviance", "dispersion").
filter_method: The filtering method to be used (e.g., "mad"). Feel free to implement your own filetering method
keep: Whether to keep "top" or "low" fitting observations.
sort: Logical indicating whether to sort the results.
decreasing: Logical indicating whether to sort in decreasing order.
mad_tolerance: Tolerance for MAD-based filtering.

Value

A character vector of row names corresponding to the top or low fitting observations.

Examples

input_var_list <- init_variable()
#> Variable name should not contain digits, spaces, or special characters.
#> If any of these are present, they will be removed from the variable name.
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                      n_genes = 5,
                      min_replicates  = 3,
                      max_replicates = 3,
                      basal_expression = 2,
                      sequencing_depth = 1e5)
#> Building mu_ij matrix
#> INFO: The length of the sequencing_depth vector is shorter than the number of samples. Values will be recycled.
#> Scaling count table according to sequencing depth: Done
#> INFO: Scaling counts by sequencing depth may exhibit some randomness due to certain parameter combinations, resulting in erratic behavior. This can be minimized by simulating more genes. We advise verifying the simulated sequencing depth to avoid drawing incorrect conclusions.
#> k_ij ~ Nbinom(mu_ij, dispersion)
#> Counts simulation: Done
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                        metadata =  mock_data$metadata)
l_tmb <- fitModelParallel(formula = kij ~ myVariable, data = data2fit, 
                    group_by = "geneID", family = glmmTMB::nbinom2(link = "log"), 
                    n.cores = 1)
#> Log file location: /tmp/RtmpS86cq0/htrfit.log
#> CPU(s) number : 1
#> Cluster type : PSOCK
# Identify top fitting observations based on AIC with MAD filtering
identifyTopFit(l_tmb, metric = "AIC", filter_method = "mad", keep = "top", 
               sort = TRUE, decreasing = TRUE, mad_tolerance = 3)
#> Based on the specified metric (AIC) and the MAD filtering method, the following selection criteria were applied:
#> 1. The MAD-based threshold for considering outliers was calculated.
#> 2. Values above the threshold were identified, threshold: 87.2150684926346
#> 3. Summary of selection:
#> - 4 out of 5 observations had values above the threshold for the AIC metric.
#> [1] "gene5" "gene2" "gene3" "gene1"

# Identify low fitting observations based on BIC without sorting
identifyTopFit(l_tmb, metric = "BIC", filter_method = "mad", keep = "low", sort = FALSE)
#> Based on the specified metric (BIC) and the MAD filtering method, the following selection criteria were applied:
#> 1. The MAD-based threshold for considering outliers was calculated.
#> 2. Values bellow the threshold were identified, threshold: 86.5903469003188
#> 3. Summary of selection:
#> - 1 out of 5 observations had values bellow the threshold for the BIC metric.
#> [1] "gene4"

# Identify top fitting observations based on log-likelihood with MAD filtering and custom tolerance
identifyTopFit(l_tmb, metric = "logLik", filter_method = "mad", keep = "top", mad_tolerance = 2)
#> Based on the specified metric (logLik) and the MAD filtering method, the following selection criteria were applied:
#> 1. The MAD-based threshold for considering outliers was calculated.
#> 2. Values above the threshold were identified, threshold: -49.4817291872032
#> 3. Summary of selection:
#> - 5 out of 5 observations had values above the threshold for the logLik metric.
#> [1] "gene1" "gene2" "gene3" "gene4" "gene5"