if (!requireNamespace("knitr", quietly = TRUE)) {
install.packages("knitr")
}
if(!requireNamespace("readxl", quietly = TRUE)) {
install.packages("readxl")
}
if(!requireNamespace("gt", quietly = TRUE)) {
install.packages("gt")
}
library(knitr)
library(RCurl)
library(gprofiler2)
library(dplyr)
library(tibble)
library(stringi)
library(BiocGenerics)
library(data.table)
library(GEOquery)
library(readxl)
library(biomaRt)
library(edgeR)
library(SummarizedExperiment)
library(DESeq2)
library(ggplot2)
library(ComplexHeatmap)
library(circlize)
library(tidyr)
library(gt)
This workbook uses the GEOQuery (Davis and Meltzer 2007), biomaRt (Durinck et al. 2005), tibble (Müller and Wickham 2025), readxl(Wickham and Bryan 2025), dplyr (Wickham et al. 2023), tidyr (Wickham et al. 2025), knitr (Xie 2014), DESeq2 (Love et al. 2014), edgeR (Chen et al. 2025), SummarizedExperiment(Morgan et al. 2025), ggplot2 (Wickham 2016), ComplexHeatmap (Gu 2022), data.table (Barrett et al. 2006), RCurl (Temple Lang 2025), BiocGenerics (Huber et al. 2015), stringi (Gagolewski 2022), gt (Iannone et al. 2026) and circlize (Gu et al. 2014) packages in its code chunks.
Septic shock is a complication of sepsis with severe and potentially fatal consequences. Treatment for septic shock is largely dependent on supportive hemodynamic therapy, but patient response to such therapies is variable, and while some septic shock patients, responders, exhibit significant improvement of organ function within a few days of treatment in the ICU, other patients, non-responders, do not exhibit early improvement. Changes in organ function during the early stages of treatment are highly predictive of final outcome, and fatal cases of septic shock are associated with major transcriptomic changes in the initial stages of the condition, so the distinction of gene expression profiles associated with responders and non-responders is informative as to the mechanisms underlying septic shock mortality.
The NCBI Gene Expression Omnibus (GEO) (Barrett et al. 2012) series GSE110487 is a dataset from 2018 consisting of RNA sequencing read counts of whole blood samples taken from 14 septic shock patients who responded to hemodynamic therapy (R group) and 17 non-responders (NR group), at two time points: immediately after ICU admission (T1), and 48 hours later (T2), after some time undergoing treatment.
In Assignment 1, we downloaded the dataset of 58096 genes and 62 samples. Since the data used HGNC identifiers for the genes, we mapped these to ENSG identifiers and cleaned out low expression genes, leaving us with 13673 genes. We visualized and compared raw and normalized read counts with MDS plots, and noted the similarity of their distribution. We then analyzed differential expression between the R patients and the NR patients at each time point, and we analyzed differential expression between T1 and T2 in each group. We noted no differentially expressed genes above our significance threshold between the R and NR groups at T1, and 67 differentially expressed genes between the patient groups at T2, so we determined that analyzing differential expression between R and NR at T2 would provide insight into the difference between the transcriptomic responses of the two groups to hemodynamic therapy. We concluded by visualizing the genes that were differentially expressed in R vs. NR at T2 with a volcano plot and a heatmap. The code used in assignment 1 is contained and hidden in this notebook.
We’ll make lists of the genes identified in Assignment 1 as differentially expressed, with corrected p-values below the significance threshold of 0.05.
# get list of significantly differentially expressed genes
DE_genes <- as.data.frame(sig_T2) %>% dplyr::arrange(padj)
DE_genelist <- rownames(DE_genes)
# get separate lists of the upregulated and downregulated genes
upreg_genelist <- rownames(DE_genes[DE_genes$log2FoldChange > 0,])
downreg_genelist <- rownames(DE_genes[DE_genes$log2FoldChange < 0,])
We have 67 differentially expressed genes, 29 of which are upregulated in the R group (R+ genes) and 38 of which are downregulated in the R group (R- genes).
Drawing from protocol demonstrated by (Reimand et al. 2019), we’ll use a one-tailed Fisher’s exact test, as implemented in the gprofiler2 package (Kolberg et al. 2020), to analyze enrichment. This method is suitable for thresholded over-representation analysis because we are calculating enrichment using a binary condition, whether or not a gene is significantly differentially expressed, meaning data fits the 2 x 2 contingency tables used by the Fisher’s exact test, and we wish to assess a directional effect, whether terms are more enriched in the results than expected (Rodchenkov et al. 2019). (1)
We use the Bader lab enrichment map gene set file (Merico et al. 2010) containing all human genesets from the most recent releases, as of 2026-03-02, of GO biological process (Aleksander et al. 2025) excluding annotations that are extracted from public pathway figures, inferred from electronic annotation, inferred from reviewed computational analysis, or have no biological data available, and all pathway resources, including Reactome (Croft et al. 2010), MSigDB (Liberzon et al. 2015), NetPath (Kandasamy et al. 2010), PANTHER (Mi et al. 2013), WikiPathways (Agrawal et al. 2023), PathBank (Wishart et al. 2023), and HumanCyc (Trupp et al. 2010). This is a highly comprehensive, up-to-date compilation of annotation sources that will allow for greater coverage than using any one database alone. (2)
data_dir = paste0(getwd(),"/data")
# access and parse the names of the bader lab human gene set files
gmt_url = "https://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
filenames = RCurl::getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# find the gmt file with all GOBP and pathway terms, excluding those extracted
# from published pathway figures and inferred from electronic annotation
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_noPFOCR_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(data_dir,gmt_file)
# if the file has not already been downloaded, download
if(!file.exists(dest_gmt_file)){
download.file(
paste(gmt_url,gmt_file,sep=""),
destfile=dest_gmt_file
)
}
# read and clean the downloaded gmt file so that it can be handled by gsea
cleaned_gmt_file = paste0(data_dir, "/cleaned_", BiocGenerics::basename(dest_gmt_file))
if (!file.exists(cleaned_gmt_file)) {
gmt_lines <- readLines(dest_gmt_file, warn = FALSE)
# remove non-ASCII characters
clean_lines <- stringi::stri_trans_general(gmt_lines, id = "Any-Latin; Latin-ASCII")
# remove duplicate terms
unique_clean_lines <- clean_lines[!data.table::duplicated(clean_lines)]
writeLines(unique_clean_lines, cleaned_gmt_file)
}
With all our data prepared, we use gprofiler2::gost to run ORA on each of our gene lists. For multiple testing correction we use gprofiler’s g:SCS (set counts and sizes) method. SCS correction specifically takes into account the structure behind functional annotations, and functional profiling methods where significance is determined from set intersections in 2 x 2 contingency tables. SCS estimations were demonstrated in simulations with randomly sampled queries to more closely align with empirical quantiles of p-values than the more general Bonferroni and Benjamini-Hochberg estimations (Reimand et al. 2007).
# upload the cleaned bader lab gmt file to g:profiler to use as annotation source
if (!exists("custom_gmt")) {
custom_gmt <- gprofiler2::upload_GMT_file(cleaned_gmt_file)
}
# define significance threshold as 0.05, term size bounds as 15 and 500
thresh <- 0.05
min_genes <- 15
max_genes <- 500
# ora of all differentially expressed genes
DE_ORA <- gprofiler2::gost(query = DE_genelist,
organism = custom_gmt,
ordered_query = FALSE,
user_threshold = thresh,
correction_method = "g_SCS")
# extract analysis results
DE_enriched_genesets <- DE_ORA$result
num_enriched <- nrow(DE_enriched_genesets)
# filter for geneset size between min_genes and max_genes
DE_enriched_filtered <- DE_enriched_genesets %>%
dplyr::filter(term_size >= min_genes & term_size <= max_genes)
num_enriched_filtered <- nrow(DE_enriched_filtered)
# ora of upregulated genes only and get results
upreg_ORA <- gprofiler2::gost(query = upreg_genelist,
organism = custom_gmt,
ordered_query = FALSE,
user_threshold = thresh,
correction_method = "g_SCS")
upreg_enriched_genesets <- upreg_ORA$result
num_enriched_up <- nrow(upreg_enriched_genesets)
upreg_enriched_filtered <- upreg_enriched_genesets %>%
dplyr::filter(term_size >= min_genes & term_size <= max_genes)
num_filtered_up <- nrow(upreg_enriched_filtered)
# ora of downregulated genes only and get results
downreg_ORA <- gprofiler2::gost(query = downreg_genelist,
organism = custom_gmt,
ordered_query = FALSE,
user_threshold = thresh,
correction_method = "g_SCS")
downreg_enriched_genesets <- downreg_ORA$result
num_enriched_down <- nrow(downreg_enriched_genesets)
downreg_enriched_filtered <- downreg_enriched_genesets %>%
dplyr::filter(term_size >= min_genes & term_size <= max_genes)
num_filtered_down <- nrow(downreg_enriched_filtered)
Filtering for SCS-corrected p-values below the threshold of 0.05, we get 9 terms enriched in the list of all our differentially expressed genes. 5 of these have between 15 and 500 genes, and the rest are larger. (3)
We get 5 terms enriched with SCS-corrected p-value < 0.05 in the list of R+ genes, and 5 terms enriched with SCS-corrected p-value < 0.05 in the list of R- genes. Filtering for term size between 15 and 500, we have 3 terms enriched in the R+ genes and 4 terms enriched in the R- genes. In Table 1, we compare the terms enriched in the three gene lists.
all_enriched <- dplyr::bind_rows(DE_enriched_genesets, upreg_enriched_genesets,
downreg_enriched_genesets) %>%
dplyr::distinct(term_id, .keep_all = TRUE)
results_table <- all_enriched %>%
dplyr::mutate(enriched_allDE = term_id %in% DE_enriched_genesets$term_id,
enriched_Rplus = term_id %in% upreg_enriched_genesets$term_id,
enriched_Rminus = term_id %in% downreg_enriched_genesets$term_id,
annot_source = sapply(strsplit(term_id, "%"), function(x) x[2]),
ID = sapply(strsplit(term_id, "%"), function(x) x[3])) %>%
dplyr::select(c(term_name, term_size, annot_source, enriched_allDE,
enriched_Rplus, enriched_Rminus, ID)) %>%
dplyr::arrange(!enriched_allDE, term_size)
#knitr::kable(results_table, format = "html", align = "c")
results_table %>%
gt::gt() %>%
gt::opt_table_lines(extent = "all")
| term_name | term_size | annot_source | enriched_allDE | enriched_Rplus | enriched_Rminus | ID |
|---|---|---|---|---|---|---|
| regulation of T cell receptor signaling pathway | 52 | GOBP | TRUE | FALSE | TRUE | GO:0050856 |
| regulation of antigen receptor-mediated signaling pathway | 67 | GOBP | TRUE | FALSE | TRUE | GO:0050854 |
| negative regulation of cell migration | 252 | GOBP | TRUE | TRUE | FALSE | GO:0030336 |
| negative regulation of cell motility | 263 | GOBP | TRUE | TRUE | FALSE | GO:2000146 |
| negative regulation of locomotion | 268 | GOBP | TRUE | TRUE | FALSE | GO:0040013 |
| regulation of immune response | 888 | GOBP | TRUE | FALSE | TRUE | GO:0050776 |
| regulation of immune system process | 1387 | GOBP | TRUE | FALSE | FALSE | GO:0002682 |
| regulation of cellular process | 10154 | GOBP | TRUE | FALSE | FALSE | GO:0050794 |
| regulation of biological process | 10596 | GOBP | TRUE | FALSE | FALSE | GO:0050789 |
| negative regulation of T cell receptor signaling pathway | 33 | GOBP | FALSE | FALSE | TRUE | GO:0050860 |
| Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell | 217 | REACTOME DATABASE ID RELEASE 95 | FALSE | FALSE | TRUE | 198933 |
| positive regulation of developmental process | 969 | GOBP | FALSE | TRUE | FALSE | GO:0051094 |
| regulation of multicellular organismal process | 2359 | GOBP | FALSE | TRUE | FALSE | GO:0051239 |
Among the three set of terms, the DE terms, the R+ terms, and
the R- terms, there are 13 unique terms. R+ terms and
R- terms each overlap partially with DE terms, and all DE terms with
term size between min_genes and
max_genes are in either R+ or R-, but
there are no shared terms between the R+ and R- groups. (4)
The disparity between the DE group and the R+ and R- groups highlights the fallibility of thresholded enrichment analysis: the sizes of small gene lists can impact the estimated significance of a signal. Large terms identified as enriched in the DE group are not identified in analysis of the smaller response-specific groups, and some terms identified as enriched in the R+ or R- group when analyzed separately do not pass the significance threshold when the groups are analyzed together. Terms identified in enrichment analysis of both the DE group and either the R+ or R- group may be especially significant. Terms identified in independent analysis of the R+ or R- group may aid in interpreting terms identified in the analysis of the full DE group.
All the terms identified as enriched in the DE genes are involved in regulation of biological processes. Specifically, several terms involving regulation of immune response and cell motility are enriched. Genes involved in regulation of immune response, T-cell receptor signalling pathways and antigen receptor-mediated signalling pathways in particular, are enriched, and are specifically enriched in R- genes. The term ‘negative regulation of T cell receptor signaling pathway,’ was furthermore identified as enriched in the R- gene list, though not in the full DE gene list due to the presence of R+ genes weakening the signal, and suggests that the R- genes involved in regulation of T-cell receptor signalling pathways specifically negatively regulate these pathways, and that their downregulation in R patients corresponds to positive regulation of T-cell related pathways. Genes involving negative regulation of cell migration, motility and locomotion are enriched, and are specifically enriched in R+ genes.
The enrichment of terms related to immune response regulation, especially negative regulation of T-cell pathways in R- genes, supports Barcella et al’s conclusions that R patients specifically modulate genes involved in T-cell activation while NR patients do not, including their identification of genes involved in signal transduction in the T-cell antigen receptor complex being upregulated in R patients, and genes inhibiting T-cell activation being downregulated.
On the other hand, the enrichment of terms related to negative regulation of cell migration in R+ genes does not support the paper’s conclusions or suggested mechanisms, as Barcella et al do not discuss enrichment of genes related to cell motility or their potential role in the biological mechanisms of therapy-modulated septic shock response. (1)
However, Barcella et al’s analysis identifies the GO terms ‘positive regulation of cell migration,’ and ‘regulation of cell migration’ as being enriched from T1 to T2 in NR patients exclusively, with the latter term highlighted as especially significant. Although this result is not factored into their conclusions, it stands in support of our observation of the enrichment of genes negatively regulating cell migration in the R- group.
Other studies have demonstrated that genes involved in activation of leukocyte migration are be overexpressed in the endothelial cells of septic shock patients (Langston et al. 2022) (Cao and Robinson 2014), and that the inflammatory cascade caused by sepsis is known to cause dysregulated neutrophil migration into critical organs(Joffre et al. 2020), where they can damage the endothelium and drive inflammation (Hattori et al. 2017). The role of increased cell migration in septic symptoms supports the enrichment of genes negatively regulating cell migration in patients who have responded positively to hemodynamic therapy. (2)
Next, we’ll perform non-thresholded gene set enrichment analysis on the entire list of genes in the data. We’ll analyze the distribution of terms throughout our entire list of genes. We begin by ranking the genes. We assign ranks according to significance, with genes that saw higher expression in R patients receiving a positive number, and genes that saw lower expression in R patients receiving a negative number.
#get all genes, calculate ranks, and arrange by rank
ranked_genes <- as.data.frame(response_T2) %>%
dplyr::mutate(rank = -log(padj, base = 10) * sign(log2FoldChange)) %>%
dplyr::arrange(desc(rank)) %>%
tibble::rownames_to_column("geneid") %>%
dplyr::select(c(geneid, rank))
#write the ranked gene list to a file to input into gsea algorithm
rank_file <- file.path(data_dir, "ranked_genes.rnk")
write.table(ranked_genes, rank_file, quote = FALSE, sep = "\t", row.names = FALSE)
To perform gene set enrichment analysis, we use the original GSEA algorithm (Subramanian et al. 2005), implemented via the GSEAPreRanked function in the GSEA 4.4.0 jar file for the command line. For the genesets, we’ll use the same Bader lab .gmt file that we used for the thresholded analysis. (1)
gsea_jar <- file.path(data_dir, "GSEA_4.4.0", "gsea-cli.sh")
analysis_name <- "T2_R_vs_NR"
gsea_cmd <- paste("", gsea_jar,
"GSEAPreRanked -gmx", cleaned_gmt_file,
"-rnk" , rank_file,
"-collapse false -nperm 1000 -scoring_scheme weighted",
"-rpt_label", analysis_name,
"-plot_top_x 20 -rnd_seed 6789 -set_max", max_genes,
"-set_min", min_genes, "-zip_report false",
"-out", data_dir,
"> gsea_output.txt",sep = " ")
gsea_run <- FALSE
if(length(list.files(data_dir, pattern = "^T2_R_vs_NR\\.GseaPreranked.")) == 0 & !gsea_run) {
suppressWarnings(system(gsea_cmd))
gsea_run <- TRUE
}
The most significantly enriched terms identified in the non-thresholded GSEA include many of the same terms identified in the thresholded ORA relate to RNA processing and DNA and protein synthesis. Most of the highly significant terms are enriched specifically among R- genes, with only few highly significant terms, related to insulin signalling and autophagy, enriched among the R+ genes. (2)
The terms identified in the thresholded analysis do not appear among the top significant terms identified in the non-thresholded analysis. However, as shown in figure 2, many terms relating to immune cell regulation are identified as significantly enriched in the GSEA. Although this comparison is not straightforward, since ORA results are binary while GSEA results are extensive and continuous, the high signal from terms related to RNA processing in the GSEA results that are not present in the ORA results show that the GSEA has identified genesets that were hidden by the arbitrary nature of a thresholded analysis. (3)
In Cytoscape (Shannon et al. 2003), we create an
EnrichmentMap (Merico
et al. 2010) of GSEA results with FDR-corrected p-value
lesser than 0.05 and edge similarity greater than 0.375. The resulting
network, shown in figure 1, has 249 nodes and 2547 edges. (1)
We annotate this network using AutoAnnotate (Kucera et al.
2016), using GS_DESCR as our label column, selecting ‘Layout
network to minimize cluster overlap’, and setting the amount of clusters
one notch from the fewest. After manually reorganizing nodes and
relabelling clusters, we get figure 2. (2)(3)
We collapse the network to a theme network in Cytoscape and note
major themes. The main themes are RNA processing, regulation of immune
cell mediated pathways, DNA repair and autophagy. The high enrichment of
terms relating to RNA processing pathways, DNA repair pathways and
autophagous pathways is novel to the non-thresholded analysis, but these
results do align with the model and the mechanisms involved in septic
shock therapy response. (4)
As with the thresholded ORA results, the enrichment of terms relating to regulation of immune cell response, especially regulation of T-cell and antigen receptor mediated pathways supports Barcella et al’s conclusions.
The non-thresholded analysis also detected enrichment of terms related to regulation of interleukin-10 production, which supports Barcella et al’s identification of interleukin receptor genes exclusively downregulated in the R group.
Terms related to cell migration were not highlighted in the non-thresholded analysis, but many terms related to RNA processing, DNA repair and autophagy, mechanisms which Barcella et al did not explicitly discuss or note, were identified as significant. (1)
Genes related to DNA repair have been identified as biomarkers for sepsis (Gu et al. 2025), which supports the enrichment of DNA repair related terms in R- genes in our analysis, suggesting that hemodynamic therapy reduces expression of these biomarkers among R patients, but not NR patients.
In general, the signalling pathways activated in cases of sepsis are associated with the upregulation of many common gene classes associated with inflamation and cellular metabolism (Bushra et al. 2024), which supports the highly significant enrichment of many nondescript terms related to RNA synthesis and processing in R- genes in our analysis, and the observation of more significantly enriched terms among our R- genes than among our R+ genes, suggesting that many sepsis-associated pathways causing dysfunctional upregulation of genes are downregulated by hemodynamic therapy in R patients, but not in NR patients.
Autophagy has been identified to play a multifarious role in sepsis cases, with potential both to modulate inflammation and promote cell survival (Iba et al. 2024), and to be dysregulated in sepsis signalling pathways and contribute to tissue damage(Yin et al. 2019). This supports our GSEA’s identification of terms related to the regulation of autophagy as enriched among our R+ genes, and suggests that hemodynamic therapy successfully mediates autophagy in responders, but not in non-responders. (2)