Preparing Gene Lists
for Enrichment Analysis
To prepare for ORA, we first split differential expression results
into upregulated only, downregulated only, and all DE genes combined.
These will each be analyzed individually, then compared. Additionally,
we maintain a continuous ranked list that is needed for GSEA.
I use the same filtering for upregulated/downregulated genes as was
used in A1 (FDR < 0.05, |logFC| > 1). After doing this, the
combined list is generated as the union of the up/down lists.
Additionally, I use for the “universe”/ background set all of the tested
genes. This is important, because not all human genes were tested, so
using all human genes as the background would potentially falsely
increase significance of enrichment.
The metric that I used for the GSEA ranked list is
-log10(p-value)*sign(logFC). This captures the statistical significance
in p-value with the biological information in the sign of the logFC. The
result is that low p-value and positive fold change genes are at the top
of the list (most upregulated), with high p-value genes in the middle,
then low p-value negative fold change genes at the end.
up_genes <- rownames(results_table)[results_table$FDR < fdr_threshold &
results_table$logFC > logfc_threshold]
down_genes <- rownames(results_table)[results_table$FDR < fdr_threshold &
results_table$logFC < -logfc_threshold]
all_de_genes <- c(up_genes, down_genes)
background_genes <- rownames(results_table)
results_table$rank_metric <- -log10(results_table$PValue) * sign(results_table$logFC)
ranked_genes <- sort(setNames(results_table$rank_metric, rownames(results_table)),
decreasing = TRUE)
Thresholded
Over-Representation Analysis
Method Choice
I chose to perform ORA using clusterProfiler (Yu et al.,
2012). clusterProfiler is a Bioconductor package for ORA that uses
Benjamini-Hochberg FDR correction. Some reasons I chose to user it
include that it is fully self-contained in R and it has extensive
documentation so that I can fully understand its mechanisms.
For the annotation database, I chose Gene Ontology Biological Process
(GO:BP), which is sourced from
org.Hs.eg.db. To my knowledge, GO:BP provides the broadest and most granular coverage
of biological processes, which is why I chose it.
For all ORA runs, I used the following thresholds: - Adjusted
p-value: < 0.05 - Minimum gene set size: 10 - Maximum gene set size:
500 - Background universe: all 19,735 tested genes (again, NOT all human
genes)
Converting Gene
Symbols to Entrez IDs
enrichGO() from clusterProfiler is going to convert to
entrez IDs anyways. This worried me, because I did not want to lose any
data or miss errors without knowing. Because of this, I chose to
manually convert to entrez IDs before running clusterProfiler. As shown
below, all genes were successfully converted.
symbol_to_entrez <- function(gene_symbols) {
mapping <- AnnotationDbi::select(org.Hs.eg.db,
keys = gene_symbols,
columns = "ENTREZID",
keytype = "SYMBOL")
mapping <- mapping[!is.na(mapping$ENTREZID), ]
mapping <- mapping[!duplicated(mapping$SYMBOL), ]
return(setNames(mapping$ENTREZID, mapping$SYMBOL))
}
up_entrez <- symbol_to_entrez(up_genes)
down_entrez <- symbol_to_entrez(down_genes)
all_entrez <- symbol_to_entrez(all_de_genes)
bg_entrez <- symbol_to_entrez(background_genes)
cat("Upregulated genes mapped:", length(up_entrez), "/", length(up_genes), "\n")
## Upregulated genes mapped: 313 / 313
cat("Downregulated genes mapped:", length(down_entrez), "/", length(down_genes), "\n")
## Downregulated genes mapped: 247 / 247
cat("All DE genes mapped:", length(all_entrez), "/", length(all_de_genes), "\n")
## All DE genes mapped: 560 / 560
cat("Background mapped:", length(bg_entrez), "/", length(background_genes), "\n")
## Background mapped: 19735 / 19735
ORA: Upregulated
Genes
Next, I run enrichGO() run on the upregulated gene list.
I use readable=true so that the Entrez IDs are converted
back for the final result for readability.
ora_up <- enrichGO(
gene = as.character(up_entrez),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
universe = as.character(bg_entrez),
minGSSize = 10,
maxGSSize = 500,
readable = TRUE
)
cat("Upregulated ORA: genesets returned:", nrow(ora_up@result[ora_up@result$p.adjust < 0.05, ]), "\n")
## Upregulated ORA: genesets returned: 107
if (nrow(ora_up) > 0) {
p1 <- dotplot(ora_up, showCategory = 20, font.size = 10) +
ggtitle("GO:BP Over-Representation — Upregulated Genes") +
theme(plot.title = element_text(hjust = 0.5, size = 12))
print(p1)
}

Figure 1. This is a dot plot of the top 20 GO:BP terms enriched in upregulated genes
(logFC > 1, FDR < 0.05). The x-axis shows the proportion of
pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene
ratio. Dot size is proportional to the count of DE genes in each term;
dot colour represents adjusted p-value (darker = more significant).
if (nrow(ora_up) > 0) {
up_table <- ora_up@result[ora_up@result$p.adjust < 0.05,
c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
up_table$pvalue <- formatC(up_table$pvalue, format = "e", digits = 2)
up_table$p.adjust <- formatC(up_table$p.adjust, format = "e", digits = 2)
knitr::kable(head(up_table, 20),
caption = "**Table 2.** Top 20 GO:BP terms enriched in upregulated genes (FDR < 0.05).",
row.names = FALSE)
}
Table 2. Top 20 GO:BP terms enriched in upregulated genes (FDR <
0.05).
| antibacterial humoral response |
10/275 |
51/15989 |
1.32e-08 |
4.42e-05 |
10 |
| defense response to Gram-positive bacterium |
12/275 |
95/15989 |
8.11e-08 |
1.36e-04 |
12 |
| fat cell differentiation |
17/275 |
231/15989 |
5.33e-07 |
5.04e-04 |
17 |
| humoral immune response |
17/275 |
233/15989 |
6.02e-07 |
5.04e-04 |
17 |
| antimicrobial humoral response |
12/275 |
128/15989 |
2.13e-06 |
1.42e-03 |
12 |
| intermediate filament organization |
8/275 |
56/15989 |
4.81e-06 |
2.20e-03 |
8 |
| epidermis development |
19/275 |
335/15989 |
5.72e-06 |
2.20e-03 |
19 |
| response to corticosteroid |
12/275 |
141/15989 |
5.86e-06 |
2.20e-03 |
12 |
| multi-organism reproductive process |
14/275 |
195/15989 |
7.25e-06 |
2.20e-03 |
14 |
| response to glucocorticoid |
11/275 |
121/15989 |
7.65e-06 |
2.20e-03 |
11 |
| response to steroid hormone |
18/275 |
311/15989 |
7.67e-06 |
2.20e-03 |
18 |
| cellular response to corticosteroid stimulus |
8/275 |
60/15989 |
8.16e-06 |
2.20e-03 |
8 |
| antimicrobial humoral immune response mediated by
antimicrobial peptide |
10/275 |
100/15989 |
8.53e-06 |
2.20e-03 |
10 |
| multi-multicellular organism process |
14/275 |
200/15989 |
9.69e-06 |
2.32e-03 |
14 |
| muscle system process |
21/275 |
426/15989 |
1.59e-05 |
3.42e-03 |
21 |
| keratinocyte differentiation |
11/275 |
131/15989 |
1.63e-05 |
3.42e-03 |
11 |
| innate immune response in mucosa |
5/275 |
20/15989 |
1.82e-05 |
3.55e-03 |
5 |
| monocyte chemotaxis |
7/275 |
49/15989 |
1.91e-05 |
3.55e-03 |
7 |
| defense response to bacterium |
15/275 |
245/15989 |
2.31e-05 |
4.07e-03 |
15 |
| cellular response to glucocorticoid stimulus |
7/275 |
51/15989 |
2.50e-05 |
4.14e-03 |
7 |
ORA: Downregulated
Genes
Next, enrichGO()was run on the downregulated gene list.
I use the exact same parameters as for the upregulated genes, so that
they two results can be meaningfully compared.
ora_down <- enrichGO(
gene = as.character(down_entrez),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
universe = as.character(bg_entrez),
minGSSize = 10,
maxGSSize = 500,
readable = TRUE
)
cat("Downregulated ORA: genesets returned:", nrow(ora_down@result[ora_down@result$p.adjust < 0.05, ]), "\n")
## Downregulated ORA: genesets returned: 13
if (nrow(ora_down) > 0) {
p2 <- dotplot(ora_down, showCategory = 20, font.size = 10) +
ggtitle("GO:BP Over-Representation — Downregulated Genes") +
theme(plot.title = element_text(hjust = 0.5, size = 12))
print(p2)
}

Figure 2. This is a dot plot of the top 20 GO:BP terms enriched in downregulated genes
(logFC < -1, FDR < 0.05). he x-axis shows the proportion of
pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene
ratio. Dot size is proportional to the count of DE genes in each term;
dot colour represents adjusted p-value (darker = more significant).
if (nrow(ora_down) > 0) {
down_table <- ora_down@result[ora_down@result$p.adjust < 0.05,
c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
down_table$pvalue <- formatC(down_table$pvalue, format = "e", digits = 2)
down_table$p.adjust <- formatC(down_table$p.adjust, format = "e", digits = 2)
knitr::kable(head(down_table, 20),
caption = "**Table 3.** Top 20 GO:BP terms enriched in downregulated genes (FDR < 0.05).",
row.names = FALSE)
}
Table 3. Top 20 GO:BP terms enriched in downregulated genes (FDR <
0.05).
| female pregnancy |
12/167 |
175/15989 |
3.05e-07 |
5.46e-04 |
12 |
| multi-organism reproductive process |
12/167 |
195/15989 |
9.73e-07 |
7.58e-04 |
12 |
| multi-multicellular organism process |
12/167 |
200/15989 |
1.27e-06 |
7.58e-04 |
12 |
| sensory perception of smell |
10/167 |
202/15989 |
5.25e-05 |
2.35e-02 |
10 |
| detection of chemical stimulus involved in sensory
perception |
10/167 |
214/15989 |
8.51e-05 |
2.51e-02 |
10 |
| detection of chemical stimulus involved in sensory
perception of smell |
9/167 |
175/15989 |
9.34e-05 |
2.51e-02 |
9 |
| sensory perception of chemical stimulus |
11/167 |
262/15989 |
9.83e-05 |
2.51e-02 |
11 |
| detection of stimulus involved in sensory
perception |
11/167 |
288/15989 |
2.25e-04 |
4.36e-02 |
11 |
| regulation of tooth mineralization |
3/167 |
13/15989 |
2.96e-04 |
4.36e-02 |
3 |
| detection of chemical stimulus |
10/167 |
250/15989 |
3.01e-04 |
4.36e-02 |
10 |
| detection of stimulus |
13/167 |
401/15989 |
3.05e-04 |
4.36e-02 |
13 |
| killing of cells of another organism |
6/167 |
88/15989 |
3.17e-04 |
4.36e-02 |
6 |
| disruption of cell in another organism |
6/167 |
88/15989 |
3.17e-04 |
4.36e-02 |
6 |
ORA: Combined, all DE
Genes
I also ran with the combined gene list. Again, I used all of the same
parameters to ensure comparability.
ora_all <- enrichGO(
gene = as.character(all_entrez),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
universe = as.character(bg_entrez),
minGSSize = 10,
maxGSSize = 500,
readable = TRUE
)
cat("Combined ORA: genesets returned:", nrow(ora_all@result[ora_all@result$p.adjust < 0.05, ]), "\n")
## Combined ORA: genesets returned: 43
if (nrow(ora_all) > 0) {
p3 <- dotplot(ora_all, showCategory = 20, font.size = 10) +
ggtitle("GO:BP Over-Representation — All DE Genes") +
theme(plot.title = element_text(hjust = 0.5, size = 12))
print(p3)
}

Figure 3. This is a dot plot of the top 20 GO:BP terms enriched in all differentially
expressed genes (FDR < 0.05). The x-axis shows the proportion of
pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene
ratio. Dot size is proportional to the count of DE genes in each term;
dot colour represents adjusted p-value (darker = more significant).
Comparing Up vs. Down
vs. Combined Results
To numerically compare these analyses, the set of significant GO:BP term descriptions is extracted from
each result object using a helper function, and set operations
(intersect(), setdiff(), union())
are applied to look into shared/unique terms.
get_sig_terms <- function(ora_result) {
if (is.null(ora_result) || nrow(ora_result) == 0) return(character(0))
ora_result@result$Description[ora_result@result$p.adjust < 0.05]
}
up_terms <- get_sig_terms(ora_up)
down_terms <- get_sig_terms(ora_down)
all_terms <- get_sig_terms(ora_all)
shared_up_down <- intersect(up_terms, down_terms)
only_up <- setdiff(up_terms, down_terms)
only_down <- setdiff(down_terms, up_terms)
only_combined <- setdiff(all_terms, union(up_terms, down_terms))
comparison_df <- data.frame(
Analysis = c("Upregulated only", "Downregulated only", "Combined (all DE)",
"Shared (up and down)", "Unique to combined"),
Gene_Sets = c(length(up_terms), length(down_terms), length(all_terms),
length(shared_up_down), length(only_combined))
)
knitr::kable(comparison_df,
caption = "**Table 4.** Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA analyses.",
col.names = c("Analysis", "Significant Gene Sets"))
Table 4. Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA
analyses.
| Upregulated only |
107 |
| Downregulated only |
13 |
| Combined (all DE) |
43 |
| Shared (up and down) |
3 |
| Unique to combined |
7 |
Non-Thresholded Gene
Set Enrichment Analysis
Method
For GSEA, I decided to use the fgsea package
(Korotkevich et al., 2021). I know this package to be computationally
efficient, and it will use the ranked gene list that was already
created.
The gene sets that I decided to use are the MSigDB Hallmark
collection (v2023.2.Hs). I specifically chose to use the Hallmark gene
sets over GO:BP because they are
manually created from lots of overlapping gene sets, so there is minimal
redundancy. I wanted the cytoscape visualization to be manageable, and
only include unique pathways. Additionally, these gene sets can be
accessed via the msigdbr package. I used the following
parameters: - Ranking metric: -log10(p-value) * sign(logFC) -
minSize = 15, maxSize = 500 -
nperm = 1000 - Significance threshold for reporting: FDR
< 0.25 (conventional for GSEA, Subramanian et al., 2005)
Running fgsea
msig_hallmark <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(msig_hallmark$gene_symbol, msig_hallmark$gs_name)
cat("Number of Hallmark gene sets:", length(hallmark_list), "\n")
## Number of Hallmark gene sets: 50
cat("Total genes in ranked list:", length(ranked_genes), "\n")
## Total genes in ranked list: 19735
set.seed(42)
fgsea_results <- fgsea(
pathways = hallmark_list,
stats = ranked_genes,
minSize = 15,
maxSize = 500,
nperm = 1000
)
fgsea_results <- fgsea_results[order(fgsea_results$NES, decreasing = TRUE), ]
cat("\nGSEA results summary:\n")
##
## GSEA results summary:
cat("Total pathways tested:", nrow(fgsea_results), "\n")
## Total pathways tested: 50
cat("Significant pathways (FDR < 0.25):", sum(fgsea_results$padj < 0.25, na.rm = TRUE), "\n")
## Significant pathways (FDR < 0.25): 38
cat("Significant pathways (FDR < 0.05):", sum(fgsea_results$padj < 0.05, na.rm = TRUE), "\n")
## Significant pathways (FDR < 0.05): 36
cat("\nTop 10 by NES (upregulated):\n")
##
## Top 10 by NES (upregulated):
print(head(fgsea_results[fgsea_results$NES > 0, c("pathway", "NES", "padj", "size")], 10))
## pathway NES padj size
## <char> <num> <num> <int>
## 1: HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.667822 0.003154773 199
## 2: HALLMARK_MYC_TARGETS_V1 2.650398 0.003154773 199
## 3: HALLMARK_HYPOXIA 2.537022 0.003154773 196
## 4: HALLMARK_OXIDATIVE_PHOSPHORYLATION 2.359163 0.003154773 200
## 5: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 2.313043 0.003154773 200
## 6: HALLMARK_MTORC1_SIGNALING 2.304354 0.003154773 199
## 7: HALLMARK_UV_RESPONSE_UP 2.179340 0.003154773 156
## 8: HALLMARK_TGF_BETA_SIGNALING 2.177949 0.003154773 54
## 9: HALLMARK_E2F_TARGETS 2.176156 0.003154773 200
## 10: HALLMARK_P53_PATHWAY 2.058446 0.003154773 197
cat("\nBottom 10 by NES (downregulated):\n")
##
## Bottom 10 by NES (downregulated):
print(head(fgsea_results[fgsea_results$NES < 0, c("pathway", "NES", "padj", "size")], 10))
## pathway NES padj size
## <char> <num> <num> <int>
## 1: HALLMARK_SPERMATOGENESIS -0.7731071 0.9645777 120
## 2: HALLMARK_INTERFERON_ALPHA_RESPONSE -0.9129588 0.6960163 97
## 3: HALLMARK_BILE_ACID_METABOLISM -1.0214033 0.4456654 106
Visualizing GSEA
Results
I visualized this with a horizontal bar plot, sorted by NES. The
names of the pathways have been cleaned by removing the
HALLMARK_ prefix and replacing the underscores with
spaces.
All 38 significant pathways have a positive NES, so all the bars are
red. This means that we donn’t see any pathways significantly enriched
at the baseline, as opposed to post-NAC.
fgsea_sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]
fgsea_sig <- fgsea_sig[order(fgsea_sig$NES), ]
fgsea_sig$pathway_clean <- gsub("HALLMARK_", "", fgsea_sig$pathway)
fgsea_sig$pathway_clean <- gsub("_", " ", fgsea_sig$pathway_clean)
fgsea_sig$direction <- ifelse(fgsea_sig$NES > 0, "Upregulated", "Downregulated")
ggplot(fgsea_sig, aes(x = NES, y = reorder(pathway_clean, NES), fill = direction)) +
geom_bar(stat = "identity", alpha = 0.85) +
scale_fill_manual(values = c("Upregulated" = "#d73027", "Downregulated" = "#4575b4")) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
labs(
title = "GSEA: Hallmark Pathways (FDR < 0.25)",
subtitle = "Ranked by Normalized Enrichment Score (NES)",
x = "Normalized Enrichment Score (NES)",
y = "Hallmark Gene Set",
fill = "Direction",
caption = "Positive NES = enriched in FollowUp (post-NAC); Negative NES = enriched in Baseline (pre-NAC)"
) +
theme_bw(base_size = 11) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom"
)

Figure 4. Bar plot of all Hallmark gene sets
significant at FDR < 0.25 from fgsea analysis, sorted by NES. All 38
significant pathways are enriched post-NAC.
gsea_table <- fgsea_sig[, c("pathway_clean", "NES", "padj", "size", "direction")]
gsea_table$NES <- round(gsea_table$NES, 3)
gsea_table$padj <- formatC(gsea_table$padj, format = "e", digits = 2)
gsea_table <- gsea_table[order(as.numeric(gsea_table$NES), decreasing = TRUE), ]
knitr::kable(gsea_table,
caption = "**Table 5.** All significantly enriched Hallmark gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by NES.",
col.names = c("Pathway", "NES", "FDR (adj. p)", "Gene Set Size", "Direction"),
row.names = FALSE)
Table 5. All significantly enriched Hallmark
gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by
NES.
| TNFA SIGNALING VIA NFKB |
2.668 |
3.15e-03 |
199 |
Upregulated |
| MYC TARGETS V1 |
2.650 |
3.15e-03 |
199 |
Upregulated |
| HYPOXIA |
2.537 |
3.15e-03 |
196 |
Upregulated |
| OXIDATIVE PHOSPHORYLATION |
2.359 |
3.15e-03 |
200 |
Upregulated |
| EPITHELIAL MESENCHYMAL TRANSITION |
2.313 |
3.15e-03 |
200 |
Upregulated |
| MTORC1 SIGNALING |
2.304 |
3.15e-03 |
199 |
Upregulated |
| UV RESPONSE UP |
2.179 |
3.15e-03 |
156 |
Upregulated |
| TGF BETA SIGNALING |
2.178 |
3.15e-03 |
54 |
Upregulated |
| E2F TARGETS |
2.176 |
3.15e-03 |
200 |
Upregulated |
| P53 PATHWAY |
2.058 |
3.15e-03 |
197 |
Upregulated |
| MYC TARGETS V2 |
2.045 |
3.15e-03 |
57 |
Upregulated |
| APOPTOSIS |
2.031 |
3.15e-03 |
161 |
Upregulated |
| UNFOLDED PROTEIN RESPONSE |
1.992 |
3.15e-03 |
111 |
Upregulated |
| ANDROGEN RESPONSE |
1.973 |
3.15e-03 |
100 |
Upregulated |
| ADIPOGENESIS |
1.971 |
3.15e-03 |
199 |
Upregulated |
| CHOLESTEROL HOMEOSTASIS |
1.942 |
3.15e-03 |
74 |
Upregulated |
| MYOGENESIS |
1.925 |
3.15e-03 |
193 |
Upregulated |
| PI3K AKT MTOR SIGNALING |
1.880 |
3.15e-03 |
103 |
Upregulated |
| UV RESPONSE DN |
1.868 |
3.15e-03 |
143 |
Upregulated |
| MITOTIC SPINDLE |
1.866 |
3.15e-03 |
198 |
Upregulated |
| G2M CHECKPOINT |
1.853 |
3.15e-03 |
199 |
Upregulated |
| DNA REPAIR |
1.830 |
3.15e-03 |
149 |
Upregulated |
| REACTIVE OXYGEN SPECIES PATHWAY |
1.782 |
3.15e-03 |
49 |
Upregulated |
| IL2 STAT5 SIGNALING |
1.743 |
3.15e-03 |
198 |
Upregulated |
| PROTEIN SECRETION |
1.734 |
3.15e-03 |
96 |
Upregulated |
| KRAS SIGNALING UP |
1.666 |
3.15e-03 |
197 |
Upregulated |
| GLYCOLYSIS |
1.652 |
3.15e-03 |
198 |
Upregulated |
| HEDGEHOG SIGNALING |
1.630 |
1.79e-02 |
36 |
Upregulated |
| FATTY ACID METABOLISM |
1.595 |
5.22e-03 |
153 |
Upregulated |
| ESTROGEN RESPONSE LATE |
1.593 |
5.22e-03 |
197 |
Upregulated |
| APICAL JUNCTION |
1.482 |
9.88e-03 |
198 |
Upregulated |
| HEME METABOLISM |
1.458 |
1.20e-02 |
192 |
Upregulated |
| ESTROGEN RESPONSE EARLY |
1.444 |
1.38e-02 |
197 |
Upregulated |
| COAGULATION |
1.433 |
2.78e-02 |
129 |
Upregulated |
| PEROXISOME |
1.427 |
2.71e-02 |
102 |
Upregulated |
| INFLAMMATORY RESPONSE |
1.369 |
2.41e-02 |
199 |
Upregulated |
| COMPLEMENT |
1.278 |
7.76e-02 |
197 |
Upregulated |
| ANGIOGENESIS |
1.239 |
1.95e-01 |
35 |
Upregulated |
Comparing GSEA
vs. Thresholded ORA
It would not be straightforward to do a direct term-by-term
comparison between the ORA and GSEA results. A big part of this is that
the methods use different annotation databases, so the pathway names
themselves may not overlap, even if they refer to similar things. Second
and more importantly, the imputs given to these methods are quite
different, since ORA gets a binary gene list, while GSEA sees a
continuous ranking. Because of this, GSEA may find signals that ORA
cannot. Despite this, the high-level results of these methods
can be compared. The ORA results pointed to stress response and
AP-1/MAPK-related terms. While GSEA points to TNF signaling via NF-KB
and inflammatory response Hallmark, this represents similar underlying
biology. Additionally, GSEA noted EMT, MYC targets, and metabolic
programs, while ORA did not. As previously stated, this is likely
because the signals were consistent, but too moderate for ORA to find
individually significant.
Cytoscape Enrichment
Map Visualization
Overview
Next, GSEA results were visualized using Cytoscape (v3.10) with the
EnrichmentMap (v3.3) and AutoAnnotate (v1.4) apps.
Exporting Results for
EnrichmentMap
First, the analysis was exported to get file to run EnrichmentMap
on.
fgsea_results$clean_name <- tools::toTitleCase(
tolower(gsub("_", " ", gsub("HALLMARK_", "", fgsea_results$pathway)))
)
em_generic_clean <- data.frame(
name = fgsea_results$clean_name,
description = fgsea_results$clean_name,
pvalue = fgsea_results$pval,
qvalue = ifelse(is.na(fgsea_results$padj), 1, fgsea_results$padj),
phenotype = ifelse(fgsea_results$NES > 0, 1, -1)
)
write.table(em_generic_clean, "gsea_results_clean.txt",
sep = "\t", row.names = FALSE, quote = FALSE)
write.table(
data.frame(Gene = names(ranked_genes), Score = ranked_genes),
"ranked_genelist_for_em.rnk",
sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)
msig_hallmark$clean_name <- tools::toTitleCase(
tolower(gsub("_", " ", gsub("HALLMARK_", "", msig_hallmark$gs_name)))
)
gmt_lines <- tapply(msig_hallmark$gene_symbol, msig_hallmark$clean_name, function(genes) {
paste(unique(genes), collapse = "\t")
})
gmt_output <- paste0(names(gmt_lines), "\tna\t", gmt_lines)
writeLines(gmt_output, "hallmark_clean.gmt")
sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]
EnrichmentMap
Parameters
The following parameters were used for EnrichmentMap (recorded for
reproducability):
| Analysis type |
Generic/gProfiler/Enrichr |
| Enrichments file |
gsea_results_clean.txt |
| GMT file |
hallmark_clean.gmt (MSigDB Hallmark, v2023.2.Hs) |
| Ranks file |
ranked_genelist_for_em.rnk |
| FDR q-value cutoff |
0.25 |
| Edge similarity metric |
Jaccard+Overlap Combined |
| Edge similarity cutoff |
0.1 |
This analysis was originally run with the edge similarity cutoff at
.375. However, this led to all nodes except 2 being singletons. Because
of this, I reverted this parameter back to the default of .1 so we can
observe patterns in the clustering.
Initial Network
(Prior to Manual Layout)
knitr::include_graphics("figures/cytoscape_raw.png")
Annotated
Network
The, I used AutoAnnotate to identify and label clusters. All
parameters were default (In “quickstart” page, “amount of clusters”
slider in the middle, “label column” = name, “layout network to minimize
cluster overlap” not checked)
knitr::include_graphics("figures/cytoscape_uncollapsed.png")
Theme Network
(Collapsed)
Then, I collapsed the network so that each cluster becomes a single
large node labeled with its theme name.
knitr::include_graphics("figures/cytoscape_collapsed.png")
Major Themes
This analysis yielded six clusters and twelve singletons (from 38
nodes with 23 edges). In the table below I have catalouged each theme,
and made notes on if it was present or inferrable from the original
paper.
themes_df <- data.frame(
Theme = c(
"TNFA Signaling via NF-kB",
"Epithelial Mesenchymal Transition",
"MYC Targets V2",
"Fatty Acid Metabolism",
"Coagulation / Complement",
"Estrogen Response (Early/Late)"
),
Type = c(
"Cluster", "Cluster", "Cluster", "Cluster", "Cluster", "Cluster"
),
Fit_with_Paper = c(
"Yes, paper reports NF-KB and TNF signaling enrichment",
"Yes,EMT is an chemotherapy resistance mechanism",
"Yes, MYC-driven proliferation consistent with aggressive resistant tumors",
"Novel, metabolic reprogramming not highlighted in original paper",
"Partially yes, complement relates to immune TME remodeling reported in paper",
"Yes, estrogen signaling central to luminal subtypes in this dataset"
)
)
knitr::kable(themes_df,
caption = "Table 7. Major biological themes identified in the enrichment map.",
col.names = c("Theme", "Type", "Fit with Original Paper"))
Table 7. Major biological themes identified in the enrichment
map.
| TNFA Signaling via NF-kB |
Cluster |
Yes, paper reports NF-KB and TNF signaling
enrichment |
| Epithelial Mesenchymal Transition |
Cluster |
Yes,EMT is an chemotherapy resistance mechanism |
| MYC Targets V2 |
Cluster |
Yes, MYC-driven proliferation consistent with
aggressive resistant tumors |
| Fatty Acid Metabolism |
Cluster |
Novel, metabolic reprogramming not highlighted in
original paper |
| Coagulation / Complement |
Cluster |
Partially yes, complement relates to immune TME
remodeling reported in paper |
| Estrogen Response (Early/Late) |
Cluster |
Yes, estrogen signaling central to luminal subtypes in
this dataset |
7.1 Comments on clustering
The cytoscape visualization yielded 38 nodes with 23 edges. There were 6 non-overlapping clusters. The vast majority of the clusters represent expected results given the original paper’s analysis. There is more novelty in the singletons, but they also have less statistical power. Overall, this clustering seems in line with what we know biologically. It is notable that there were only 38 significant pathways, and all of them were enriched in the treatment group. One possible explanation for the low number of significant pathways is the nature of the data: these patients did not respond to treatment, so perhaps we should expect them to have less change than the average patient. Additionally, the original paper is focused on how NAC induces change in non-responders, so it makes sense that we see most pathways enriched post-treatment.