if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# A1
packages <- c("GEOquery", "edgeR", "limma", "knitr",
              "ggplot2", "pheatmap", "org.Hs.eg.db", "dplyr")

# A2 to be used
a2_packages <- c("clusterProfiler", "enrichplot", "fgsea", "msigdbr", "ggrepel")

all_packages <- c(packages, a2_packages)
for (pkg in all_packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    BiocManager::install(pkg, update = FALSE, ask = FALSE)
    library(pkg, character.only = TRUE)
  }
}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width = 10,
  fig.height = 8,
  fig.path = "figures/"
)

1 Introduction

1.1 Dataset Overview

For this assignment, I will perform pathway enrichment analysis using previously generated differential expression analysis. This type of analysis will investigate the processes and pathways associated with neoadjuvant chemotherapy (NAC) resistance in breast cancer, which is the dataset used.

The dataset used is GSE309004, which can be accessed at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE309004. The original study (Guevara-Nieto et al., 2025) that this data is from is a paired study, where the same patient would contribute a biopsy at two different times (before and after NAC). All 35 of the patients in this dataset are “non-responders”, meaning they showed no response to chemotherapy.

The original study focused on the transcriptome changes between samples to try to learn about treatment resistance. They report an array of findings, including genes FOS and NR4A1 being consistently differentially expressed, IL-17, MAPK, NF-κB, and TNF signaling pathways being enriched, and post-NAC infiltration of CD4 memory T cells, regulatory T cells, neutrophils, and macrophages being increased. Throughout this report, I will compare my findings to what the original authors found, both on a statistical level and a biological background level.

1.2 Reproducing the A1 Pipeline

Here, I re-create the full pipeline from assignment 1 that was used to do data processing and differential expression. I chose to do this rather than loading a saved file for the sake of reproducability. This means the notebook is self-contained and can be run start to finish. Note: the statistical tests and visualization done to justify which outliers were removed and which normalization method was chosen are not included. The purpose of this part is mostly so this report can be ran on the raw data.

1.2.1 Data Loading

First, the GEO data is downloaded using getGEO(). To make sure that we only keep gene symbols that are valid HUGO symbols, they are compared to org.Hs.eg.db. The three statistical outliers identified in A1 (X014Q.Tn, X083Q.Tn, X226B.Tn) are then removed.

geo_id <- "GSE309004"
if (!file.exists(paste0(geo_id, "_metadata.Rdata"))) {
  gse <- getGEO(geo_id, GSEMatrix = TRUE, getGPL = FALSE)
  save(gse, file = paste0(geo_id, "_metadata.Rdata"))
} else {
  load(paste0(geo_id, "_metadata.Rdata"))
}
sample_info <- pData(gse[[1]])
files <- list.files(geo_id, full.names = TRUE)
count_file <- files[1]
expression_data <- read.table(count_file,
                              header = TRUE,
                              row.names = 1,
                              sep = "\t")
valid_symbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")
mapped_genes <- rownames(expression_data) %in% valid_symbols
expression_clean <- expression_data[mapped_genes, ]
clean_colnames <- gsub("^X", "", colnames(expression_clean))
clean_colnames <- gsub("\\.", "-", clean_colnames)
match_idx <- match(clean_colnames, sample_info$title)
sample_info <- sample_info[match_idx, ]
sample_info$condition <- ifelse(grepl("B", sample_info$title) &
                                !grepl("Q", sample_info$title),
                                "Baseline", "FollowUp")
outliers <- c("X014Q.Tn", "X083Q.Tn", "X226B.Tn")
expression_clean <- expression_clean[, !(colnames(expression_clean) %in% outliers)]
outlier_titles <- gsub("^X", "", gsub("\\.", "-", outliers))
sample_info <- sample_info[!(sample_info$title %in% outlier_titles), ]

1.2.2 Normalization and Differential Expression

Since edgeR’s TMM normalization needs raw counts, those are loaded. Then, intersect() is used to restrict this to only genes that appear in both the cleaned expression matrix and the raw counts file. TMM normalization is then applied.

For differential expression, “Baseline” condition is set as the reference level (this is pre-treatment). Differential expression is extracted with the pipeline (estimateDisp()glmQLFit()glmQLFTest()). Finally, the results are saved for all genes to prepare for using GSEA (which wants the FULL ranked list).

raw_file <- list.files(geo_id, pattern = "rawcounts", full.names = TRUE)
raw_counts <- read.table(gzfile(raw_file),
                         header = TRUE,
                         row.names = 1,
                         sep = "\t")
common_genes <- intersect(rownames(expression_clean), rownames(raw_counts))
raw_counts <- raw_counts[common_genes, colnames(expression_clean)]
dge <- DGEList(counts = raw_counts)
dge <- calcNormFactors(dge, method = "TMM")
final_expression_data <- cpm(dge, log = TRUE, prior.count = 1)
group <- factor(sample_info$condition, levels = c("Baseline", "FollowUp"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "FollowUp_vs_Baseline")
dge_de <- DGEList(counts = raw_counts)
dge_de <- calcNormFactors(dge_de, method = "TMM")
dge_de <- estimateDisp(dge_de, design)
fit <- glmQLFit(dge_de, design)
qlf <- glmQLFTest(fit, coef = 2)
results_table <- as.data.frame(topTags(qlf, n = Inf))
fdr_threshold <- 0.05
logfc_threshold <- 1
sig_genes <- results_table[results_table$FDR < fdr_threshold &
                           abs(results_table$logFC) > logfc_threshold, ]

1.2.3 A1 Summary Statistics

Here, I generate a table to summarize some important statistics from the reproduction of A1. These numbers confirm that the reproduced pipeline matches the original results, and provide some more context about the data.

n_baseline <- sum(sample_info$condition == "Baseline")
n_followup <- sum(sample_info$condition == "FollowUp")
n_up <- sum(results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold)
n_down <- sum(results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold)

stats_df <- data.frame(
  Category = c(
    "Original samples", "Samples after QC", "Baseline samples", "FollowUp samples",
    "Outliers removed", "Original genes", "Mapped to HUGO (97.28%)", "Unmapped genes",
    "Total genes tested", "DE genes (FDR<0.05, |logFC|>1)", "Upregulated", "Downregulated"
  ),
  Value = c(54, 51, n_baseline, n_followup, 3, 20287, 19736, 551,
            nrow(results_table), nrow(sig_genes), n_up, n_down)
)

knitr::kable(stats_df,
             caption = "**Table 1.** Summary statistics from the reproduced Assignment 1 pipeline.",
             col.names = c("Metric", "Count"))
Table 1. Summary statistics from the reproduced Assignment 1 pipeline.
Metric Count
Original samples 54
Samples after QC 51
Baseline samples 26
FollowUp samples 25
Outliers removed 3
Original genes 20287
Mapped to HUGO (97.28%) 19736
Unmapped genes 551
Total genes tested 19735
DE genes (FDR<0.05, |logFC|>1) 560
Upregulated 313
Downregulated 247

2 Preparing Gene Lists for Enrichment Analysis

To prepare for ORA, we first split differential expression results into upregulated only, downregulated only, and all DE genes combined. These will each be analyzed individually, then compared. Additionally, we maintain a continuous ranked list that is needed for GSEA.

I use the same filtering for upregulated/downregulated genes as was used in A1 (FDR < 0.05, |logFC| > 1). After doing this, the combined list is generated as the union of the up/down lists. Additionally, I use for the “universe”/ background set all of the tested genes. This is important, because not all human genes were tested, so using all human genes as the background would potentially falsely increase significance of enrichment.

The metric that I used for the GSEA ranked list is -log10(p-value)*sign(logFC). This captures the statistical significance in p-value with the biological information in the sign of the logFC. The result is that low p-value and positive fold change genes are at the top of the list (most upregulated), with high p-value genes in the middle, then low p-value negative fold change genes at the end.

up_genes      <- rownames(results_table)[results_table$FDR < fdr_threshold &
                                          results_table$logFC > logfc_threshold]
down_genes    <- rownames(results_table)[results_table$FDR < fdr_threshold &
                                          results_table$logFC < -logfc_threshold]
all_de_genes  <- c(up_genes, down_genes)
background_genes <- rownames(results_table)
results_table$rank_metric <- -log10(results_table$PValue) * sign(results_table$logFC)
ranked_genes <- sort(setNames(results_table$rank_metric, rownames(results_table)),
                     decreasing = TRUE)

3 Thresholded Over-Representation Analysis

3.1 Method Choice

I chose to perform ORA using clusterProfiler (Yu et al., 2012). clusterProfiler is a Bioconductor package for ORA that uses Benjamini-Hochberg FDR correction. Some reasons I chose to user it include that it is fully self-contained in R and it has extensive documentation so that I can fully understand its mechanisms.

For the annotation database, I chose Gene Ontology Biological Process (GO:BP), which is sourced from org.Hs.eg.db. To my knowledge, GO:BP provides the broadest and most granular coverage of biological processes, which is why I chose it.

For all ORA runs, I used the following thresholds: - Adjusted p-value: < 0.05 - Minimum gene set size: 10 - Maximum gene set size: 500 - Background universe: all 19,735 tested genes (again, NOT all human genes)

3.2 Converting Gene Symbols to Entrez IDs

enrichGO() from clusterProfiler is going to convert to entrez IDs anyways. This worried me, because I did not want to lose any data or miss errors without knowing. Because of this, I chose to manually convert to entrez IDs before running clusterProfiler. As shown below, all genes were successfully converted.

symbol_to_entrez <- function(gene_symbols) {
  mapping <- AnnotationDbi::select(org.Hs.eg.db,
                                   keys = gene_symbols,
                                   columns = "ENTREZID",
                                   keytype = "SYMBOL")
  mapping <- mapping[!is.na(mapping$ENTREZID), ]
  mapping <- mapping[!duplicated(mapping$SYMBOL), ]
  return(setNames(mapping$ENTREZID, mapping$SYMBOL))
}

up_entrez    <- symbol_to_entrez(up_genes)
down_entrez  <- symbol_to_entrez(down_genes)
all_entrez   <- symbol_to_entrez(all_de_genes)
bg_entrez    <- symbol_to_entrez(background_genes)

cat("Upregulated genes mapped:", length(up_entrez), "/", length(up_genes), "\n")
## Upregulated genes mapped: 313 / 313
cat("Downregulated genes mapped:", length(down_entrez), "/", length(down_genes), "\n")
## Downregulated genes mapped: 247 / 247
cat("All DE genes mapped:", length(all_entrez), "/", length(all_de_genes), "\n")
## All DE genes mapped: 560 / 560
cat("Background mapped:", length(bg_entrez), "/", length(background_genes), "\n")
## Background mapped: 19735 / 19735

3.3 ORA: Upregulated Genes

Next, I run enrichGO() run on the upregulated gene list. I use readable=true so that the Entrez IDs are converted back for the final result for readability.

ora_up <- enrichGO(
  gene          = as.character(up_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Upregulated ORA: genesets returned:", nrow(ora_up@result[ora_up@result$p.adjust < 0.05, ]), "\n")
## Upregulated ORA: genesets returned: 107
if (nrow(ora_up) > 0) {
  p1 <- dotplot(ora_up, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — Upregulated Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p1)
}

Figure 1. This is a dot plot of the top 20 GO:BP terms enriched in upregulated genes (logFC > 1, FDR < 0.05). The x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

if (nrow(ora_up) > 0) {
  up_table <- ora_up@result[ora_up@result$p.adjust < 0.05,
                             c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
  up_table$pvalue   <- formatC(up_table$pvalue, format = "e", digits = 2)
  up_table$p.adjust <- formatC(up_table$p.adjust, format = "e", digits = 2)
  knitr::kable(head(up_table, 20),
               caption = "**Table 2.** Top 20 GO:BP terms enriched in upregulated genes (FDR < 0.05).",
               row.names = FALSE)
}
Table 2. Top 20 GO:BP terms enriched in upregulated genes (FDR < 0.05).
Description GeneRatio BgRatio pvalue p.adjust Count
antibacterial humoral response 10/275 51/15989 1.32e-08 4.42e-05 10
defense response to Gram-positive bacterium 12/275 95/15989 8.11e-08 1.36e-04 12
fat cell differentiation 17/275 231/15989 5.33e-07 5.04e-04 17
humoral immune response 17/275 233/15989 6.02e-07 5.04e-04 17
antimicrobial humoral response 12/275 128/15989 2.13e-06 1.42e-03 12
intermediate filament organization 8/275 56/15989 4.81e-06 2.20e-03 8
epidermis development 19/275 335/15989 5.72e-06 2.20e-03 19
response to corticosteroid 12/275 141/15989 5.86e-06 2.20e-03 12
multi-organism reproductive process 14/275 195/15989 7.25e-06 2.20e-03 14
response to glucocorticoid 11/275 121/15989 7.65e-06 2.20e-03 11
response to steroid hormone 18/275 311/15989 7.67e-06 2.20e-03 18
cellular response to corticosteroid stimulus 8/275 60/15989 8.16e-06 2.20e-03 8
antimicrobial humoral immune response mediated by antimicrobial peptide 10/275 100/15989 8.53e-06 2.20e-03 10
multi-multicellular organism process 14/275 200/15989 9.69e-06 2.32e-03 14
muscle system process 21/275 426/15989 1.59e-05 3.42e-03 21
keratinocyte differentiation 11/275 131/15989 1.63e-05 3.42e-03 11
innate immune response in mucosa 5/275 20/15989 1.82e-05 3.55e-03 5
monocyte chemotaxis 7/275 49/15989 1.91e-05 3.55e-03 7
defense response to bacterium 15/275 245/15989 2.31e-05 4.07e-03 15
cellular response to glucocorticoid stimulus 7/275 51/15989 2.50e-05 4.14e-03 7

3.4 ORA: Downregulated Genes

Next, enrichGO()was run on the downregulated gene list. I use the exact same parameters as for the upregulated genes, so that they two results can be meaningfully compared.

ora_down <- enrichGO(
  gene          = as.character(down_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Downregulated ORA: genesets returned:", nrow(ora_down@result[ora_down@result$p.adjust < 0.05, ]), "\n")
## Downregulated ORA: genesets returned: 13
if (nrow(ora_down) > 0) {
  p2 <- dotplot(ora_down, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — Downregulated Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p2)
}

Figure 2. This is a dot plot of the top 20 GO:BP terms enriched in downregulated genes (logFC < -1, FDR < 0.05). he x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

if (nrow(ora_down) > 0) {
  down_table <- ora_down@result[ora_down@result$p.adjust < 0.05,
                                 c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
  down_table$pvalue   <- formatC(down_table$pvalue, format = "e", digits = 2)
  down_table$p.adjust <- formatC(down_table$p.adjust, format = "e", digits = 2)
  knitr::kable(head(down_table, 20),
               caption = "**Table 3.** Top 20 GO:BP terms enriched in downregulated genes (FDR < 0.05).",
               row.names = FALSE)
}
Table 3. Top 20 GO:BP terms enriched in downregulated genes (FDR < 0.05).
Description GeneRatio BgRatio pvalue p.adjust Count
female pregnancy 12/167 175/15989 3.05e-07 5.46e-04 12
multi-organism reproductive process 12/167 195/15989 9.73e-07 7.58e-04 12
multi-multicellular organism process 12/167 200/15989 1.27e-06 7.58e-04 12
sensory perception of smell 10/167 202/15989 5.25e-05 2.35e-02 10
detection of chemical stimulus involved in sensory perception 10/167 214/15989 8.51e-05 2.51e-02 10
detection of chemical stimulus involved in sensory perception of smell 9/167 175/15989 9.34e-05 2.51e-02 9
sensory perception of chemical stimulus 11/167 262/15989 9.83e-05 2.51e-02 11
detection of stimulus involved in sensory perception 11/167 288/15989 2.25e-04 4.36e-02 11
regulation of tooth mineralization 3/167 13/15989 2.96e-04 4.36e-02 3
detection of chemical stimulus 10/167 250/15989 3.01e-04 4.36e-02 10
detection of stimulus 13/167 401/15989 3.05e-04 4.36e-02 13
killing of cells of another organism 6/167 88/15989 3.17e-04 4.36e-02 6
disruption of cell in another organism 6/167 88/15989 3.17e-04 4.36e-02 6

3.5 ORA: Combined, all DE Genes

I also ran with the combined gene list. Again, I used all of the same parameters to ensure comparability.

ora_all <- enrichGO(
  gene          = as.character(all_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Combined ORA: genesets returned:", nrow(ora_all@result[ora_all@result$p.adjust < 0.05, ]), "\n")
## Combined ORA: genesets returned: 43
if (nrow(ora_all) > 0) {
  p3 <- dotplot(ora_all, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — All DE Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p3)
}

Figure 3. This is a dot plot of the top 20 GO:BP terms enriched in all differentially expressed genes (FDR < 0.05). The x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

3.6 Comparing Up vs. Down vs. Combined Results

To numerically compare these analyses, the set of significant GO:BP term descriptions is extracted from each result object using a helper function, and set operations (intersect(), setdiff(), union()) are applied to look into shared/unique terms.

get_sig_terms <- function(ora_result) {
  if (is.null(ora_result) || nrow(ora_result) == 0) return(character(0))
  ora_result@result$Description[ora_result@result$p.adjust < 0.05]
}

up_terms   <- get_sig_terms(ora_up)
down_terms <- get_sig_terms(ora_down)
all_terms  <- get_sig_terms(ora_all)

shared_up_down <- intersect(up_terms, down_terms)
only_up        <- setdiff(up_terms, down_terms)
only_down      <- setdiff(down_terms, up_terms)
only_combined  <- setdiff(all_terms, union(up_terms, down_terms))

comparison_df <- data.frame(
  Analysis = c("Upregulated only", "Downregulated only", "Combined (all DE)",
               "Shared (up and down)", "Unique to combined"),
  Gene_Sets = c(length(up_terms), length(down_terms), length(all_terms),
                length(shared_up_down), length(only_combined))
)

knitr::kable(comparison_df,
             caption = "**Table 4.** Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA analyses.",
             col.names = c("Analysis", "Significant Gene Sets"))
Table 4. Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA analyses.
Analysis Significant Gene Sets
Upregulated only 107
Downregulated only 13
Combined (all DE) 43
Shared (up and down) 3
Unique to combined 7

4 Interpretation of ORA Results

4.1 Comparison with original study

From the original paper’s finding, we expect certain signals, like IL-17, MAPK, NF-KB and TNF signaling to be upregulated. Several genes from our results support this expectation. The genes that are most upregulated in our data include FOS, JUN, FOSB, ATF3, ZFP36, DUSP1, PER1, and NR4A1, many of which are implicated in stress response. Additionally, many are implicated in inflammatory response, which supports the NF-KB and TNF upregulation reported in the paper.

The genes that are downregulated like SPANXA2, CDH18, KISS1, and OPN1SW, are related to cell adhesion and organization. This fits with a process called epithelial-, to-mesenchymal transition (EMT) which often happens with chemotherapy resistance. The loss of cell adhesion genes after NAC suggests that the remaining tumor cells are changing under treatment pressure.

4.2 External Literature Support

The FOS/JUN AP-1 axis in chemotherapy resistance: The FOS and JUN transcription factor complex is really important for helping cells deal with stress and resist drugs. When the MAPK pathway activates the FOS and JUN complex it helps cells survive by turning on genes that keep them alive. This is often seen in cells that make it through treatments that are supposed to kill them. For example Daschner et al. (1999) found that having much FOS in breast cancer cells makes them resistant to a drug called doxorubicin, which is something we care about when looking at NAC.

Changes in the microenvironment and loss of cell adhesion: It is interesting that we see an increase in stress pathways and a decrease in cell adhesion genes after chemotherapy. This matches what other researchers like Galluzzi et al. (2017) found about how chemotherapy changes the tumor microenvironment. They showed that chemotherapy can cause cells to die in a way that attracts cells and that the extracellular matrix in the tumor can change to make cells more likely to spread which is consistent, with what we found in our analysis of the FOS and JUN AP-1 axis and NR4A1.


5 Non-Thresholded Gene Set Enrichment Analysis

5.1 Method

For GSEA, I decided to use the fgsea package (Korotkevich et al., 2021). I know this package to be computationally efficient, and it will use the ranked gene list that was already created.

The gene sets that I decided to use are the MSigDB Hallmark collection (v2023.2.Hs). I specifically chose to use the Hallmark gene sets over GO:BP because they are manually created from lots of overlapping gene sets, so there is minimal redundancy. I wanted the cytoscape visualization to be manageable, and only include unique pathways. Additionally, these gene sets can be accessed via the msigdbr package. I used the following parameters: - Ranking metric: -log10(p-value) * sign(logFC) - minSize = 15, maxSize = 500 - nperm = 1000 - Significance threshold for reporting: FDR < 0.25 (conventional for GSEA, Subramanian et al., 2005)

5.2 Running fgsea

msig_hallmark <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(msig_hallmark$gene_symbol, msig_hallmark$gs_name)

cat("Number of Hallmark gene sets:", length(hallmark_list), "\n")
## Number of Hallmark gene sets: 50
cat("Total genes in ranked list:", length(ranked_genes), "\n")
## Total genes in ranked list: 19735
set.seed(42)
fgsea_results <- fgsea(
  pathways = hallmark_list,
  stats    = ranked_genes,
  minSize  = 15,
  maxSize  = 500,
  nperm    = 1000
)

fgsea_results <- fgsea_results[order(fgsea_results$NES, decreasing = TRUE), ]

cat("\nGSEA results summary:\n")
## 
## GSEA results summary:
cat("Total pathways tested:", nrow(fgsea_results), "\n")
## Total pathways tested: 50
cat("Significant pathways (FDR < 0.25):", sum(fgsea_results$padj < 0.25, na.rm = TRUE), "\n")
## Significant pathways (FDR < 0.25): 38
cat("Significant pathways (FDR < 0.05):", sum(fgsea_results$padj < 0.05, na.rm = TRUE), "\n")
## Significant pathways (FDR < 0.05): 36
cat("\nTop 10 by NES (upregulated):\n")
## 
## Top 10 by NES (upregulated):
print(head(fgsea_results[fgsea_results$NES > 0, c("pathway", "NES", "padj", "size")], 10))
##                                        pathway      NES        padj  size
##                                         <char>    <num>       <num> <int>
##  1:           HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.667822 0.003154773   199
##  2:                    HALLMARK_MYC_TARGETS_V1 2.650398 0.003154773   199
##  3:                           HALLMARK_HYPOXIA 2.537022 0.003154773   196
##  4:         HALLMARK_OXIDATIVE_PHOSPHORYLATION 2.359163 0.003154773   200
##  5: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 2.313043 0.003154773   200
##  6:                  HALLMARK_MTORC1_SIGNALING 2.304354 0.003154773   199
##  7:                    HALLMARK_UV_RESPONSE_UP 2.179340 0.003154773   156
##  8:                HALLMARK_TGF_BETA_SIGNALING 2.177949 0.003154773    54
##  9:                       HALLMARK_E2F_TARGETS 2.176156 0.003154773   200
## 10:                       HALLMARK_P53_PATHWAY 2.058446 0.003154773   197
cat("\nBottom 10 by NES (downregulated):\n")
## 
## Bottom 10 by NES (downregulated):
print(head(fgsea_results[fgsea_results$NES < 0, c("pathway", "NES", "padj", "size")], 10))
##                               pathway        NES      padj  size
##                                <char>      <num>     <num> <int>
## 1:           HALLMARK_SPERMATOGENESIS -0.7731071 0.9645777   120
## 2: HALLMARK_INTERFERON_ALPHA_RESPONSE -0.9129588 0.6960163    97
## 3:      HALLMARK_BILE_ACID_METABOLISM -1.0214033 0.4456654   106

5.3 Visualizing GSEA Results

I visualized this with a horizontal bar plot, sorted by NES. The names of the pathways have been cleaned by removing the HALLMARK_ prefix and replacing the underscores with spaces.

All 38 significant pathways have a positive NES, so all the bars are red. This means that we donn’t see any pathways significantly enriched at the baseline, as opposed to post-NAC.

fgsea_sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]
fgsea_sig <- fgsea_sig[order(fgsea_sig$NES), ]

fgsea_sig$pathway_clean <- gsub("HALLMARK_", "", fgsea_sig$pathway)
fgsea_sig$pathway_clean <- gsub("_", " ", fgsea_sig$pathway_clean)
fgsea_sig$direction <- ifelse(fgsea_sig$NES > 0, "Upregulated", "Downregulated")

ggplot(fgsea_sig, aes(x = NES, y = reorder(pathway_clean, NES), fill = direction)) +
  geom_bar(stat = "identity", alpha = 0.85) +
  scale_fill_manual(values = c("Upregulated" = "#d73027", "Downregulated" = "#4575b4")) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "GSEA: Hallmark Pathways (FDR < 0.25)",
    subtitle = "Ranked by Normalized Enrichment Score (NES)",
    x = "Normalized Enrichment Score (NES)",
    y = "Hallmark Gene Set",
    fill = "Direction",
    caption = "Positive NES = enriched in FollowUp (post-NAC); Negative NES = enriched in Baseline (pre-NAC)"
  ) +
  theme_bw(base_size = 11) +
  theme(
    plot.title    = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

Figure 4. Bar plot of all Hallmark gene sets significant at FDR < 0.25 from fgsea analysis, sorted by NES. All 38 significant pathways are enriched post-NAC.

gsea_table <- fgsea_sig[, c("pathway_clean", "NES", "padj", "size", "direction")]
gsea_table$NES  <- round(gsea_table$NES, 3)
gsea_table$padj <- formatC(gsea_table$padj, format = "e", digits = 2)
gsea_table <- gsea_table[order(as.numeric(gsea_table$NES), decreasing = TRUE), ]

knitr::kable(gsea_table,
             caption = "**Table 5.** All significantly enriched Hallmark gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by NES.",
             col.names = c("Pathway", "NES", "FDR (adj. p)", "Gene Set Size", "Direction"),
             row.names = FALSE)
Table 5. All significantly enriched Hallmark gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by NES.
Pathway NES FDR (adj. p) Gene Set Size Direction
TNFA SIGNALING VIA NFKB 2.668 3.15e-03 199 Upregulated
MYC TARGETS V1 2.650 3.15e-03 199 Upregulated
HYPOXIA 2.537 3.15e-03 196 Upregulated
OXIDATIVE PHOSPHORYLATION 2.359 3.15e-03 200 Upregulated
EPITHELIAL MESENCHYMAL TRANSITION 2.313 3.15e-03 200 Upregulated
MTORC1 SIGNALING 2.304 3.15e-03 199 Upregulated
UV RESPONSE UP 2.179 3.15e-03 156 Upregulated
TGF BETA SIGNALING 2.178 3.15e-03 54 Upregulated
E2F TARGETS 2.176 3.15e-03 200 Upregulated
P53 PATHWAY 2.058 3.15e-03 197 Upregulated
MYC TARGETS V2 2.045 3.15e-03 57 Upregulated
APOPTOSIS 2.031 3.15e-03 161 Upregulated
UNFOLDED PROTEIN RESPONSE 1.992 3.15e-03 111 Upregulated
ANDROGEN RESPONSE 1.973 3.15e-03 100 Upregulated
ADIPOGENESIS 1.971 3.15e-03 199 Upregulated
CHOLESTEROL HOMEOSTASIS 1.942 3.15e-03 74 Upregulated
MYOGENESIS 1.925 3.15e-03 193 Upregulated
PI3K AKT MTOR SIGNALING 1.880 3.15e-03 103 Upregulated
UV RESPONSE DN 1.868 3.15e-03 143 Upregulated
MITOTIC SPINDLE 1.866 3.15e-03 198 Upregulated
G2M CHECKPOINT 1.853 3.15e-03 199 Upregulated
DNA REPAIR 1.830 3.15e-03 149 Upregulated
REACTIVE OXYGEN SPECIES PATHWAY 1.782 3.15e-03 49 Upregulated
IL2 STAT5 SIGNALING 1.743 3.15e-03 198 Upregulated
PROTEIN SECRETION 1.734 3.15e-03 96 Upregulated
KRAS SIGNALING UP 1.666 3.15e-03 197 Upregulated
GLYCOLYSIS 1.652 3.15e-03 198 Upregulated
HEDGEHOG SIGNALING 1.630 1.79e-02 36 Upregulated
FATTY ACID METABOLISM 1.595 5.22e-03 153 Upregulated
ESTROGEN RESPONSE LATE 1.593 5.22e-03 197 Upregulated
APICAL JUNCTION 1.482 9.88e-03 198 Upregulated
HEME METABOLISM 1.458 1.20e-02 192 Upregulated
ESTROGEN RESPONSE EARLY 1.444 1.38e-02 197 Upregulated
COAGULATION 1.433 2.78e-02 129 Upregulated
PEROXISOME 1.427 2.71e-02 102 Upregulated
INFLAMMATORY RESPONSE 1.369 2.41e-02 199 Upregulated
COMPLEMENT 1.278 7.76e-02 197 Upregulated
ANGIOGENESIS 1.239 1.95e-01 35 Upregulated

5.4 Comparing GSEA vs. Thresholded ORA

It would not be straightforward to do a direct term-by-term comparison between the ORA and GSEA results. A big part of this is that the methods use different annotation databases, so the pathway names themselves may not overlap, even if they refer to similar things. Second and more importantly, the imputs given to these methods are quite different, since ORA gets a binary gene list, while GSEA sees a continuous ranking. Because of this, GSEA may find signals that ORA cannot. Despite this, the high-level results of these methods can be compared. The ORA results pointed to stress response and AP-1/MAPK-related terms. While GSEA points to TNF signaling via NF-KB and inflammatory response Hallmark, this represents similar underlying biology. Additionally, GSEA noted EMT, MYC targets, and metabolic programs, while ORA did not. As previously stated, this is likely because the signals were consistent, but too moderate for ORA to find individually significant.


6 Cytoscape Enrichment Map Visualization

6.1 Overview

Next, GSEA results were visualized using Cytoscape (v3.10) with the EnrichmentMap (v3.3) and AutoAnnotate (v1.4) apps.

6.2 Exporting Results for EnrichmentMap

First, the analysis was exported to get file to run EnrichmentMap on.

fgsea_results$clean_name <- tools::toTitleCase(
  tolower(gsub("_", " ", gsub("HALLMARK_", "", fgsea_results$pathway)))
)

em_generic_clean <- data.frame(
  name        = fgsea_results$clean_name,
  description = fgsea_results$clean_name,
  pvalue      = fgsea_results$pval,
  qvalue      = ifelse(is.na(fgsea_results$padj), 1, fgsea_results$padj),
  phenotype   = ifelse(fgsea_results$NES > 0, 1, -1)
)
write.table(em_generic_clean, "gsea_results_clean.txt",
            sep = "\t", row.names = FALSE, quote = FALSE)

write.table(
  data.frame(Gene = names(ranked_genes), Score = ranked_genes),
  "ranked_genelist_for_em.rnk",
  sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)

msig_hallmark$clean_name <- tools::toTitleCase(
  tolower(gsub("_", " ", gsub("HALLMARK_", "", msig_hallmark$gs_name)))
)
gmt_lines <- tapply(msig_hallmark$gene_symbol, msig_hallmark$clean_name, function(genes) {
  paste(unique(genes), collapse = "\t")
})
gmt_output <- paste0(names(gmt_lines), "\tna\t", gmt_lines)
writeLines(gmt_output, "hallmark_clean.gmt")


sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]

6.3 EnrichmentMap Parameters

The following parameters were used for EnrichmentMap (recorded for reproducability):

Parameter Value
Analysis type Generic/gProfiler/Enrichr
Enrichments file gsea_results_clean.txt
GMT file hallmark_clean.gmt (MSigDB Hallmark, v2023.2.Hs)
Ranks file ranked_genelist_for_em.rnk
FDR q-value cutoff 0.25
Edge similarity metric Jaccard+Overlap Combined
Edge similarity cutoff 0.1

This analysis was originally run with the edge similarity cutoff at .375. However, this led to all nodes except 2 being singletons. Because of this, I reverted this parameter back to the default of .1 so we can observe patterns in the clustering.

6.4 Initial Network (Prior to Manual Layout)

knitr::include_graphics("figures/cytoscape_raw.png")
Figure 6. Enrichment Map network prior to manual layout.

Figure 6. Enrichment Map network prior to manual layout.

6.5 Annotated Network

The, I used AutoAnnotate to identify and label clusters. All parameters were default (In “quickstart” page, “amount of clusters” slider in the middle, “label column” = name, “layout network to minimize cluster overlap” not checked)

knitr::include_graphics("figures/cytoscape_uncollapsed.png")
Figure 7. PEnrichment Map with AutoAnnotate cluster labels after manual layout.

Figure 7. PEnrichment Map with AutoAnnotate cluster labels after manual layout.

6.6 Theme Network (Collapsed)

Then, I collapsed the network so that each cluster becomes a single large node labeled with its theme name.

knitr::include_graphics("figures/cytoscape_collapsed.png")
Figure 8. Collapsed theme network. Each node represents one AutoAnnotate cluster collapsed to a single theme node. Singleton pathways are retained as individual nodes.

Figure 8. Collapsed theme network. Each node represents one AutoAnnotate cluster collapsed to a single theme node. Singleton pathways are retained as individual nodes.

6.7 Major Themes

This analysis yielded six clusters and twelve singletons (from 38 nodes with 23 edges). In the table below I have catalouged each theme, and made notes on if it was present or inferrable from the original paper.

themes_df <- data.frame(
  Theme = c(
    "TNFA Signaling via NF-kB",
    "Epithelial Mesenchymal Transition",
    "MYC Targets V2",
    "Fatty Acid Metabolism",
    "Coagulation / Complement",
    "Estrogen Response (Early/Late)"
    
  ),
  Type = c(
    "Cluster", "Cluster", "Cluster", "Cluster", "Cluster", "Cluster"
  ),
  Fit_with_Paper = c(
    "Yes, paper reports NF-KB and TNF signaling enrichment",
    "Yes,EMT is an chemotherapy resistance mechanism",
    "Yes, MYC-driven proliferation consistent with aggressive resistant tumors",
    "Novel, metabolic reprogramming not highlighted in original paper",
    "Partially yes, complement relates to immune TME remodeling reported in paper",
    "Yes, estrogen signaling central to luminal subtypes in this dataset"
  )
)

knitr::kable(themes_df,
             caption = "Table 7. Major biological themes identified in the enrichment map.",
             col.names = c("Theme", "Type", "Fit with Original Paper"))
Table 7. Major biological themes identified in the enrichment map.
Theme Type Fit with Original Paper
TNFA Signaling via NF-kB Cluster Yes, paper reports NF-KB and TNF signaling enrichment
Epithelial Mesenchymal Transition Cluster Yes,EMT is an chemotherapy resistance mechanism
MYC Targets V2 Cluster Yes, MYC-driven proliferation consistent with aggressive resistant tumors
Fatty Acid Metabolism Cluster Novel, metabolic reprogramming not highlighted in original paper
Coagulation / Complement Cluster Partially yes, complement relates to immune TME remodeling reported in paper
Estrogen Response (Early/Late) Cluster Yes, estrogen signaling central to luminal subtypes in this dataset

7 Interpretation

7.1 Comments on clustering

The cytoscape visualization yielded 38 nodes with 23 edges. There were 6 non-overlapping clusters. The vast majority of the clusters represent expected results given the original paper’s analysis. There is more novelty in the singletons, but they also have less statistical power. Overall, this clustering seems in line with what we know biologically. It is notable that there were only 38 significant pathways, and all of them were enriched in the treatment group. One possible explanation for the low number of significant pathways is the nature of the data: these patients did not respond to treatment, so perhaps we should expect them to have less change than the average patient. Additionally, the original paper is focused on how NAC induces change in non-responders, so it makes sense that we see most pathways enriched post-treatment.

7.2 Enrichment vs original paper

I believe that these results support the original paper. The paper mentioned that certain pathways, like IL-17 NF-KB and TNF become active after NAC treatment. The top result from the GSEA analysis, TNF Signaling via NF-KB directly supports this. Additionally, the original paper found immune cells were more plentiful after NAC treatment. In our analysis, we found that pathways like complement/coagulation cascades and IL2/STAT5 signaling are active. These pathways are typical of immune cells being recruited and activated.

7.3 External Literature Support

The Hedgehog pathway is important for cancer stem cells to stay alive and for tumors to grow back after treatment. What happens after this treatment is that the Hedgehog pathway gets stronger which makes sense because the treatment may kills the cancer cells and leaves behind the stronger ones, which can resist the treatment. The Hedgehog pathway is very good at helping these stem-like cancer cells survive. Takebe et al. (2015) discuss the Hedgehog pathway and its role in cancer stem cells and what they say supports this idea about the Hedgehog pathway and cancer stem cells.


8 References

8.1 R Packages

  • GEOquery: Davis S, Meltzer PS (2007). Bioinformatics, 23(14), 1846-1847.
  • edgeR: Robinson MD, McCarthy DJ, Smyth GK (2010). Bioinformatics, 26(1), 139-140.
  • limma: Ritchie ME et al. (2015). Nucleic Acids Research, 43(7), e47.
  • org.Hs.eg.db: Carlson M (2024). R package version 3.19.1.
  • clusterProfiler: Yu G, Wang L, Han Y, He QY (2012). OMICS, 16(5), 284-287.
  • enrichplot: Yu G (2024). enrichplot: Visualization of Functional Enrichment Result. R package version 1.22.0.
  • fgsea: Korotkevich G et al. (2021). bioRxiv. doi:10.1101/060012
  • msigdbr: Dolgalev I (2022). msigdbr: MSigDB Gene Sets for Multiple Organisms. R package version 7.5.1.
  • ggplot2: Wickham H (2016). Springer-Verlag New York.
  • ggrepel: Slowikowski K (2024). ggrepel: Automatically Position Non-Overlapping Text Labels. R package version 0.9.5.
  • pheatmap: Kolde R (2019). R package version 1.0.12.
  • knitr: Xie Y (2024). R package version 1.48.
  • dplyr: Wickham H et al. (2023). R package version 1.1.4.
  • R Core Team (2024). R: A language and environment for statistical computing.
  • BiocManager: Morgan M (2024). R package version 1.30.25.

8.2 Bioinformatics Tools

  • Cytoscape: Shannon P et al. (2003). Genome Research, 13(11), 2498-2504.
  • EnrichmentMap: Merico D et al. (2010). PLoS ONE, 5(11), e13984.
  • AutoAnnotate: Kucera M et al. (2016). F1000Research, 5, 1717.
  • MSigDB: Subramanian A et al. (2005). PNAS, 102(43), 15545-15550.
  • Gene Ontology: Ashburner M et al. (2000). Nature Genetics, 25(1), 25-29.

8.3 Primary Data Source

Guevara-Nieto HM, Orozco-Castaño CA, Parra-Medina R, Poveda-Garavito JN, Garai J, Zabaleta J, López-Kleine L, Combita AL (2025). Molecular determinants of neoadjuvant chemotherapy resistance in breast cancer: An analysis of gene expression and tumor microenvironment. PLoS One 20(10): e0334335. https://doi.org/10.1371/journal.pone.0334335

8.4 Supporting Literature

  • Takebe N, Miele L, Harris PJ, Jeong W, Bando H, Kahn M, Yang SX, Ivy SP (2015). Targeting Notch, Hedgehog, and Wnt pathways in cancer stem cells: clinical update. Nature Reviews Clinical Oncology, 12(8), 445–464.
  • Galluzzi L et al. (2017). Immunogenic cell death in cancer and infectious disease. Nature Reviews Immunology, 17(2), 97-111.
  • Mohan HM et al. (2012). Molecular pathways: the role of NR4A orphan nuclear receptors in cancer. Clinical Cancer Research, 18(12), 3223-3228.
  • Subramanian A et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545-15550.
  • Daschner PJ, Ciolino HP, Plouzek CA, Yeh GC (1999). Increased AP-1 activity in drug resistant human breast cancer MCF-7 cells. Breast Cancer Research and Treatment, 53(3), 229–240. https://doi.org/10.1023/a:1006138803392