if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# A1
packages <- c("GEOquery", "edgeR", "limma", "knitr",
              "ggplot2", "pheatmap", "org.Hs.eg.db", "dplyr")

# A2 to be used
a2_packages <- c("clusterProfiler", "enrichplot", "fgsea", "msigdbr", "ggrepel")

all_packages <- c(packages, a2_packages)
for (pkg in all_packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    BiocManager::install(pkg, update = FALSE, ask = FALSE)
    library(pkg, character.only = TRUE)
  }
}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width = 10,
  fig.height = 8,
  fig.path = "figures/"
)

1 Introduction

1.1 Dataset Overview

For this assignment, I will perform pathway enrichment analysis using previously generated differential expression analysis. This type of analysis will investigate the processes and pathways associated with neoadjuvant chemotherapy (NAC) resistance in breast cancer, which is the dataset used.

The dataset used is GSE309004, which can be accessed at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE309004. The original study (Guevara-Nieto et al., 2025) that this data is from is a paired study, where the same patient would contribute a biopsy at two different times (before and after NAC). All 35 of the patients in this dataset are “non-responders”, meaning they showed no response to chemotherapy.

The original study focused on the transcriptome changes between samples to try to learn about treatment resistance. They report an array of findings, including genes FOS and NR4A1 being consistently differentially expressed, IL-17, MAPK, NF-κB, and TNF signaling pathways being enriched, and post-NAC infiltration of CD4 memory T cells, regulatory T cells, neutrophils, and macrophages being increased. Throughout this report, I will compare my findings to what the original authors found, both on a statistical level and a biological background level.

1.2 Reproducing the A1 Pipeline

Here, I re-create the full pipeline from assignment 1 that was used to do data processing and differential expression. I chose to do this rather than loading a saved file for the sake of reproducability. This means the notebook is self-contained and can be run start to finish. Note: the statistical tests and visualization done to justify which outliers were removed and which normalization method was chosen are not included. The purpose of this part is mostly so this report can be ran on the raw data.

1.2.1 Data Loading

First, the GEO data is downloaded using getGEO(). To make sure that we only keep gene symbols that are valid HUGO symbols, they are compared to org.Hs.eg.db. The three statistical outliers identified in A1 (X014Q.Tn, X083Q.Tn, X226B.Tn) are then removed.

geo_id <- "GSE309004"
if (!file.exists(paste0(geo_id, "_metadata.Rdata"))) {
  gse <- getGEO(geo_id, GSEMatrix = TRUE, getGPL = FALSE)
  save(gse, file = paste0(geo_id, "_metadata.Rdata"))
} else {
  load(paste0(geo_id, "_metadata.Rdata"))
}
sample_info <- pData(gse[[1]])
files <- list.files(geo_id, full.names = TRUE)
count_file <- files[1]
expression_data <- read.table(count_file,
                              header = TRUE,
                              row.names = 1,
                              sep = "\t")
valid_symbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")
mapped_genes <- rownames(expression_data) %in% valid_symbols
expression_clean <- expression_data[mapped_genes, ]
clean_colnames <- gsub("^X", "", colnames(expression_clean))
clean_colnames <- gsub("\\.", "-", clean_colnames)
match_idx <- match(clean_colnames, sample_info$title)
sample_info <- sample_info[match_idx, ]
sample_info$condition <- ifelse(grepl("B", sample_info$title) &
                                !grepl("Q", sample_info$title),
                                "Baseline", "FollowUp")
outliers <- c("X014Q.Tn", "X083Q.Tn", "X226B.Tn")
expression_clean <- expression_clean[, !(colnames(expression_clean) %in% outliers)]
outlier_titles <- gsub("^X", "", gsub("\\.", "-", outliers))
sample_info <- sample_info[!(sample_info$title %in% outlier_titles), ]

1.2.2 Normalization and Differential Expression

Since edgeR’s TMM normalization needs raw counts, those are loaded. Then, intersect() is used to restrict this to only genes that appear in both the cleaned expression matrix and the raw counts file. TMM normalization is then applied.

For differential expression, “Baseline” condition is set as the reference level (this is pre-treatment). Differential expression is extracted with the pipeline (estimateDisp() → glmQLFit() → glmQLFTest()). Finally, the results are saved for all genes to prepare for using GSEA (which wants the FULL ranked list).

raw_file <- list.files(geo_id, pattern = "rawcounts", full.names = TRUE)
raw_counts <- read.table(gzfile(raw_file),
                         header = TRUE,
                         row.names = 1,
                         sep = "\t")
common_genes <- intersect(rownames(expression_clean), rownames(raw_counts))
raw_counts <- raw_counts[common_genes, colnames(expression_clean)]
dge <- DGEList(counts = raw_counts)
dge <- calcNormFactors(dge, method = "TMM")
final_expression_data <- cpm(dge, log = TRUE, prior.count = 1)
group <- factor(sample_info$condition, levels = c("Baseline", "FollowUp"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "FollowUp_vs_Baseline")
dge_de <- DGEList(counts = raw_counts)
dge_de <- calcNormFactors(dge_de, method = "TMM")
dge_de <- estimateDisp(dge_de, design)
fit <- glmQLFit(dge_de, design)
qlf <- glmQLFTest(fit, coef = 2)
results_table <- as.data.frame(topTags(qlf, n = Inf))
fdr_threshold <- 0.05
logfc_threshold <- 1
sig_genes <- results_table[results_table$FDR < fdr_threshold &
                           abs(results_table$logFC) > logfc_threshold, ]

1.2.3 A1 Summary Statistics

Here, I generate a table to summarize some important statistics from the reproduction of A1. These numbers confirm that the reproduced pipeline matches the original results, and provide some more context about the data.

n_baseline <- sum(sample_info$condition == "Baseline")
n_followup <- sum(sample_info$condition == "FollowUp")
n_up <- sum(results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold)
n_down <- sum(results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold)

stats_df <- data.frame(
  Category = c(
    "Original samples", "Samples after QC", "Baseline samples", "FollowUp samples",
    "Outliers removed", "Original genes", "Mapped to HUGO (97.28%)", "Unmapped genes",
    "Total genes tested", "DE genes (FDR<0.05, |logFC|>1)", "Upregulated", "Downregulated"
  ),
  Value = c(54, 51, n_baseline, n_followup, 3, 20287, 19736, 551,
            nrow(results_table), nrow(sig_genes), n_up, n_down)
)

knitr::kable(stats_df,
             caption = "**Table 1.** Summary statistics from the reproduced Assignment 1 pipeline.",
             col.names = c("Metric", "Count"))

**Table 1.** Summary statistics from the reproduced Assignment 1 pipeline.
Metric	Count
Original samples	54
Samples after QC	51
Baseline samples	26
FollowUp samples	25
Outliers removed	3
Original genes	20287
Mapped to HUGO (97.28%)	19736
Unmapped genes	551
Total genes tested	19735
DE genes (FDR<0.05, \|logFC\|>1)	560
Upregulated	313
Downregulated	247

2 Preparing Gene Lists for Enrichment Analysis

To prepare for ORA, we first split differential expression results into upregulated only, downregulated only, and all DE genes combined. These will each be analyzed individually, then compared. Additionally, we maintain a continuous ranked list that is needed for GSEA.

I use the same filtering for upregulated/downregulated genes as was used in A1 (FDR < 0.05, |logFC| > 1). After doing this, the combined list is generated as the union of the up/down lists. Additionally, I use for the “universe”/ background set all of the tested genes. This is important, because not all human genes were tested, so using all human genes as the background would potentially falsely increase significance of enrichment.

The metric that I used for the GSEA ranked list is -log10(p-value)*sign(logFC). This captures the statistical significance in p-value with the biological information in the sign of the logFC. The result is that low p-value and positive fold change genes are at the top of the list (most upregulated), with high p-value genes in the middle, then low p-value negative fold change genes at the end.

up_genes      <- rownames(results_table)[results_table$FDR < fdr_threshold &
                                          results_table$logFC > logfc_threshold]
down_genes    <- rownames(results_table)[results_table$FDR < fdr_threshold &
                                          results_table$logFC < -logfc_threshold]
all_de_genes  <- c(up_genes, down_genes)
background_genes <- rownames(results_table)
results_table$rank_metric <- -log10(results_table$PValue) * sign(results_table$logFC)
ranked_genes <- sort(setNames(results_table$rank_metric, rownames(results_table)),
                     decreasing = TRUE)

3 Thresholded Over-Representation Analysis

3.1 Method Choice

I chose to perform ORA using clusterProfiler (Yu et al., 2012). clusterProfiler is a Bioconductor package for ORA that uses Benjamini-Hochberg FDR correction. Some reasons I chose to user it include that it is fully self-contained in R and it has extensive documentation so that I can fully understand its mechanisms.

For the annotation database, I chose Gene Ontology Biological Process (GO:BP), which is sourced from org.Hs.eg.db. To my knowledge, GO:BP provides the broadest and most granular coverage of biological processes, which is why I chose it.

For all ORA runs, I used the following thresholds: - Adjusted p-value: < 0.05 - Minimum gene set size: 10 - Maximum gene set size: 500 - Background universe: all 19,735 tested genes (again, NOT all human genes)

3.2 Converting Gene Symbols to Entrez IDs

enrichGO() from clusterProfiler is going to convert to entrez IDs anyways. This worried me, because I did not want to lose any data or miss errors without knowing. Because of this, I chose to manually convert to entrez IDs before running clusterProfiler. As shown below, all genes were successfully converted.

symbol_to_entrez <- function(gene_symbols) {
  mapping <- AnnotationDbi::select(org.Hs.eg.db,
                                   keys = gene_symbols,
                                   columns = "ENTREZID",
                                   keytype = "SYMBOL")
  mapping <- mapping[!is.na(mapping$ENTREZID), ]
  mapping <- mapping[!duplicated(mapping$SYMBOL), ]
  return(setNames(mapping$ENTREZID, mapping$SYMBOL))
}

up_entrez    <- symbol_to_entrez(up_genes)
down_entrez  <- symbol_to_entrez(down_genes)
all_entrez   <- symbol_to_entrez(all_de_genes)
bg_entrez    <- symbol_to_entrez(background_genes)

cat("Upregulated genes mapped:", length(up_entrez), "/", length(up_genes), "\n")

## Upregulated genes mapped: 313 / 313

cat("Downregulated genes mapped:", length(down_entrez), "/", length(down_genes), "\n")

## Downregulated genes mapped: 247 / 247

cat("All DE genes mapped:", length(all_entrez), "/", length(all_de_genes), "\n")

## All DE genes mapped: 560 / 560

cat("Background mapped:", length(bg_entrez), "/", length(background_genes), "\n")

## Background mapped: 19735 / 19735

3.3 ORA: Upregulated Genes

Next, I run enrichGO() run on the upregulated gene list. I use readable=true so that the Entrez IDs are converted back for the final result for readability.

ora_up <- enrichGO(
  gene          = as.character(up_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Upregulated ORA: genesets returned:", nrow(ora_up@result[ora_up@result$p.adjust < 0.05, ]), "\n")

## Upregulated ORA: genesets returned: 107

if (nrow(ora_up) > 0) {
  p1 <- dotplot(ora_up, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — Upregulated Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p1)
}

Figure 1. This is a dot plot of the top 20 GO:BP terms enriched in upregulated genes (logFC > 1, FDR < 0.05). The x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

if (nrow(ora_up) > 0) {
  up_table <- ora_up@result[ora_up@result$p.adjust < 0.05,
                             c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
  up_table$pvalue   <- formatC(up_table$pvalue, format = "e", digits = 2)
  up_table$p.adjust <- formatC(up_table$p.adjust, format = "e", digits = 2)
  knitr::kable(head(up_table, 20),
               caption = "**Table 2.** Top 20 GO:BP terms enriched in upregulated genes (FDR < 0.05).",
               row.names = FALSE)
}

**Table 2.** Top 20 GO:BP terms enriched in upregulated genes (FDR < 0.05).
Description	GeneRatio	BgRatio	pvalue	p.adjust	Count
antibacterial humoral response	10/275	51/15989	1.32e-08	4.42e-05	10
defense response to Gram-positive bacterium	12/275	95/15989	8.11e-08	1.36e-04	12
fat cell differentiation	17/275	231/15989	5.33e-07	5.04e-04	17
humoral immune response	17/275	233/15989	6.02e-07	5.04e-04	17
antimicrobial humoral response	12/275	128/15989	2.13e-06	1.42e-03	12
intermediate filament organization	8/275	56/15989	4.81e-06	2.20e-03	8
epidermis development	19/275	335/15989	5.72e-06	2.20e-03	19
response to corticosteroid	12/275	141/15989	5.86e-06	2.20e-03	12
multi-organism reproductive process	14/275	195/15989	7.25e-06	2.20e-03	14
response to glucocorticoid	11/275	121/15989	7.65e-06	2.20e-03	11
response to steroid hormone	18/275	311/15989	7.67e-06	2.20e-03	18
cellular response to corticosteroid stimulus	8/275	60/15989	8.16e-06	2.20e-03	8
antimicrobial humoral immune response mediated by antimicrobial peptide	10/275	100/15989	8.53e-06	2.20e-03	10
multi-multicellular organism process	14/275	200/15989	9.69e-06	2.32e-03	14
muscle system process	21/275	426/15989	1.59e-05	3.42e-03	21
keratinocyte differentiation	11/275	131/15989	1.63e-05	3.42e-03	11
innate immune response in mucosa	5/275	20/15989	1.82e-05	3.55e-03	5
monocyte chemotaxis	7/275	49/15989	1.91e-05	3.55e-03	7
defense response to bacterium	15/275	245/15989	2.31e-05	4.07e-03	15
cellular response to glucocorticoid stimulus	7/275	51/15989	2.50e-05	4.14e-03	7

3.4 ORA: Downregulated Genes

Next, enrichGO()was run on the downregulated gene list. I use the exact same parameters as for the upregulated genes, so that they two results can be meaningfully compared.

ora_down <- enrichGO(
  gene          = as.character(down_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Downregulated ORA: genesets returned:", nrow(ora_down@result[ora_down@result$p.adjust < 0.05, ]), "\n")

## Downregulated ORA: genesets returned: 13

if (nrow(ora_down) > 0) {
  p2 <- dotplot(ora_down, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — Downregulated Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p2)
}

Figure 2. This is a dot plot of the top 20 GO:BP terms enriched in downregulated genes (logFC < -1, FDR < 0.05). he x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

if (nrow(ora_down) > 0) {
  down_table <- ora_down@result[ora_down@result$p.adjust < 0.05,
                                 c("Description", "GeneRatio", "BgRatio", "pvalue", "p.adjust", "Count")]
  down_table$pvalue   <- formatC(down_table$pvalue, format = "e", digits = 2)
  down_table$p.adjust <- formatC(down_table$p.adjust, format = "e", digits = 2)
  knitr::kable(head(down_table, 20),
               caption = "**Table 3.** Top 20 GO:BP terms enriched in downregulated genes (FDR < 0.05).",
               row.names = FALSE)
}

**Table 3.** Top 20 GO:BP terms enriched in downregulated genes (FDR < 0.05).
Description	GeneRatio	BgRatio	pvalue	p.adjust	Count
female pregnancy	12/167	175/15989	3.05e-07	5.46e-04	12
multi-organism reproductive process	12/167	195/15989	9.73e-07	7.58e-04	12
multi-multicellular organism process	12/167	200/15989	1.27e-06	7.58e-04	12
sensory perception of smell	10/167	202/15989	5.25e-05	2.35e-02	10
detection of chemical stimulus involved in sensory perception	10/167	214/15989	8.51e-05	2.51e-02	10
detection of chemical stimulus involved in sensory perception of smell	9/167	175/15989	9.34e-05	2.51e-02	9
sensory perception of chemical stimulus	11/167	262/15989	9.83e-05	2.51e-02	11
detection of stimulus involved in sensory perception	11/167	288/15989	2.25e-04	4.36e-02	11
regulation of tooth mineralization	3/167	13/15989	2.96e-04	4.36e-02	3
detection of chemical stimulus	10/167	250/15989	3.01e-04	4.36e-02	10
detection of stimulus	13/167	401/15989	3.05e-04	4.36e-02	13
killing of cells of another organism	6/167	88/15989	3.17e-04	4.36e-02	6
disruption of cell in another organism	6/167	88/15989	3.17e-04	4.36e-02	6

3.5 ORA: Combined, all DE Genes

I also ran with the combined gene list. Again, I used all of the same parameters to ensure comparability.

ora_all <- enrichGO(
  gene          = as.character(all_entrez),
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.05,
  universe      = as.character(bg_entrez),
  minGSSize     = 10,
  maxGSSize     = 500,
  readable      = TRUE
)

cat("Combined ORA: genesets returned:", nrow(ora_all@result[ora_all@result$p.adjust < 0.05, ]), "\n")

## Combined ORA: genesets returned: 43

if (nrow(ora_all) > 0) {
  p3 <- dotplot(ora_all, showCategory = 20, font.size = 10) +
    ggtitle("GO:BP Over-Representation — All DE Genes") +
    theme(plot.title = element_text(hjust = 0.5, size = 12))
  print(p3)
}

Figure 3. This is a dot plot of the top 20 GO:BP terms enriched in all differentially expressed genes (FDR < 0.05). The x-axis shows the proportion of pathway genes present in the upregulated list, and the y-axis shows GO:BP term descriptions ordered by gene ratio. Dot size is proportional to the count of DE genes in each term; dot colour represents adjusted p-value (darker = more significant).

3.6 Comparing Up vs. Down vs. Combined Results

To numerically compare these analyses, the set of significant GO:BP term descriptions is extracted from each result object using a helper function, and set operations (intersect(), setdiff(), union()) are applied to look into shared/unique terms.

get_sig_terms <- function(ora_result) {
  if (is.null(ora_result) || nrow(ora_result) == 0) return(character(0))
  ora_result@result$Description[ora_result@result$p.adjust < 0.05]
}

up_terms   <- get_sig_terms(ora_up)
down_terms <- get_sig_terms(ora_down)
all_terms  <- get_sig_terms(ora_all)

shared_up_down <- intersect(up_terms, down_terms)
only_up        <- setdiff(up_terms, down_terms)
only_down      <- setdiff(down_terms, up_terms)
only_combined  <- setdiff(all_terms, union(up_terms, down_terms))

comparison_df <- data.frame(
  Analysis = c("Upregulated only", "Downregulated only", "Combined (all DE)",
               "Shared (up and down)", "Unique to combined"),
  Gene_Sets = c(length(up_terms), length(down_terms), length(all_terms),
                length(shared_up_down), length(only_combined))
)

knitr::kable(comparison_df,
             caption = "**Table 4.** Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA analyses.",
             col.names = c("Analysis", "Significant Gene Sets"))

**Table 4.** Count of significant GO:BP gene sets (FDR < 0.05) across the three ORA analyses.
Analysis	Significant Gene Sets
Upregulated only	107
Downregulated only	13
Combined (all DE)	43
Shared (up and down)	3
Unique to combined	7

4 Interpretation of ORA Results

4.1 Comparison with original study

From the original paper’s finding, we expect certain signals, like IL-17, MAPK, NF-KB and TNF signaling to be upregulated. Several genes from our results support this expectation. The genes that are most upregulated in our data include FOS, JUN, FOSB, ATF3, ZFP36, DUSP1, PER1, and NR4A1, many of which are implicated in stress response. Additionally, many are implicated in inflammatory response, which supports the NF-KB and TNF upregulation reported in the paper.

The genes that are downregulated like SPANXA2, CDH18, KISS1, and OPN1SW, are related to cell adhesion and organization. This fits with a process called epithelial-, to-mesenchymal transition (EMT) which often happens with chemotherapy resistance. The loss of cell adhesion genes after NAC suggests that the remaining tumor cells are changing under treatment pressure.

4.2 External Literature Support

The FOS/JUN AP-1 axis in chemotherapy resistance: The FOS and JUN transcription factor complex is really important for helping cells deal with stress and resist drugs. When the MAPK pathway activates the FOS and JUN complex it helps cells survive by turning on genes that keep them alive. This is often seen in cells that make it through treatments that are supposed to kill them. For example Daschner et al. (1999) found that having much FOS in breast cancer cells makes them resistant to a drug called doxorubicin, which is something we care about when looking at NAC.

Changes in the microenvironment and loss of cell adhesion: It is interesting that we see an increase in stress pathways and a decrease in cell adhesion genes after chemotherapy. This matches what other researchers like Galluzzi et al. (2017) found about how chemotherapy changes the tumor microenvironment. They showed that chemotherapy can cause cells to die in a way that attracts cells and that the extracellular matrix in the tumor can change to make cells more likely to spread which is consistent, with what we found in our analysis of the FOS and JUN AP-1 axis and NR4A1.

5 Non-Thresholded Gene Set Enrichment Analysis

5.1 Method

For GSEA, I decided to use the fgsea package (Korotkevich et al., 2021). I know this package to be computationally efficient, and it will use the ranked gene list that was already created.

The gene sets that I decided to use are the MSigDB Hallmark collection (v2023.2.Hs). I specifically chose to use the Hallmark gene sets over GO:BP because they are manually created from lots of overlapping gene sets, so there is minimal redundancy. I wanted the cytoscape visualization to be manageable, and only include unique pathways. Additionally, these gene sets can be accessed via the msigdbr package. I used the following parameters: - Ranking metric: -log10(p-value) * sign(logFC) - minSize = 15, maxSize = 500 - nperm = 1000 - Significance threshold for reporting: FDR < 0.25 (conventional for GSEA, Subramanian et al., 2005)

5.2 Running fgsea

msig_hallmark <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(msig_hallmark$gene_symbol, msig_hallmark$gs_name)

cat("Number of Hallmark gene sets:", length(hallmark_list), "\n")

## Number of Hallmark gene sets: 50

cat("Total genes in ranked list:", length(ranked_genes), "\n")

## Total genes in ranked list: 19735

set.seed(42)
fgsea_results <- fgsea(
  pathways = hallmark_list,
  stats    = ranked_genes,
  minSize  = 15,
  maxSize  = 500,
  nperm    = 1000
)

fgsea_results <- fgsea_results[order(fgsea_results$NES, decreasing = TRUE), ]

cat("\nGSEA results summary:\n")

## 
## GSEA results summary:

cat("Total pathways tested:", nrow(fgsea_results), "\n")

## Total pathways tested: 50

cat("Significant pathways (FDR < 0.25):", sum(fgsea_results$padj < 0.25, na.rm = TRUE), "\n")

## Significant pathways (FDR < 0.25): 38

cat("Significant pathways (FDR < 0.05):", sum(fgsea_results$padj < 0.05, na.rm = TRUE), "\n")

## Significant pathways (FDR < 0.05): 36

cat("\nTop 10 by NES (upregulated):\n")

## 
## Top 10 by NES (upregulated):

print(head(fgsea_results[fgsea_results$NES > 0, c("pathway", "NES", "padj", "size")], 10))

##                                        pathway      NES        padj  size
##                                         <char>    <num>       <num> <int>
##  1:           HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.667822 0.003154773   199
##  2:                    HALLMARK_MYC_TARGETS_V1 2.650398 0.003154773   199
##  3:                           HALLMARK_HYPOXIA 2.537022 0.003154773   196
##  4:         HALLMARK_OXIDATIVE_PHOSPHORYLATION 2.359163 0.003154773   200
##  5: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 2.313043 0.003154773   200
##  6:                  HALLMARK_MTORC1_SIGNALING 2.304354 0.003154773   199
##  7:                    HALLMARK_UV_RESPONSE_UP 2.179340 0.003154773   156
##  8:                HALLMARK_TGF_BETA_SIGNALING 2.177949 0.003154773    54
##  9:                       HALLMARK_E2F_TARGETS 2.176156 0.003154773   200
## 10:                       HALLMARK_P53_PATHWAY 2.058446 0.003154773   197

cat("\nBottom 10 by NES (downregulated):\n")

## 
## Bottom 10 by NES (downregulated):

print(head(fgsea_results[fgsea_results$NES < 0, c("pathway", "NES", "padj", "size")], 10))

##                               pathway        NES      padj  size
##                                <char>      <num>     <num> <int>
## 1:           HALLMARK_SPERMATOGENESIS -0.7731071 0.9645777   120
## 2: HALLMARK_INTERFERON_ALPHA_RESPONSE -0.9129588 0.6960163    97
## 3:      HALLMARK_BILE_ACID_METABOLISM -1.0214033 0.4456654   106

5.3 Visualizing GSEA Results

I visualized this with a horizontal bar plot, sorted by NES. The names of the pathways have been cleaned by removing the HALLMARK_ prefix and replacing the underscores with spaces.

All 38 significant pathways have a positive NES, so all the bars are red. This means that we donn’t see any pathways significantly enriched at the baseline, as opposed to post-NAC.

fgsea_sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]
fgsea_sig <- fgsea_sig[order(fgsea_sig$NES), ]

fgsea_sig$pathway_clean <- gsub("HALLMARK_", "", fgsea_sig$pathway)
fgsea_sig$pathway_clean <- gsub("_", " ", fgsea_sig$pathway_clean)
fgsea_sig$direction <- ifelse(fgsea_sig$NES > 0, "Upregulated", "Downregulated")

ggplot(fgsea_sig, aes(x = NES, y = reorder(pathway_clean, NES), fill = direction)) +
  geom_bar(stat = "identity", alpha = 0.85) +
  scale_fill_manual(values = c("Upregulated" = "#d73027", "Downregulated" = "#4575b4")) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "GSEA: Hallmark Pathways (FDR < 0.25)",
    subtitle = "Ranked by Normalized Enrichment Score (NES)",
    x = "Normalized Enrichment Score (NES)",
    y = "Hallmark Gene Set",
    fill = "Direction",
    caption = "Positive NES = enriched in FollowUp (post-NAC); Negative NES = enriched in Baseline (pre-NAC)"
  ) +
  theme_bw(base_size = 11) +
  theme(
    plot.title    = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

Figure 4. Bar plot of all Hallmark gene sets significant at FDR < 0.25 from fgsea analysis, sorted by NES. All 38 significant pathways are enriched post-NAC.

gsea_table <- fgsea_sig[, c("pathway_clean", "NES", "padj", "size", "direction")]
gsea_table$NES  <- round(gsea_table$NES, 3)
gsea_table$padj <- formatC(gsea_table$padj, format = "e", digits = 2)
gsea_table <- gsea_table[order(as.numeric(gsea_table$NES), decreasing = TRUE), ]

knitr::kable(gsea_table,
             caption = "**Table 5.** All significantly enriched Hallmark gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by NES.",
             col.names = c("Pathway", "NES", "FDR (adj. p)", "Gene Set Size", "Direction"),
             row.names = FALSE)

**Table 5.** All significantly enriched Hallmark gene sets (FDR < 0.25) from non-thresholded GSEA, sorted by NES.
Pathway	NES	FDR (adj. p)	Gene Set Size	Direction
TNFA SIGNALING VIA NFKB	2.668	3.15e-03	199	Upregulated
MYC TARGETS V1	2.650	3.15e-03	199	Upregulated
HYPOXIA	2.537	3.15e-03	196	Upregulated
OXIDATIVE PHOSPHORYLATION	2.359	3.15e-03	200	Upregulated
EPITHELIAL MESENCHYMAL TRANSITION	2.313	3.15e-03	200	Upregulated
MTORC1 SIGNALING	2.304	3.15e-03	199	Upregulated
UV RESPONSE UP	2.179	3.15e-03	156	Upregulated
TGF BETA SIGNALING	2.178	3.15e-03	54	Upregulated
E2F TARGETS	2.176	3.15e-03	200	Upregulated
P53 PATHWAY	2.058	3.15e-03	197	Upregulated
MYC TARGETS V2	2.045	3.15e-03	57	Upregulated
APOPTOSIS	2.031	3.15e-03	161	Upregulated
UNFOLDED PROTEIN RESPONSE	1.992	3.15e-03	111	Upregulated
ANDROGEN RESPONSE	1.973	3.15e-03	100	Upregulated
ADIPOGENESIS	1.971	3.15e-03	199	Upregulated
CHOLESTEROL HOMEOSTASIS	1.942	3.15e-03	74	Upregulated
MYOGENESIS	1.925	3.15e-03	193	Upregulated
PI3K AKT MTOR SIGNALING	1.880	3.15e-03	103	Upregulated
UV RESPONSE DN	1.868	3.15e-03	143	Upregulated
MITOTIC SPINDLE	1.866	3.15e-03	198	Upregulated
G2M CHECKPOINT	1.853	3.15e-03	199	Upregulated
DNA REPAIR	1.830	3.15e-03	149	Upregulated
REACTIVE OXYGEN SPECIES PATHWAY	1.782	3.15e-03	49	Upregulated
IL2 STAT5 SIGNALING	1.743	3.15e-03	198	Upregulated
PROTEIN SECRETION	1.734	3.15e-03	96	Upregulated
KRAS SIGNALING UP	1.666	3.15e-03	197	Upregulated
GLYCOLYSIS	1.652	3.15e-03	198	Upregulated
HEDGEHOG SIGNALING	1.630	1.79e-02	36	Upregulated
FATTY ACID METABOLISM	1.595	5.22e-03	153	Upregulated
ESTROGEN RESPONSE LATE	1.593	5.22e-03	197	Upregulated
APICAL JUNCTION	1.482	9.88e-03	198	Upregulated
HEME METABOLISM	1.458	1.20e-02	192	Upregulated
ESTROGEN RESPONSE EARLY	1.444	1.38e-02	197	Upregulated
COAGULATION	1.433	2.78e-02	129	Upregulated
PEROXISOME	1.427	2.71e-02	102	Upregulated
INFLAMMATORY RESPONSE	1.369	2.41e-02	199	Upregulated
COMPLEMENT	1.278	7.76e-02	197	Upregulated
ANGIOGENESIS	1.239	1.95e-01	35	Upregulated

5.4 Comparing GSEA vs. Thresholded ORA

It would not be straightforward to do a direct term-by-term comparison between the ORA and GSEA results. A big part of this is that the methods use different annotation databases, so the pathway names themselves may not overlap, even if they refer to similar things. Second and more importantly, the imputs given to these methods are quite different, since ORA gets a binary gene list, while GSEA sees a continuous ranking. Because of this, GSEA may find signals that ORA cannot. Despite this, the high-level results of these methods can be compared. The ORA results pointed to stress response and AP-1/MAPK-related terms. While GSEA points to TNF signaling via NF-KB and inflammatory response Hallmark, this represents similar underlying biology. Additionally, GSEA noted EMT, MYC targets, and metabolic programs, while ORA did not. As previously stated, this is likely because the signals were consistent, but too moderate for ORA to find individually significant.

6 Cytoscape Enrichment Map Visualization

6.1 Overview

Next, GSEA results were visualized using Cytoscape (v3.10) with the EnrichmentMap (v3.3) and AutoAnnotate (v1.4) apps.

6.2 Exporting Results for EnrichmentMap

First, the analysis was exported to get file to run EnrichmentMap on.

fgsea_results$clean_name <- tools::toTitleCase(
  tolower(gsub("_", " ", gsub("HALLMARK_", "", fgsea_results$pathway)))
)

em_generic_clean <- data.frame(
  name        = fgsea_results$clean_name,
  description = fgsea_results$clean_name,
  pvalue      = fgsea_results$pval,
  qvalue      = ifelse(is.na(fgsea_results$padj), 1, fgsea_results$padj),
  phenotype   = ifelse(fgsea_results$NES > 0, 1, -1)
)
write.table(em_generic_clean, "gsea_results_clean.txt",
            sep = "\t", row.names = FALSE, quote = FALSE)

write.table(
  data.frame(Gene = names(ranked_genes), Score = ranked_genes),
  "ranked_genelist_for_em.rnk",
  sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)

msig_hallmark$clean_name <- tools::toTitleCase(
  tolower(gsub("_", " ", gsub("HALLMARK_", "", msig_hallmark$gs_name)))
)
gmt_lines <- tapply(msig_hallmark$gene_symbol, msig_hallmark$clean_name, function(genes) {
  paste(unique(genes), collapse = "\t")
})
gmt_output <- paste0(names(gmt_lines), "\tna\t", gmt_lines)
writeLines(gmt_output, "hallmark_clean.gmt")


sig <- fgsea_results[!is.na(fgsea_results$padj) & fgsea_results$padj < 0.25, ]

6.3 EnrichmentMap Parameters

The following parameters were used for EnrichmentMap (recorded for reproducability):

Parameter	Value
Analysis type	Generic/gProfiler/Enrichr
Enrichments file	gsea_results_clean.txt
GMT file	hallmark_clean.gmt (MSigDB Hallmark, v2023.2.Hs)
Ranks file	ranked_genelist_for_em.rnk
FDR q-value cutoff	0.25
Edge similarity metric	Jaccard+Overlap Combined
Edge similarity cutoff	0.1

This analysis was originally run with the edge similarity cutoff at .375. However, this led to all nodes except 2 being singletons. Because of this, I reverted this parameter back to the default of .1 so we can observe patterns in the clustering.

6.4 Initial Network (Prior to Manual Layout)

knitr::include_graphics("figures/cytoscape_raw.png")

Figure 6. Enrichment Map network prior to manual layout.

6.5 Annotated Network

The, I used AutoAnnotate to identify and label clusters. All parameters were default (In “quickstart” page, “amount of clusters” slider in the middle, “label column” = name, “layout network to minimize cluster overlap” not checked)

knitr::include_graphics("figures/cytoscape_uncollapsed.png")

Figure 7. PEnrichment Map with AutoAnnotate cluster labels after manual layout.

6.6 Theme Network (Collapsed)

Then, I collapsed the network so that each cluster becomes a single large node labeled with its theme name.

knitr::include_graphics("figures/cytoscape_collapsed.png")

Figure 8. Collapsed theme network. Each node represents one AutoAnnotate cluster collapsed to a single theme node. Singleton pathways are retained as individual nodes.

6.7 Major Themes

This analysis yielded six clusters and twelve singletons (from 38 nodes with 23 edges). In the table below I have catalouged each theme, and made notes on if it was present or inferrable from the original paper.

themes_df <- data.frame(
  Theme = c(
    "TNFA Signaling via NF-kB",
    "Epithelial Mesenchymal Transition",
    "MYC Targets V2",
    "Fatty Acid Metabolism",
    "Coagulation / Complement",
    "Estrogen Response (Early/Late)"
    
  ),
  Type = c(
    "Cluster", "Cluster", "Cluster", "Cluster", "Cluster", "Cluster"
  ),
  Fit_with_Paper = c(
    "Yes, paper reports NF-KB and TNF signaling enrichment",
    "Yes,EMT is an chemotherapy resistance mechanism",
    "Yes, MYC-driven proliferation consistent with aggressive resistant tumors",
    "Novel, metabolic reprogramming not highlighted in original paper",
    "Partially yes, complement relates to immune TME remodeling reported in paper",
    "Yes, estrogen signaling central to luminal subtypes in this dataset"
  )
)

knitr::kable(themes_df,
             caption = "Table 7. Major biological themes identified in the enrichment map.",
             col.names = c("Theme", "Type", "Fit with Original Paper"))

Table 7. Major biological themes identified in the enrichment map.
Theme	Type	Fit with Original Paper
TNFA Signaling via NF-kB	Cluster	Yes, paper reports NF-KB and TNF signaling enrichment
Epithelial Mesenchymal Transition	Cluster	Yes,EMT is an chemotherapy resistance mechanism
MYC Targets V2	Cluster	Yes, MYC-driven proliferation consistent with aggressive resistant tumors
Fatty Acid Metabolism	Cluster	Novel, metabolic reprogramming not highlighted in original paper
Coagulation / Complement	Cluster	Partially yes, complement relates to immune TME remodeling reported in paper
Estrogen Response (Early/Late)	Cluster	Yes, estrogen signaling central to luminal subtypes in this dataset

7 Interpretation

7.1 Comments on clustering

The cytoscape visualization yielded 38 nodes with 23 edges. There were 6 non-overlapping clusters. The vast majority of the clusters represent expected results given the original paper’s analysis. There is more novelty in the singletons, but they also have less statistical power. Overall, this clustering seems in line with what we know biologically. It is notable that there were only 38 significant pathways, and all of them were enriched in the treatment group. One possible explanation for the low number of significant pathways is the nature of the data: these patients did not respond to treatment, so perhaps we should expect them to have less change than the average patient. Additionally, the original paper is focused on how NAC induces change in non-responders, so it makes sense that we see most pathways enriched post-treatment.

7.2 Enrichment vs original paper

I believe that these results support the original paper. The paper mentioned that certain pathways, like IL-17 NF-KB and TNF become active after NAC treatment. The top result from the GSEA analysis, TNF Signaling via NF-KB directly supports this. Additionally, the original paper found immune cells were more plentiful after NAC treatment. In our analysis, we found that pathways like complement/coagulation cascades and IL2/STAT5 signaling are active. These pathways are typical of immune cells being recruited and activated.

7.3 External Literature Support

The Hedgehog pathway is important for cancer stem cells to stay alive and for tumors to grow back after treatment. What happens after this treatment is that the Hedgehog pathway gets stronger which makes sense because the treatment may kills the cancer cells and leaves behind the stronger ones, which can resist the treatment. The Hedgehog pathway is very good at helping these stem-like cancer cells survive. Takebe et al. (2015) discuss the Hedgehog pathway and its role in cancer stem cells and what they say supports this idea about the Hedgehog pathway and cancer stem cells.

8 References

8.1 R Packages

GEOquery: Davis S, Meltzer PS (2007). Bioinformatics, 23(14), 1846-1847.
edgeR: Robinson MD, McCarthy DJ, Smyth GK (2010). Bioinformatics, 26(1), 139-140.
limma: Ritchie ME et al. (2015). Nucleic Acids Research, 43(7), e47.
org.Hs.eg.db: Carlson M (2024). R package version 3.19.1.
clusterProfiler: Yu G, Wang L, Han Y, He QY (2012). OMICS, 16(5), 284-287.
enrichplot: Yu G (2024). enrichplot: Visualization of Functional Enrichment Result. R package version 1.22.0.
fgsea: Korotkevich G et al. (2021). bioRxiv. doi:10.1101/060012
msigdbr: Dolgalev I (2022). msigdbr: MSigDB Gene Sets for Multiple Organisms. R package version 7.5.1.
ggplot2: Wickham H (2016). Springer-Verlag New York.
ggrepel: Slowikowski K (2024). ggrepel: Automatically Position Non-Overlapping Text Labels. R package version 0.9.5.
pheatmap: Kolde R (2019). R package version 1.0.12.
knitr: Xie Y (2024). R package version 1.48.
dplyr: Wickham H et al. (2023). R package version 1.1.4.
R Core Team (2024). R: A language and environment for statistical computing.
BiocManager: Morgan M (2024). R package version 1.30.25.

8.2 Bioinformatics Tools

Cytoscape: Shannon P et al. (2003). Genome Research, 13(11), 2498-2504.
EnrichmentMap: Merico D et al. (2010). PLoS ONE, 5(11), e13984.
AutoAnnotate: Kucera M et al. (2016). F1000Research, 5, 1717.
MSigDB: Subramanian A et al. (2005). PNAS, 102(43), 15545-15550.
Gene Ontology: Ashburner M et al. (2000). Nature Genetics, 25(1), 25-29.

8.3 Primary Data Source

Guevara-Nieto HM, Orozco-Castaño CA, Parra-Medina R, Poveda-Garavito JN, Garai J, Zabaleta J, López-Kleine L, Combita AL (2025). Molecular determinants of neoadjuvant chemotherapy resistance in breast cancer: An analysis of gene expression and tumor microenvironment. PLoS One 20(10): e0334335. https://doi.org/10.1371/journal.pone.0334335

8.4 Supporting Literature

Takebe N, Miele L, Harris PJ, Jeong W, Bando H, Kahn M, Yang SX, Ivy SP (2015). Targeting Notch, Hedgehog, and Wnt pathways in cancer stem cells: clinical update. Nature Reviews Clinical Oncology, 12(8), 445–464.
Galluzzi L et al. (2017). Immunogenic cell death in cancer and infectious disease. Nature Reviews Immunology, 17(2), 97-111.
Mohan HM et al. (2012). Molecular pathways: the role of NR4A orphan nuclear receptors in cancer. Clinical Cancer Research, 18(12), 3223-3228.
Subramanian A et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545-15550.
Daschner PJ, Ciolino HP, Plouzek CA, Yeh GC (1999). Increased AP-1 activity in drug resistant human breast cancer MCF-7 cells. Breast Cancer Research and Treatment, 53(3), 229–240. https://doi.org/10.1023/a:1006138803392

BCB420 Assignment 2: Pathway Enrichment Analysis of NAC Resistance in Breast Cancer

Julia Gilley

2026-03-10