0 Package installation check

# Check installation
if(!requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if(!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if(!requireNamespace("readr", quietly = TRUE)) {
  install.packages("readr")
}
if(!requireNamespace("rmarkdown", quietly = TRUE)) {
  install.packages("rmarkdown")
}
if(!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
if(!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
if(!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if(!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}

# Attach
library(knitr)
library(dplyr)
library(readr)
library(rmarkdown)
library(tidyr)
library(tibble)
library(BiocManager)
library(ggplot2)

1 Introduction

In Assignment 1, I selected the RNA-seq dataset from a study (Tan et al., 2025) investigating how nonsense-mediated RNA decay is regulated in a cell type-specific manner during human neural development (Karousis et al., 2016).

1.1 Dataset information

The dataset used in this analysis is GEO accession: GSE263406, a publicly available RNA sequencing dataset generated from Homo sapiens cells and released on May 21, 2025. This study (Tan et al., 2025) investigates the role of nonsense-mediated RNA decay (NMD), a conserved RNA quality-control pathway that regulates the stability and turnover of messenger RNAs. NMD plays an important role in gene expression regulation and is involved in multiple biological processes, including development and disease. The dataset focuses on identifying RNAs targeted by the NMD factor UPF3B (Kadlec et al., 2004) during neural differentiation.

graphic abstract from the study by Tan et al.

Specifically, bulk RNA-seq was performed on neural progenitor cells (NPCs) differentiated from human embryonic stem cells (hESCs). The experimental design compares two groups: a control (Ctrl) group and a UPF3B knockout (UPF3B-KO) group. By comparing gene expression profiles between these groups, the study aims to identify transcripts whose stability or abundance is affected by the loss of UPF3B. This dataset provides insight into how NMD sensitivity can change depending on cellular context (Buchwald et al., 2010), particularly during the transition from pluripotent stem cells to neural progenitor cells, and helps reveal how RNA-binding proteins may contribute to cell type-specific regulation of RNA decay.

The full dataset (after simple filtering) contains 43884 genes and 12 samples:
- (Ctrl_1-6): six control replicates where the UPF3B gene is intact and expressed.
- (UPF3B.KO_1-6): six replicates where UPF3B gene has been genetically disrupted (knockout).

See Figure in the A1 html file.

1.2 Dataset processing

The bulk mRNA-seq dataset used in this analysis was obtained from GEO and accessed via the GEOquery package in R. Genes with no expression across all samples were removed, leaving 43,884 genes across 12 samples. Library sizes were visualized to assess sample consistency, and Ensembl gene IDs were mapped to HGNC symbols using the biomaRt package (Durinck et al., 2005), resulting in 31,282 successfully annotated genes. Lowly expressed genes were filtered with edgeR’s filterByExpr() (Chen et al., 2025), retaining 19,197 genes for downstream analysis.

Normalization was performed using the TMM (Trimmed Mean of M-values) method (Robinson & Oshlack, 2010) to adjust for differences in library size and sequencing depth, assuming most genes are not differentially expressed. Raw count distributions were evaluated using log2-transformed boxplots and density plots, which highlighted potential outliers and confirmed consistency across replicates. Post-normalization comparisons showed improved uniformity in expression distributions, preparing the dataset for subsequent differential expression analysis.

1.3 Differential expression analysis

Differential expression analysis of TMM-normalized log2 CPM values showed clear separation between control and UPF3B knockout NPC samples in multidimensional scaling plots. The BCV plot indicated low variability among replicates (common dispersion ~0.1).

Using an FDR cutoff of 0.05 and applying an additional threshold of |log2FC| > 1 yielded 1209 high-confidence genes. UPF3B knockout led to more down-regulated genes (n = 727) than up-regulated (n = 482), suggesting a primarily positive regulatory role in NPCs and highlighting its cell type-specific effects. Volcano plot.

2 Over-representation analysis (ORA)

Here, we perform over-representation analysis (ORA) (Khatri et al., 2012) using package gprofiler2 on the significant genes identified in A1, focusing on biologically relevant gene sets.

2.1 Importing dataset from A1

In the A1 analysis file, additional code was added to identify significant genes using an FDR cutoff of 0.05 and an additional threshold of |log2FC| > 1. These filtered genes were then exported as a CSV file to the A2/data directory, making them readily available for subsequent analyses in the code chunk below. Subsets for up-regulated and down-regulated genes are also stored here.

library(dplyr)

setwd("/home/rstudio/projects/A2")

# read in significant genes list exported from Assignment 1
sig_genes_A1 <- read.csv("data/sig_genes.csv", stringsAsFactors = FALSE)

# set into different subsets
up_genes <- sig_genes_A1 %>%
  filter(logFC > 0)
up_genes_names <- up_genes$gene_name

down_genes <- sig_genes_A1 %>%
  filter(logFC < 0)
down_genes_names <- down_genes$gene_name

all_genes_names <- sig_genes_A1$gene_name

# summary message
cat("All significant genes:", length(all_genes_names),
    "\nUp regulated genes:", length(up_genes_names),
    "\nDown regulated genes:", length(down_genes_names))

## All significant genes: 1209 
## Up regulated genes: 482 
## Down regulated genes: 727

2.2 ORA methods and process

Our analysis workflow was adapted from the CBW Pathway Workshop framework (Isserlin, 2024), with several adjustments and customizations applied to suit the specific needs of this study.

2.2.1 Parameters

Data source selection: For the over-representation analysis (ORA) of UPF3B-related significant genes, we selected the following databases as sources: GO-BP (Biological Process) to capture the biological processes influenced by UPF3B, including RNA metabolism and neural development; GO-CC (Cellular Component) to identify subcellular localization patterns of the affected proteins; KEGG to map genes onto classical signaling and metabolic pathways; and REAC (Reactome) to include curated human pathways relevant to mRNA decay and neurodevelopmental processes. This combination provides a comprehensive view of the functional and pathway-level impacts of UPF3B knockout in neural progenitor cells.

Organism: Specify the analysis for human genes according to course requirements.

Correction method: Control for multiple testing using false discovery rate (or Benjamini-Hochberg method) (Benjamini & Hochberg, 1995).

Other parameters:
ordered_query = FALSE: perform strict thresholded ORA without considering gene ranking.
significant = TRUE: return only statistically significant enriched terms.
exclude_iea = TRUE: exclude automatic electronic annotations to focus on high-confidence terms.

# ensure the package is installed and attached
if(!requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}
library(gprofiler2)

sources_lst <- c("GO:BP",    # Gene Ontology - Biological Process
                 "GO:CC",    # Gene Ontology – Cellular Component
                 "KEGG",     # Kyoto Encyclopedia of Genes and Genomes
                 "REAC"      # Reactome Pathways
                 )

upreg_lst <- list("Up-regulated Genes" = up_genes_names)
downreg_lst <- list("Down-regulated Genes" = down_genes_names)

# Thresholded ORA on up-regulated genes
upreg_ora <- gost(
  query = upreg_lst,        # focus on upregulated genes
  sources = sources_lst,    # restricted to these sources
  organism = "hsapiens",    # focus on pathways in human
  correction_method = "fdr",# False Discovery Rate (Benjamini-Hochberg Method)
  ordered_query = FALSE,    # put FALSE for strict thresholded ORA
  significant = TRUE,       # statistically significant pathways
  exclude_iea = TRUE        # focus on high-confidence annotations
  )

# Thresholded ORA on down-regulated genes
downreg_ora <- gost(
  query = downreg_lst,
  sources = sources_lst,
  organism = "hsapiens",
  correction_method = "fdr",
  ordered_query = FALSE,
  significant = TRUE,
  exclude_iea = TRUE
  )

2.2.2 Visualization using Manhattan plots

Here we visualize the functional enrichment results of up- and down-regulated genes using Manhattan plots. Each point represents a significantly enriched term from GO-BP, GO-CC, KEGG, or REAC (Reactome), with the y-axis showing -log10(FDR). The patchwork package (Pedersen, 2025) is used to arrange the up- and down-regulated plots stacked for easy comparison.

if(!requireNamespace("patchwork", quietly = TRUE)) {
  install.packages("patchwork")
}
library(gprofiler2)
library(patchwork)

# Interactive plot (zoomable, hover for gene info)
p_up <- gostplot(
  upreg_ora,
  capped = TRUE,
  interactive = FALSE
)
p_down <- gostplot(
  downreg_ora,
  capped = TRUE,
  interactive = FALSE
)

p_up / p_down

Figure 1. Manhattan plots showing enriched biological processes and pathways for up-regulated (top) and down-regulated (bottom) genes in human neural progenitor cells (NPC). The analyzed genes represent those potentially regulated by UPF3B, identified through differential expression analysis between UPF3B-KO and Ctrl samples. ORA was performed using GO-BP, GO-CC, KEGG, and REAC Reactome databases with FDR < 0.05, excluding electronically inferred annotations (IEA). Each point represents a significantly enriched term, with the y-axis showing -log10(FDR).

2.3 Extract results

2.3.1 Filtering results according to thresholds

To improve the robustness of enrichment results, pathway terms were filtered using predefined thresholds. Gene sets with fewer than 5 or more than 500 genes were excluded, and only terms with an intersection size of at least 2 genes with the query list were retained. After applying these criteria, 73 significantly enriched pathways were identified for the up-regulated gene set and 388 for the down-regulated gene set.

# extract the results dataframe for up- and down-regulated genes
# set thresholds
min_gs_size = 5
max_gs_size = 500    # adjust to improve robustness of enrichment results
min_intersec = 2

# apply filtering thresholds on term sizes
up_result_filtered <- upreg_ora$result %>%
  filter(term_size >= min_gs_size, 
         term_size <= max_gs_size,
         intersection_size >= min_intersec)
down_result_filtered <- downreg_ora$result %>%
  filter(term_size >= min_gs_size, 
         term_size <= max_gs_size,
         intersection_size >= min_intersec)

# message
up_pathways <- nrow(up_result_filtered)
down_pathways <- nrow(down_result_filtered)
cat("Significant up-regulated pathways:", up_pathways,
    "\nSignificant down-regulated pathways:", down_pathways)

## Significant up-regulated pathways: 73 
## Significant down-regulated pathways: 388

2.3.2 Top 10 up-regulated pathways

if(!requireNamespace("kableExtra", quietly = TRUE)) {
  install.packages("kableExtra")
}
library(kableExtra)

# keep only columns needed (avoiding list columns)
keep_cols <- c("term_id", "term_name",
               "p_value", "intersection_size")
up_subset <- up_result_filtered[, keep_cols]

# select top 10
up_top10 <- up_subset %>%
  arrange(p_value) %>%
  slice(1:10)

# modify p_value format
up_top10$p_value <- formatC(up_top10$p_value, format = "e", digits = 2)

# render table of top 10
knitr::kable(up_top10, 
             row.names = FALSE, 
             format = "html",
             caption = "Top 10 significantly up-regulated functional terms/pathways") %>% 
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE) %>%
  column_spec(2, width = "30em")

Top 10 significantly up-regulated functional terms/pathways
term_id	term_name	p_value	intersection_size
KEGG:04080	Neuroactive ligand-receptor interaction	5.30e-06	21
REAC:R-HSA-500792	GPCR ligand binding	8.21e-04	22
KEGG:04020	Calcium signaling pathway	1.03e-03	14
REAC:R-HSA-416476	G alpha (q) signalling events	1.05e-03	14
GO:0007200	phospholipase C-activating G protein-coupled receptor signaling pathway	3.73e-03	10
GO:0048729	tissue morphogenesis	3.73e-03	17
GO:0048732	gland development	3.73e-03	11
GO:0048534	hematopoietic or lymphoid organ development	3.73e-03	6
GO:0050926	regulation of positive chemotaxis	3.73e-03	5
GO:0050918	positive chemotaxis	5.12e-03	6

Table 1. Top 10 significantly enriched functional terms and pathways for up-regulated genes. Terms were filtered by gene set size (5-500 genes) and minimum intersection size (>=2 genes), and ranked by p-value.

2.3.3 Top 10 down-regulated pathways

library(kableExtra)

down_subset <- down_result_filtered[, keep_cols]

# select top 10
down_top10 <- down_subset %>%
  arrange(p_value) %>%
  slice(1:10)

# modify p_value format
down_top10$p_value <- formatC(down_top10$p_value, format = "e", digits = 2)

# render table of top 10
knitr::kable(down_top10, 
             row.names = FALSE, 
             format = "html",
             caption = "Top 10 significantly down-regulated functional terms/pathways") %>% 
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE) %>%
  column_spec(2, width = "30em")

Top 10 significantly down-regulated functional terms/pathways
term_id	term_name	p_value	intersection_size
GO:0031012	extracellular matrix	4.22e-21	69
GO:0030312	external encapsulating structure	4.22e-21	69
GO:0062023	collagen-containing extracellular matrix	1.18e-15	52
GO:0001568	blood vessel development	1.42e-13	58
GO:0003013	circulatory system process	1.39e-10	50
REAC:R-HSA-1474244	Extracellular matrix organization	1.53e-10	41
GO:0048514	blood vessel morphogenesis	2.26e-10	50
GO:0030198	extracellular matrix organization	2.36e-10	35
GO:0043062	extracellular structure organization	2.54e-10	35
GO:0045229	external encapsulating structure organization	2.75e-10	35

Table 2. Top 10 significantly enriched functional terms and pathways for down-regulated genes. Terms were filtered by gene set size (5-500 genes) and minimum intersection size (>=2 genes), and ranked by p-value.

2.4 Interpretation of results

Here in this assignment, up-regulated genes refer to those with higher expression in the UPF3B-KO (knockout) group compared with the Ctrl (control) group, while down-regulated genes have lower expression in the UPF3B-KO group.

if(!requireNamespace("stringr", quietly = TRUE)) {
  install.packages("stringr")
}
library(dplyr)
library(stringr)

# try searching terms/pathways related to neuro-development
up_neuro_terms <- up_subset %>%
  filter(str_detect(term_name, regex("neuro", ignore_case = TRUE))) %>% 
  arrange(p_value)
down_neuro_terms <- down_subset %>%
  filter(str_detect(term_name, regex("neuro", ignore_case = TRUE))) %>% 
  arrange(p_value)

2.4.1 Summary of ORA Results

ORA was conducted separately for up-regulated and down-regulated genes in the UPF3B knockout versus control samples, with terms filtered by gene set size (5-500 genes) and a minimum intersection size of 2 genes. Using these criteria, 73 pathways were found to be significantly enriched among up-regulated genes, whereas 388 pathways were significantly enriched among down-regulated genes, reflecting a broader and more pronounced effect on gene repression than activation.

Notably, the down-regulated pathways included a substantial number of extra-cellular matrix (ECM) and vascular-related processes (as shown in Table 2.), such as extra-cellular matrix organization, collagen-containing ECM, blood vessel development, and circulatory system processes.

In addition, neural-related pathways, including neuron projection, dopaminergic neuron differentiation, and regulation of neurogenesis, were significantly down-regulated (as shown in Table 4.). Up-regulated pathways were enriched for neurotransmitter receptor complexes and synaptic signaling, suggesting selective transcript accumulation. Overall, these results indicate that UPF3B knockout affects diverse biological processes in a pathway-specific manner, with particularly strong effects on both neural and extra-cellular matrix-related functions.

library(kableExtra)
knitr::kable(up_neuro_terms, 
             row.names = FALSE, 
             format = "html",
             caption = "up-regulated neural-related pathways/functional terms") %>%
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE) %>%
  column_spec(2, width = "30em")

up-regulated neural-related pathways/functional terms
term_id	term_name	p_value	intersection_size
KEGG:04080	Neuroactive ligand-receptor interaction	0.0000053	21
REAC:R-HSA-112316	Neuronal System	0.0241016	16
GO:0098878	neurotransmitter receptor complex	0.0382932	4
REAC:R-HSA-112314	Neurotransmitter receptors and postsynaptic signal transmission	0.0421370	10

Table 3. ORA of up-regulated genes in UPF3B knockout samples highlighting neural-related pathways. The table lists significantly enriched terms associated with neurotransmitter receptor complexes, neuroactive ligand-receptor interactions, and postsynaptic signal transmission, indicating selective accumulation of synaptic and neuronal transcripts following UPF3B loss.

library(kableExtra)
knitr::kable(down_neuro_terms, 
             row.names = FALSE, 
             format = "html",
             caption = "down-regulated neural-related pathways/functional terms") %>%
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE) %>%
  column_spec(2, width = "30em")

down-regulated neural-related pathways/functional terms
term_id	term_name	p_value	intersection_size
KEGG:04080	Neuroactive ligand-receptor interaction	0.0000081	36
GO:0044306	neuron projection terminus	0.0003634	12
GO:0007272	ensheathment of neurons	0.0030640	11
GO:0043025	neuronal cell body	0.0053345	19
GO:0071542	dopaminergic neuron differentiation	0.0106984	6
GO:0050905	neuromuscular process	0.0167504	9
GO:0050767	regulation of neurogenesis	0.0187927	17
REAC:R-HSA-112316	Neuronal System	0.0206791	28
GO:0005883	neurofilament	0.0398329	3

Table 4. ORA of down-regulated genes in UPF3B knockout samples highlighting neural-related pathways. The table lists significantly enriched terms related to neuron projection, neuronal cell body, dopaminergic neuron differentiation, and regulation of neurogenesis, reflecting suppression of neural development and differentiation processes upon UPF3B loss.

2.4.2 Neural Pathway Dysregulation

Focusing on neuro-developmental processes, down-regulated gene sets were enriched for neuron projection terminus, neuronal cell body, dopaminergic neuron differentiation, and regulation of neuro-genesis, reflecting suppression of neural development and differentiation pathways (Alrahbeni et al., 2015).

Up-regulated gene sets, by contrast, were enriched for neurotransmitter receptor complexes, neuroactive ligand-receptor interactions, and post-synaptic signal transmission, suggesting that specific synaptic components may accumulate due to reduced NMD activity. This bidirectional pattern highlights that UPF3B knockout selectively impacts neural transcript stability, consistent with the notion that NMD targets are factor- and cell type-specific (Tan et al., 2025), rather than globally suppressed.

2.4.3 Integration with Published Evidence

Independent studies further validate these findings. UPF3B mutations have been associated with neuro-developmental disorders (Jolly et al., 2013), cognitive impairments (Alrahbeni et al., 2015), and synaptic dysfunction, supporting the observed down-regulation of neuronal differentiation pathways in our analysis.

Additionally, NMD has been shown to regulate synaptic receptor composition and post-synaptic signaling, aligning with the up-regulated neurotransmitter receptor complexes detected in our analysis. ECM and vascular-related transcript regulation has also been reported in contexts where NMD influences structural gene expression (Chamieh et al., 2008), further supporting the broader biological relevance of our ORA results. Together, these findings highlight the pathway-specific, pleiotropic effects of UPF3B knockout.

3 Non-thresholded gene set enrichment analysis (GSEA)

3.1 Fetching .gmt file

For the non-thresholded gene set enrichment analysis (GSEA), we selected the newest released .gmt file Human_GO_AllPathways_noPFOCR_no_GO_iea_September_01_2025_symbol.gmt” from BaderLab Gene Sets (Lab, 2026).

Note: Firstly, I attempted to use the current release version (Mar. 02, 2026); however, errors occurred when loading the GMT file in the GSEA application due to duplicate gene set names. Therefore, an alternative gene set collection without duplicate pathway names was used for the subsequent analysis.

This file contains all Gene Ontology (GO) pathways across biological processes, molecular functions, and cellular components, represented by gene symbols. We specifically chose the version without parent–child overlap correction (noPFOCR) to retain the full hierarchical structure of GO terms, and without IEA (Inferred from Electronic Annotation) evidence to ensure high-confidence annotations. This selection balances comprehensive coverage of relevant GO pathways with reliable, experimentally supported gene annotations, enhancing the interpretability of the enrichment results.

if(!requireNamespace("RCurl", quietly = TRUE)) {
  install.packages("RCurl")
}
library(RCurl)

# select current release - Mar.02
gmt_file_name <- "Human_GO_AllPathways_noPFOCR_no_GO_iea_September_01_2025_symbol.gmt"
setwd("/home/rstudio/projects/A2")
data_dir <- "./data"

# set download directory
if(!dir.exists(data_dir)) dir.create(data_dir)
dest_gmt_file <- file.path(data_dir, gmt_file_name)

# URL for download
gmt_url <- paste0("http://download.baderlab.org/EM_Genesets/September_01_2025/Human/symbol/", gmt_file_name)

# download GMT file if not already present
if(!file.exists(dest_gmt_file)){
  message("Downloading GMT file to: ", dest_gmt_file)
  download.file(gmt_url, destfile = dest_gmt_file, mode = "wb")
} else {
  message("GMT file already exists: ", dest_gmt_file)
}

3.2 Generating .rnk file

We generated a .rnk file containing all genes from the differential expression analysis in Assignment 1. Each gene was assigned a ranking score calculated as the sign of the log2 fold-change multiplied by the negative log10 of the p value corrected using FDR: \[ \text{rank} = \text{sign}(\text{logFC}) \times -\log_{10}(\text{FDR adjusted pValue}) \] Positive values indicate genes up-regulated in the condition of interest, whereas negative values indicate down-regulated genes.

This ranking strategy was chosen to capture both the direction and statistical significance of differential expression, ensuring that genes with small but highly significant changes contribute appropriately to enrichment analysis. Using this approach allows non-thresholded GSEA to leverage information from all measured genes, rather than only those passing arbitrary significance cutoffs, increasing the sensitivity and interpretability of pathway-level results.

library(dplyr)

setwd("~/projects/A2")
results_DEA <- read.csv("data/DEA_result.csv", stringsAsFactors = FALSE)

ranked_genes <- results_DEA %>%
  mutate(rank = sign(logFC) * (-log10(FDR))) %>%
  # keep only two required columns
  dplyr::select(gene_name, rank) %>%
  arrange(desc(rank))

# save .rnk file into data dir
data_dir <- "./data"
rnk_file <- file.path(data_dir, "pre_GSEA_rank.rnk")

if (!file.exists(rnk_file)) {
  write.table(ranked_genes, 
            file = rnk_file, 
            sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
  message("DEA CSV file created: ", rnk_file)
} else {
  message("DEA File already exists, skipping write: ", rnk_file)
}

paged_table(ranked_genes[1:10, ] )

Table 5. Gene rank, calculated as the sign of the log2 fold-change multiplied by the negative log10 of the FDR. Positive values indicate genes up-regulated in the condition of interest (UPF3B-KO), while negative values indicate down-regulated genes. For clarity, only the first 10 rows and first 10 columns are shown.

3.3 GSEA process

Our analysis workflow was adapted from the CBW Pathway Workshop framework (Isserlin, 2024), with several adjustments and customizations applied to suit the specific needs of this study. The process was performed using the GSEA desktop software.

3.3.1 Parameters

GSEA desktop version: v4.4.0

Geneset database: Human_GO_AllPathways_noPFOCR_no_GO_iea_September_01_2025_symbol.gmt

Number of permutations: 2000

Ranked DEGs: pre_GSEA_rank.rnk

Collapse/Remap to gene symbols: No_Collapse

Max size: 500

Min size: 15

Scree shot for GSEA parameters

Figure 2. Screenshot of the parameter settings used for the GSEA analysis. The interface displays key configuration options including the number of permutations, selection of the ranked gene list file, and the option to disable gene symbol remapping since the ranked list already contains gene symbols. Additional settings shown include the specification of an analysis name to label the output directory, adjustment of gene set size filters to exclude excessively large or small pathways, and selection of the output directory where GSEA saves the analysis results.

3.3.2 Results

In this gene set enrichment analysis (GSEA), genes were ranked such that positive values indicate up-regulation in the condition of interest (e.g., UPF3B-KO), while negative values indicate down-regulation. Accordingly, GSEA reports enrichment separately for the positive and negative ends of the ranked list.

The na_pos category represents gene sets that are significantly enriched among genes up-regulated in UPF3B-KO, reflecting pathways more active in the experimental condition. In contrast, the na_neg category represents gene sets enriched among genes down-regulated in UPF3B-KO (or relatively active in the Ctrl condition). In the current analysis, 2,488 gene sets were up-regulated in na_pos, with 162 reaching FDR < 25% and 157 reaching nominal p-value < 1%, whereas 5,269 gene sets were up-regulated in na_neg, with 1,718 reaching FDR < 25% and 835 reaching nominal p-value < 1%. These results indicate that a substantial number of pathways are either activated or repressed in UPF3B-KO, providing insights into the biological processes associated with the experimental condition.

Scree shot for GSEA parameters

Figure 3. Screenshot of GSEA result overview. The top part (na_pos) denote up-regulation in UPF3B-KO samples, while the bottom part (na_neg) denote down-regulation in UPF3B-KOsamples. A snapshot of the enrichment results provides an overview, while detailed results are available in both HTML and tab-delimited (TSV) formats. These results allow for further interpretation of the biological pathways and processes associated with the na_neg phenotype.

Scree shot for Top 20 up-regulated pathways

Figure 4. Top 20 up-regulated pathways. These up-regulated gene sets (in UPF3B-KO) are primarily associated with mitochondrial function and energy metabolism, including pathways for mitochondrial translation, respiratory chain assembly, oxidative phosphorylation, and aerobic respiration. Some pathways also involve glycolysis, fatty acid catabolism, and cellular transport, indicating coordinated up-regulation of both mitochondrial and cytosolic metabolic processes. Overall, these results suggest enhanced cellular bio-energetics and metabolic activity in the condition of interest.

Scree shot for Top 20 down-regulated pathways

Figure 5. Top 20 down-regulated pathways. The top 20 down-regulated gene sets (in UPF3B-KO) are primarily associated with extracellular matrix structure, organization, and cell adhesion, including collagen fibril organization, cadherin signaling, and integrin-mediated pathways. Several pathways also involve Wnt and SMAD signaling, which regulate cell proliferation, differentiation, and tissue morphogenesis. Overall, these results suggest that the condition of interest (UPF3B-KO) is linked to reduced extra-cellular matrix activity and altered developmental signaling.

3.4 Interpretation of results

3.4.1 Comparison to Thresholded Analysis (ORA)

ORA (threshold-based) focuses only on genes passing specific significance cutoffs, whereas GSEA evaluates the entire ranked gene list, allowing even modest but coordinated changes to contribute to pathway enrichment. As a result, GSEA identified substantially more significant gene sets than ORA, with 363 and 1,483 gene sets reaching FDR < 5% in the na_pos (up-regulated in UPF3B-KO) and na_neg (down-regulated in UPF3B-KO) directions, far exceeding the number of significant pathways detected by ORA (73 up-regulated and 388 down-regulated pathways). This illustrates that GSEA is more sensitive to subtle but coordinated transcriptional changes, whereas ORA emphasizes pathways driven by strong, threshold-passing gene expression differences.

3.4.2 Qualitative similarities

Despite methodological differences, GSEA and ORA highlight consistent biological themes. Both analyses point to strong down-regulation of extracellular matrix (ECM) and neural differentiation pathways, and up-regulation of synaptic signaling and neurotransmitter receptor complexes in UPF3B knockout samples. The convergence of these findings reinforces the conclusion that UPF3B selectively impacts neural transcript stability and ECM-related processes, supporting observations from previous studies (Alrahbeni et al., 2015; Chamieh et al., 2008).

3.4.3 Qualitative Differences and Interpretation Challenges

Direct, quantitative comparison between ORA and GSEA is not straightforward due to their different underlying principles. ORA relies on a predefined significance cutoff, potentially overlooking pathways influenced by moderate but coordinated gene expression changes.

GSEA, in contrast, uses rank-based enrichment scores, capturing cumulative effects across the gene list and revealing additional pathways with smaller individual gene effects. As a result, GSEA often reports more extensive pathway enrichment, providing a broader perspective on transcriptional changes that may be missed in threshold-based ORA. Comparisons between the two methods are therefore qualitative, focusing on convergent biological patterns rather than exact overlaps in pathway lists.

4 Visualize gene set enrichment analysis in Cytoscape

To interpret and explore the results of gene set enrichment analysis (GSEA), we used Cytoscape (Shannon et al., 2003), a network-based visualization platform. Within Cytoscape, the EnrichmentMap app (Merico et al., 2010) was used to construct a network representation of the enriched pathways. Each node in the network represents a gene set, while edges indicate gene overlap between sets.

4.1 Creating the Enrichment Map

Parameters:

FDR adjusted p-Value (i.e., q-Value) Cutoff: 0.1.
Edge Cutoff: 0.375.

The EnrichmentMap network (shown in Figure 6.) consists of 412 nodes and 1078 edges, with an average of ~7 neighbors per node, indicating relatively high connectivity among gene sets. A clustering coefficient of 0.575 reflects pronounced modularity or functional clustering, while a network density of 0.097 indicates overall sparse connectivity. The network contains 114 connected components, highlighting multiple independent functional modules. Overall, this EnrichmentMap depicts a complex, moderately connected network with distinct clusters of enriched gene sets.

The majority of nodes in the EnrichmentMap are colored blue, indicating that most gene sets are down-regulated. This suggests that the observed biological processes are predominantly repressed under the experimental conditions (UPF3B-KO samples). Such a bias toward down-regulation may highlight key pathways that are suppressed and warrant further investigation.

Raw enrichment map

Figure 6. EnrichmentMap visualization of gene set enrichment results. The network contains 412 nodes and 1078 edges, showing moderately connected gene sets with distinct functional clusters. High clustering highlights modular organization, while multiple connected components reflect independent biological processes. Red nodes indicate up-regulated gene sets (in UPF3B-KO samples), while blue nodes indicate down-regulated gene sets. The color intensity typically reflects the magnitude or significance of the enrichment.

4.2 Finding clusters with Annotation

4.2.1 Generate an annotation set using AutoAnnotate

Cluster creation:

Perform clustering via clusterMaker2 app
cluster algorithm: MCL algorithm
edge weight column: similarity coefficient
create singleton clusters: No
layout network to minimize cluster overlap: No

Label creation:

Generate node labels using WordCloud app (most frequent word in cluster and adjacent words)
label column: GS_DESCR
maximum words per label of 3 (default)
minimum word occurrence of 1 (default)
adjacent word bonus of 8 (default)

Annotated enrichment map

Figure 7. Functional clustering of enriched gene sets using AutoAnnotate in Cytoscape. Gene sets were grouped into functional clusters using the clusterMaker2 app with the MCL clustering algorithm, based on the similarity coefficient between gene sets. Cluster labels were automatically generated using the WordCloud app from the GS_DESCR column, highlighting the most frequently occurring terms within each cluster. The resulting clusters were displayed using the AutoAnnotate grid layout to minimize overlap and improve visualization of functional modules within the enrichment network.

4.2.2 AutoAnnotate display - after layout

Enrichment map layout

Figure 8. Functional clustering of enriched gene sets after layout. Application of a network layout improved the spatial separation and organization of clusters, facilitating clearer visualization of functional themes and relationships among enriched gene sets.

Enrichment map - theme network

Figure 9. Summary network created after clustering. The network was collapsed to a theme network using AutoAnnotate, in which each cluster of enriched gene sets is represented by a single node. Applying a network layout improved the spatial organization of clusters, allowing clear visualization of functional themes. The major themes identified include extracellular matrix organization, neurogenesis and neuronal differentiation, synaptic signaling, immune response, and cell adhesion.

4.3 Results summary and interpretations

4.3.1 Up-regulated pathways/genesets (for UPF3B-KO samples)

The GSEA results for UPF3B knockout samples broadly support the conclusions reported by (Tan et al., 2025). Specifically, up-regulated pathways in the knockout samples were enriched for mitochondrial function and oxidative-reduction/respiration processes, suggesting enhanced cellular energy metabolism or compensatory mitochondrial activity. Although (Tan et al., 2025) focused on NMD-mediated transcript regulation and neural development, mitochondrial and metabolic processes are consistent with the reported role of NMD in maintaining neuronal homeostasis and synaptic function. For example, neuronal activity and differentiation require tightly regulated energy production (@ Karousis et al., 2016), and the up-regulation of these metabolic pathways may reflect indirect effects of UPF3B loss on neuronal cell physiology.

4.3.2 Down-regulated pathways/genesets (for UPF3B-KO samples)

In contrast, down-regulated pathways were more numerous and enriched for cardiac, neural, and extracellular matrix (ECM) development. This pattern aligns with (Tan et al., 2025) observations that NMD disruption selectively suppresses transcripts involved in neural differentiation and extracellular structural components. Notably, pathways such as Wnt signaling and GDNF/RET signaling – while not strictly annotated under neural development in the database – play crucial roles in neuron differentiation, axon guidance, and synapse formation, providing mechanistic support for impaired neurodevelopment in UPF3B knockout contexts. Similarly, down-regulation of ECM-related processes echoes previous reports that NMD regulates structural gene expression, potentially influencing tissue organization and vascular development.

4.3.3 Comparison to Thresholded Methods (ORA)

Compared to thresholded methods like ORA, GSEA provides a more comprehensive, non-binary view of pathway regulation. Whereas ORA identified a smaller subset of significant pathways (e.g., 73 up-regulated vs 388 down-regulated), GSEA highlighted thousands of gene sets with subtle but coordinated enrichment patterns. This is particularly useful for detecting mitochondrial and metabolic pathways that may not meet strict fold-change cutoffs but are consistently altered across many genes. Therefore, while the qualitative trends – stronger down-regulation in neural/ECM pathways – are consistent between ORA and GSEA, GSEA captures additional pathways that ORA may miss due to arbitrary thresholding, making the comparison informative but not straightforward.

4.4 Supporting Evidence from Literature

Multiple studies provide independent support for these findings. GDNF and RET signaling are well-known to regulate dopaminergic neuron survival, differentiation, and axon guidance (Chamieh et al., 2008), supporting the down-regulation observed in our analysis. Similarly, Wnt signaling is implicated in both skeletal and neural development, including neuron proliferation and axon outgrowth (Alrahbeni et al., 2015). Mitochondrial and oxidative phosphorylation pathways, up-regulated in UPF3B knockout samples, are critical for neuronal energy supply and synaptic function, consistent with studies showing that mitochondrial dynamics are influenced by NMD activity (Tan et al., 2025). Together, these publications reinforce that our GSEA results are biologically plausible and reflect mechanisms consistent with the original Tan paper.

5 Link to journal entry

A2 journal entry could be found here.

References

Alrahbeni, T., Sartor, F., Anderson, J., Miedzybrodzka, Z., McCaig, C., & Müller, B. (2015). Full UPF3B function is critical for neuronal differentiation of neural stem cells. Molecular Brain, 8, 33. https://doi.org/10.1186/s13041-015-0122-1

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.

Buchwald, G., Ebert, J., Basquin, C., Sauliere, J., Jayachandran, U., Bono, F., Le Hir, H., & Conti, E. (2010). Insights into the recruitment of the NMD machinery from the crystal structure of a core EJC-UPF3b complex. Proceedings of the National Academy of Sciences of the United States of America, 107, 10050–10055.

Chamieh, H., Ballut, L., Bonneau, F., & Le Hir, H. (2008). NMD factors UPF2 and UPF3 bridge UPF1 to the exon junction complex and stimulate its RNA helicase activity. Nature Structural & Molecular Biology, 15(1), 85–93. https://doi.org/10.1038/nsmb1330

Chen, Y., Chen, L., Lun, A. T. L., Baldoni, P., & Smyth, G. K. (2025). edgeR v4: Powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. Nucleic Acids Research, 53(2), gkaf018. https://doi.org/10.1093/nar/gkaf018

Davis, S., & Meltzer, P. (2007). GEOquery: A bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics, 14, 1846–1847. https://doi.org/10.1093/bioinformatics/btm254

Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., & Huber, W. (2005). BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics, 21, 3439–3440.

Gu, Z. (2022). Complex heatmap visualization. iMeta. https://doi.org/10.1002/imt2.43

Isserlin, R. (2024). CBW workshop example r notebooks. https://baderlab.github.io/CBW_Pathways_2024/

Jolly, L. A., Homan, C. C., Jacob, R., Barry, S., & Gecz, J. (2013). The UPF3B gene, implicated in intellectual disability, autism, ADHD and childhood onset schizophrenia regulates neural progenitor cell behaviour and neuronal outgrowth. Human Molecular Genetics, 22(23), 4673–4687. https://doi.org/10.1093/hmg/ddt315

Kadlec, J., Izaurralde, E., & Cusack, S. (2004). The structural basis for the interaction between nonsense-mediated mRNA decay factors UPF2 and UPF3. Nature Structural & Molecular Biology, 11, 330–337.

Karousis, E. D., Nasif, S., & Mühlemann, O. (2016). Nonsense-mediated mRNA decay: Novel mechanistic insights and biological impact. WIREs RNA, 7(5), 661–682. https://doi.org/10.1002/wrna.1357

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput Biol, 8(2), e1002375.

Kolberg, L., Raudvere, U., Kuzmin, I., Vilo, J., & Peterson, H. (2020). gprofiler2– an r package for gene list functional enrichment analysis and namespace conversion toolset g:profiler. F1000Research, 9 (ELIXIR)(709).

Lab, B. (2026). Enrichment map gene sets: Human gene set collections in GMT format. University of Toronto; Bader Lab GeneSets database, University of Toronto. http://baderlab.org/GeneSets

Leeds, P., Peltz, S. W., Jacobson, A., & Culbertson, M. R. (1991). The product of the yeast UPF1 gene is required for rapid turnover of mRNAs containing a premature translational termination codon. Genes & Development, 5, 2303–2314.

Merico, D., Isserlin, R., Stueker, O., Emili, A., & Bader, G. D. (2010). Enrichment map: A network-based method for gene-set enrichment visualization and interpretation. PLoS ONE, 5(11), e13984. https://doi.org/10.1371/journal.pone.0013984

Morgan, M., & Ramos, M. (2025). BiocManager: Access the bioconductor project package repository. https://doi.org/10.32614/CRAN.package.BiocManager

Müller, K., & Wickham, H. (2025). Tibble: Simple data frames. https://tibble.tidyverse.org/

Pedersen, T. L. (2025). Patchwork: The composer of plots. https://doi.org/10.32614/CRAN.package.patchwork

R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. https://doi.org/10.1186/gb-2010-11-3-r25

Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B., & Ideker, T. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research, 13(11), 2498–2504. https://doi.org/10.1101/gr.1239303

Tan, K., Sebat, J., & Wilkinson, M. F. (2025). Cell type- and factor-specific nonsense-mediated RNA decay. Nucleic Acids Res, 53(9), gkaf395. https://doi.org/10.1093/nar/gkaf395

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://doi.org/10.32614/CRAN.package.dplyr

BCB420 Assignment 2

Pathway and Network Analysis

Feiyang Sun

2026-03-10