Dataset: series GSE309004
Why is the dataset of interest to you? What are the control
and test conditions of the dataset?: This dataset is of
interest because it focuses on breast cancer, a widespread and robustly
studied disease. The experiment is topical, addressing chemotherapy
resistance, which is a major challenge in cancer treatment. Most of my
interest, however, stems from the high quality of the dataset. I hope
that because of the quality and volume of the data, interesting patterns
can be found in the expression.
# Should not need to install packages, but good practice to include
# This is within BCB420 Docker image
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
packages <- c("GEOquery", "edgeR", "biomaRt", "ggplot2", "pheatmap",
"knitr", "limma", "ComplexHeatmap")
for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
BiocManager::install(pkg)
library(pkg, character.only = TRUE)
}
}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Data Download
geo_id <- "GSE309004"
# Get GEO object using class suggested method
if(!file.exists(paste0(geo_id, "_metadata.Rdata"))){
gse <- getGEO(geo_id, GSEMatrix = TRUE, getGPL = FALSE)
save(gse, file = paste0(geo_id, "_metadata.Rdata"))
} else {
load(paste0(geo_id, "_metadata.Rdata"))
}
sample_info <- pData(gse[[1]])
# They also provide supplementary files! We will download these too
supp_files <- getGEOSuppFiles(geo_id, makeDirectory = TRUE,
baseDir = ".", fetch_files = TRUE)
# For reference, show the files we got
list.files(geo_id)
## [1] "GSE309004_NR_nac_Normalizedcounts_final.txt.gz"
## [2] "GSE309004_NR_nac_rawcounts_NR_final.txt.gz"
Data Assessment
files <- list.files(geo_id, full.names = TRUE)
count_file <- files[1]
expression_data <- read.table(count_file,
header = TRUE,
row.names = 1,
sep = "\t")
# View some verification info about the data
dim(expression_data)
## [1] 20287 54
head(expression_data[, 1:5])
## X013Q_Bn X014B.Tn X014Q.Tn X050B.A X050Q.A
## A1BG 2.1642486 3.0334871 3.731225 2.9118184 3.2262792
## A1BG-AS1 -0.3607066 0.7656241 -1.126756 1.0968503 -0.4386826
## A1CF -3.0452047 -2.8699644 -1.126756 -2.4356447 -2.3646820
## A2M 9.5246170 10.0651743 6.357060 7.9647773 8.6227241
## A2M-AS1 2.3421972 2.6669765 -1.126756 1.3863570 1.4219143
## A2ML1 -0.8983634 7.2539722 4.427833 -0.9950722 -0.4386826
Data Assessment
# Check for missing values, defined as NA
cat("Missing values:", sum(is.na(expression_data)), "\n")
## Missing values: 0
# Take a look at a few samples
summary(expression_data[,1:5])
## X013Q_Bn X014B.Tn X014Q.Tn X050B.A
## Min. :-5.85256 Min. :-5.6773 Min. :-1.127 Min. :-5.243
## 1st Qu.: 0.07818 1st Qu.: 0.3888 1st Qu.:-1.127 1st Qu.:-0.385
## Median : 3.21084 Median : 3.4901 Median : 2.780 Median : 2.962
## Mean : 2.53313 Mean : 2.7406 Mean : 2.636 Mean : 2.246
## 3rd Qu.: 5.10961 3rd Qu.: 5.3485 3rd Qu.: 5.023 3rd Qu.: 5.001
## Max. :15.91502 Max. :15.0372 Max. :15.165 Max. :16.539
## X050Q.A
## Min. :-4.6866
## 1st Qu.:-0.4387
## Median : 3.0882
## Mean : 2.3060
## 3rd Qu.: 5.1336
## Max. :15.9869
Gene Mapping to HUGO
Symbols
Current Gene
Identifiers
# Take a look at the current state of the identifiers
head(rownames(expression_data), 20)
## [1] "A1BG" "A1BG-AS1" "A1CF" "A2M" "A2M-AS1" "A2ML1"
## [7] "A4GALT" "A4GNT" "AAAS" "AACS" "AADAC" "AADACL2"
## [13] "AADAT" "AAGAB" "AAK1" "AAMDC" "AAMP" "AANAT"
## [19] "AAR2" "AARD"
# Count total genes
cat("Total genes:", nrow(expression_data), "\n")
## Total genes: 20287
# Suspected hugo patterns = capital, alphanumeric
hugo_pattern <- "^[A-Z0-9-]+$"
likely_hugo <- sum(grepl(hugo_pattern, rownames(expression_data)))
cat("Genes matching HUGO *pattern* (but not confirmed):", likely_hugo, "\n")
## Genes matching HUGO *pattern* (but not confirmed): 19997
Before moving on, I wanted to make sure that the current gene
identifiers made sense. For this, I took a look at how many genes were
capital alphanumeric, since this means that they could already be HUGO
identfiers. It seems like the vast majority of genes are, so I moved on
with actually validating the identifiers.
Verify Gene Symbol
Validity
# Found this method online as an alternative to web requests (for time's sake)
if (!require("org.Hs.eg.db")) {
BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
}
# We can just grab the valid hugo symbols from this database!
valid_symbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")
# Then cross-check this with our data
mapped_genes <- rownames(expression_data) %in% valid_symbols
unmapped_genes <- rownames(expression_data)[!mapped_genes]
cat("Genes that are valid HUGO symbols:", sum(mapped_genes), "\n")
## Genes that are valid HUGO symbols: 19736
cat("Genes that are NOT valid HUGO symbols:", length(unmapped_genes), "\n")
## Genes that are NOT valid HUGO symbols: 551
cat("Which is a rate of :", round(sum(mapped_genes)/nrow(expression_data)*100, 2), "%\n")
## Which is a rate of : 97.28 %
97% of genes mapped is pretty good! It seems like the majority of
gene identifiers are well-formed, so when handling the data we can treat
the others as a minority of cases
Unmapped and
Duplicate Genes
# First, print out some information about unmapped and duplicate genes
cat("Number of unmapped genes:\n")
## Number of unmapped genes:
print(length(unmapped_genes))
## [1] 551
duplicate_symbols <- names(table(rownames(expression_data)))[table(rownames(expression_data)) > 1]
cat("\nDuplicate gene symbols:", length(duplicate_symbols), "\n")
##
## Duplicate gene symbols: 0
Some unmapped genes, but no duplicates! Also, though there are 551
unmapped genes, this doesn’t actually comprise a huge proportion of the
data. For data cleaning, I will only keep genes that map to valid HUGO
symbols. I will also include code for removing duplicates, though this
will not result in any change.
Clean Expression
Data
# keep only hugo sumbols
expression_clean <- expression_data[mapped_genes, ]
# will not run
if(length(duplicate_symbols) > 0){
for(dup_gene in duplicate_symbols){
dup_rows <- which(rownames(expression_clean) == dup_gene)
if(length(dup_rows) > 1){
variances <- apply(expression_clean[dup_rows, ], 1, var)
# We WOULD want to keep exactly one row, the one with highest variance
keep_row <- dup_rows[which.max(variances)]
remove_rows <- dup_rows[dup_rows != keep_row]
expression_clean <- expression_clean[-remove_rows, ]
}
}
}
cat("Final gene count:", nrow(expression_clean), "\n")
## Final gene count: 19736
The full count of genes has not decreased much across the cleaning,
which is good. Then, I move on to some reordering and renaming. To note,
the data uses “B” for baseline samples, and “Q” for follow up samples in
the names.
Outlier Detection
# BOXPLOT
par(mfrow = c(1, 1), mar = c(10, 4, 4, 2))
boxplot(expression_clean, las = 2,
main = "Expression Distributions",
cex.axis = 0.4,
col = ifelse(sample_info$condition == "Baseline", "lightblue", "lightpink"),
ylab = "Expression",
outline = FALSE)
legend("topright", legend = c("Baseline", "FollowUp"),
fill = c("lightblue", "lightpink"))

# DENSITY PLOT
plot(density(expression_clean[,1]),
main = "Sample Expression Density Plots",
xlab = "Expression",
ylim = c(0, 0.3),
col = ifelse(sample_info$condition[1] == "Baseline", "blue", "red"))
for(i in 2:ncol(expression_clean)){
lines(density(expression_clean[,i]),
col = ifelse(sample_info$condition[i] == "Baseline", "blue", "red"))
}
legend("topright", legend = c("Baseline", "FollowUp"),
col = c("blue", "red"), lwd = 2)

# MDS PLOT
# Filter low-expressed genes (as in paper)
keep_genes <- rowSums(expression_clean > 0) >= 5
expression_filtered <- expression_clean[keep_genes, ]
par(mar = c(5, 4, 4, 2))
plotMDS(expression_filtered,
col = ifelse(sample_info$condition == "Baseline", "blue", "red"),
pch = 16,
main = "MDS Plot",
cex = 1.2)
legend("topright", legend = c("Baseline", "FollowUp"),
col = c("blue", "red"), pch = 16)

# View some info about correlation
sample_cor <- cor(expression_filtered)
cat("Min correlation:", min(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Min correlation: 0.7195318
cat("Max correlation:", max(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Max correlation: 0.9657983
cat("Mean correlation:", mean(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Mean correlation: 0.8481694
# Use median correlation to identify outlier candidates (to investigate by hand)
median_cor <- apply(sample_cor, 1, median)
outlier_threshold <- median(median_cor) - 2*sd(median_cor)
cat("\nPotential outliers (defied by median correlation):\n")
##
## Potential outliers (defied by median correlation):
potential_outliers <- which(median_cor < outlier_threshold)
print(colnames(expression_clean)[potential_outliers])
## [1] "X014Q.Tn" "X050B.A" "X226Q.Tn"
cat("Median correlations:", median_cor[potential_outliers], "\n")
## Median correlations: 0.7822334 0.8072393 0.7839399
For outlier detection, I generated three plots: box plot, density
graph, and MDS plot. For the MDS plot, I filtered only by genes
expressed in 5 or more samples, so that the plot would be within a
comprehensible scale. Looking at the plots, there are no outliers so
significant that merits immediate removal. However, the density graph
does show some strange peaks. Additionally, some potential outliers are
identified just by looking at median correlations ((<
2*sd(median_cor)) = potential outlier. On the basis of the density graph
and statistical investigation, I chose to remove these samples as
outliers. The paper that this data is sourced from does not mention any
outlier filtering, since they spent a lot of time on filtering for data
quality. However, given the median correlation and density peaks of
these samples, I feel confident that they are outliers, and thus choose
to remove.
Remove Outliers
# Remove the previously stated outliers
severe_outliers <- c("X014Q.Tn", "X083Q.Tn", "X226B.Tn")
expression_clean <- expression_clean[, !(colnames(expression_clean) %in% severe_outliers)]
# Need to find GSM IDs and remove
outlier_gsm <- rownames(sample_info)[match(gsub("^X", "", gsub("\\.", "-", severe_outliers)),
sample_info$title)]
sample_info <- sample_info[!(rownames(sample_info) %in% outlier_gsm), ]
# Reorder sample_info
clean_colnames <- gsub("^X", "", colnames(expression_clean))
clean_colnames <- gsub("\\.", "-", clean_colnames)
match_idx <- match(clean_colnames, sample_info$title)
sample_info <- sample_info[match_idx, ]
# Look at how this affected distribution
cat("Baseline samples after cleaning:", sum(sample_info$condition == "Baseline"), "\n")
## Baseline samples after cleaning: 26
cat("Followup samples after cleaning:", sum(sample_info$condition == "FollowUp"), "\n")
## Followup samples after cleaning: 25
# Density plot after outlier removal
plot(density(expression_clean[,1]),
main = "Sample Expression Density Plots After Outlier Removal",
xlab = "Expression",
ylim = c(0, 0.3),
col = ifelse(sample_info$condition[1] == "Baseline", "blue", "red"))
for(i in 2:ncol(expression_clean)){
lines(density(expression_clean[,i]),
col = ifelse(sample_info$condition[i] == "Baseline", "blue", "red"))
}
legend("topright", legend = c("Baseline", "FollowUp"),
col = c("blue", "red"), lwd = 2)
The three samples removed did not all originate from the same condition,
so the data still remains as balanced as possible with an odd number.
Additionlly, the new density plot confirms that the anomolous peaks did
come from the statistical outliers.
Normalization
The data provided by the study includes both raw and normalized
samples. Here, I start from the raw data, then normalize. However, I
compare this normalization to the log normalization that was done by the
paper authors.
Fetch and asses raw
data
raw_file <- file.path("GSE309004", "GSE309004_NR_nac_rawcounts_NR_final.txt.gz")
raw_counts <- read.table(gzfile(raw_file),
header = TRUE,
row.names = 1,
sep = "\t")
raw_counts <- raw_counts[, colnames(expression_clean)]
# ,match
common_genes <- intersect(rownames(expression_clean), rownames(raw_counts))
raw_counts <- raw_counts[common_genes, ]
expression_prenorm <- expression_clean[common_genes, ]
Visualization of raw
data
# BOXPLOT
plot_box <- function(mat, main = "", ylab = "Expression") {
par(mar = c(10, 4, 4, 2))
boxplot(mat, las = 2,
main = main,
ylab = ylab,
cex.axis = 0.4,
col = "lightblue",
outline = FALSE)
}
# DENSITY PLOT
plot_density <- function(mat, main = "") {
plot(density(mat[,1]),
main = main,
xlab = "Expression Value",
ylim = c(0, max(apply(mat, 2, function(x) max(density(x)$y)))),
col = rainbow(ncol(mat))[1])
for(i in 2:ncol(mat)){
lines(density(mat[,i]), col = rainbow(ncol(mat))[i])
}
}
# Plot raw counts (log2 transformed for visualization)
plot_box(log2(raw_counts + 1),
main = "Raw Counts Distribution",
ylab = "Log2(counts + 1)")

plot_density(log2(raw_counts + 1),
main = "Raw Counts Density")

Box plots of the raw data showed medians clearly not aligned, which
both confirms that this is raw data, and emphasizes the need for
normalization. I decided to use TMM normalization for this data, since
it deals with both library size and compositional bias that could occur
with this data given the visualizations. Though this is not what the
authors used, I think it is the correct choice to use what I think is
best.
TMM normalize
(edgeR)
library(edgeR)
dge <- DGEList(counts = raw_counts)
dge <- calcNormFactors(dge, method = "TMM")
cat("normalization factors:\n")
## normalization factors:
print(summary(dge$samples$norm.factors))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8171 0.9675 0.9985 1.0032 1.0369 1.2441
tmm_normalized <- cpm(dge, log = TRUE, prior.count = 1)
I wanted to visually compare the post-normalization data to both the
raw data AND the normalized data provided from the associated study.
Comparison of
Normalization Methods
par(mfrow = c(3, 2), mar = c(10, 4, 4, 2))
plot_box(log2(raw_counts + 1),
main = "1. Raw Counts",
ylab = "Log2(counts + 1)")
plot_density(log2(raw_counts + 1),
main = "1. Raw Counts Density")
plot_box(tmm_normalized,
main = "2. TMM Normalized",
ylab = "Log2 CPM")
plot_density(tmm_normalized,
main = "2. TMM Normalized Density")
plot_box(expression_prenorm,
main = "3. GEO Pre-Normalized (from original study)",
ylab = "Log2 Expression")
plot_density(expression_prenorm,
main = "3. GEO Pre-Normalized Density")
While both normalization methods correctly adjust the means (as visible
in the box plot), TMM normalization creates a less noisy density graph.
Originally, I though that the density graph from the pre-normalized data
didn’t have a dip after the first peak, but I realized it’s just hard to
see because of the other lines over it. To avoid an overreliance on
qualitative assessment of graphs, I also looked into some statistics
behind these three data.
Comparison
statistics
# Coefficient of variation
cv_samples <- function(mat) {
means <- colMeans(mat)
sds <- apply(mat, 2, sd)
mean(sds / means)
}
cat("\nCoefficient of variation:\n")
##
## Coefficient of variation:
cat("Raw counts:", round(cv_samples(log2(raw_counts + 1)), 3), "\n")
## Raw counts: 0.519
cat("TMM normalized:", round(cv_samples(tmm_normalized), 3), "\n")
## TMM normalized: 1.39
cat("Study's pre-normalized:", round(cv_samples(expression_prenorm), 3), "\n")
## Study's pre-normalized: 1.51
TMM normalization has a lower coefficient of variation, which I am
happy with. However, given both the graphs and statistics, it is clear
that the originalsudy also correctly normalized their data. They did not
include a explicit raionale behind their normalization choice.
Final cleaned and
normalized dataset
final_expression_data <- tmm_normalized
final_sample_info <- sample_info
cat("\nFinal dataset:\n")
##
## Final dataset:
cat("Genes:", nrow(final_expression_data), "\n")
## Genes: 19735
cat("Samples:", ncol(final_expression_data), "\n")
## Samples: 51
cat("Baseline samples:", sum(final_sample_info$condition == "Baseline"), "\n")
## Baseline samples: 26
cat("FollowUp samples:", sum(final_sample_info$condition == "FollowUp"), "\n")
## FollowUp samples: 25
Calculate Differential
Expression
Setup, define, and fit the model. Baseline is the reference (because
we are interested in the variation in followup).
library(edgeR)
group <- factor(sample_info$condition, levels = c("Baseline", "FollowUp"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "FollowUp_vs_Baseline")
raw_counts <- raw_counts[, colnames(final_expression_data)]
dge_de <- DGEList(counts = raw_counts)
dge_de <- calcNormFactors(dge_de, method = "TMM")
dge_de <- estimateDisp(dge_de, design)
cat("\nBiological coefficient of variation (BCV):\n")
##
## Biological coefficient of variation (BCV):
cat("Common dispersion:", sqrt(dge_de$common.dispersion), "\n")
## Common dispersion: 0.7531688
cat("Trended dispersion range:", range(sqrt(dge_de$trended.dispersion)), "\n")
## Trended dispersion range: 0.5549423 1.172181
# FIT THE MODEL
fit <- glmQLFit(dge_de, design)
qlf <- glmQLFTest(fit, coef = 2)
#Extract results
results <- topTags(qlf, n = Inf)
results_table <- as.data.frame(results)
Trended BCV range: 0.55 – 1.17, indicating a reasonable and expected
amount of biological variation. Also all genes passed, which is
good.
Set Significance
Thresholds
Next, I moved on to setting significance thresholds. I decided on the
following thresholds: P-value < 0.05, FDR < 0.05, and |log2 fold
change| > 1, and I kept FDR to control for multiple testing. These
thesholds not not seem to deviate too far from what is common.
# my decided on thresholds
pvalue_threshold <- 0.05
fdr_threshold <- 0.05
logfc_threshold <- 1
# Significant genes at different thresholds:
sig_pvalue <- sum(results_table$PValue < pvalue_threshold)
sig_fdr <- sum(results_table$FDR < fdr_threshold)
sig_fdr_fc <- sum(results_table$FDR < fdr_threshold & abs(results_table$logFC) > logfc_threshold)
cat("Genes at each threshold:\n")
## Genes at each threshold:
cat("P-value:", sig_pvalue, "genes\n")
## P-value: 4466 genes
cat("FDR:", sig_fdr, "genes\n")
## FDR: 1460 genes
cat("FDR and |logFC| :", sig_fdr_fc, "genes\n")
## FDR and |logFC| : 560 genes
cat("\nUpregulated (FDR < 0.05, logFC > 1):",
sum(results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold), "genes\n")
##
## Upregulated (FDR < 0.05, logFC > 1): 313 genes
cat("Downregulated (FDR < 0.05, logFC < -1):",
sum(results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold), "genes\n")
## Downregulated (FDR < 0.05, logFC < -1): 247 genes
Gene
investigation
After getting the significant genes, I wanted to check for FOS and
NR4A1, two genes that are mentioned in the paper.
# Get significant genes
sig_genes <- results_table[results_table$FDR < fdr_threshold &
abs(results_table$logFC) > logfc_threshold, ]
sig_genes <- sig_genes[order(sig_genes$FDR), ]
# FOS / NR4A1
if("FOS" %in% rownames(results_table)){
cat("\nFOS:\n")
print(results_table["FOS", c("logFC", "logCPM", "PValue", "FDR")])
}
##
## FOS:
## logFC logCPM PValue FDR
## FOS 3.556146 6.478184 1.341759e-12 2.647961e-08
if("NR4A1" %in% rownames(results_table)){
cat("\nNR4A1:\n")
print(results_table["NR4A1", c("logFC", "logCPM", "PValue", "FDR")])
}
##
## NR4A1:
## logFC logCPM PValue FDR
## NR4A1 2.940868 5.089892 8.653683e-12 8.539021e-08
The two genes of interest do show as significantly upregulated after
my analysis. This matches what the paper found, and it also provides a
good sanity check for the differential gene expresison analysis.
Volcano Plot
par(mar = c(5, 5, 4, 2))
colors <- rep("gray", nrow(results_table))
colors[results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold] <- "red"
colors[results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold] <- "blue"
plot(results_table$logFC,
-log10(results_table$PValue),
col = colors,
pch = 16,
cex = 0.6,
xlab = "Log2 Fold Change (FollowUp vs Baseline)",
ylab = "-Log10 P-value",
main = "Volcano Plot: Differential Expression\nFollowUp vs Baseline")
# Add threshold lines
abline(h = -log10(0.05), col = "black", lty = 2, lwd = 2)
abline(v = c(-logfc_threshold, logfc_threshold), col = "black", lty = 2, lwd = 2)
# Add legend
legend("topright",
legend = c(paste("Upregulated (", sum(colors == "red"), ")", sep=""),
paste("Downregulated (", sum(colors == "blue"), ")", sep=""),
"Not significant"),
col = c("red", "blue", "gray"),
pch = 16,
cex = 1.2)
# Label top genes
top_genes <- rownames(sig_genes)[1:min(10, nrow(sig_genes))]
if(length(top_genes) > 0){
top_coords <- results_table[top_genes, ]
text(top_coords$logFC,
-log10(top_coords$PValue),
labels = top_genes,
pos = 4,
cex = 0.7)
}
The volcano plot visualizes the distribution of upregulated and
downregulated genes, with downregulated genes highlighted in blue, and
upregulated genes highlighted in red. I notice that the volcano plot is
reasonably symmetrical, meaning that 1. there is general parity in the
number of upregulated and downregulated genes, and 2. The log fold
change that determines that these genes are significant is generally the
same on both sides. However, I do notice that the greatest log fold
changes all lie in downregulated gene outliers. This is probably in line
with the underlying biology.
Heatmap of Top
Differentially Expressed Genes
library(pheatmap)
# top 50 genes
top_n <- 50
top_gene_names <- rownames(sig_genes)[1:min(top_n, nrow(sig_genes))]
if(length(top_gene_names) > 0){
heatmap_data <- final_expression_data[top_gene_names, ]
gene_vars <- apply(heatmap_data, 1, var)
zero_var_genes <- names(gene_vars[gene_vars == 0 | is.na(gene_vars)])
if(length(zero_var_genes) > 0){
heatmap_data <- heatmap_data[!rownames(heatmap_data) %in% zero_var_genes, ]
}
heatmap_data_scaled <- t(scale(t(heatmap_data)))
# Check for NAs
if(any(is.na(heatmap_data_scaled))){
na_genes <- apply(heatmap_data_scaled, 1, function(x) any(is.na(x)))
heatmap_data_scaled <- heatmap_data_scaled[!na_genes, ]
}
# Check for Inf values
inf_genes <- apply(heatmap_data_scaled, 1, function(x) any(is.infinite(x)))
heatmap_data_scaled <- heatmap_data_scaled[!inf_genes, ]
}
cat("\nFinal heatmap will show", nrow(heatmap_data_scaled), "genes\n")
##
## Final heatmap will show 50 genes
# Create annotation for samples
annotation_col <- data.frame(
Condition = sample_info$condition,
row.names = colnames(heatmap_data_scaled)
)
# Create annotation colors
ann_colors <- list(
Condition = c(Baseline = "lightblue", FollowUp = "lightpink")
)
# Create heatmap
pheatmap(heatmap_data_scaled,
annotation_col = annotation_col,
annotation_colors = ann_colors,
show_colnames = FALSE,
show_rownames = TRUE,
cluster_rows = TRUE,
cluster_cols = TRUE,
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "complete",
fontsize_row = 8,
main = paste("Top", nrow(heatmap_data_scaled),
"Differentially Expressed Genes\n(Scaled Expression)"),
color = colorRampPalette(c("blue", "white", "red"))(100))
The heatmap does show some degree of clustering. However, it seems that
highly downregulated genes cluster together more than highly upregulated
genes. In terms of all upregulated genes vs all downregulated genes, the
downregulated genes seem to seperate and cluster, while the upregulated
genes do cluster together, but are also found around genes with no
change in expression. This makes sense given the biological background
for chemotherapy significantly downregulating target genes.
Summary of
Differential Expression Results
cat("Total genes analyzed:", nrow(results_table), "\n")
## Total genes analyzed: 19735
cat("Genes with FDR < 0.05:", sig_fdr, "\n")
## Genes with FDR < 0.05: 1460
cat("Genes with FDR < 0.05 AND |logFC| > 1:", sig_fdr_fc, "\n")
## Genes with FDR < 0.05 AND |logFC| > 1: 560
cat(" - Upregulated:", sum(results_table$FDR < fdr_threshold &
results_table$logFC > logfc_threshold), "\n")
## - Upregulated: 313
cat(" - Downregulated:", sum(results_table$FDR < fdr_threshold &
results_table$logFC < -logfc_threshold), "\n\n")
## - Downregulated: 247
Answers to misc.
questions
What are the control
and test conditions of the dataset?
Control condition: Baseline (pre-neoadjuvant
chemotherapy treatment)
Test condition: FollowUp (post-neoadjuvant chemotherapy
treatment)
Were there expression
values that were not unique for specific genes? How did you handle
these?
No, there were not any duplicates in this dataset. However, I left in
code for duplicate handling.
Were there expression
values that could not be mapped to current HUGO symbols?
Yes, 551 genes (2.72%) could not be mapped to current HUGO symbols.
These were removed from the analysis. Some reasons for being unmapped
included outdated gene symbols (like “C10orf82”, “C11orf1” or deprecated
identifiers (like “ago-01”, “ago-02”, “ago-03”)
How many outliers
were removed?
3 outlier samples were removed from the analysis.
- X014Q.Tn (FollowUp sample)
- X083Q.Tn (FollowUp sample)
- X226B.Tn (Baseline sample)
In the outlier removal section, I elaborate further on the
rationale.
How did you handle
replicates?
Each sample treated as an independent observation, with no averaging
or combination. Inherent to the design of the study is that each sample
is paired.
What is the final
coverage of your dataset?
19,736 genes
References
R Packages
GEOquery: Davis S, Meltzer PS (2007). GEOquery: a bridge between the
Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23(14),
1846-1847. doi:10.1093/bioinformatics/btm254
edgeR: Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a
Bioconductor package for differential expression analysis of digital
gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616
limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK
(2015). limma powers differential expression analyses for RNA-sequencing
and microarray studies. Nucleic Acids Research, 43(7), e47. doi:10.1093/nar/gkv007
biomaRt: Durinck S, Spellman PT, Birney E, Huber W (2009). Mapping
identifiers for the integration of genomic datasets with the
R/Bioconductor package biomaRt. Nature Protocols, 4, 1184-1191. doi:10.1038/nprot.2009.97
org.Hs.eg.db: Carlson M (2024). org.Hs.eg.db: Genome wide annotation
for Human. R package version 3.19.1.
ggplot2: Wickham H (2016). ggplot2: Elegant Graphics for Data
Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org
pheatmap: Kolde R (2019). pheatmap: Pretty Heatmaps. R package
version 1.0.12, https://CRAN.R-project.org/package=pheatmap
knitr: Xie Y (2024). knitr: A General-Purpose Package for Dynamic
Report Generation in R. R package version 1.48, https://yihui.org/knitr/
R Core Team (2024). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. URL
https://www.R-project.org/
BiocManager: Morgan M (2024). BiocManager: Access the Bioconductor
Project Package Repository. R package version 1.30.25, https://CRAN.R-project.org/package=BiocManager ```
Primary Data
Source
Guevara-Nieto HM, Orozco-Castaño CA, Parra-Medina R, Poveda-Garavito
JN, Garai J, Zabaleta J, López-Kleine L, Combita AL (2025). Molecular
determinants of neoadjuvant chemotherapy resistance in breast cancer: An
analysis of gene expression and tumor microenvironment. PLoS One 20(10):
e0334335. https://doi.org/10.1371/journal.pone.0334335
GEO Accession: GSE309004