Dataset: series GSE309004

Why is the dataset of interest to you? What are the control and test conditions of the dataset?: This dataset is of interest because it focuses on breast cancer, a widespread and robustly studied disease. The experiment is topical, addressing chemotherapy resistance, which is a major challenge in cancer treatment. Most of my interest, however, stems from the high quality of the dataset. I hope that because of the quality and volume of the data, interesting patterns can be found in the expression.

# Should not need to install packages, but good practice to include
# This is within BCB420 Docker image
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

packages <- c("GEOquery", "edgeR", "biomaRt", "ggplot2", "pheatmap", 
              "knitr", "limma", "ComplexHeatmap")

for (pkg in packages) {
  if (!require(pkg, character.only = TRUE)) {
    BiocManager::install(pkg)
    library(pkg, character.only = TRUE)
  }
}

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

1 Data Download

geo_id <- "GSE309004"

# Get GEO object using class suggested method
if(!file.exists(paste0(geo_id, "_metadata.Rdata"))){
  gse <- getGEO(geo_id, GSEMatrix = TRUE, getGPL = FALSE)
  save(gse, file = paste0(geo_id, "_metadata.Rdata"))
} else {
  load(paste0(geo_id, "_metadata.Rdata"))
}
sample_info <- pData(gse[[1]])

# They also provide supplementary files! We will download these too
supp_files <- getGEOSuppFiles(geo_id, makeDirectory = TRUE, 
                               baseDir = ".", fetch_files = TRUE)

# For reference, show the files we got
list.files(geo_id)
## [1] "GSE309004_NR_nac_Normalizedcounts_final.txt.gz"
## [2] "GSE309004_NR_nac_rawcounts_NR_final.txt.gz"

2 Data Assessment

files <- list.files(geo_id, full.names = TRUE)

count_file <- files[1]
expression_data <- read.table(count_file, 
                                  header = TRUE, 
                                  row.names = 1,
                                  sep = "\t")
# View some verification info about the data 
dim(expression_data)
## [1] 20287    54
head(expression_data[, 1:5])
##            X013Q_Bn   X014B.Tn  X014Q.Tn    X050B.A    X050Q.A
## A1BG      2.1642486  3.0334871  3.731225  2.9118184  3.2262792
## A1BG-AS1 -0.3607066  0.7656241 -1.126756  1.0968503 -0.4386826
## A1CF     -3.0452047 -2.8699644 -1.126756 -2.4356447 -2.3646820
## A2M       9.5246170 10.0651743  6.357060  7.9647773  8.6227241
## A2M-AS1   2.3421972  2.6669765 -1.126756  1.3863570  1.4219143
## A2ML1    -0.8983634  7.2539722  4.427833 -0.9950722 -0.4386826

3 Data Assessment

# Check for missing values, defined as NA
cat("Missing values:", sum(is.na(expression_data)), "\n")
## Missing values: 0
# Take a look at a few samples
summary(expression_data[,1:5])
##     X013Q_Bn           X014B.Tn          X014Q.Tn         X050B.A      
##  Min.   :-5.85256   Min.   :-5.6773   Min.   :-1.127   Min.   :-5.243  
##  1st Qu.: 0.07818   1st Qu.: 0.3888   1st Qu.:-1.127   1st Qu.:-0.385  
##  Median : 3.21084   Median : 3.4901   Median : 2.780   Median : 2.962  
##  Mean   : 2.53313   Mean   : 2.7406   Mean   : 2.636   Mean   : 2.246  
##  3rd Qu.: 5.10961   3rd Qu.: 5.3485   3rd Qu.: 5.023   3rd Qu.: 5.001  
##  Max.   :15.91502   Max.   :15.0372   Max.   :15.165   Max.   :16.539  
##     X050Q.A       
##  Min.   :-4.6866  
##  1st Qu.:-0.4387  
##  Median : 3.0882  
##  Mean   : 2.3060  
##  3rd Qu.: 5.1336  
##  Max.   :15.9869

4 Gene Mapping to HUGO Symbols

4.1 Current Gene Identifiers

# Take a look at the current state of the identifiers
head(rownames(expression_data), 20)
##  [1] "A1BG"     "A1BG-AS1" "A1CF"     "A2M"      "A2M-AS1"  "A2ML1"   
##  [7] "A4GALT"   "A4GNT"    "AAAS"     "AACS"     "AADAC"    "AADACL2" 
## [13] "AADAT"    "AAGAB"    "AAK1"     "AAMDC"    "AAMP"     "AANAT"   
## [19] "AAR2"     "AARD"
# Count total genes
cat("Total genes:", nrow(expression_data), "\n")
## Total genes: 20287
# Suspected hugo patterns = capital, alphanumeric
hugo_pattern <- "^[A-Z0-9-]+$"
likely_hugo <- sum(grepl(hugo_pattern, rownames(expression_data)))
cat("Genes matching HUGO *pattern* (but not confirmed):", likely_hugo, "\n")
## Genes matching HUGO *pattern* (but not confirmed): 19997

Before moving on, I wanted to make sure that the current gene identifiers made sense. For this, I took a look at how many genes were capital alphanumeric, since this means that they could already be HUGO identfiers. It seems like the vast majority of genes are, so I moved on with actually validating the identifiers.

4.2 Verify Gene Symbol Validity

# Found this method online as an alternative to web requests (for time's sake)
if (!require("org.Hs.eg.db")) {
  BiocManager::install("org.Hs.eg.db")
  library(org.Hs.eg.db)
}

# We can just grab the valid hugo symbols from this database!
valid_symbols <- keys(org.Hs.eg.db, keytype = "SYMBOL")

# Then cross-check this with our data
mapped_genes <- rownames(expression_data) %in% valid_symbols
unmapped_genes <- rownames(expression_data)[!mapped_genes]

cat("Genes that are valid HUGO symbols:", sum(mapped_genes), "\n")
## Genes that are valid HUGO symbols: 19736
cat("Genes that are NOT valid HUGO symbols:", length(unmapped_genes), "\n")
## Genes that are NOT valid HUGO symbols: 551
cat("Which is a rate of :", round(sum(mapped_genes)/nrow(expression_data)*100, 2), "%\n")
## Which is a rate of : 97.28 %

97% of genes mapped is pretty good! It seems like the majority of gene identifiers are well-formed, so when handling the data we can treat the others as a minority of cases

4.3 Unmapped and Duplicate Genes

# First, print out some information about unmapped and duplicate genes
cat("Number of unmapped genes:\n")
## Number of unmapped genes:
print(length(unmapped_genes))
## [1] 551
duplicate_symbols <- names(table(rownames(expression_data)))[table(rownames(expression_data)) > 1]
cat("\nDuplicate gene symbols:", length(duplicate_symbols), "\n")
## 
## Duplicate gene symbols: 0

Some unmapped genes, but no duplicates! Also, though there are 551 unmapped genes, this doesn’t actually comprise a huge proportion of the data. For data cleaning, I will only keep genes that map to valid HUGO symbols. I will also include code for removing duplicates, though this will not result in any change.

4.4 Clean Expression Data

# keep only hugo sumbols
expression_clean <- expression_data[mapped_genes, ]

# will not run
if(length(duplicate_symbols) > 0){
  for(dup_gene in duplicate_symbols){
    dup_rows <- which(rownames(expression_clean) == dup_gene)
    if(length(dup_rows) > 1){
      variances <- apply(expression_clean[dup_rows, ], 1, var)
      # We WOULD want to keep exactly one row, the one with highest variance
      keep_row <- dup_rows[which.max(variances)]
      remove_rows <- dup_rows[dup_rows != keep_row]
      expression_clean <- expression_clean[-remove_rows, ]
    }
  }
}

cat("Final gene count:", nrow(expression_clean), "\n")
## Final gene count: 19736

The full count of genes has not decreased much across the cleaning, which is good. Then, I move on to some reordering and renaming. To note, the data uses “B” for baseline samples, and “Q” for follow up samples in the names.

4.5 Sample Information

# Make sure to match expression columns to sample_info rows
clean_colnames <- gsub("^X", "", colnames(expression_clean))
clean_colnames <- gsub("\\.", "-", clean_colnames)
match_idx <- match(clean_colnames, sample_info$title)
# IMPORTANT! Reorder sample_info to match expression_clean! 
sample_info_ordered <- sample_info[match_idx, ]

# This needs to print TRUE
cat("Have all the samples been matched? :", all(!is.na(match_idx)), "\n")
## Have all the samples been matched? : TRUE
# Assign conditions based on B or Q in title (explained above)
sample_info_ordered$condition <- ifelse(grepl("B", sample_info_ordered$title) & 
                                         !grepl("Q", sample_info_ordered$title), 
                                         "Baseline", "FollowUp")


sample_info <- sample_info_ordered

#Verify info
cat("Number of baseline samples:", sum(sample_info$condition == "Baseline"), "\n")
## Number of baseline samples: 27
cat("Number of followup samples:", sum(sample_info$condition == "FollowUp"), "\n")
## Number of followup samples: 27

As verified by the information printed out, all samples were successfully matched, and there is parity in the number of baseline vs followup samples

5 Outlier Detection

# BOXPLOT 
par(mfrow = c(1, 1), mar = c(10, 4, 4, 2))
boxplot(expression_clean, las = 2, 
        main = "Expression Distributions",
        cex.axis = 0.4, 
        col = ifelse(sample_info$condition == "Baseline", "lightblue", "lightpink"),
        ylab = "Expression",
        outline = FALSE)
legend("topright", legend = c("Baseline", "FollowUp"), 
       fill = c("lightblue", "lightpink"))

# DENSITY PLOT
plot(density(expression_clean[,1]), 
     main = "Sample Expression Density Plots",
     xlab = "Expression",
     ylim = c(0, 0.3),
     col = ifelse(sample_info$condition[1] == "Baseline", "blue", "red"))

for(i in 2:ncol(expression_clean)){
  lines(density(expression_clean[,i]), 
        col = ifelse(sample_info$condition[i] == "Baseline", "blue", "red"))
}
legend("topright", legend = c("Baseline", "FollowUp"), 
       col = c("blue", "red"), lwd = 2)

# MDS PLOT
# Filter low-expressed genes (as in paper)
keep_genes <- rowSums(expression_clean > 0) >= 5
expression_filtered <- expression_clean[keep_genes, ]

par(mar = c(5, 4, 4, 2))
plotMDS(expression_filtered, 
        col = ifelse(sample_info$condition == "Baseline", "blue", "red"),
        pch = 16, 
        main = "MDS Plot",
        cex = 1.2)
legend("topright", legend = c("Baseline", "FollowUp"), 
       col = c("blue", "red"), pch = 16)

# View some info about correlation
sample_cor <- cor(expression_filtered)
cat("Min correlation:", min(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Min correlation: 0.7195318
cat("Max correlation:", max(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Max correlation: 0.9657983
cat("Mean correlation:", mean(sample_cor[lower.tri(sample_cor)], na.rm = TRUE), "\n")
## Mean correlation: 0.8481694
# Use median correlation to identify outlier candidates (to investigate by hand)
median_cor <- apply(sample_cor, 1, median)
outlier_threshold <- median(median_cor) - 2*sd(median_cor)

cat("\nPotential outliers (defied by median correlation):\n")
## 
## Potential outliers (defied by median correlation):
potential_outliers <- which(median_cor < outlier_threshold)

print(colnames(expression_clean)[potential_outliers])
## [1] "X014Q.Tn" "X050B.A"  "X226Q.Tn"
cat("Median correlations:", median_cor[potential_outliers], "\n")
## Median correlations: 0.7822334 0.8072393 0.7839399

For outlier detection, I generated three plots: box plot, density graph, and MDS plot. For the MDS plot, I filtered only by genes expressed in 5 or more samples, so that the plot would be within a comprehensible scale. Looking at the plots, there are no outliers so significant that merits immediate removal. However, the density graph does show some strange peaks. Additionally, some potential outliers are identified just by looking at median correlations ((< 2*sd(median_cor)) = potential outlier. On the basis of the density graph and statistical investigation, I chose to remove these samples as outliers. The paper that this data is sourced from does not mention any outlier filtering, since they spent a lot of time on filtering for data quality. However, given the median correlation and density peaks of these samples, I feel confident that they are outliers, and thus choose to remove.

5.1 Remove Outliers

# Remove the previously stated outliers

severe_outliers <- c("X014Q.Tn", "X083Q.Tn", "X226B.Tn")
expression_clean <- expression_clean[, !(colnames(expression_clean) %in% severe_outliers)]

# Need to find GSM IDs and remove
outlier_gsm <- rownames(sample_info)[match(gsub("^X", "", gsub("\\.", "-", severe_outliers)), 
                                            sample_info$title)]
sample_info <- sample_info[!(rownames(sample_info) %in% outlier_gsm), ]
# Reorder sample_info
clean_colnames <- gsub("^X", "", colnames(expression_clean))
clean_colnames <- gsub("\\.", "-", clean_colnames)
match_idx <- match(clean_colnames, sample_info$title)
sample_info <- sample_info[match_idx, ]

# Look at how this affected distribution
cat("Baseline samples after cleaning:", sum(sample_info$condition == "Baseline"), "\n")
## Baseline samples after cleaning: 26
cat("Followup samples after cleaning:", sum(sample_info$condition == "FollowUp"), "\n")
## Followup samples after cleaning: 25
# Density plot after outlier removal
plot(density(expression_clean[,1]), 
     main = "Sample Expression Density Plots After Outlier Removal",
     xlab = "Expression",
     ylim = c(0, 0.3),
     col = ifelse(sample_info$condition[1] == "Baseline", "blue", "red"))

for(i in 2:ncol(expression_clean)){
  lines(density(expression_clean[,i]), 
        col = ifelse(sample_info$condition[i] == "Baseline", "blue", "red"))
}

legend("topright", legend = c("Baseline", "FollowUp"), 
       col = c("blue", "red"), lwd = 2)

The three samples removed did not all originate from the same condition, so the data still remains as balanced as possible with an odd number. Additionlly, the new density plot confirms that the anomolous peaks did come from the statistical outliers.

6 Normalization

The data provided by the study includes both raw and normalized samples. Here, I start from the raw data, then normalize. However, I compare this normalization to the log normalization that was done by the paper authors.

6.1 Fetch and asses raw data

raw_file <- file.path("GSE309004", "GSE309004_NR_nac_rawcounts_NR_final.txt.gz")

raw_counts <- read.table(gzfile(raw_file), 
                           header = TRUE, 
                           row.names = 1,
                           sep = "\t")
raw_counts <- raw_counts[, colnames(expression_clean)]
# ,match
common_genes <- intersect(rownames(expression_clean), rownames(raw_counts))
raw_counts <- raw_counts[common_genes, ]
expression_prenorm <- expression_clean[common_genes, ]

6.2 Visualization of raw data

# BOXPLOT
plot_box <- function(mat, main = "", ylab = "Expression") {
  par(mar = c(10, 4, 4, 2))
  boxplot(mat, las = 2,
          main = main,
          ylab = ylab,
          cex.axis = 0.4,
          col = "lightblue",
          outline = FALSE)
}

# DENSITY PLOT
plot_density <- function(mat, main = "") {
  plot(density(mat[,1]), 
       main = main,
       xlab = "Expression Value",
       ylim = c(0, max(apply(mat, 2, function(x) max(density(x)$y)))),
       col = rainbow(ncol(mat))[1])
  
  for(i in 2:ncol(mat)){
    lines(density(mat[,i]), col = rainbow(ncol(mat))[i])
  }
}

# Plot raw counts (log2 transformed for visualization)
plot_box(log2(raw_counts + 1), 
         main = "Raw Counts Distribution",
         ylab = "Log2(counts + 1)")

plot_density(log2(raw_counts + 1), 
             main = "Raw Counts Density")

Box plots of the raw data showed medians clearly not aligned, which both confirms that this is raw data, and emphasizes the need for normalization. I decided to use TMM normalization for this data, since it deals with both library size and compositional bias that could occur with this data given the visualizations. Though this is not what the authors used, I think it is the correct choice to use what I think is best.

6.3 TMM normalize (edgeR)

library(edgeR)

dge <- DGEList(counts = raw_counts)
dge <- calcNormFactors(dge, method = "TMM")

cat("normalization factors:\n")
## normalization factors:
print(summary(dge$samples$norm.factors))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8171  0.9675  0.9985  1.0032  1.0369  1.2441
tmm_normalized <- cpm(dge, log = TRUE, prior.count = 1)

I wanted to visually compare the post-normalization data to both the raw data AND the normalized data provided from the associated study.

6.4 Comparison of Normalization Methods

par(mfrow = c(3, 2), mar = c(10, 4, 4, 2))

plot_box(log2(raw_counts + 1), 
         main = "1. Raw Counts",
         ylab = "Log2(counts + 1)")
plot_density(log2(raw_counts + 1), 
             main = "1. Raw Counts Density")
plot_box(tmm_normalized, 
         main = "2. TMM Normalized",
         ylab = "Log2 CPM")
plot_density(tmm_normalized, 
             main = "2. TMM Normalized Density")
plot_box(expression_prenorm, 
         main = "3. GEO Pre-Normalized (from original study)",
         ylab = "Log2 Expression")
plot_density(expression_prenorm, 
             main = "3. GEO Pre-Normalized Density")

While both normalization methods correctly adjust the means (as visible in the box plot), TMM normalization creates a less noisy density graph. Originally, I though that the density graph from the pre-normalized data didn’t have a dip after the first peak, but I realized it’s just hard to see because of the other lines over it. To avoid an overreliance on qualitative assessment of graphs, I also looked into some statistics behind these three data.

6.5 Comparison statistics

# Coefficient of variation
cv_samples <- function(mat) {
  means <- colMeans(mat)
  sds <- apply(mat, 2, sd)
  mean(sds / means)
}

cat("\nCoefficient of variation:\n")
## 
## Coefficient of variation:
cat("Raw counts:", round(cv_samples(log2(raw_counts + 1)), 3), "\n")
## Raw counts: 0.519
cat("TMM normalized:", round(cv_samples(tmm_normalized), 3), "\n")
## TMM normalized: 1.39
cat("Study's pre-normalized:", round(cv_samples(expression_prenorm), 3), "\n")
## Study's pre-normalized: 1.51

TMM normalization has a lower coefficient of variation, which I am happy with. However, given both the graphs and statistics, it is clear that the originalsudy also correctly normalized their data. They did not include a explicit raionale behind their normalization choice.

6.6 Final cleaned and normalized dataset

final_expression_data <- tmm_normalized
final_sample_info <- sample_info

cat("\nFinal dataset:\n")
## 
## Final dataset:
cat("Genes:", nrow(final_expression_data), "\n")
## Genes: 19735
cat("Samples:", ncol(final_expression_data), "\n")
## Samples: 51
cat("Baseline samples:", sum(final_sample_info$condition == "Baseline"), "\n")
## Baseline samples: 26
cat("FollowUp samples:", sum(final_sample_info$condition == "FollowUp"), "\n")
## FollowUp samples: 25

7 Calculate Differential Expression

Setup, define, and fit the model. Baseline is the reference (because we are interested in the variation in followup).

library(edgeR)

group <- factor(sample_info$condition, levels = c("Baseline", "FollowUp"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "FollowUp_vs_Baseline")
raw_counts <- raw_counts[, colnames(final_expression_data)]
dge_de <- DGEList(counts = raw_counts)
dge_de <- calcNormFactors(dge_de, method = "TMM")
dge_de <- estimateDisp(dge_de, design)

cat("\nBiological coefficient of variation (BCV):\n")
## 
## Biological coefficient of variation (BCV):
cat("Common dispersion:", sqrt(dge_de$common.dispersion), "\n")
## Common dispersion: 0.7531688
cat("Trended dispersion range:", range(sqrt(dge_de$trended.dispersion)), "\n")
## Trended dispersion range: 0.5549423 1.172181
# FIT THE MODEL 
fit <- glmQLFit(dge_de, design)
qlf <- glmQLFTest(fit, coef = 2)

#Extract results
results <- topTags(qlf, n = Inf)
results_table <- as.data.frame(results)

Trended BCV range: 0.55 – 1.17, indicating a reasonable and expected amount of biological variation. Also all genes passed, which is good.

7.1 Set Significance Thresholds

Next, I moved on to setting significance thresholds. I decided on the following thresholds: P-value < 0.05, FDR < 0.05, and |log2 fold change| > 1, and I kept FDR to control for multiple testing. These thesholds not not seem to deviate too far from what is common.

# my decided on thresholds
pvalue_threshold <- 0.05
fdr_threshold <- 0.05
logfc_threshold <- 1

# Significant genes at different thresholds:
sig_pvalue <- sum(results_table$PValue < pvalue_threshold)
sig_fdr <- sum(results_table$FDR < fdr_threshold)
sig_fdr_fc <- sum(results_table$FDR < fdr_threshold & abs(results_table$logFC) > logfc_threshold)

cat("Genes at each threshold:\n")
## Genes at each threshold:
cat("P-value:", sig_pvalue, "genes\n")
## P-value: 4466 genes
cat("FDR:", sig_fdr, "genes\n")
## FDR: 1460 genes
cat("FDR and |logFC| :", sig_fdr_fc, "genes\n")
## FDR and |logFC| : 560 genes
cat("\nUpregulated (FDR < 0.05, logFC > 1):", 
    sum(results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold), "genes\n")
## 
## Upregulated (FDR < 0.05, logFC > 1): 313 genes
cat("Downregulated (FDR < 0.05, logFC < -1):", 
    sum(results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold), "genes\n")
## Downregulated (FDR < 0.05, logFC < -1): 247 genes

7.2 Gene investigation

After getting the significant genes, I wanted to check for FOS and NR4A1, two genes that are mentioned in the paper.

# Get significant genes
sig_genes <- results_table[results_table$FDR < fdr_threshold & 
                            abs(results_table$logFC) > logfc_threshold, ]
sig_genes <- sig_genes[order(sig_genes$FDR), ]

# FOS / NR4A1 
if("FOS" %in% rownames(results_table)){
  cat("\nFOS:\n")
  print(results_table["FOS", c("logFC", "logCPM", "PValue", "FDR")])
}
## 
## FOS:
##        logFC   logCPM       PValue          FDR
## FOS 3.556146 6.478184 1.341759e-12 2.647961e-08
if("NR4A1" %in% rownames(results_table)){
  cat("\nNR4A1:\n")
  print(results_table["NR4A1", c("logFC", "logCPM", "PValue", "FDR")])
}
## 
## NR4A1:
##          logFC   logCPM       PValue          FDR
## NR4A1 2.940868 5.089892 8.653683e-12 8.539021e-08

The two genes of interest do show as significantly upregulated after my analysis. This matches what the paper found, and it also provides a good sanity check for the differential gene expresison analysis.

7.3 Volcano Plot

par(mar = c(5, 5, 4, 2))
colors <- rep("gray", nrow(results_table))
colors[results_table$FDR < fdr_threshold & results_table$logFC > logfc_threshold] <- "red"
colors[results_table$FDR < fdr_threshold & results_table$logFC < -logfc_threshold] <- "blue"

plot(results_table$logFC, 
     -log10(results_table$PValue),
     col = colors,
     pch = 16,
     cex = 0.6,
     xlab = "Log2 Fold Change (FollowUp vs Baseline)",
     ylab = "-Log10 P-value",
     main = "Volcano Plot: Differential Expression\nFollowUp vs Baseline")

# Add threshold lines
abline(h = -log10(0.05), col = "black", lty = 2, lwd = 2)
abline(v = c(-logfc_threshold, logfc_threshold), col = "black", lty = 2, lwd = 2)

# Add legend
legend("topright", 
       legend = c(paste("Upregulated (", sum(colors == "red"), ")", sep=""),
                  paste("Downregulated (", sum(colors == "blue"), ")", sep=""),
                  "Not significant"),
       col = c("red", "blue", "gray"),
       pch = 16,
       cex = 1.2)

# Label top genes
top_genes <- rownames(sig_genes)[1:min(10, nrow(sig_genes))]
if(length(top_genes) > 0){
  top_coords <- results_table[top_genes, ]
  text(top_coords$logFC, 
       -log10(top_coords$PValue),
       labels = top_genes,
       pos = 4,
       cex = 0.7)
}

The volcano plot visualizes the distribution of upregulated and downregulated genes, with downregulated genes highlighted in blue, and upregulated genes highlighted in red. I notice that the volcano plot is reasonably symmetrical, meaning that 1. there is general parity in the number of upregulated and downregulated genes, and 2. The log fold change that determines that these genes are significant is generally the same on both sides. However, I do notice that the greatest log fold changes all lie in downregulated gene outliers. This is probably in line with the underlying biology.

7.4 Heatmap of Top Differentially Expressed Genes

library(pheatmap)

# top 50 genes
top_n <- 50
top_gene_names <- rownames(sig_genes)[1:min(top_n, nrow(sig_genes))]

if(length(top_gene_names) > 0){
  heatmap_data <- final_expression_data[top_gene_names, ]
  gene_vars <- apply(heatmap_data, 1, var)
  zero_var_genes <- names(gene_vars[gene_vars == 0 | is.na(gene_vars)])
  
  if(length(zero_var_genes) > 0){
    heatmap_data <- heatmap_data[!rownames(heatmap_data) %in% zero_var_genes, ]
  }
  
  heatmap_data_scaled <- t(scale(t(heatmap_data)))
  
  # Check for NAs
  if(any(is.na(heatmap_data_scaled))){
    na_genes <- apply(heatmap_data_scaled, 1, function(x) any(is.na(x)))
    heatmap_data_scaled <- heatmap_data_scaled[!na_genes, ]
  }
  
  # Check for Inf values
    inf_genes <- apply(heatmap_data_scaled, 1, function(x) any(is.infinite(x)))
    heatmap_data_scaled <- heatmap_data_scaled[!inf_genes, ]
  }
  
  cat("\nFinal heatmap will show", nrow(heatmap_data_scaled), "genes\n")
## 
## Final heatmap will show 50 genes
    # Create annotation for samples
    annotation_col <- data.frame(
      Condition = sample_info$condition,
      row.names = colnames(heatmap_data_scaled)
    )
    
    # Create annotation colors
    ann_colors <- list(
      Condition = c(Baseline = "lightblue", FollowUp = "lightpink")
    )
    
    # Create heatmap
    pheatmap(heatmap_data_scaled,
             annotation_col = annotation_col,
             annotation_colors = ann_colors,
             show_colnames = FALSE,
             show_rownames = TRUE,
             cluster_rows = TRUE,
             cluster_cols = TRUE,
             clustering_distance_rows = "euclidean",
             clustering_distance_cols = "euclidean",
             clustering_method = "complete",
             fontsize_row = 8,
             main = paste("Top", nrow(heatmap_data_scaled), 
                         "Differentially Expressed Genes\n(Scaled Expression)"),
             color = colorRampPalette(c("blue", "white", "red"))(100))

The heatmap does show some degree of clustering. However, it seems that highly downregulated genes cluster together more than highly upregulated genes. In terms of all upregulated genes vs all downregulated genes, the downregulated genes seem to seperate and cluster, while the upregulated genes do cluster together, but are also found around genes with no change in expression. This makes sense given the biological background for chemotherapy significantly downregulating target genes.

7.5 Summary of Differential Expression Results

cat("Total genes analyzed:", nrow(results_table), "\n")
## Total genes analyzed: 19735
cat("Genes with FDR < 0.05:", sig_fdr, "\n")
## Genes with FDR < 0.05: 1460
cat("Genes with FDR < 0.05 AND |logFC| > 1:", sig_fdr_fc, "\n")
## Genes with FDR < 0.05 AND |logFC| > 1: 560
cat("  - Upregulated:", sum(results_table$FDR < fdr_threshold & 
                             results_table$logFC > logfc_threshold), "\n")
##   - Upregulated: 313
cat("  - Downregulated:", sum(results_table$FDR < fdr_threshold & 
                               results_table$logFC < -logfc_threshold), "\n\n")
##   - Downregulated: 247

8 Answers to misc. questions

8.1 What are the control and test conditions of the dataset?

Control condition: Baseline (pre-neoadjuvant chemotherapy treatment)
Test condition: FollowUp (post-neoadjuvant chemotherapy treatment)

8.2 Were there expression values that were not unique for specific genes? How did you handle these?

No, there were not any duplicates in this dataset. However, I left in code for duplicate handling.

8.3 Were there expression values that could not be mapped to current HUGO symbols?

Yes, 551 genes (2.72%) could not be mapped to current HUGO symbols. These were removed from the analysis. Some reasons for being unmapped included outdated gene symbols (like “C10orf82”, “C11orf1” or deprecated identifiers (like “ago-01”, “ago-02”, “ago-03”)

8.4 How many outliers were removed?

3 outlier samples were removed from the analysis.

  1. X014Q.Tn (FollowUp sample)
  2. X083Q.Tn (FollowUp sample)
  3. X226B.Tn (Baseline sample)

In the outlier removal section, I elaborate further on the rationale.

8.5 How did you handle replicates?

Each sample treated as an independent observation, with no averaging or combination. Inherent to the design of the study is that each sample is paired.


8.6 What is the final coverage of your dataset?

19,736 genes

9 References

9.1 R Packages

GEOquery: Davis S, Meltzer PS (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23(14), 1846-1847. doi:10.1093/bioinformatics/btm254

edgeR: Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616

limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. doi:10.1093/nar/gkv007

biomaRt: Durinck S, Spellman PT, Birney E, Huber W (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 4, 1184-1191. doi:10.1038/nprot.2009.97

org.Hs.eg.db: Carlson M (2024). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.19.1.

ggplot2: Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org

pheatmap: Kolde R (2019). pheatmap: Pretty Heatmaps. R package version 1.0.12, https://CRAN.R-project.org/package=pheatmap

knitr: Xie Y (2024). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48, https://yihui.org/knitr/

R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

BiocManager: Morgan M (2024). BiocManager: Access the Bioconductor Project Package Repository. R package version 1.30.25, https://CRAN.R-project.org/package=BiocManager ```

9.2 Primary Data Source

Guevara-Nieto HM, Orozco-Castaño CA, Parra-Medina R, Poveda-Garavito JN, Garai J, Zabaleta J, López-Kleine L, Combita AL (2025). Molecular determinants of neoadjuvant chemotherapy resistance in breast cancer: An analysis of gene expression and tumor microenvironment. PLoS One 20(10): e0334335. https://doi.org/10.1371/journal.pone.0334335

GEO Accession: GSE309004