By: Owen Lo

# Install and load required packages
if(!requireNamespace("readxl", quietly = TRUE)) {
  install.packages("readxl")
}
if (!requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}

library(GEOquery)
library(tibble)
library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(biomaRt)
library(DESeq2)
library(edgeR)
library(SummarizedExperiment)
library(ggplot2)
library(ComplexHeatmap)
library(circlize)

This workbook uses the GEOQuery (Davis and Meltzer 2007), biomaRt (Durinck et al. 2005), tibble (Müller and Wickham 2025), readxl(Wickham and Bryan 2025), dplyr (Wickham et al. 2023), tidyr(Wickham et al. 2025), knitr (Xie 2014), DESeq2(Love et al. 2014), edgeR (Chen et al. 2025), SummarizedExperiment(Morgan et al. 2025), ggplot2 (Wickham 2016), ComplexHeatmap (Gu 2022) and circlize (Gu et al. 2014) packages in its code chunks.

Introduction

Septic shock is a potentially fatal complication of sepsis, and treatment largely depends on early supportive hemodynamic therapy. Patient response to hemodynamic therapy varies, and the NCBI Gene Expression Omnibus (GEO) (Barrett et al. 2012) series GSE110487 is a dataset from 2018 consisting of RNA sequencing read counts of whole blood samples taken from 14 septic shock patients who responded to hemodynamic therapy and 17 non-responders, at two time points: immediately after ICU admission, and 48 hours later, after hemodynamic therapy. (Barcella et al. 2018)

Fetching and assessing data

Downloading and formatting

We’ll use the GEOQuery package to retrieve the supplementary files and metadata pertaining to the experiment from the GEO with its series accession number, and then look at how it’s formatted.

geo_id <- "GSE110487"

# get metadata associated with gse and supplementary files
gse <- GEOquery::getGEO(geo_id, GSEMatrix=FALSE)
supp <- GEOquery::getGEOSuppFiles(geo_id, fetch_files=FALSE)

# download data if not already downloaded
current_dir <- file.path(getwd())
supp_file_path <- file.path(current_dir, geo_id, supp$fname[1])
if (!file.exists(supp_file_path)) {
  data_file <- GEOquery::getGEOSuppFiles(GEO = geo_id)
}

# read data into a dataframe
raw_counts <- readxl::read_xlsx(supp_file_path)

# take a peek
knitr::kable(raw_counts[1:5, 1:10], format = "html", align = "c")
Geneid E04T1 E04T2 E07T1 E07T2 E15T1 E15T2 E16T1 E16T2 E17T1
ENSG00000000003 1 0 1 1 2 4 1 4 4
ENSG00000000005 0 0 1 1 0 0 0 0 0
ENSG00000000419 259 208 599 576 734 534 630 653 283
ENSG00000000457 184 332 370 400 410 345 444 438 263
ENSG00000000460 89 81 245 185 117 167 177 134 127
# count genes and samples
cat(paste("Number of genes: ", nrow(raw_counts),
          "\nNumber of samples: ", ncol(raw_counts)-1))
Number of genes:  58096 
Number of samples:  62

Samples in the raw data are named according to patient identity and time point, but don’t contain any information on whether the patient was a responder or a non-responder, so we’ll map sample identifiers to both response and time point.

# retrieve list of samples in the series, and compile condition data for each
samples <- gse@gsms
sample_sources <- do.call(rbind, 
                             lapply(samples, FUN=function(x){
                               c(x@header$title,
                                 x@header$source_name_ch1)}))

# Use this data to create a dataframe mapping sample titles to responder status and time point
sample_metadata <- as.data.frame(sample_sources, stringsAsFactors = FALSE)
colnames(sample_metadata) = c("sample", "source")
sample_metadata <- sample_metadata %>% 
  magrittr::set_rownames(.$sample) %>% 
  mutate(
    response = ifelse(grepl("^Responder", source), "R", "NR"),
    time_point = ifelse(grepl("T1", source), "T1", "T2"),
  ) %>% 
  dplyr::select(response, time_point)

# take a peek
knitr::kable(sample_metadata[1:5,], format = "html", align = "c")
response time_point
E04T1 R T1
E04T2 R T2
E07T1 NR T1
E07T2 NR T2
E15T1 NR T1

Quality assessment

Now, we’ll determine whether data quality varies across conditions

# calculating summary statistics for each sample
sample_summaries <- data.frame(do.call(cbind, lapply(raw_counts[, 2:ncol(raw_counts)], summary)))
qual_overview <- as.data.frame(t(sample_summaries))

# Adding extra statistics for total reads, number of genes with zero expression, number of genes with low expression
qual_overview <- qual_overview %>% 
  dplyr::mutate(total_reads = colSums(raw_counts[2:ncol(raw_counts)]),
                zero_exp = colSums(raw_counts[2:ncol(raw_counts)] == 0),
                low_exp = colSums(raw_counts[2:ncol(raw_counts)]<10),
                )

# adding metadata to sort by conditions
qual_overview <- merge(qual_overview, sample_metadata, by = 0)

# overview stats for samples from non-responders and responders
stats_by_response <- qual_overview %>%
  dplyr::group_by(response,) %>%
  summarize(
    NumSamples = n(),
    MeanLibSize = mean(total_reads),
    SDLibSize = sd(total_reads),
    PropZeroExp = mean(zero_exp) / nrow(raw_counts),
    PropLowExp = mean(low_exp) / nrow(raw_counts),
  )

# overview stats for samples from time point 1 and time point 2
stats_by_time <- qual_overview %>%
  dplyr::group_by(time_point,) %>%
  summarize(
    NumSamples = n(),
    MeanLibSize = mean(total_reads),
    SDLibSize = sd(total_reads),
    PropZeroExp = mean(zero_exp) / nrow(raw_counts),
    PropLowExp = mean(low_exp) / nrow(raw_counts),
  )

# merging tables and displaying
stats_by_condition <- bind_rows(
    stats_by_response %>% 
        dplyr::mutate(group_type = "response") %>%
        dplyr::rename(condition = response),
    stats_by_time %>% 
        dplyr::mutate(group_type = "time_point") %>%
        dplyr::rename(condition = time_point)
    )
knitr::kable(stats_by_condition[,1:6], format = "html", align = "c")
condition NumSamples MeanLibSize SDLibSize PropZeroExp PropLowExp
NR 28 13414209 3691198 0.5809215 0.7772332
R 34 13675120 3772593 0.5990009 0.7761170
T1 31 13725546 3630149 0.5940995 0.7795584
T2 31 13389033 3836210 0.5875725 0.7736838

Mean, standard deviation, and proportions of unexpressed and lowly expressed genes are similar across patient responder status and time, so special correction for data quality won’t be necessary.

Identifier mapping and cleaning

Mapping to HUGO symbols

The data uses Ensembl Gene IDs beginning with ENSG.

# get ensembl version 100 and the human gene dataset
ensembl <- biomaRt::useEnsembl(biomart = "ensembl")
ensembl <- biomaRt::useDataset("hsapiens_gene_ensembl", mart = ensembl)

Rather than Ensembl Gene IDs, we want to use HUGO Gene Nomenclature Committee (HGNC) symbols to identify the genes in our dataset. We’ll use the Ensembl human dataset to map as many ENSG IDs as possible to HGNC symbols, and add those mappings to our read data.

# store ensembl IDs from the data as a vector
ensembl_ids <- raw_counts$Geneid

# check if an ensembl to hgnc conversion file exists already
conversion_stash <- "ensembl_to_hgnc.rds"
if (file.exists(conversion_stash)) {
  
  # if the file exists, read conversion data from that
  ensembl_to_hgnc <- readRDS(conversion_stash)
} else {
  
  # if the file does not exist, get conversion data from biomaRt and cache it
  ensembl_to_hgnc <- biomaRt::getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                         filters = c("ensembl_gene_id"),
                         values = ensembl_ids,
                         mart = ensembl)
  saveRDS(ensembl_to_hgnc, conversion_stash)
}

# merge conversion df with data and rearrange such that both id columns come before samples
raw_counts_annot <- dplyr::left_join(raw_counts, ensembl_to_hgnc, 
                                     by = c("Geneid" = "ensembl_gene_id"))
raw_counts_cols <- dplyr::setdiff(names(raw_counts), "Geneid")
colnames(raw_counts_annot)[1] = "ensembl_gene_id"
raw_counts_annot <- raw_counts_annot[, c("ensembl_gene_id", "hgnc_symbol", raw_counts_cols)]

# peek at annotated counts
knitr::kable(raw_counts_annot[1:5, 1:10], format = "html", align = "c")
ensembl_gene_id hgnc_symbol E04T1 E04T2 E07T1 E07T2 E15T1 E15T2 E16T1 E16T2
ENSG00000000003 TSPAN6 1 0 1 1 2 4 1 4
ENSG00000000005 TNMD 0 0 1 1 0 0 0 0
ENSG00000000419 DPM1 259 208 599 576 734 534 630 653
ENSG00000000457 SCYL3 184 332 370 400 410 345 444 438
ENSG00000000460 C1orf112 89 81 245 185 117 167 177 134

Now rows are identified by both ENSG ID and HGNC symbol.

Cleaning out low expression and unmapped rows

Assaying for rows where mapping one ENSG ID to one HGNC symbol failed:

# subset rows where one ensembl id maps to multiple hgnc symbols
ensembl_dups <- raw_counts_annot$ensembl_gene_id[duplicated(raw_counts_annot$ensembl_gene_id)]
multiple_hgnc <- raw_counts_annot %>% 
  filter(ensembl_gene_id %in% ensembl_dups)

# subset rows where multiple ensembl ids map to one hgnc symbol
hgnc_dups <- raw_counts_annot$hgnc_symbol[duplicated(raw_counts_annot$hgnc_symbol)]
same_hgnc <- raw_counts_annot %>% 
  filter(hgnc_symbol %in% hgnc_dups & hgnc_symbol != "")

# subset rows where ensembl id does not map to a hgnc symbol 
unmapped <- raw_counts_annot %>% 
  filter(is.na(hgnc_symbol) | hgnc_symbol == "")

cat(paste(nrow(unmapped), "unmapped rows\n",
          nrow(multiple_hgnc), 
          "rows where one ENSG ID maps to multiple HGNC symbols \n",
          nrow(same_hgnc), 
          "rows where multiple ENSG IDs map to the same HGNC symbol"))
20178 unmapped rows
 8 rows where one ENSG ID maps to multiple HGNC symbols 
 28 rows where multiple ENSG IDs map to the same HGNC symbol

A high proportion of the dataset’s rows do not map to any HGNC symbol. These are likely to correspond to outdated aliases of current symbols, or deprecated ORFs. By filtering low expression genes from the data and comparing the proportions of mapped and unmapped genes that remain, we can assess the significance of the unmapped rows.

# size of smallest condition group in the dataset: non-responders at time point 1/2
smallest_group_size <- 11

# keep only unmapped rows that have a count of at least 10 in 11 samples
rows_to_keep <- rowSums(unmapped >= 10) >= smallest_group_size
clean_unmapped <- unmapped[rows_to_keep,]

# subset mapped counts 
mapped <- raw_counts_annot %>% 
  filter(!is.na(hgnc_symbol) & hgnc_symbol != "")

# apply same expression filter to mapped counts 
rows_to_keep <- rowSums(mapped >= 10) >= smallest_group_size
clean_mapped <- mapped[rows_to_keep,]

# collate and display results of filtering
cleaning_results <- data.frame(
  Type = c("mapped genes", "unmapped genes"),
  PreCleanCount = c(nrow(mapped), nrow(unmapped)),
  PostCleanCount = c(nrow(clean_mapped), nrow(clean_unmapped)),
  SignificantProportion = c(nrow(clean_mapped)/nrow(mapped), 
                     nrow(clean_unmapped)/nrow(unmapped))
)
knitr::kable(cleaning_results, format = "html", align = "c")
Type PreCleanCount PostCleanCount SignificantProportion
mapped genes 37922 13673 0.3605559
unmapped genes 20178 1907 0.0945089

Less than a tenth of the unmapped rows had sufficient levels of expression to not be filtered in early data cleaning, compared to more than a third of the mapped rows.

Since unmapped rows are largely lowly expressed, we can discard them without losing a great amount of information. ENSG IDs mapping to multiple HGNC symbols may be transcripts that hybridize to more than one gene, and since they are in such small proportion in the dataset, we will treat them as multiple genes without significantly impacting analysis. HGNC symbols mapped to by multiple ENSG IDs may be genes with multiple splice variants. Since they are likewise in minuscule proportion in the dataset, we will sum together rows that map to the same HUGO gene.

# create a new dataframe where rows correspond to unique hgnc symbols
clean_counts_byhgnc <- clean_mapped %>% 
  dplyr::group_by(hgnc_symbol) %>% 
  # sum rows that map to the same hgnc symbol
  dplyr::summarise(dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)))

knitr::kable(clean_counts_byhgnc[1:5, 1:10], format = "html", align = "c")
hgnc_symbol E04T1 E04T2 E07T1 E07T2 E15T1 E15T2 E16T1 E16T2 E17T1
A1BG-AS1 6 7 19 11 20 9 13 20 7
A2M-AS1 2 1 2 2 2 5 2 2 3
AAAS 22 21 25 35 45 18 37 23 33
AACS 28 10 27 60 133 50 96 64 52
AAGAB 177 202 458 520 644 476 475 687 278
cat(paste("Number of genes: ", nrow(clean_counts_byhgnc),
          "\nNumber of samples: ", ncol(clean_counts_byhgnc)-1))
Number of genes:  13673 
Number of samples:  62

We now have a dataframe of read counts for genes with expression levels above a baseline, identified by unique HUGO symbols.

Normalization

Converting data for DESeq2

Let’s create a DESeqDataSet object from our data, and continue our analysis with the DESeq2 package, adapting methods from Love et al’s RNASeq workflow (Love et al. 2015).

# extract counts as a matrix and set row and column names
clean_count_matrix <- as.matrix(clean_counts_byhgnc[,2:ncol(clean_counts_byhgnc)])
rownames(clean_count_matrix) <- clean_counts_byhgnc$hgnc_symbol
colnames(clean_count_matrix) <- colnames(clean_counts_byhgnc)[2:ncol(clean_counts_byhgnc)]

#create a variable in the metadata from the interaction of response and time_point
sample_metadata$group = interaction(sample_metadata$response, 
                                      sample_metadata$time_point, 
                                      sep = "_")

# create a DESeqDataSet object from the counts matrix and the sample metadata, using the interaction of response and time point as the design. 
counts_dds <- DESeq2::DESeqDataSetFromMatrix(countData = clean_count_matrix,
                                             colData = sample_metadata,
                                             design = ~group)

Normalizing counts

We’ll implement relative log expression (RLE) normalization on the counts using the DESeq2 package. RLE normalization is based on the hypothesis that most genes are not differentially expressed, which is suitable for this dataset, as the associated publication identifies a low proportion of DE genes.

# estimate normalization factors and apply them to counts
counts_dds <- DESeq2::estimateSizeFactors(counts_dds)
norm_counts <- DESeq2::counts(counts_dds, normalized = TRUE)

Plotting sample clustering

In order to visualize the effects of normalization, we’ll generate Multi-Dimensional Scaling (MDS) plots from the un-normalized and normalized data.

# create SummarizedExperiment objects with un-normalized and normalized counts
unnorm_counts <- DESeq2::counts(counts_dds, normalized = FALSE)
unnorm_counts_se <- SummarizedExperiment(assays = list(counts = unnorm_counts),
                                         colData = sample_metadata)
norm_counts_se <- SummarizedExperiment(assays = list(counts = norm_counts), 
                                       colData = sample_metadata)

# generate MDS data with un-normalized counts
unnorm_mds <- edgeR::plotMDS.SummarizedExperiment(unnorm_counts_se, plot = FALSE)
unnorm_mds_df <- data.frame(
  Dim1 = unnorm_mds$x,
  Dim2 = unnorm_mds$y,
  Group = unnorm_counts_se$group,
  Sample = colnames(unnorm_counts_se)
)

# generate MDS data with normalized counts
norm_mds <- edgeR::plotMDS.SummarizedExperiment(norm_counts_se, plot = FALSE)
norm_mds_df <- data.frame(
  Dim1 = norm_mds$x,
  Dim2 = norm_mds$y,
  Group = norm_counts_se$group,
  Sample = colnames(norm_counts_se)
)

# plot un-normalized MDS  
ggplot2::ggplot(unnorm_mds_df, aes(x = Dim1, y = Dim2, color = Group)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    x = paste("Dim1 (", round(unnorm_mds$var.explained[1]*100, 1), "%)"),
    y = paste("Dim2 (", round(unnorm_mds$var.explained[2]*100, 1), "%)"),
    title = "MDS plot: un-normalized counts"
  )

Figure 1: MDS plot generated from filtered RNA sequencing read counts of whole blood samples from septic shock patients who were responsive (R) and non-responsive (NR) to hemodynamic therapy, taken at time points before and after treatment (T1 and T2), without normalization.

# plot normalized MDS
ggplot2::ggplot(norm_mds_df, aes(x = Dim1, y = Dim2, color = Group)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    x = paste("Dim1 (", round(norm_mds$var.explained[1]*100, 1), "%)"),
    y = paste("Dim2 (", round(norm_mds$var.explained[2]*100, 1), "%)"),
    title = "MDS plot: normalized counts"
  )

Figure 2: MDS plot generated from filtered RNA sequencing read counts of whole blood samples from septic shock patients who were responsive (R) and non-responsive (NR) to hemodynamic therapy, taken at time points before and after treatment (T1 and T2), normalized by relative log expression.

The MDS plots generated from the data pre and post normalization are nearly identical. The low impact of normalization indicates little technical bias in the dataset, and suggests that msot variation is biological

Differential gene expression

Evaluating differential expression

From Fig. 1 and Fig. 2, some grouping is visible of samples within the same set of conditions. Samples taken at the same time point from patients with the same response behaviour loosely cluster together. As such, we’ll use the four combinations of response and time point as factors to evaluate differential expression.

# evaluate differential expression. Appropriate design already specified in counts_dds
counts_dds <- DESeq2::DESeq(counts_dds)

We’ll look for genes that are differentially expressed based on patient response at each of the two time points, and look for genes that are differentially expressed over time within each of the two patient response groups.

# DE between R and NR at T1
response_T1 <- DESeq2::results(counts_dds, contrast = c("group", "NR_T1", "R_T1"), alpha = 0.05)
summary(response_T1)

# DE between R and NR at T2 
response_T2 <- DESeq2::results(counts_dds, contrast = c("group", "NR_T2", "R_T2"), alpha = 0.05)
summary(response_T2)

# DE between T1 and T2 for NR patients
time_NR <- DESeq2::results(counts_dds, contrast = c("group", "NR_T1", "NR_T2"), alpha =0.05)
summary(time_NR)

# DE between T1 and T2 for R patients
time_R <- DESeq2::results(counts_dds, contrast = c("group", "R_T1", "R_T2"), alpha = 0.05)
summary(time_R)
# get genes with adjusted p-value < 0.05 for each test
sig_T1 <- response_T1[response_T1$padj < 0.05,]
sig_T2 <- response_T2[response_T2$padj < 0.05,]
sig_NR <- time_NR[time_NR$padj < 0.05,]
sig_R <- time_R[time_R$padj < 0.05,]

cat(paste(nrow(sig_T1), "DE genes between R and NR at T1\n",
          nrow(sig_T2), "DE genes between R and NR at T2\n",
          nrow(sig_NR), "DE genes from T1 to T2 in NR\n",
          nrow(sig_R), "DE genes from T1 to T2 in R"))
0 DE genes between R and NR at T1
 67 DE genes between R and NR at T2
 846 DE genes from T1 to T2 in NR
 1613 DE genes from T1 to T2 in R

DESeq2 implements multiple hypothesis testing via Benjamini-Hochberg false discovery rate correction on p-values, producing adjusted p-values that more accurately reflect significance in bulk RNA-sequencing data than the original p-values given by the Wald test for differential expression. The threshold we’ve used for differentially expressed genes is genes with adjusted p-value < 0.05, a commonly used threshold [cite cite].

In this analysis, we’ll work with the genes that were differentially expressed between the R and NR groups at T2. Since no genes are significantly differentially expressed between the R and NR groups at T1, analyzing the genes that are differentially expressed between the groups at T2 will tell us which genes were regulated differently between responder and non-responder patients as a result of the same conditions, yielding information on the differences between the two patient group’s biological responses to the therapy.

Visualizing differentially expressed genes

We’ll create a volcano plot of the genes using the differential expression statistics calculated between the R and NR groups at T2.

# convert DE results to data frame
difexp_info <- as.data.frame(response_T2)

# add column indicating whether genes are significantly up or downregulated
difexp_info$DE <- "no"
difexp_info$DE[difexp_info$log2FoldChange > 0 & difexp_info$padj < 0.05] <- "up"
difexp_info$DE[difexp_info$log2FoldChange < 0 & difexp_info$padj < 0.05] <- "down"

# generate volcano plot, -log10 of adjusted p-val vs log2 of FC
ggplot2::ggplot(data = difexp_info, mapping = aes(x = log2FoldChange, y = -log10(padj), col = DE)) + 
  geom_point() +
  theme_minimal() + 
  geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed') + 
  scale_color_manual(values = c("blue", "grey", "red"),
                     labels = c("Downregulated in R group", "Not DE", "Upregulated in R group")) +
  labs(color = "Expression", 
       x = "log 2 fold change",
       y = "log 10 corrected p-value",
       title = "Expression of genes in R vs. NR samples at T2")

Figure 3: FDR-corrected p-value (-log10) vs. fold change (log2) in expression of genes in whole blood samples from septic shock patients who responded to early supportive hemodynamic therapy (R samples) and patients who did not respond (NR), taken 48 hours after treatment. Threshold for differential expression was corrected p-value < 0.05.

Next, we’ll create a heatmap of the genes differentially expressed between the R and NR groups at T2, and see how they cluster according to sample conditions.

# use normalized counts for heatmap matrix, filter for DE genes
normcounts_heatmap <- norm_counts[rownames(norm_counts) %in% rownames(sig_T2),]

# normalize rows
normcounts_heatmap <- t(scale(t(normcounts_heatmap)))

# set colours to scale from blue at min values to white at 0 to red at max values
heatmap_col = circlize::colorRamp2(c(min(normcounts_heatmap), 0, max(normcounts_heatmap)),
                                   c("blue", "white", "red"))

# set colours to correspond to experimental conditions
group_colours <- rainbow(n = 4)
names(group_colours) <- unique(sample_metadata$group)

# add annotation with experimental conditions and corresponding colours
hannot <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(
  Group = sample_metadata$group),
  col = list(
    Group = group_colours
  ),
show_legend = TRUE)

# generate heatmap
DE_heatmap <- ComplexHeatmap::Heatmap(normcounts_heatmap, 
                                   col = heatmap_col,
                                   show_column_names = FALSE,
                                   show_row_names = FALSE,
                                   top_annotation = hannot,
                                   show_heatmap_legend = FALSE,
                                   column_title = "Genes differentially expressed in R vs. NR groups at T1")

DE_heatmap

Figure 4: Heatmap of genes differentially expressed in septic shock patients who responded to hemodynamic therapy and patients who did not respond 48 hours after treatment (groups R_T2 and NR_T2, respectively). Blue denotes low gene expression and red denotes high gene expression.

The R_T2 and NR_T2 samples largely clustered apart from each other, while the R_T1 and NR_T1 samples are not. This is as expected, since the genes being plotted are differentially expressed between R and NR at T2, while no genes are differentially expressed between R and NR at T1. There is also significant separation between the T1 and T2 groups for R, but not for NR, indicating that many of these genes change in expression in responders over the 48 hours after hemodynamic therapy, but not in non-responders.

Interpretation and documentation

Bibliography

Barcella, Matteo, Bernardo Bollen Pinto, Daniele Braga, et al. 2018. “Identification of a Transcriptome Profile Associated with Improvement of Organ Function in Septic Shock Patients After Early Supportive Therapy.” Critical Care 22 (1). https://doi.org/10.1186/s13054-018-2242-3.
Barrett, Tanya, Stephen E. Wilhite, Pierre Ledoux, et al. 2012. “NCBI GEO: Archive for Functional Genomics Data Sets—Update.” Nucleic Acids Research 41 (D1): D991–95. https://doi.org/10.1093/nar/gks1193.
Chen, Yunshun, Lizhong Chen, Aaron T L Lun, Pedro Baldoni, and Gordon K Smyth. 2025. edgeR V4: Powerful Differential Analysis of Sequencing Data with Expanded Functionality and Improved Support for Small Counts and Larger Datasets.” Nucleic Acids Research 53 (2): gkaf018. https://doi.org/10.1093/nar/gkaf018.
Davis, Sean, and Paul S. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–47. https://doi.org/10.1093/bioinformatics/btm254.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, et al. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Gu, Zuguang. 2022. “Complex Heatmap Visualization.” iMeta, ahead of print. https://doi.org/10.1002/imt2.43.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics, ahead of print. https://doi.org/10.1093/bioinformatics/btu393.
Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. 2015. “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” F1000Research 4 (October): 1070. https://doi.org/10.12688/f1000research.7035.1.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2025. SummarizedExperiment: A Container (S4 Class) for Matrix-Like Assays. https://doi.org/10.18129/B9.bioc.SummarizedExperiment.
Müller, Kirill, and Hadley Wickham. 2025. Tibble: Simple Data Frames. https://tibble.tidyverse.org/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, and Jennifer Bryan. 2025. Readxl: Read Excel Files. https://readxl.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://doi.org/10.32614/CRAN.package.dplyr.
Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2025. Tidyr: Tidy Messy Data. https://doi.org/10.32614/CRAN.package.tidyr.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
---
title: "Differential Gene Expression in Septic Shock Patients Based on Response to Early Supportive Hemodynamic Therapy"
output: 
  html_notebook:
    toc: true
    toc_depth: 2
    code_folding: show
bibliography: A1_bib.bib
---



By: **Owen Lo**
```{r message = FALSE, warning = FALSE, class.source = 'fold-hide'}
# Install and load required packages
if(!requireNamespace("readxl", quietly = TRUE)) {
  install.packages("readxl")
}
if (!requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}

library(GEOquery)
library(tibble)
library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(biomaRt)
library(DESeq2)
library(edgeR)
library(SummarizedExperiment)
library(ggplot2)
library(ComplexHeatmap)
library(circlize)
```

This workbook uses the GEOQuery [@Davis2007], biomaRt [@Durinck2005], tibble [@Mueller2025], readxl[@Wickham2025], dplyr [@Wickham2023], tidyr[@Wickham2025a], knitr [@Xie2014], DESeq2[@Love2014], edgeR [@Chen2025], SummarizedExperiment[@Morgan2025], ggplot2 [@Wickham2016], ComplexHeatmap [@Gu2022] and circlize [@Gu2014] packages in its code chunks.

# Introduction

Septic shock is a potentially fatal complication of sepsis, and treatment largely depends on early supportive hemodynamic therapy. Patient response to hemodynamic therapy varies, and the NCBI Gene Expression Omnibus (GEO) [@Barrett2012] series GSE110487 is a dataset from 2018 consisting of RNA sequencing read counts of whole blood samples taken from 14 septic shock patients who responded to hemodynamic therapy and 17 non-responders, at two time points: immediately after ICU admission, and 48 hours later, after hemodynamic therapy. [@Barcella2018]

# Fetching and assessing data

## Downloading and formatting

We'll use the GEOQuery package to retrieve the supplementary files and metadata pertaining to the experiment from the GEO with its series accession number, and then look at how it's formatted.

```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
geo_id <- "GSE110487"

# get metadata associated with gse and supplementary files
gse <- GEOquery::getGEO(geo_id, GSEMatrix=FALSE)
supp <- GEOquery::getGEOSuppFiles(geo_id, fetch_files=FALSE)

# download data if not already downloaded
current_dir <- file.path(getwd())
supp_file_path <- file.path(current_dir, geo_id, supp$fname[1])
if (!file.exists(supp_file_path)) {
  data_file <- GEOquery::getGEOSuppFiles(GEO = geo_id)
}

# read data into a dataframe
raw_counts <- readxl::read_xlsx(supp_file_path)

# take a peek
knitr::kable(raw_counts[1:5, 1:10], format = "html", align = "c")
```

```{r, message = FALSE, warning = FALSE}
# count genes and samples
cat(paste("Number of genes: ", nrow(raw_counts),
          "\nNumber of samples: ", ncol(raw_counts)-1))
```

Samples in the raw data are named according to patient identity and time point, but don't contain any information on whether the patient was a responder or a non-responder, so we'll map sample identifiers to both response and time point.

```{r, message = FALSE, warning = FALSE}
# retrieve list of samples in the series, and compile condition data for each
samples <- gse@gsms
sample_sources <- do.call(rbind, 
                             lapply(samples, FUN=function(x){
                               c(x@header$title,
                                 x@header$source_name_ch1)}))

# Use this data to create a dataframe mapping sample titles to responder status and time point
sample_metadata <- as.data.frame(sample_sources, stringsAsFactors = FALSE)
colnames(sample_metadata) = c("sample", "source")
sample_metadata <- sample_metadata %>% 
  magrittr::set_rownames(.$sample) %>% 
  mutate(
    response = ifelse(grepl("^Responder", source), "R", "NR"),
    time_point = ifelse(grepl("T1", source), "T1", "T2"),
  ) %>% 
  dplyr::select(response, time_point)

# take a peek
knitr::kable(sample_metadata[1:5,], format = "html", align = "c")
```

## Quality assessment

Now, we'll determine whether data quality varies across conditions

```{r, message = FALSE, warning = FALSE}
# calculating summary statistics for each sample
sample_summaries <- data.frame(do.call(cbind, lapply(raw_counts[, 2:ncol(raw_counts)], summary)))
qual_overview <- as.data.frame(t(sample_summaries))

# Adding extra statistics for total reads, number of genes with zero expression, number of genes with low expression
qual_overview <- qual_overview %>% 
  dplyr::mutate(total_reads = colSums(raw_counts[2:ncol(raw_counts)]),
                zero_exp = colSums(raw_counts[2:ncol(raw_counts)] == 0),
                low_exp = colSums(raw_counts[2:ncol(raw_counts)]<10),
                )

# adding metadata to sort by conditions
qual_overview <- merge(qual_overview, sample_metadata, by = 0)

# overview stats for samples from non-responders and responders
stats_by_response <- qual_overview %>%
  dplyr::group_by(response,) %>%
  summarize(
    NumSamples = n(),
    MeanLibSize = mean(total_reads),
    SDLibSize = sd(total_reads),
    PropZeroExp = mean(zero_exp) / nrow(raw_counts),
    PropLowExp = mean(low_exp) / nrow(raw_counts),
  )

# overview stats for samples from time point 1 and time point 2
stats_by_time <- qual_overview %>%
  dplyr::group_by(time_point,) %>%
  summarize(
    NumSamples = n(),
    MeanLibSize = mean(total_reads),
    SDLibSize = sd(total_reads),
    PropZeroExp = mean(zero_exp) / nrow(raw_counts),
    PropLowExp = mean(low_exp) / nrow(raw_counts),
  )

# merging tables and displaying
stats_by_condition <- bind_rows(
    stats_by_response %>% 
        dplyr::mutate(group_type = "response") %>%
        dplyr::rename(condition = response),
    stats_by_time %>% 
        dplyr::mutate(group_type = "time_point") %>%
        dplyr::rename(condition = time_point)
    )
knitr::kable(stats_by_condition[,1:6], format = "html", align = "c")
```

Mean, standard deviation, and proportions of unexpressed and lowly expressed genes are similar across patient responder status and time, so special correction for data quality won't be necessary.

# Identifier mapping and cleaning

## Mapping to HUGO symbols

The data uses Ensembl Gene IDs beginning with ENSG.

```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# get ensembl version 100 and the human gene dataset
ensembl <- biomaRt::useEnsembl(biomart = "ensembl")
ensembl <- biomaRt::useDataset("hsapiens_gene_ensembl", mart = ensembl)
```

Rather than Ensembl Gene IDs, we want to use HUGO Gene Nomenclature Committee (HGNC) symbols to identify the genes in our dataset. We'll use the Ensembl human dataset to map as many ENSG IDs as possible to HGNC symbols, and add those mappings to our read data.

```{r, message = FALSE, warning = FALSE}
# store ensembl IDs from the data as a vector
ensembl_ids <- raw_counts$Geneid

# check if an ensembl to hgnc conversion file exists already
conversion_stash <- "ensembl_to_hgnc.rds"
if (file.exists(conversion_stash)) {
  
  # if the file exists, read conversion data from that
  ensembl_to_hgnc <- readRDS(conversion_stash)
} else {
  
  # if the file does not exist, get conversion data from biomaRt and cache it
  ensembl_to_hgnc <- biomaRt::getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                         filters = c("ensembl_gene_id"),
                         values = ensembl_ids,
                         mart = ensembl)
  saveRDS(ensembl_to_hgnc, conversion_stash)
}

# merge conversion df with data and rearrange such that both id columns come before samples
raw_counts_annot <- dplyr::left_join(raw_counts, ensembl_to_hgnc, 
                                     by = c("Geneid" = "ensembl_gene_id"))
raw_counts_cols <- dplyr::setdiff(names(raw_counts), "Geneid")
colnames(raw_counts_annot)[1] = "ensembl_gene_id"
raw_counts_annot <- raw_counts_annot[, c("ensembl_gene_id", "hgnc_symbol", raw_counts_cols)]

# peek at annotated counts
knitr::kable(raw_counts_annot[1:5, 1:10], format = "html", align = "c")
```

Now rows are identified by both ENSG ID and HGNC symbol.

## Cleaning out low expression and unmapped rows

Assaying for rows where mapping one ENSG ID to one HGNC symbol failed:

```{r, message = FALSE, warning = FALSE}
# subset rows where one ensembl id maps to multiple hgnc symbols
ensembl_dups <- raw_counts_annot$ensembl_gene_id[duplicated(raw_counts_annot$ensembl_gene_id)]
multiple_hgnc <- raw_counts_annot %>% 
  filter(ensembl_gene_id %in% ensembl_dups)

# subset rows where multiple ensembl ids map to one hgnc symbol
hgnc_dups <- raw_counts_annot$hgnc_symbol[duplicated(raw_counts_annot$hgnc_symbol)]
same_hgnc <- raw_counts_annot %>% 
  filter(hgnc_symbol %in% hgnc_dups & hgnc_symbol != "")

# subset rows where ensembl id does not map to a hgnc symbol 
unmapped <- raw_counts_annot %>% 
  filter(is.na(hgnc_symbol) | hgnc_symbol == "")

cat(paste(nrow(unmapped), "unmapped rows\n",
          nrow(multiple_hgnc), 
          "rows where one ENSG ID maps to multiple HGNC symbols \n",
          nrow(same_hgnc), 
          "rows where multiple ENSG IDs map to the same HGNC symbol"))
```

A high proportion of the dataset's rows do not map to any HGNC symbol. These are likely to correspond to outdated aliases of current symbols, or deprecated ORFs. By filtering low expression genes from the data and comparing the proportions of mapped and unmapped genes that remain, we can assess the significance of the unmapped rows.

```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# size of smallest condition group in the dataset: non-responders at time point 1/2
smallest_group_size <- 11

# keep only unmapped rows that have a count of at least 10 in 11 samples
rows_to_keep <- rowSums(unmapped >= 10) >= smallest_group_size
clean_unmapped <- unmapped[rows_to_keep,]

# subset mapped counts 
mapped <- raw_counts_annot %>% 
  filter(!is.na(hgnc_symbol) & hgnc_symbol != "")

# apply same expression filter to mapped counts 
rows_to_keep <- rowSums(mapped >= 10) >= smallest_group_size
clean_mapped <- mapped[rows_to_keep,]

# collate and display results of filtering
cleaning_results <- data.frame(
  Type = c("mapped genes", "unmapped genes"),
  PreCleanCount = c(nrow(mapped), nrow(unmapped)),
  PostCleanCount = c(nrow(clean_mapped), nrow(clean_unmapped)),
  SignificantProportion = c(nrow(clean_mapped)/nrow(mapped), 
                     nrow(clean_unmapped)/nrow(unmapped))
)
knitr::kable(cleaning_results, format = "html", align = "c")
```

Less than a tenth of the unmapped rows had sufficient levels of expression to not be filtered in early data cleaning, compared to more than a third of the mapped rows.

Since unmapped rows are largely lowly expressed, we can discard them without losing a great amount of information. ENSG IDs mapping to multiple HGNC symbols may be transcripts that hybridize to more than one gene, and since they are in such small proportion in the dataset, we will treat them as multiple genes without significantly impacting analysis. HGNC symbols mapped to by multiple ENSG IDs may be genes with multiple splice variants. Since they are likewise in minuscule proportion in the dataset, we will sum together rows that map to the same HUGO gene.
```{r, message = FALSE, warning = FALSE}
# create a new dataframe where rows correspond to unique hgnc symbols
clean_counts_byhgnc <- clean_mapped %>% 
  dplyr::group_by(hgnc_symbol) %>% 
  # sum rows that map to the same hgnc symbol
  dplyr::summarise(dplyr::across(dplyr::where(is.numeric), ~ sum(.x, na.rm = TRUE)))

knitr::kable(clean_counts_byhgnc[1:5, 1:10], format = "html", align = "c")
```

```{r, message = FALSE, warning = FALSE}
cat(paste("Number of genes: ", nrow(clean_counts_byhgnc),
          "\nNumber of samples: ", ncol(clean_counts_byhgnc)-1))
```

We now have a dataframe of read counts for genes with expression levels above a baseline, identified by unique HUGO symbols.

# Normalization

## Converting data for DESeq2

Let's create a DESeqDataSet object from our data, and continue our analysis with the DESeq2 package, adapting methods from Love et al's RNASeq workflow [@Love2015].

```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# extract counts as a matrix and set row and column names
clean_count_matrix <- as.matrix(clean_counts_byhgnc[,2:ncol(clean_counts_byhgnc)])
rownames(clean_count_matrix) <- clean_counts_byhgnc$hgnc_symbol
colnames(clean_count_matrix) <- colnames(clean_counts_byhgnc)[2:ncol(clean_counts_byhgnc)]

#create a variable in the metadata from the interaction of response and time_point
sample_metadata$group = interaction(sample_metadata$response, 
                                      sample_metadata$time_point, 
                                      sep = "_")

# create a DESeqDataSet object from the counts matrix and the sample metadata, using the interaction of response and time point as the design. 
counts_dds <- DESeq2::DESeqDataSetFromMatrix(countData = clean_count_matrix,
                                             colData = sample_metadata,
                                             design = ~group)
```

## Normalizing counts

We'll implement relative log expression (RLE) normalization on the counts using the DESeq2 package. RLE normalization is based on the hypothesis that most genes are not differentially expressed, which is suitable for this dataset, as the associated publication identifies a low proportion of DE genes.

```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}

# estimate normalization factors and apply them to counts
counts_dds <- DESeq2::estimateSizeFactors(counts_dds)
norm_counts <- DESeq2::counts(counts_dds, normalized = TRUE)
```

## Plotting sample clustering

In order to visualize the effects of normalization, we'll generate Multi-Dimensional Scaling (MDS) plots from the un-normalized and normalized data.

```{r, message = FALSE, warning = FALSE}
# create SummarizedExperiment objects with un-normalized and normalized counts
unnorm_counts <- DESeq2::counts(counts_dds, normalized = FALSE)
unnorm_counts_se <- SummarizedExperiment(assays = list(counts = unnorm_counts),
                                         colData = sample_metadata)
norm_counts_se <- SummarizedExperiment(assays = list(counts = norm_counts), 
                                       colData = sample_metadata)

# generate MDS data with un-normalized counts
unnorm_mds <- edgeR::plotMDS.SummarizedExperiment(unnorm_counts_se, plot = FALSE)
unnorm_mds_df <- data.frame(
  Dim1 = unnorm_mds$x,
  Dim2 = unnorm_mds$y,
  Group = unnorm_counts_se$group,
  Sample = colnames(unnorm_counts_se)
)

# generate MDS data with normalized counts
norm_mds <- edgeR::plotMDS.SummarizedExperiment(norm_counts_se, plot = FALSE)
norm_mds_df <- data.frame(
  Dim1 = norm_mds$x,
  Dim2 = norm_mds$y,
  Group = norm_counts_se$group,
  Sample = colnames(norm_counts_se)
)

# plot un-normalized MDS  
ggplot2::ggplot(unnorm_mds_df, aes(x = Dim1, y = Dim2, color = Group)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    x = paste("Dim1 (", round(unnorm_mds$var.explained[1]*100, 1), "%)"),
    y = paste("Dim2 (", round(unnorm_mds$var.explained[2]*100, 1), "%)"),
    title = "MDS plot: un-normalized counts"
  )
```

<center>

**Figure 1:** MDS plot generated from filtered RNA sequencing read counts of whole blood samples from septic shock patients who were responsive (R) and non-responsive (NR) to hemodynamic therapy, taken at time points before and after treatment (T1 and T2), without normalization.

</center>

```{r, message = FALSE, warning = FALSE}
# plot normalized MDS
ggplot2::ggplot(norm_mds_df, aes(x = Dim1, y = Dim2, color = Group)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(
    x = paste("Dim1 (", round(norm_mds$var.explained[1]*100, 1), "%)"),
    y = paste("Dim2 (", round(norm_mds$var.explained[2]*100, 1), "%)"),
    title = "MDS plot: normalized counts"
  )
```

<center>

**Figure 2:** MDS plot generated from filtered RNA sequencing read counts of whole blood samples from septic shock patients who were responsive (R) and non-responsive (NR) to hemodynamic therapy, taken at time points before and after treatment (T1 and T2), normalized by relative log expression.

</center>

The MDS plots generated from the data pre and post normalization are nearly identical. The low impact of normalization indicates little technical bias in the dataset, and suggests that msot variation is biological 

# Differential gene expression

## Evaluating differential expression

From Fig. 1 and Fig. 2, some grouping is visible of samples within the same set of conditions. Samples taken at the same time point from patients with the same response behaviour loosely cluster together. As such, we'll use the four combinations of response and time point as factors to evaluate differential expression.
```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# evaluate differential expression. Appropriate design already specified in counts_dds
counts_dds <- DESeq2::DESeq(counts_dds)
```
We'll look for genes that are differentially expressed based on patient response at each of the two time points, and look for genes that are differentially expressed over time within each of the two patient response groups. 
```{r, message = FALSE, warning = FALSE, results = 'hide', class.source = 'fold-show'}
# DE between R and NR at T1
response_T1 <- DESeq2::results(counts_dds, contrast = c("group", "NR_T1", "R_T1"), alpha = 0.05)
summary(response_T1)

# DE between R and NR at T2 
response_T2 <- DESeq2::results(counts_dds, contrast = c("group", "NR_T2", "R_T2"), alpha = 0.05)
summary(response_T2)

# DE between T1 and T2 for NR patients
time_NR <- DESeq2::results(counts_dds, contrast = c("group", "NR_T1", "NR_T2"), alpha =0.05)
summary(time_NR)

# DE between T1 and T2 for R patients
time_R <- DESeq2::results(counts_dds, contrast = c("group", "R_T1", "R_T2"), alpha = 0.05)
summary(time_R)
```
```{r, message = FALSE, warning = FALSE}
# get genes with adjusted p-value < 0.05 for each test
sig_T1 <- response_T1[response_T1$padj < 0.05,]
sig_T2 <- response_T2[response_T2$padj < 0.05,]
sig_NR <- time_NR[time_NR$padj < 0.05,]
sig_R <- time_R[time_R$padj < 0.05,]

cat(paste(nrow(sig_T1), "DE genes between R and NR at T1\n",
          nrow(sig_T2), "DE genes between R and NR at T2\n",
          nrow(sig_NR), "DE genes from T1 to T2 in NR\n",
          nrow(sig_R), "DE genes from T1 to T2 in R"))
```


DESeq2 implements multiple hypothesis testing via Benjamini-Hochberg false discovery rate correction on p-values, producing adjusted p-values that more accurately reflect significance in bulk RNA-sequencing data than the original p-values given by the Wald test for differential expression. 
The threshold we've used for differentially expressed genes is genes with adjusted p-value < 0.05, a commonly used threshold [cite cite]. 

In this analysis, we'll work with the genes that were differentially expressed between the R and NR groups at T2. Since no genes are significantly differentially expressed between the R and NR groups at T1, analyzing the genes that are differentially expressed between the groups at T2 will tell us which genes were regulated differently between responder and non-responder patients as a result of the same conditions, yielding information on the differences between the two patient group's biological responses to the therapy.

# Visualizing differentially expressed genes

We'll create a volcano plot of the genes using the differential expression statistics calculated between the R and NR groups at T2.
```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# convert DE results to data frame
difexp_info <- as.data.frame(response_T2)

# add column indicating whether genes are significantly up or downregulated
difexp_info$DE <- "no"
difexp_info$DE[difexp_info$log2FoldChange > 0 & difexp_info$padj < 0.05] <- "up"
difexp_info$DE[difexp_info$log2FoldChange < 0 & difexp_info$padj < 0.05] <- "down"

# generate volcano plot, -log10 of adjusted p-val vs log2 of FC
ggplot2::ggplot(data = difexp_info, mapping = aes(x = log2FoldChange, y = -log10(padj), col = DE)) + 
  geom_point() +
  theme_minimal() + 
  geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed') + 
  scale_color_manual(values = c("blue", "grey", "red"),
                     labels = c("Downregulated in R group", "Not DE", "Upregulated in R group")) +
  labs(color = "Expression", 
       x = "log 2 fold change",
       y = "log 10 corrected p-value",
       title = "Expression of genes in R vs. NR samples at T2")

```
<center>

**Figure 3:** FDR-corrected p-value (-log10) vs. fold change (log2) in expression of genes in whole blood samples from septic shock patients who responded to early supportive hemodynamic therapy (R samples) and patients who did not respond (NR), taken 48 hours after treatment. Threshold for differential expression was corrected p-value < 0.05.

</center>

Next, we'll create a heatmap of the genes differentially expressed between the R and NR groups at T2, and see how they cluster according to sample conditions.
```{r, message = FALSE, warning = FALSE, class.source = 'fold-show'}
# use normalized counts for heatmap matrix, filter for DE genes
normcounts_heatmap <- norm_counts[rownames(norm_counts) %in% rownames(sig_T2),]

# normalize rows
normcounts_heatmap <- t(scale(t(normcounts_heatmap)))

# set colours to scale from blue at min values to white at 0 to red at max values
heatmap_col = circlize::colorRamp2(c(min(normcounts_heatmap), 0, max(normcounts_heatmap)),
                                   c("blue", "white", "red"))

# set colours to correspond to experimental conditions
group_colours <- rainbow(n = 4)
names(group_colours) <- unique(sample_metadata$group)

# add annotation with experimental conditions and corresponding colours
hannot <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(
  Group = sample_metadata$group),
  col = list(
    Group = group_colours
  ),
show_legend = TRUE)

# generate heatmap
DE_heatmap <- ComplexHeatmap::Heatmap(normcounts_heatmap, 
                                   col = heatmap_col,
                                   show_column_names = FALSE,
                                   show_row_names = FALSE,
                                   top_annotation = hannot,
                                   show_heatmap_legend = FALSE,
                                   column_title = "Genes differentially expressed in R vs. NR groups at T1")

DE_heatmap
```
<center>

**Figure 4:** Heatmap of genes differentially expressed in septic shock patients who responded to hemodynamic therapy and patients who did not respond 48 hours after treatment (groups R_T2 and NR_T2, respectively). Blue denotes low gene expression and red denotes high gene expression.

</center>

The R_T2 and NR_T2 samples largely clustered apart from each other, while the R_T1 and NR_T1 samples are not. This is as expected, since the genes being plotted are differentially expressed between R and NR at T2, while no genes are differentially expressed between R and NR at T1. There is also significant separation between the T1 and T2 groups for R, but not for NR, indicating that many of these genes change in expression in responders over the 48 hours after hemodynamic therapy, but not in non-responders.

# Interpretation and documentation
* This dataset was of interest because the distinct conditions within it are not a result of differential treatment or circumstances, but of differential biological response to a treatment, and the inclusion of two different time points means that the dataset can be used to analyze how pre-existing biological differences lead to different epigenetic responses.

* In this dataset, the control condition is patients who were non-responders to early supportive hemodynamic therapy as treatment for septic shock and the test condition is patients who were responders. Samples collected at ICU admission also serve as a control for samples collected 48h later.

* There are 34 samples from responders and 28 samples from non responders. There are 32 samples taken at time point 1 and 32 sample staken at time point 2.

* There were expression values that were not unique for specific genes, but they were a very small proportion of the data. When multiple expression values mapped to one gene, we used their sum as the expression value for that gene. When an expression value mapped to multiple genes, we duplicated it and used it as the expression value for each gene it mapped to. Since there were very few such cases, the impact of this transformation on library size was negligible. 

* A large proportion of the expression values could not be mapped to HUGO symbols, but most of the unmapped values were very low, and were removed when filtering out low expression genes. 

* There were no outliers in the dataset, according to the threshold on Cook's distance applied by DESeq2. 

* Biological replicates were grouped together in the design model used for differential expression analysis. There were no technical replicates.

* The dataset covers 13673 genes across 62 samples.

# Bibliography
