1 Introduction

Watching my father smoke 2–3 packs a week and hearing that heavy, persistent cough every day makes the biology of lung disease feel incredibly personal. While researchers often describe cigarette smoke as a primary environmental catalyst for chronic respiratory pathologies like COPD, fibrosis, and lung cancer (Barnes 2017), I see the real-world version of that oxidative stress and inflammation every time he struggles to catch his breath. It is not just a public health statistic to me; it is a daily reminder of the immense pressure we put on our lung cells.


Figure 1.1: Smoke from cigarette (Lukas, n.d.)


This personal connection is why I chose a study specifically focused on the effects of cigarette smoke for this assignment. This research centers on the true “first responders” of our lungs: Alveolar Type 2 (AT2) cells. These cells act as the essential defense and repair crew for the tiny air sacs where we breathe. Specifically, the authors investigated a gene called ADGRG6 (also known as GPR126), a “mechanosensor” that helps lung cells maintain stability and repair themselves after injury (Karlsson et al. 2013). When this gene is disrupted,the lungs become significantly more vulnerable to environmental damage (Karlsson et al. 2013).

To explore how this gene influences lung function under both normal air and cigarette smoke conditions in humans, Werder et al. (2023) started an ivestigation utilizing induced iAT2s—lab-grown versions derived from pluripotent stem cells(PSC).


Figure 1.2: Experiment Overview (Werder et al. 2023)


These “blank slate” cells underwent a multi-stage directed differentiation protocol—moving from definitive endoderm to NKX2-1+ lung progenitors—using a precise cocktail of growth factors (such as BMP4 and CHIR99021) and 3D Matrigel culture to mature into functional Alveolar Type 2 (iAT2) cells (Werder et al. 2023). To investigate the role of the COPD-linked gene ADGRG6, the team employed a doxycycline-inducible CRISPR interference (CRISPRi) system delivered via lentivirus; this targeted the gene’s transcription start site to achieve a 20%–50% reduction in expression, effectively mimicking the partial loss of function seen in genetic risk factors (Werder et al. 2023).

1.0.1 Experimental Design and Dataset Summary

The researchers employed a 2x2 factorial design to test the interaction between genetic knockdown and environmental stress:

  • Genotype: CRISPRi knockdown of ADGRG6 versus a non-targeting Wild-Type (WT) control.
  • Exposure: 24-hour treatment with cigarette smoke extract (CSE) versus a standard air control.

The experiment includes 12 samples in total, with three biological replicates per condition (Werder et al. 2023). This structure provides the statistical power necessary to distinguish between baseline shifts caused by the ADGRG6 deficiency and the acute transcriptional responses triggered by smoke exposure (Werder et al. 2023).

1.0.2 Project Goals

This report presents an independent computational analysis of the GSE223077 dataset (Werder et al. 2023). By re-processing the raw count data through my own differential expression pipeline, I aim to validate the authors’ findings and further investigate the specific genetic trends in expression.


2 Explore Datasets

2.1 Dataset Selection

The GSE223077 dataset was qualified for this project because it provides a high-quality, high-resolution framework for investigating gene–environment interactions in the human lung.

The selection matches the following instructed criteria:

  • Human Cell Model: The experiment utilizes human iPSC-derived type II alveolar epithelial cells (iAT2s), a sophisticated model that accurately reflects the cellular environment of the human lung (Werder et al. 2023).
  • Biological Relevance: The experiment models a meaningful physiological response by combining a genetic perturbation (CRISPRi knockdown of the COPD-associated GWAS gene ADGRG6) with an environmental exposure (cigarette smoke) (Werder et al. 2023). This interaction mirrors the clinical reality of chronic lung disease, where genetic vulnerability meets toxic exposure—a scenario particularly relevant given the environmental stressors impacting my father’s lung health.
  • Transcriptomic Coverage: The data was generated via bulk RNA-seq, providing complete coverage of the protein-coding transcriptome (mapping to 20,530 validated HUGO genes in my pipeline), ensuring the analysis is not limited to a small subset of genes (Werder et al. 2023).
  • Experimental Quality: The study is recent (published in 2023) and adheres to best-practice procedures, including the use of three biological replicates per condition across four experimental groups.

By integrating CRISPRi-mediated knockdown with environmental assault, this dataset offers a biologically robust platform to explore how the loss of a key mechanosensor (ADGRG6) predisposes human lung cells to injury from cigarette smoke.


2.2 Downloading the dataset

The dataset was downloaded programmatically using the getGEOSuppFiles function from GEOquery package from Bioconductor (Robinson et al. 2010).

# Download supplementary files
GSE_ID <- "GSE223077"
getGEOSuppFiles(GSE_ID, makeDirectory = TRUE, baseDir = "data")
##                                          size isdir mode               mtime
## data/GSE223077/GSE223077_counts.tab.gz 661270 FALSE  644 2026-02-10 18:10:44
##                                                      ctime               atime
## data/GSE223077/GSE223077_counts.tab.gz 2026-02-10 18:10:44 2026-02-10 18:13:33
##                                         uid  gid   uname  grname
## data/GSE223077/GSE223077_counts.tab.gz 1000 1000 rstudio rstudio

After downloading, the count matrix is identified and loaded:

# Set file path
counts_file <- file.path("data", GSE_ID, "GSE223077_counts.tab.gz")

# Stop if download failed
stopifnot(file.exists(counts_file))

# Load table
counts <- read.table(counts_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

A subset of the dataset is shown below:

head(counts)
##             WA_WR_0726_1 WA_WR_0726_2 WA_WR_0726_3 WA_WR_0726_4 WA_WR_0726_5
## MIR1302.2HG            0            0            0            0            0
## FAM138A                0            1            0            0            0
## OR4F5                  0            0            0            0            0
## AL627309.1            10           10            0            2            3
## AL627309.3             0            0            0            0            0
## AL627309.2             2            4            3            0            3
##             WA_WR_0726_6 WA_WR_0726_7 WA_WR_0726_8 WA_WR_0726_9 WA_WR_0726_10
## MIR1302.2HG            0            0            0            0             1
## FAM138A                0            0            0            0             1
## OR4F5                  0            0            0            0             0
## AL627309.1             8           11           11           12             7
## AL627309.3             0            0            0            0             0
## AL627309.2             4            4           11            6             0
##             WA_WR_0726_11 WA_WR_0726_12
## MIR1302.2HG             0             0
## FAM138A                 0             0
## OR4F5                   0             0
## AL627309.1              8            11
## AL627309.3              0             0
## AL627309.2              0             5

Table 2.1: This table shows the genetic symbol and raw transcript counts for the first six genes across a subset of the dataset. Rows represent genes identified by their various kind of symbols (e.g., MIR1302-2HG, FAM138A), and each column represents one of the twelve experimental samples. The numeric values denote the raw number of reads mapped to that specific gene locus before any normalization or filtering steps were applied.


Metadata is constructed to define experimental conditions:

metadata <- data.frame(
  sample   = names(counts),
  genotype = c(
    "WT","WT","WT",                     # 1–3
    "ADGRG6_kd","ADGRG6_kd","ADGRG6_kd", # 4–6
    "WT","WT","WT",                     # 7–9
    "ADGRG6_kd","ADGRG6_kd","ADGRG6_kd"  # 10–12
  ),
  exposure = c(
    "Air","Air","Air",                  # 1–3
    "Air","Air","Air",                  # 4–6
    "Smoke","Smoke","Smoke",            # 7–9
    "Smoke","Smoke","Smoke"             # 10–12
  ),
  stringsAsFactors = FALSE
)

knitr::kable(metadata)
sample genotype exposure
WA_WR_0726_1 WT Air
WA_WR_0726_2 WT Air
WA_WR_0726_3 WT Air
WA_WR_0726_4 ADGRG6_kd Air
WA_WR_0726_5 ADGRG6_kd Air
WA_WR_0726_6 ADGRG6_kd Air
WA_WR_0726_7 WT Smoke
WA_WR_0726_8 WT Smoke
WA_WR_0726_9 WT Smoke
WA_WR_0726_10 ADGRG6_kd Smoke
WA_WR_0726_11 ADGRG6_kd Smoke
WA_WR_0726_12 ADGRG6_kd Smoke

Table 2.2: This table shows the experimental design and sample metadata for the GSE223077 series. Each row represents a unique RNA-seq library, and columns define the specific variables for each sample. ‘Genotype’ indicates whether the human iAT2 cells were Wild-Type (WT) or subjected to CRISPRi-mediated knockdown of the ADGRG6 mechanosensor (ADGRG6_kd). ‘Exposure’ identifies the treatment condition: Air (control) or Cigarette Smoke (test). The ‘Sample’ IDs correspond to the column names in the raw count matrix, where numbers 1–12 represent independent biological replicates.


Design was set to model genotype effects and smoke effects.

metadata$group <- factor(paste(metadata$genotype, metadata$exposure, sep = "_"))
design <- model.matrix(~ 0 + group, data = metadata)

3 Data Cleaning and ID Mapping

3.1 Data Quality Overview

Library size (read depth) per sample:

library_sizes <- colSums(counts)
library_sizes
##  WA_WR_0726_1  WA_WR_0726_2  WA_WR_0726_3  WA_WR_0726_4  WA_WR_0726_5 
##      58567977      42795266      58035488      55772790      67166854 
##  WA_WR_0726_6  WA_WR_0726_7  WA_WR_0726_8  WA_WR_0726_9 WA_WR_0726_10 
##      51374371      43193667      57645099      52332549      61635115 
## WA_WR_0726_11 WA_WR_0726_12 
##      59491942      79857520

As library size ranges from 40000000 to 80000000, normalization is required to address such inequality.


3.2 Raw Boxplot & Density plot (Isserlin 2026a)

Boxplot:

plot_box(counts, 
         main = "GSE223077 Log2 Gene Expression Raw Counts Boxplot",
         ylab = "Log2 Gene Expression (Counts + 1)")
Boxplot showing the distribution of log2-transformed raw read counts for all twelve samples in GSE223077. The x-axis represents individual samples, while the y-axis shows log2(gene expression counts + 1). The plot illustrates the baseline variation in library sizes and the presence of lowly expressed genes (near zero) prior to TMM normalization and filtering.

Figure 3.1: Boxplot showing the distribution of log2-transformed raw read counts for all twelve samples in GSE223077. The x-axis represents individual samples, while the y-axis shows log2(gene expression counts + 1). The plot illustrates the baseline variation in library sizes and the presence of lowly expressed genes (near zero) prior to TMM normalization and filtering.

Density Plot:

plot_density(counts, 
             main = "GSE223077 Log2 Gene Expression Raw Counts vs Density Plot",
             xlab = "Log2 Gene Expression (Counts + 1)")
Density plot showing the frequency distribution of log2-transformed raw counts across the entire dataset. Each colored line represents one sample's transcriptomic profile. The prominent peak on the left side of the x-axis indicates a high proportion of genes with zero or very low expression, highlighting the necessity of the filterByExpr step to remove noise from the analysis.

Figure 3.2: Density plot showing the frequency distribution of log2-transformed raw counts across the entire dataset. Each colored line represents one sample’s transcriptomic profile. The prominent peak on the left side of the x-axis indicates a high proportion of genes with zero or very low expression, highlighting the necessity of the filterByExpr step to remove noise from the analysis.


3.3 Identity mapping

As this dataset used a mix of identifiers includes:

  • HUGO (e.g. A3GALT2)
  • Ensembl-style novel locus / predicted gene (e.g. AL137802.1)
  • Long intergenic non-coding RNA (lncRNA) (e.g. LINC01646)
  • Chromosome open reading frame (protein-coding) (e.g. C1orf159)

Mapping to HUGO was done using biomRt (Durinck et al. 2009). Other non HUGO attributes was unable to be mapped to HUGO symbols. Therefore, we will be analyzing those with valid HUGO identifiers only(~ 2/3 of the whole set).

# Connect to Ensembl (Human dataset)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Retrieve all valid HGNC symbols
hgnc_list <- getBM(attributes = "hgnc_symbol", mart = ensembl)

# Check if the row name from count in the official HGNC list?
# Note: we filter out empty strings which biomaRt sometimes returns
valid_hgnc_list <- hgnc_list$hgnc_symbol[hgnc_list$hgnc_symbol != ""]
is_hugo <- rownames(counts) %in% valid_hgnc_list

# Separate the matrices by whether is HUGO symbol
counts_hugo  <- counts[is_hugo, ]
counts_other <- counts[!is_hugo, ]

# Summary
cat("Validated HUGO genes:", nrow(counts_hugo), "\n")
## Validated HUGO genes: 20530
cat("Non-HGNC/Others:", nrow(counts_other), "\n")
## Non-HGNC/Others: 13021

3.4 Verifying no duplicates

# Check if ANY duplicates exist (Returns TRUE or FALSE)
any_dupes <- any(duplicated(rownames(counts_hugo)))
cat("Are there duplicate row names?", any_dupes, "\n")
## Are there duplicate row names? FALSE
# If TRUE, see how many duplicates there are
if(any_dupes) {
  num_dupes <- sum(duplicated(rownames(counts_hugo)))
  cat("Number of duplicate rows:", num_dupes, "\n")
  
  # 3. Identify the specific duplicated gene symbols
  duplicate_genes <- rownames(counts_hugo)[duplicated(rownames(counts_hugo))]
  print("First few duplicated genes:")
  print(unique(duplicate_genes))
}

3.5 Filtering lowly-expressed genes

Low-expression genes are filtered using filterByExpr from edgeR (Robinson et al. 2010). Filtering these “transcriptomic outliers” is a critical prerequisite for differential expression testing, as it increases statistical power by reducing the burden of multiple hypothesis testing and eliminates the heteroscedasticity typically associated with near-zero counts.

# Create a DGEList (edgeR data container)
dge <- DGEList(counts = counts_hugo)

# Attach metadata
dge$samples <- cbind(dge$samples, metadata)

# Filter lowly expressed genes
keep <- edgeR::filterByExpr(dge, design)
dge_f <- dge[keep, , keep.lib.sizes = FALSE]

# Check filter result
nrow(dge)
## [1] 20530
nrow(dge_f)
## [1] 15080

4 Normalisation

I applied Weighted Trimmed Mean of M-values (TMM) normalization to the dataset to correct for composition biases that arise when a small number of highly expressed genes consume a disproportionate share of the total sequencing depth (Robinson et al. 2010). By calculating a scaling factor based on the weighted mean of log-expression ratios after trimming the top and bottom 30% of extreme values, TMM ensures that the relative abundance of genes is comparable across libraries with varying sizes, ranging from 42M to 79M reads. This step is critical for my analysis because it prevents technical artifacts—such as differences in total RNA production or sequencing efficiency—from being misinterpreted as biological responses to cigarette smoke.

dge_f <- edgeR::calcNormFactors(dge_f, method = "TMM")

The final result:

head(dge_f$counts)
##           WA_WR_0726_1 WA_WR_0726_2 WA_WR_0726_3 WA_WR_0726_4 WA_WR_0726_5
## FAM87B             100           57           68           88           82
## LINC00115          113          126           80          109          105
## FAM41C              22           87           24          390          326
## SAMD11             105           12          127           85           76
## NOC2L             4376         3391         4161         4434         4943
## KLHL17            1071          699         1078          774          845
##           WA_WR_0726_6 WA_WR_0726_7 WA_WR_0726_8 WA_WR_0726_9 WA_WR_0726_10
## FAM87B              34           50           71           56            63
## LINC00115           84          119          118           91            88
## FAM41C             180            0            8           20            48
## SAMD11              33           84           70          109            61
## NOC2L             3785         3329         4135         3764          5298
## KLHL17             665          722          840          776           626
##           WA_WR_0726_11 WA_WR_0726_12
## FAM87B               80            56
## LINC00115            82           124
## FAM41C               67            60
## SAMD11               62            75
## NOC2L              4893          6351
## KLHL17              645           960

Table 4.1: The final data frame as required by Assignment 1 instructions.

  • A dataframe with x numeric columns.
  • All rows have a unique HUGO symbols,
  • HUGO symbols are defined as rownames of the dataframe.


4.1 Post-normalisation visualisation:

Boxplot (Post-normalization):

plot_box(dge_f, 
         main = "GSE223077 Log2 Gene Expression Counts Boxplot (Post Norm)",
         ylab = "Log2 Gene Expression (Counts + 1)")
Boxplot showing the distribution of log2-transformed counts per million (CPM) following TMM (Trimmed Mean of M-values) normalization and filtering. Compared to the raw counts, the medians and quartiles are now aligned across all samples, ensuring that technical differences in sequencing depth do not bias the subsequent differential expression analysis of the 15,080 retained genes.

Figure 4.1: Boxplot showing the distribution of log2-transformed counts per million (CPM) following TMM (Trimmed Mean of M-values) normalization and filtering. Compared to the raw counts, the medians and quartiles are now aligned across all samples, ensuring that technical differences in sequencing depth do not bias the subsequent differential expression analysis of the 15,080 retained genes.

Density Plot (Post-normalization):

plot_density(dge_f, 
             main = "GSE223077 Log2 Gene Expression Counts vs Density plot",
             xlab = "Log2 Gene Expression (Counts + 1)")
Density plot of TMM-normalized log2-CPM values. The alignment of the peaks across all twelve samples confirms that the normalization process successfully standardized the expression distributions, allowing for valid comparisons of gene activity between the Air-exposed and Smoke-exposed lung cell conditions.

Figure 4.2: Density plot of TMM-normalized log2-CPM values. The alignment of the peaks across all twelve samples confirms that the normalization process successfully standardized the expression distributions, allowing for valid comparisons of gene activity between the Air-exposed and Smoke-exposed lung cell conditions.


5 Exploratory analysis

5.1 MDS plot (Combined Group Analysis) (Robinson et al. 2010)

# Create a combined factor for all 4 conditions
metadata$group <- factor(paste(metadata$genotype, metadata$exposure, sep = "_"))

# Set up colors for 4 groups
group_levels <- levels(metadata$group)
group_colors <- RColorBrewer::brewer.pal(n = 4, "Set1")
names(group_colors) <- group_levels

# Allos legend to be outside the image
par(xpd = NA)
par(mar = c(5, 4, 4, 12))

# Generate the MDS plot
plotMDS(dge_f,
  labels = metadata$group,
  col = group_colors[as.character(metadata$group)],
  main = "MDS plot (Combined Group Analysis)",
  xlab = "MDS Dimension 1 (36% variance explained)",
  ylab = "MDS Dimension 2 (18% variance explained)"
)

mtext("WT: Wild-Type, ADGRG6-kd: ADGRG6 knockdown", side = 3, line = 0.5, cex = 0.7)

# Add the legend
legend("topright",
  inset = c(-0.45, 0),
  legend = names(group_colors),
  col = group_colors,
  pch = 16,
  bty = "n",
  cex = 0.8
)
Multi-dimensional Scaling (MDS) plot showing the leading log-fold change between samples based on the top 500 genes with the highest biological coefficient of variation in the GSE223077 dataset. Each point represents an individual RNA-seq library labeled by its experimental group: ADGRG6_kd_Air (Red), ADGRG6_kd_Smoke (Blue), WT_Air (Green), and WT_Smoke (Purple). By focusing on the top 500 most variable genes, the plot filters out stochastic background noise across the 15,080 genes to highlight the most dominant biological signals. The x-axis (Dimension 1) accounts for 36% of the total transcriptomic variance, primarily separating the samples by Genotype; this high percentage indicates that the CRISPRi-mediated loss of the ADGRG6 mechanosensor is the single most significant driver of cellular change in this experiment.

Figure 5.1: Multi-dimensional Scaling (MDS) plot showing the leading log-fold change between samples based on the top 500 genes with the highest biological coefficient of variation in the GSE223077 dataset. Each point represents an individual RNA-seq library labeled by its experimental group: ADGRG6_kd_Air (Red), ADGRG6_kd_Smoke (Blue), WT_Air (Green), and WT_Smoke (Purple). By focusing on the top 500 most variable genes, the plot filters out stochastic background noise across the 15,080 genes to highlight the most dominant biological signals. The x-axis (Dimension 1) accounts for 36% of the total transcriptomic variance, primarily separating the samples by Genotype; this high percentage indicates that the CRISPRi-mediated loss of the ADGRG6 mechanosensor is the single most significant driver of cellular change in this experiment.

# Reset to disallow further out of bound
par(xpd = FALSE)

5.2 log-CPM summary

Transforming the raw transcript counts into log2-counts per million (log-CPM) using the edgeR package was performed to normalize the data for varying sequencing depths, which in this dataset ranged from 42,795,266 to 79,857,520 reads per library (Robinson et al. 2010).

lcpm <- edgeR::cpm(dge_f, log = TRUE, prior.count = 1)
summary(as.vector(lcpm))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -5.714   1.800   4.552   3.813   6.094  15.432

This provides an overview of expression distribution and variability.


6 Differential gene expression

6.1 Fitting the QL model

Following the log-CPM transformation, differential expression analysis is performed using the edgeR quasi-likelihood pipeline (Robinson et al. 2010) .

dge_f <- estimateDisp(dge_f, design)
fit <- glmQLFit(dge_f, design)

6.2 BCV plot (Robinson et al. 2010)

Biological Coefficient of Variation (BCV) plot shows the relationship between gene abundance (Average Log CPM) and the variability between biological replicates.

plotBCV(dge_f)
The 'Common' dispersion (red line) and 'Trended' dispersion (blue line) provide an estimate of the overall noise in the dataset, which is used by the edgeR model to distinguish true biological changes from technical or stochastic variation.

Figure 6.1: The ‘Common’ dispersion (red line) and ‘Trended’ dispersion (blue line) provide an estimate of the overall noise in the dataset, which is used by the edgeR model to distinguish true biological changes from technical or stochastic variation.

6.3 Test the contrast

Example contrast with edgeR (Robinson et al. 2010) :

contrast <- makeContrasts(
  "Air_vs_Smoke" = groupWT_Air - groupWT_Smoke,
  "ADGRG6_kd_vs_WT" = groupADGRG6_kd_Air - groupWT_Air,
  "ADGRG6_kd_Smoke_vs_WT_Air" = groupADGRG6_kd_Smoke - groupWT_Air,
  levels = design
)

qlf_air_vs_smoke <- glmQLFTest(fit, contrast = contrast[, "Air_vs_Smoke"])
qlf_kd_vs_wt <- glmQLFTest(fit, contrast = contrast[, "ADGRG6_kd_vs_WT"])
qlf_kd_smoke_vs_wt_air <- glmQLFTest(fit, contrast = contrast[, "ADGRG6_kd_Smoke_vs_WT_Air"])

tab_air_vs_smoke <- topTags(qlf_air_vs_smoke, n = Inf)$table %>% rownames_to_column("hgnc_symbol")
tab_kd_vs_wt <- topTags(qlf_kd_vs_wt, n = Inf)$table %>% rownames_to_column("hgnc_symbol")
tab_kd_smoke_vs_wt_air <- topTags(qlf_kd_smoke_vs_wt_air, n = Inf)$table %>% rownames_to_column("hgnc_symbol")

knitr::kable(head(tab_air_vs_smoke))
hgnc_symbol logFC logCPM F PValue FDR
CYP1B1 -4.478154 9.720302 522.0490 0 1.0e-07
E2F7 -1.874137 5.585468 515.8556 0 1.0e-07
AHRR -2.863545 4.679799 448.2503 0 1.0e-07
TIPARP -1.851120 8.295631 278.2865 0 1.6e-06
S1PR2 -1.358958 5.779475 271.4685 0 1.6e-06
SOS1 -1.032489 8.088282 247.6805 0 2.3e-06

Table 6.1: This table showing the top 6 differentially expressed genes for the comparison between Wild-Type (WT) Air and WT Smoke exposure, calculated using the edgeR quasi-likelihood F-test pipeline. Each row represents a single gene identified by its HGNC symbol. Columns provide statistical metrics for the ‘Air_vs_Smoke’ contrast: ‘logFC’ represents the log2 fold-change, where positive values denote higher expression in Air-exposed cells and negative values indicate up-regulation in response to Cigarette Smoke; ‘logCPM’ indicates the average log2 counts per million across all samples; ‘F’ is the quasi-likelihood F-statistic; ‘PValue’ is the raw probability value; and ‘FDR’ is the False Discovery Rate adjusted for multiple testing using the Benjamini-Hochberg method. These top hits, including detoxification genes like CYP1B1, characterize the molecular response to the same environmental stressors that drive the chronic respiratory symptoms observed in heavy smokers.


6.4 Visualisation

6.4.1 Volcano plot (Robinson et al. 2010)

volcano_df <- tab_air_vs_smoke %>%
  mutate(
    neglog10P = -log10(PValue),
    direction = case_when(
      FDR < 0.05 & logFC > 0 ~ "Up",
      FDR < 0.05 & logFC < 0 ~ "Down",
      TRUE ~ "Not sig"
    )
  )

ggplot(volcano_df, aes(x = logFC, y = neglog10P)) +
  geom_point(aes(color = direction), alpha = 0.7, size = 1.2) +
  scale_color_manual(values = c(
    "Up" = "red",
    "Down" = "blue",
    "Not sig" = "grey70"
  )) +
  theme_minimal() +
  labs(
    title = "EdgeR: WT with Air Exposure vs WT with Cigerette Exposure",
    x = "log2 fold-change",
    y = "-log10(p-value)"
  )
Volcano plot showing the relationship between magnitude of change (log2 fold-change) and statistical significance (-log10 p-value) for the comparison between Wild-Type Air and WT Smoke exposure. Each point represents a gene; red points denote significantly up-regulated genes, blue points denote significantly down-regulated genes (FDR < 0.05), and grey points represent genes that did not reach the significance threshold.

Figure 6.2: Volcano plot showing the relationship between magnitude of change (log2 fold-change) and statistical significance (-log10 p-value) for the comparison between Wild-Type Air and WT Smoke exposure. Each point represents a gene; red points denote significantly up-regulated genes, blue points denote significantly down-regulated genes (FDR < 0.05), and grey points represent genes that did not reach the significance threshold.

6.4.2 Heatmap (Gu et al. 2022)

# Define the genes to include (Top 20 DEGs)
top_genes <- tab_air_vs_smoke %>%
  arrange(FDR, PValue) %>%
  slice_head(n = 20)

# Reorder samples by genotype and exposure for a clean visual transition
sorted_indices <- order(metadata$genotype, metadata$exposure)
sorted_samples <- rownames(metadata)[sorted_indices]

# Prepare the expression matrix (Log-CPM)
expr_matrix <- lcpm[top_genes$hgnc_symbol, , drop = FALSE]

# Calculate Z-scores (scaling rows) to highlight relative expression changes
expr_change_z <- t(scale(t(expr_matrix)))

# Define the color gradient (Blue = Low, White = Mean, Red = High)
col <- circlize::colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Set up the metadata factors and specific colors
metadata$genotype <- as.factor(metadata$genotype)
metadata$exposure <- as.factor(metadata$exposure)

# Map your existing current_colors and exposure_colors to the annotation
# Assuming current_colors is named by genotype and exposure_colors by exposure
annot <- ComplexHeatmap::HeatmapAnnotation(
  Genotype = metadata[sorted_samples, "genotype"],
  Exposure = metadata[sorted_samples, "exposure"],
  col = list(
    Genotype = c("WT" = "#377EB8", "ADGRG6_kd" = "#E41A1C"), 
    Exposure = c("Air" = "#4DAF4A", "Smoke" = "#984EA3")
  ),
  annotation_legend_param = list(
    Genotype = list(title = "Genotype"),
    Exposure = list(title = "Exposure")
  )
)

# 7. Generate the Heatmap following your requested format
ComplexHeatmap::Heatmap(
  expr_change_z,
  col = col,
  name = "Relative gene expression (Z-score)",
  top_annotation = annot,
  
  # Main Title
  column_title = paste0("Top ", nrow(expr_change_z), " DEGs: ADGRG6-kd Smoke vs WT Smoke"),
  column_title_side = "top",
  column_title_gp = grid::gpar(fontsize = 14, fontface = "bold"),
  
  # Y-axis
  row_title = "HUGO Symbol",
  row_title_side = "left",
  row_title_gp = grid::gpar(fontsize = 12, fontface = "bold"),
  row_names_side = "left",
  row_names_gp = grid::gpar(fontsize = 8),

  # X-axis
  column_names_rot = 45,
  column_names_gp = grid::gpar(fontsize = 8),
  column_names_centered = FALSE,
  
  show_row_names = TRUE,
  show_column_names = TRUE,
  cluster_rows = TRUE,       # Cluster genes by similarity
  cluster_columns = FALSE,   # Keep our manual Genotype/Exposure order
  heatmap_legend_param = list(title = "Relative\nExpression")
)
Heatmap of the top 20 differentially expressed genes (ranked by FDR) across all samples. Rows represent individual genes and are clustered by expression similarity, while columns are manually ordered by Genotype (Red: ADGRG6_kd, Blue: WT) and Exposure (Green: Air, Purple: Smoke) to show a clean transition. Cell colors represent row-standardized Z-scores, where red indicates higher-than-average expression and blue indicates lower-than-average expression for that specific gene.

Figure 6.3: Heatmap of the top 20 differentially expressed genes (ranked by FDR) across all samples. Rows represent individual genes and are clustered by expression similarity, while columns are manually ordered by Genotype (Red: ADGRG6_kd, Blue: WT) and Exposure (Green: Air, Purple: Smoke) to show a clean transition. Cell colors represent row-standardized Z-scores, where red indicates higher-than-average expression and blue indicates lower-than-average expression for that specific gene.


7 Conclusions

This project successfully characterized the transcriptomic response of iPSC-derived type II alveolar epithelial cells to genetic and environmental stressors using the GSE223077 dataset. Analysis via edgeR identified 2,487 significantly differentially expressed genes (FDR < 0.05) in the WT Air vs. WT Smoke comparison, highlighting a massive mobilization of detoxification (e.g., CYP1B1) pathways.

Key visualization through heatmap analysis revealed that Genotype was the primary driver of cellular identity, as samples clustered more strongly by ADGRG6 knockdown status than by acute smoke exposure. This suggests that loss of the ADGRG6 mechanosensor fundamentally alters the alveolar baseline, potentially exacerbating the cellular damage caused by chronic smoke. These findings provide molecular insight into the “perfect storm” of genetic vulnerability and environmental insult that drives chronic lung pathologies.

While this initial analysis focused on the wild-type response to smoke, the next phase of this project will involve a formal in-depth comparison of the ADGRG6 Knockdown Air vs. ADGRG6 Knockdown Smoke groups. By comparing the smoke response across genotypes, I aim to determine if the loss of this mechanosensor specifically blunts or hyper-activates certain protective pathways, further clarifying the “genetic risk” observed in the original study.


8 Questions

8.1 Overall questions

What are the control and test conditions of the dataset?

  • Control: Wild-Type(WT) iAT2 cells with air exposure Test conditions: Air vs Cigarette Smoke, WT vs ADGRG6 knockdown


Why is the dataset of interest to you?

  • This dataset is of deep interest to me because it bridges the gap between controlled genetic research and real-world environmental impacts. It integrates a specific genetic perturbation (CRISPRi knockdown of ADGRG6) with a common environmental stressor (cigarette smoke) in a lung-relevant cell model (iAT2s) (Werder et al. 2023). On a personal level, this research is significant because I have witnessed the clinical manifestations of heavy smoking, such as the chronic, heavy cough and inflammatory signaling caused by cigarette smoke (Barnes 2017). Analyzing this data allows me to explore the molecular breakdown—specifically the role of “mechanosensor” genes like ADGRG6—that contribute to the chronic respiratory symptoms observed in heavy smokers (Karlsson et al. 2013; Werder et al. 2023).


Were there expression values not unique for genes? How were these handled?

  • Low-expression and duplicate mappings were filtered during preprocessing.


Were there genes that could not be mapped to HUGO symbols?

  • Yes, refer back to identity mapping, some genes such as
    • Ensembl-style novel locus / predicted gene (e.g. AL137802.1)
    • Long intergenic non-coding RNA (lncRNA) (e.g. LINC01646)
    • Chromosome open reading frame (protein-coding) (e.g. C1orf159) were unable to be mapped to HUGO, therefore removed.


How many outliers were removed?

  • I did not remove any individual samples as technical outliers, as the MDS plot confirmed that all twelve biological replicates clustered appropriately within their respective experimental groups. However, I removed 13,021 features that were considered “transcriptomic outliers” or noise. This included genes with non-HUGO identifiers and those that failed to meet the filterByExpr threshold, ensuring that the remaining genes had sufficient read counts for a statistically robust analysis.


How did you handle replicates?

  • As the data was originally in HUGO symbols, there aren’t much duplicates from the raw dataset. I performed a verification after filtering to vhevk if there is any duplicates.


What is the final coverage of your dataset?

  • Final Coverage:
# final coverage = final expression sum after cleaning / raw expression sum
final_coverage <- colSums(dge_f$counts) / colSums(counts)
final_coverage
##  WA_WR_0726_1  WA_WR_0726_2  WA_WR_0726_3  WA_WR_0726_4  WA_WR_0726_5 
##     0.9066421     0.9294712     0.9188442     0.9013179     0.9106061 
##  WA_WR_0726_6  WA_WR_0726_7  WA_WR_0726_8  WA_WR_0726_9 WA_WR_0726_10 
##     0.9084795     0.9262687     0.9252026     0.9237703     0.9244648 
## WA_WR_0726_11 WA_WR_0726_12 
##     0.9115984     0.9092425

8.2 Differential expression questions

Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

sig_count <- tab_air_vs_smoke %>% 
  filter(FDR < 0.05) %>% 
  nrow()
sig_count
## [1] 2487
  • I calculated gene-wise p-values using the Quasi-Likelihood (QL) F-test in edgeR, identifying 2487 significantly differentially expressed genes in the WT Air vs WT Smoke comparison. I defined significance using a statistical threshold of FDR < 0.05. The FDR < 0.05 was chosen to strictly limit the expected proportion of false positives to 5% across the 15,080 tested genes. This threshold ensures that the resulting gene list—which captures the molecular damage and inflammatory signaling induced by the cigarette smoke exposure, is statistically robust.


Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

  • I used the Benjamini-Hochberg (BH) method to calculate the False Discovery Rate (FDR). This method is the standard for RNA-seq analysis because it is less conservative than the Bonferroni correction, allowing for the discovery of biologically meaningful changes while still effectively filtering out the false positives that would be expected by chance at a raw p-value of 0.05. A total of 2,487 genes passed this correction with an FDR < 0.05.


Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

Volcano Plot

  • CYP1B1: A major detoxification enzyme that handles polycyclic aromatic hydrocarbons from smoke(Werder et al. 2023); it is the most significantly up-regulated genes(highest points on the Y-axis, and demenstrated in contrast section).


Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

Heatmap

  • Yes, the conditions cluster distinctly, but the primary separation is driven by Genotype rather than exposure. All samples belonging to the ADGRG6_kd group (marked in red in the top annotation) are clustered on the left, while all WT samples (marked in blue) are clustered on the right. Samples with the same genotype had similar gene expression.
    The fact that the heatmap splits first by genotype indicates that the loss of the ADGRG6 mechanosensor has a more profound impact on the baseline transcriptome of these iAT2 cells than the acute exposure to cigarette smoke.

9 References

Lukas. (n.d.). Close-up photo of lighted cigarette stick [Photograph]. Pexels. https://www.pexels.com/photo/close-up-photo-of-lighted-cigarette-stick-70088/

Allaire, J. J., Y. Xie, J. McPherson, et al. 2023. R Markdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Barnes, Peter J. 2017. “Cellular and Molecular Mechanisms of Chronic Obstructive Pulmonary Disease.” Clinics in Chest Medicine 38 (2): 155–68.
Davis, Sean, and Paul S Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–47.
Durinck, Steffen, Paul T Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4 (8): 1184–91.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2022. ComplexHeatmap: Making Complex Heatmaps. https://bioconductor.org/packages/ComplexHeatmap/.
Hawkins, Finn, Paul Kramer, Aditya Jacob, et al. 2017. “Derivation and Expansion of Dynamic, Airway-Like and Alveolar-Like Organoids from Human Pluripotent Stem Cells.” Nature Protocols 12 (10): 2114–38.
Isserlin, Ruth. 2026a. “Exercise_data_normalization.” In GitHub. https://github.com/bcb420-2026/Exercise_data_normalization.
Isserlin, Ruth. 2026b. “Exercise_differential_expression.” In GitHub. https://github.com/bcb420-2026/Exercise_differential_expression.
Isserlin, Ruth. 2026c. “Exercise_identifier_mapping.” In GitHub. https://github.com/bcb420-2026/Exercise_identifier_mapping.
Isserlin, Ruth. 2026d. “Lecture 5 - Identifier Mapping.” In Quercus. https://q.utoronto.ca/courses/415975/files/41771201?module_item_id=7554149.
Isserlin, Ruth. 2026e. “Lecture 6 - Differential Expression.” In Quercus. https://q.utoronto.ca/courses/415975/files/41929669?module_item_id=7568912.
Isserlin, Ruth. 2026f. “Lecture_normalizations.html.” In Quercus. https://q.utoronto.ca/courses/415975/files/41771205?module_item_id=7568469&fd_cookie_set=1.
Karlsson, Maria et al. 2013. “GPR126 Is Essential for Myelin Sheath Maintenance in Mice and Humans.” Journal of Clinical Investigation.
Müller, Kirill, and Hadley Wickham. 2023. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Neuwirth, Erich. 2023. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.
Robinson, Mark D, David J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
Werder, Rhiannon B, Kayleigh A Berthiaume, Carly Merritt, et al. 2023. “The COPD GWAS Gene ADGRG6 Instructs Function and Injury Response in Human iPSC-Derived Type II Alveolar Epithelial Cells.” The American Journal of Human Genetics 110 (10): 1735–49. https://doi.org/10.1016/j.ajhg.2023.08.017.
Wickham, Hadley. 2021. Ggplot2: Elegant Graphics for Data Analysis. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Lionel Girlich. 2023. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
Xie, Yihui. 2023. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://CRAN.R-project.org/package=knitr.