2.2 Demo data

In this tutorial, we will use a ground truth dataset called PBMC14k for demonstration purposes. It was generated from a Peripheral Blood Mononuclear Cells (PBMCs) dataset containing 10 known cell types, with 2,000 cells per type [Zheng et al., 2017]:

  1. We first rectified the gene symbol issues of the original dataset, including the dash-dot conversion (e.g. “RP11-34P13.7” changed to “RP11.34P13.7”) and “X” added to those started with numbers (e.g. “7SK” changed to “X7SK”), by referring to the gene annotation file (GRCh37.82) used in the original study.
  2. Then we removed 3 cell populations, CD34+ cells, CD4+ helper T cells, and total CD8+ cytotoxic cells, from the dataset because of either low sorting purity or a significant overlap with other immune cells based on the sorting strategy, and created a new dataset with seven known cell types and 14k cells in total.

The original dataset is freely available under this accession number SRP073767 and Zenodo.

How was the PBMC14K dataset generated from the original dataset?
## Step 1: rectify the invalid gene symbols
# "Filtered_DownSampled_SortedPBMC_data.csv" is the raw count matrix directly downloaded from Zenodo
counts <- read.csv("Filtered_DownSampled_SortedPBMC_data.csv", row.names = 1) 
d <- t(counts); dim(d) # it includes 21592 genes and 20000 cells

# "genesymbol_from_GTF_GRCh37.txt" contains the official gene ids and symbols extracted from GTF file downloaded from
officialGene <- read.table("genesymbol_from_GTF_GRCh37.txt", header = T, sep = "\t", quote = "", stringsAsFactors = F); head(officialGene)  https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/
officialGene$dotted_symbol <- gsub("-", "\\.", officialGene$gene_name); officialGene$dotted_symbol <- make.unique(officialGene$dotted_symbol)
table(row.names(d) %in% officialGene$dotted_symbol); row.names(d)[! row.names(d) %in% officialGene$dotted_symbol] # two genes are not in: X7SK.1 and X7SK.2
row.names(d) <- gsub("X7SK.1", "7SK", row.names(d)); row.names(d) <- gsub("X7SK.2", "7SK.1", row.names(d))
table(row.names(d) %in% officialGene$dotted_symbol) # all true
row.names(officialGene) <- officialGene$dotted_symbol
officialGene <- officialGene[row.names(d),]
row.names(d) <- make.unique(officialGene$gene_name)

# "Labels.csv" contains the true labels of cell types and was directly downloaded from Zenodo
celltype <- read.csv("Labels.csv"); head(celltype); 
table(celltype$x) # 2000 cells for each of 10 cell types: CD14+ Monocyte, CD19+ B, CD34+, CD4+ T Helper2, CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, CD56+ NK, CD8+ Cytotoxic T, CD8+/CD45RA+ Naive Cytotoxic
df <- data.frame(cell_barcode = colnames(d), trueLabel_full = celltype$x); dim(df)
truelabel_map <- c(`CD14+ Monocyte`="Monocyte", `CD19+ B`="B", `CD34+`="CD34pos", `CD4+ T Helper2`="CD4Th2", `CD4+/CD25 T Reg`="CD4Treg",
                  `CD4+/CD45RA+/CD25- Naive T`="CD4TN", `CD4+/CD45RO+ Memory`="CD4TCM", `CD56+ NK`="NK", `CD8+ Cytotoxic T`="CD8CTL", `CD8+/CD45RA+ Naive Cytotoxic`="CD8TN")
df$trueLabel <- as.character(truelabel_map[df$trueLabel_full])

## Step 2: extract 7 populations
df.14k <- df[df$trueLabel_full %in% c("CD14+ Monocyte", "CD19+ B", "CD4+/CD25 T Reg", "CD4+/CD45RA+/CD25- Naive T", "CD4+/CD45RO+ Memory", "CD56+ NK", "CD8+/CD45RA+ Naive Cytotoxic"),]
write.table(df.14k, file = "PBMC14k_trueLabel.txt", sep = "\t", row.names = TRUE, col.names = TRUE, quote = FALSE, append = FALSE)

d.14k <- d[,df.14k$cell_barcode]
d.14k <- d.14k[rowSums(d.14k) > 0,]
write.table(d.14k, file = "PBMC14k_rawCount.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE) # 17986 genes, 14000 cells


The PBMC14k dataset is embeded in scMINER R package and can be easily loaded by:

library(scMINER)
data("pbmc14k_rawCount")
dim(pbmc14k_rawCount)
## [1] 17986 14000
pbmc14k_rawCount[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##               CACTTTGACGCAAT GTTACGGAAACGAA AGTCACGACAGGAG TTCGAGGACCAGTA
## AL627309.1                 .              .              .              .
## AP006222.2                 .              .              .              .
## RP11-206L10.3              .              .              .              .
## RP11-206L10.2              .              .              .              .
## RP11-206L10.9              .              .              .              .
##               CACTTATGAGTCGT
## AL627309.1                 .
## AP006222.2                 .
## RP11-206L10.3              .
## RP11-206L10.2              .
## RP11-206L10.9              .