This script uses ComBat to remove batch effects from the bulk RNA-seq data (which was generated in three batches) and visualizes the results.

The script reads in values from bulk_tpm.txt, a TSV where each column is a sample and each row is a gene. For your convenience, you can download this file from http://mitra.stanford.edu/kundaje/pangwei/mesoderm_data/bulk_tpm.txt. The batch correction procedure might take a few minutes to run.

We first load in the data.

library(sva)
library(biomaRt)
source('analysis.r')
source('metadata.r')

bulkTPMData <- read.table('bulk_tpm.txt', header = TRUE)

# Remove celltypes in celltypesToExclude,
# remove celltypes in experimentMetadata that aren't actually in countData,
# and remove celltypes in countData that aren't in experimentMetadata.
# Then take the log of the data for subsequent analysis.
celltypesToExclude <- getCelltypesToExclude()
bulkExperimentMetadata <- getExperimentMetadata()
bulkExperimentMetadata <- filter(bulkExperimentMetadata,
                                 !Celltype %in% celltypesToExclude,
                                 SampleID %in% names(bulkTPMData))
row.names(bulkExperimentMetadata) <- bulkExperimentMetadata$SampleID
bulkTPMData <- dplyr::select(bulkTPMData,
                             match(bulkExperimentMetadata$SampleID, names(bulkTPMData)))
logBulkTPM <- log2(bulkTPMData + 1)

# The order of samples in countData should now match those in experimentMetadata. 
assert_that(all(bulkExperimentMetadata$SampleID == colnames(bulkTPMData)))

Next, we filter out all genes where the difference between the samples with the highest and lowest expression was less than 2 (in log2 TPM units, i.e., a 4-fold difference in expression).

maxTPM <- apply(logBulkTPM, 1, max)
minTPM <- apply(logBulkTPM, 1, min)
logBulkTPM <- logBulkTPM[(maxTPM >= 2) & ((maxTPM - minTPM) >= 2), ]

We then use the ComBat algorithm, as implemented in the sva package, to remove batch effects. This sometimes results in small negative values for the corrected gene expressions, which we set to zero.

TFExprSet <- ExpressionSet(
  assayData = as.matrix(logBulkTPM),
  phenoData = new(
    "AnnotatedDataFrame", 
    data = bulkExperimentMetadata))

phenoData <- pData(TFExprSet)
batch <- phenoData$Batch
modcombat <- model.matrix(~as.factor(Celltype), data = phenoData)
combatExpr <- ComBat(
  dat = exprs(TFExprSet), 
  batch = batch, 
  mod = modcombat, 
  par.prior = FALSE, 
  prior.plots = FALSE)
## Found 3 batches
## Adjusting for 9 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding nonparametric adjustments
## Adjusting the Data
combatExpr[combatExpr < 0] <- 0

This is the PCA plot of the samples before batch correction, using the top 500 genes by variance. You can see that the hESC / APS / MPS samples are quite separated between batches 1 and 3; the somites, dermomyotome, and sclerotome also differ between batches.

numTopGenes <- 500
idxOfTop <- order(apply(logBulkTPM, 1, var), decreasing = TRUE)[1:numTopGenes]
PCA(t(logBulkTPM[idxOfTop, ]), bulkExperimentMetadata, PCs=c(1,2), intgroup=c('Celltype', 'Batch'))

After batch correction, the different cell types cluster more tightly. As expected, cell types that are more highly differentiated (Somites, Dermomyotome, Sclerotome) are more dispersed.

numTopGenes <- 500
idxOfTop <- order(apply(combatExpr, 1, var), decreasing = TRUE)[1:numTopGenes]
PCA(t(combatExpr[idxOfTop, ]), bulkExperimentMetadata, PCs=c(1,2), intgroup=c('Celltype', 'Batch'))

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] stringr_1.0.0             RCurl_1.95-4.7           
##  [3] bitops_1.0-6              gdata_2.17.0             
##  [5] org.Hs.eg.db_3.1.2        assertthat_0.1           
##  [7] dplyr_0.4.3               GO.db_3.1.2              
##  [9] AnnotationDbi_1.30.1      Biobase_2.28.0           
## [11] goseq_1.20.0              RSQLite_1.0.0            
## [13] DBI_0.3.1                 geneLenDataBase_1.4.0    
## [15] BiasedUrn_1.07            cluster_2.0.3            
## [17] tsne_0.1-2                ggplot2_2.0.0            
## [19] gplots_2.17.0             RColorBrewer_1.1-2       
## [21] DESeq2_1.8.2              RcppArmadillo_0.6.400.2.2
## [23] Rcpp_0.12.3               GenomicRanges_1.20.8     
## [25] GenomeInfoDb_1.4.3        IRanges_2.2.9            
## [27] S4Vectors_0.6.6           BiocGenerics_0.14.0      
## [29] biomaRt_2.24.1            sva_3.14.0               
## [31] genefilter_1.50.0         mgcv_1.8-10              
## [33] nlme_3.1-124             
## 
## loaded via a namespace (and not attached):
##  [1] VGAM_1.0-0              splines_3.2.3          
##  [3] gtools_3.5.0            Formula_1.2-1          
##  [5] latticeExtra_0.6-26     Rsamtools_1.20.5       
##  [7] yaml_2.1.13             lattice_0.20-33        
##  [9] digest_0.6.9            XVector_0.8.0          
## [11] colorspace_1.2-6        htmltools_0.3          
## [13] Matrix_1.2-3            plyr_1.8.3             
## [15] XML_3.98-1.3            zlibbioc_1.14.0        
## [17] xtable_1.8-0            scales_0.3.0           
## [19] BiocParallel_1.2.22     annotate_1.46.1        
## [21] GenomicFeatures_1.20.6  lazyeval_0.1.10        
## [23] nnet_7.3-11             survival_2.38-3        
## [25] magrittr_1.5            evaluate_0.8           
## [27] foreign_0.8-66          tools_3.2.3            
## [29] formatR_1.2.1           munsell_0.4.2          
## [31] locfit_1.5-9.1          lambda.r_1.1.7         
## [33] Biostrings_2.36.4       caTools_1.17.1         
## [35] futile.logger_1.4.1     grid_3.2.3             
## [37] labeling_0.3            rmarkdown_0.9.2        
## [39] gtable_0.1.2            R6_2.1.1               
## [41] GenomicAlignments_1.4.2 gridExtra_2.0.0        
## [43] knitr_1.12.3            rtracklayer_1.28.10    
## [45] Hmisc_3.17-1            futile.options_1.0.0   
## [47] KernSmooth_2.23-15      stringi_1.0-1          
## [49] geneplotter_1.46.0      rpart_4.1-10           
## [51] acepack_1.3-3.3