August 2, 2017  |  Cytobank, DROP, Education, viSNE  |  By  |  0 Comments

Understanding Sample Heterogeneity from RNA-seq Data
with DROP + viSNE

As scientists we often want to understand the heterogeneity in a group of samples across multiple biomarkers, whether those are cell populations and their functional protein expression or RNA transcripts. With Cytobank’s upcoming DROP feature, you can build your understanding of this sample heterogeneity using many new data types!

In this example, we’ll show you how you can use viSNE in Cytobank to separate primary tumor tissue samples and paired normal tissue samples from subjects with 5 different kidney cancers based on bulk RNA-seq data from The Cancer Genome Atlas project.

We downloaded FPKM data from TCGA for 1011 samples from 5 different types of kidney cancers, as well as matched normal tissue samples for 139 of the subjects. The tumors ranged from stage I to stage IV, and subjects varied by age, gender, and whether they were living when the sample was taken. We chose the 806 most variable genes to “DROP” into Cytobank.

In Cytobank, we applied viSNE to arcsinh-transformed transcript data and found that the primary tumor samples were grouped by disease type (Figure 1, left), but the normal tissue samples from individuals with all disease types were grouped together in a separate island (Figure 1, right).

Figure 1. viSNE picks groups of different disease and normal tissue samples based on their expression signatures across all 806 transcripts at once.
Figure 1. viSNE picks groups of different disease and normal tissue samples based on their expression signatures across all 806 transcripts at once.

 

Coloring by channel (marker), we can see that markers are expressed only in certain clusters of these samples (Figure 2).

Figure 2. Markers are differentially expressed in certain groups of samples. IGF2 and CA9 are expressed only in Wilms Tumor and Renal Clear Cell Carcinoma, respectively. In contrast, EPCAM has lower expression in all of the tumor samples than the normal tissue samples, and MET is expressed in all of the tumor samples aside from Wilms Tumor.
Figure 2. Markers are differentially expressed in certain groups of samples. IGF2 and CA9 are expressed only in Wilms Tumor and Renal Clear Cell Carcinoma, respectively. In contrast, EPCAM has lower expression in all of the tumor samples than the normal tissue samples, and MET is expressed in all of the tumor samples aside from Wilms Tumor.

 

RNA-seq data is only one of many ‘bulk’ biomarker measurements you might want to use to understand sample heterogeneity. How will you know whether you can analyze your other data types with DROP and viSNE? Good candidate ‘bulk’ data sets will have:

  • Around 100+ samples to compare
  • Measurements that are continuous like transcript levels, cell population abundances, or methylation detection levels (although there is some evidence that discrete genotype data can be analyzed with viSNE, too!)
  • Been filtered down to ~800 biomarkers that are most interesting for looking at sample heterogeneity (e.g. most variable, as in this example)

DROP will be shipping this month to Enterprise servers!