In an ideal world RNA-seq reads would unambiguously align to the species from which they originate. This would allow cancer biologists to simply extract RNA from xenografts, sequence it and then align it to the graft genome and trust their results. Unfortunately, reality isn't so simple (see figure). Instead there are reads that will align well to both, say, the human and the mouse genome. This can cause systematic noise in the data leading to inappropriate conclusions.
Imagine you have a treatment that you're interested in because it shrinks the size of xenografted human tumors. Then you extract RNA from treated and untreated xenografts and run a naive RNA-seq pipeline (aligning only to the human genome). The issue with this strategy is that removing the smaller, treated tumor could be more difficult and result in a greater proportion of host (mouse) cell contamination. If the mouse cells react differently to the treatment, or if they simply express a different transcriptional program, reads that originate from mouse cells and align ambiguously to both the mouse and the human genome could confound the analysis. For example, a gene that is down-regulated in treated tumors may be missed because the ambiguous mouse reads from the mouse orthologue of that gene add to the FPKM in the treated tumor sample.
Fortunately, there are several possible solitons to this problem. (I guess if it were otherwise I wouldn't be writing this post)
First, one could address the purity of the sample directly. In preparing the xenograft sample for sequencing one could stain with human-specific and mouse-specific antibodies and FACS sort only the human+/mouse- populations. This would potentially reduce the contamination enough to safely map only to the human genome. Such an approach, however, is not always feasible or indeed sufficient to fully resolve the contamination. For example, in the case of a patient-deprived xenograft, one may not know all the specific markers that are expressed in that particular tumor.
In the cases where FACS sorting purification is not a viable solution, one can take a bioinformatic approach. I would group the space of bioinformatic approaches to dealing with host-graft contaminations into three major categories:
- Host-alignment Filtering
- Filtering ambiguous reads
- Resolving ambiguous reads
What I call Host-alignment Filtering is perhaps the most intuitive, albeit somewhat naive approach. The idea is simply to first align all reads to the host species, and use those reads that do not align to the host in expression analysis (i.e. align to graft genome and process normally). This method is conceptually and technically simple, but it is a very crude and strict filter and therefore can potentially degrade the data.
The second approach is to first generate a host-graft hybrid reference genome, which contains contigs (or chromosomes) annotated by species. After aligning all the reads to this hybrid genome one can select only those reads that uniquely align to the graft species. This approach is still quite stringent and can result in some data loss, however it is more reasonable than the first approach because at least each read is "exposed" to both genomes and the aligner algorithm has the opportunity to make a unique call from the full sample space.
Finally, a third approach is to align all the reads to each genome independently then to resolve the "ambiguously" aligned reads using the alignment information. In other words, for those reads that can be aligned to both the host and the graft genomes, you keep the alignment that has the best score and discard any other alignment for that read. This approach is the most involved but is the preferred method especially when sensitivity is a particular concern or if the sequencing depth is limiting.
All of these approaches can be done manually (i.e. using a combination of alignment strategies and Unix functions or a short script), but there are a couple of tools that make the resolution of ambiguously aligned reads method easier. The first tool that I would recommend is Xenome (a tool included in the Gossamer bioinformatics suite) from Conway et al. This tool starts from the un-aligned reads (fastq files) and the host and graft genomes and classifies reads as host, graft or ambiguous. Another great tool for implementing this strategy is Disambiguate, a freely available (MIT License) tool from AstraZeneca (Ahdesmäki, et al.) Disambiguate takes aligned reads (bam files) and classifies the pre-aligned reads as host, graft or ambiguous based on their alignment score. Otherwise, if you're feeling adventurous or just want to get your hands dirty, it's not too difficult to whip up your own version of these tools. If you do, be sure to send me your Git repo and I'll include it here!