Chapter 8: RNA-seq analysis#
“Do the difficult things while they are easy and do the great things while they are small. A journey of a thousand miles must begin with a single step.”
—Lao Tzu
See also
RNA-seq involves sequencing RNA molecules to determine their structure and identify which genes are being expressed in a cell or tissue. Numerous computational approaches have been developed to process and analyze short read sequences produced by RNA-seq techniques. Short reads are mapped to the reference genome and quantified into copy numbers (exon, coding region, etc.). After normalization, the counts (copy numbers) are used to identify signicant genes by the differential expression (DE) analysis. Further biological insight is gained by pathway and gene network analyses. In comparison with microarray, RNA-seq has higher sensitivity and lower technical variation. More importantly, RNA-seq can discover novel transcribed regions, allele-specific expression, RNA editing and alternative splicing.
Mapping#
One common task in RNA-seq analysis is mapping the reads produced by the sequencing machine back to the reference genome or transcriptome to determine which genes are being transcribed and to quantify their expression levels. There are several tools available for mapping RNA-seq reads to a reference genome or transcriptome, including:
HISAT2
TopHat
STAR
Subread
Bowtie
These tools align the reads to the reference using various algorithms and heuristics, such as indexing the reference and using dynamic programming to find the best alignment for each read. The output of the mapping step is a set of alignments in a standardized format, such as BAM or SAM, which can then be processed further to identify differentially expressed genes, alternative splicing events, and other features of interest.
The process of read mapping, also known as alignment, involves identifying the unique location in a reference where a short read sequence matches. However, in practice, the reference may not be an exact representation of the RNA being sequenced due to sample-specific variations such as single nucleotide polymorphisms (SNPs) and insertions or deletions (indels). Additionally, reads may come from spliced transcripts rather than a genome, and may contain sequencing errors or align perfectly to multiple locations. Therefore, the goal of read mapping is to find the best match for each short read in the reference, taking into account these potential errors and variations.
Most short read aligners use a two-step approach: first, they use a fast heuristic algorithm to identify a list of potential alignment locations, and then they use a more computationally intensive local alignment algorithm to evaluate all of the candidate alignments. Fast heuristic matching is implemented in current aligners which either utilize hash tables or the Burrows Wheeler transform (BWT). Hash-table aligners are able to detect intricate discrepancies between the read and reference. On the other hand, BWT-based aligners are able to efficiently map reads that closely match the reference, but struggle to handle more complex misalignments and may become slow in these cases.
Normalization#
Longer transcripts have higher read counts. To correct for this, between and within sample normalization is often performed to normalize the read counts for each gene within (or between) a sample to the total number of reads obtained from that sample. This can be done using methods such as upper quartile normalization, or TMM (trimmed mean of M-values) normalization, or RPKM (reads per kilobase of exon model per million mapped reads), a widely used method for within-sample normalization.
Differential expression#
differential expression analysis is a method used to identify changes in gene expression between two or more groups of samples. It involves quantifying the abundance of RNA molecules in each sample, and then comparing the results to determine which genes are differentially expressed between the groups. There are several methods for performing differential expression analysis, including edgeR, DESeq2, and limma. These methods typically involve testing for statistical significance using tools such as the t-test or ANOVA, and correcting for multiple testing using techniques such as Bonferroni correction or false discovery rate (FDR) control. It is important to carefully design the experiment and choose appropriate statistical methods to ensure accurate results.
A list of methods for DE analysis
DEGseg, assuming Poisson with a dispersion parameter
edgeR, baySeq, DEseq, assuming negative binomial with additional dispersion parameters
Detection of alternative splicing#
Aligners identifying splice sites from sequence reads (aka splice aware aligners): TopHat, MapSplice, SpliceMap, HMMsplicer, GSNAP, STAR, RUM, SoapSplice, etc.
Transcriptome assemblers: Cufflinks, Scripture, Trinity, Trans-ABySS, GRIT.
Alternative expression tools identify isoform expression: Cuffdiff, ALEXA-seq, MISO, SplicingCompass, Flux Capacitor, JuncBASE,DEXSeq, MATS, SpliceR, FineSplice, ARH-seq, etc.
Detection of gene fusion#
Fusion genes are detected by looking for pairs of genes that have at least two or three short read pairs where one end aligns with one gene and the other end aligns with the other gene. To identify the specific point where the genes fuse (the exon-exon junction), a database of potential exon-exon combinations between the two genes is created. Short reads that do not align to either the genome or transcriptome are then aligned against this junction database to determine the exact location of the fusion point. The alignment must span at least 10 base pairs with one exon in order to be considered a valid fusion gene. A list of fusion gene detection tools
A list of methods for gene fusion detection
RNA editing#
RNA editing is a process in which certain cells can modify specific nucleotide sequences within an RNA molecule after it has been synthesized by RNA polymerase. This can sometimes result in a gene having a sequence of nucleotides that does not exactly match the sequence of the corresponding messenger RNA.
A list of methods for RNA editing
REDITOOLS: efficient RNA editing detection by RNA-SE RADAR: a rigorously annotated database of A-to-I RNA editing