Here are all the steps involved in the RNA-Seq data analysis



Historically there were lots of methods for measuring gene expression in cells. For example, not so long ago microarrays were very popular, but now they are being gradually replaced by NGS (Next-Generation Sequencing) methods. These methods allow one to figure out the sequences of nucleotides (ATGC) in DNA or RNA molecules. In this article, we will discuss only RNA sequencing protocols and RNA-seq analysis. Let’s figure out what standard protocol of RNA-seq is? We can divide it into two parts: RNA-seq data generation and RNA-seq analysis.

I. Data generation

To generate RNA-seq dataset, RNA is first extracted from the sample (tissue or cell). After that specific enzyme called DNAse is used in order to get rid of DNA contamination. As a result, we obtain only RNAs. But the problem is that there are lots of different types of RNA (mRNA, tRNA, miRNA, and lots of rRNA). In order to estimate level of gene expression we only need mRNA. How we can isolate mRNAs? The most effective method is poly-A selection. This method takes into consideration specific structure of mRNA, which has poly-A sequence on one of the ends. Other types of RNA lack this sequence, thus only mRNAs can be extracted. After that all mRNAs are fragmented, reverse transcribed into cDNA (complementary DNA) and legated with sequencing adaptors. Finally, cDNAs are sequenced using next-generation sequencing technologies to produce many short reads. RNA-seq data ready for analysis. RNA-seq analysis starts by mapping reads to a reference genome. You can do RNA seq analysis with Basepair who is an expert in this field.

Figure 2. A schematic representation of the RNA-Seq protocol. RNA transcripts are converted into double-stranded cDNA, which are then fragmented and their ends sequenced.

II. Data analysis

Figure 3. A schematic representation of RNA-seq analysis. The goal is to identify differentially expressed genes across conditions. RNA-seq reads are mapped to the reference genome, gene expression are quantified, differentially expressed genes are identified, and enriched function or pathways are discovered. Quality is checked at every step.

a. RNA-seq data

At this step quality of data is checked. Machine gives you .fastq files that contain all the reads and their IDs. For reads quality check fastQC tool is often used. It gives you QC report, where you can find lots of quality metrics. Usually the following parameters are important: base quality score, read length distribution, GC content, presence of adapters sequences. If there are many short reads in your data, if it is contaminated by other genome or if adapters sequences bring bias to your data, you will notice it in the report. According to this you can go back to preprocessing data, trim unnecessary parts of reads, get rid of adapters and continue to analyse RNA-seq data. By the end of this step all reads are stored in .fastq files, which additionally contain information about quality of each read (Phred scale). 

b. Alignment 

After quality check, reads are ready for alignment. Alignment procedure allows one to map each read to the most similar genome region and thus reveal the origin of this read. RNA-seq reads alignment procedure differs from alignment of genomic data. As you may remember, genes in human have exons and introns. During specific process called splicing introns are removed and mature mRNA consistins only of exons is produced. The problem is that reads from RNA-seq data also store only exon sequences

c. Quantification

Quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene. Based on the number of reads we can estimate expression level of this gene. Sometimes quantification can be very complicated due to strand specific genes. For example, when one gene is located on forward strand and second gene is located on reversed strand. They can overlap and lead to ambiguity in the alignment. There are several strategies how we can count reads.   

d. Differential expression (DE) analysis

It can be done in several ways. One of them is limma. Limma is a package for the analysis of gene expression data arising from microarray or RNA-Seq. It uses matrix of gene expression values and applys linear models (specific statistical algorithms) which measures how expression of each gene in case differs from control. Usually such difference is illustrated using color intensity (Fig. 7). By the end of the analysis we obtain several statistical coefficients which we use for sorting genes and plotting heatmap. Using heatmap, we can observe DE and see what genes are overexpressed and what genes are underexpressed in different conditions (Figure 7).

Figure 7. Heatmap of gene expression data. Rows are samples. Columns are genes. Color range shows how gene expression increases or decreases between case and control samples. 

e. Enrichment analysis

Heatmap only shows you information about particular genes, but if you are interested in more complicated processes such as signaling pathway analysis you need to operate gene sets. How we understand that some particular pathway is upregulated under specific condition? We need to do enrichment analysis. After limma analysis you have genes with increased expression. Now you can figure out in what pathways they are involved. There are several web-based tools and databases that can do this (MSigDB, Enrichr etc). They calculate how many genes in our set overlap with gene sets of signaling pathways. As a result, you will have list of pathways what are significantly similar with your set. In order to check it you can take gene set of some of those pathways and map them to your heatmap (GSEA plot). You will get some distribution of those genes. If distribution is random – there is no any pathway regulation under this condition. If this gene set distribution is not random (shifted to some end), congrats! Despite the fact that all your classmates have already gone home, you find your differentially regulated pathway! 

Figure 8. GSEA (gene set enrichment analysis) plot. On X axis there are genes, on Y axis is enrichment score. It is calculated based on gene names distribution. Black lines scattered through X axis are gene names of some pathway.


Please enter your comment!
Please enter your name here

20 + nine =

This site uses Akismet to reduce spam. Learn how your comment data is processed.