Monday, December 29, 2014

FASTQ and Sequencing Quality



The output from an Illumina platform is the FASTQ file. To understand this file, we’ll jump right into a real example from our data:

QD_0_ACTTGA_L005_R1_001.fastq.gz

First thing to note is the name given by the Illumina platform. The file name is composed of 6 fields separated by underscores:
          - QD_0 = sample name (defined by our lab in this case)
          - ACTTGA = barcode sequence ligated to each DNA fragment
          - L005 = sequencing lane on the flow cell
          - 001 = set number

Each file is can only hold a certain amount of sequencing data and anything exceeding that size is split into multiple files denoted by the set number. This means that files that differ only by set number are part of the same sample and must be combined before analysis.

FASTQ files have an extension of either .fastq or .fq but as you may have noticed, the file name I gave above is technically not a FASTQ file. The .gz extension denotes that this file is compressed by the Gzip algorithm to minimize the space each file takes up.

To unzip the file, simply navigate to the file in a terminal window and use the following command:

gunzip <filename>.gz

To gzip a file, use this command:

gzip <filename>.gz

It is not recommended to unzip the FASTQ file unless necessary to keep file sizes as small as possible. However, certain actions such as viewing the file will require you to first unzip it. To view the first 12 lines of the FASTQ file without unzipping the whole file, do this:

gunzip -cd <filename>.gz | head -12

For the file I mentioned above, the output looks like this:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA
GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA
+
CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>
@HWI-ST765:225:C5K6GACXX:5:1101:1365:2094 1:N:0:ACTTGA
TTTCAAAGTCATAATTGATTAATTTGAGTTTATTTGGATCAAATGCAGTTGGCGATTGTTCTGCGGTCGTGTTAGTTAAGATTTGTAAAGAATCCCCTTGT
+
CCCFFFFFGHHHHJJIJIJIJHIJIIIGHIJIJJJJIIJFHIFJJJIJGGIIJIIIHGGHIIEH/B@BHHFBD@CCAC;>CECCCACEDC>ADD@CDDDDA
@HWI-ST765:225:C5K6GACXX:5:1101:1435:2137 1:N:0:ACTTGA
AGCCCTTAACCGTATTTATACGACCTAGTGGGACAAGTAAACGAGAGGAAAGTCCGAGCTACACAGGGCAGAGTGCCGGATAACGTCCGGGCGGCGTGAGC
+
CCCFFFFFHHHHDGIJJJJIJJJJJJJJIJIIJJJJIFGIJJJJJJJJJIJJHHIIJHHFFFFFEDDDDDDDDCDACBDBDDDDDDDDDDDBDBBD<@BD9

This is what the guts of a FASTQ file looks like. Each sequenced fragment – called a read – is contained in 4 lines of the file. Since this file is 12 lines, it describes 3 reads. The first line in the file:

@HWI-ST765:225:C5K6GACXX:5:1101:1500:2093 1:N:0:ACTTGA

The first line always begins with @ and must be a unique identifier of that read. The information contained here is given by the Illumina platform and tells the exact X and Y coordinate on the flow cell where this read was sequenced. For more detailed information, go here.

GCATTAGGTGCGTTAGAAGCAACAAAAGCACACGGTAAAAAATTACCTATCTTTGGTGTGGATGCGTTACCAGAAGCATTACAATTGATTAGTAAAGGCGA

The second line contains the raw sequence for the read.

+

The third line is a “+” followed by an optional description (nothing, in this case).

CCCFFFFFFHHGHJIJJJJJJGIIJJJJJJJJJJJGIJJJIJJIIJJJJIJIIJJJGHEHEHFFBCCDDDDDDDDCCDDDDDDDDDDDEDECCDEDCC<@>

The final line is the quality score for each base. Each score is ASCII encoded, where the higher the score, the higher the quality. The following table can be used to calculate the score. Take the value of the character and subtract 33 to find the quality score.


Ex. for the character C, 34 is the quality score (67 – 33 = 34).

But what does this score mean? The score is actually a Phred quality score where:
Q = –10 log P
Where Q = quality score and P = probability base was incorrectly called

A Phred score of 34, therefore, means that a base has a 99.96% chance of being correct whereas a score of 10 has a 90% probability of being correct. We can use this scoring system to evaluate the quality of our sequencing results. An efficient way of doing this is using the tool FastQC.  To run this tool, simply use the following command: 

fastqc <filename>

The program computes and graphs a bunch of statistics on the quality of the data.


This image shows quality distribution at each base position in a read. As typical of Illumina sequencing data, the quality drops towards the end of the read due to phasing and pre-phasing. Recall that a single read is sequenced from a cluster of DNA strand. Phasing and pre-phasing occurs when strands in a sequencing cluster fail to insert a nucleotide or add multiple nucleotides during a sequencing cycle. These strands fall out of sync with the rest of the cluster and reduce the purity of the wavelength emitted by the cluster as a whole, thus resulting in a greater probability of the wrong base being called. Because this problem increases with every cycle, end of the read tends to have lower quality.


Overall, FastQC shows that most bases have a quality score greater than 30 (0.1% chance of error for each base). This suggests that the sequencing data has sufficiently high quality. For more information on the output from FastQC, please see this PDF.

The next post will attempt to detail how to map these reads to a reference genome. 

Sunday, December 28, 2014

Library construction and Illumina sequencing


This will be the first of a series of posts that will detail the process I have used to analyze a set of RNA seq data for our lab. 

Library Preparation

We’ll start by assuming that RNA has been successfully extracted from some biological source and ribosomal RNA has been removed. 

Step 1A: RNA -> cDNA 
To create a DNA library that can be sequenced, it is first necessary to convert our pool of RNA into complementary DNA (cDNA). One technique is to use a small DNA primer to target the poly-A tail on strands of mRNA and extend the primer to create cDNA.

For species that don’t attach a poly-A tail to mRNA, a strategy called random priming is used. The RNA is combined with synthetic DNA hexamer primers (444444 = 4096 different sequences) and the cDNA is constructed by extending off wherever a primer anneals to the RNA molecule.

Step 1B: Strand specificity (optional)
During the creation of the cDNA library, the initial strand cDNA is synthesized from a primer annealed to an RNA strand. During the synthesis of the complementary cDNA strand, it is possible to use dUTP instead of dTTP to mark the second strand with “U”s. This method results in double stranded cDNA where each strand is distinguishable from each other (where one contains T and the other contains U). The second strand can be targeted and degraded leaving a single strand that can be sequenced. The benefit of this method is that it allows each sequenced fragment to be mapped uniquely to one specific strand of the reference genome. For instance, if two genes overlap on opposite strands of a genome then this strand-specific sequencing strategy will be able to determine which strand (and therefore gene) each RNA fragment was derived from.


Step 2: Size selection

At this point, some protocol is used to break down all the cDNA fragments to a common size. Typically, this is some physical process such as sonication. The ends of these fragments are repaired and additional segments of synthetic DNA are ligated on the ends of the fragments. These include primers required to initiate sequencing and barcode sequences to identify which fragments belong to which sample (crucial for multiplexed experiments where multiple samples are sequenced at the same time). 


Step 3: Amplification

Finally, the library is PCR amplified and is ready to be sequenced.

Illumina Sequencing

For those out there (including myself) that are visually-orientated, this video will demonstrate the basic steps: 
http://youtu.be/77r5p8IBwJk?t=45s

After the library is prepared, the first step is to wash and mount the fragments on a flow cell. Fragments will be randomly distributed over this surface. Next, a technique called bridge amplification duplicates the annealed fragments to create a monoclonal cluster of DNA. 

This is where the sequencing begins. First a primer is annealed to every DNA segment attached to the flow cell. A series of cycles is preformed where a single (reversible terminators) fluorescent nucleotide is added to each growing DNA fragment. The each cluster of fragments emits a specific wavelength corresponding to one of four nucleotides. By measuring the wavelength of light emitted from each cluster after each cycle, the sequence of each fragment can be determined.

Finally, a technique known as paired-end sequencing can be used to double the sequencing information derived from each sample. The general idea is that each strand of DNA is sequenced from 5’ to 3’.  This results in two sets of sequences, one corresponding to each DNA strand. Each pair may or may not overlap depending on the size of the fragment and the length of each read.

(For further details and figures, please check out this page: http://nextgen.mgh.harvard.edu/IlluminaChemistry.html)

Once all this is done, the Illumina platform analyzes a huge set of images to derive the sequence of all the fragments. Our RNA-seq experiment uses random priming, pair-end sequencing, and is not strand specific. The blog posts that follow this will be written for this type of analysis.

In the next post, I will explain the output of this sequencing and how to analyze its quality.