Sequencing Data

Publicly Available Data on NCBI

All sequencing data is stored on NCBI in two databases called GEO and SRA. These are essentially the "Pubmed"-equivalent of data storage.

GEO: Stores the meta-information about sequencing experiments, along with processed sequencing data (e.g. bedgraph files)
SRA (short-read archive): Linked to the GEO database, where the FASTQ files from the next-generation sequencers are stored

Typically (or hopefully) a manuscript will list the accession number either at the end of the paper or in the Methods. The Zika paper lists this as their accession number: GSE78711

GSE: The "GSE" prefix indicates that this is a GEO Series, meaning that the database entry serves as a summary of all the experiments contained in the experiment. Typically this lists information about the associated paper (authors, abstract), study design, sequencing platforms utilized, and a list of samples.
GSM: Each sample has a "GSM" prefix, and goes into further detail about how the sample was experimentally treated, how the DNA was extracted, and what type of experiment this is (e.g. RNA-seq). At the bottom of the GSM page, there is an ftp link to obtain the SRA reads. Right click on the link and select "Copy Link Address" to copy the web address to your clipboard.
SRR: The "SRR" prefix followed by a 7 digit accession denotes a sequencing run. To facilitate more efficient storage and faster transfer, sequencing files are compressed (typically by gzip compression, with the extension *.gz). NCBI uses their own compression format called SRA (more on this later).

Fetching Reads from the NCBI SRA Website to your server

To obtain the SRA reads, follow these steps:

Change into your home directory (cd ~).
Use the wget Linux utility to fetch the reads from the NCBI website. wget is essentially a command-line downloader.
- wget <web_address_of_filename>
- Example: wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP070/SRP070895/SRR3191542/SRR3191542.sra
You will see a few lines that establish the connection to the FTP server, followed by a progress bar that shows the status of the download. The download is complete when you have returned to your Linux prompt.
If you now list all SRA files, you should see the following:
jabelsky@dinesh-mdh:~$ ls *.sra
SRR3191542.sra
jabelsky@dinesh-mdh:~$ ls -lh *.sra
-rw-r--r-- 1 jabelsky jabelsky 465M Apr 7 18:21 SRR3191542.sra
jabelsky@dinesh-mdh:~$
Now that we have the SRA document, we need to convert it to a FASTQ format

Converting SRA to FASTQ

The NCBI provides a utility called the SRA Toolkit that will "unzip" the SRA file into the component FASTQ files. As with many Linux tools, this tool is not installed on our virtual machine, but it is available for download on the NCBI website.

Go to the SRA Tookit website, and locate the second link under the heading "1. NCBI SRA Toolkit latest release": Ubuntu Linux 64 bit architecture
As with obtaining the SRA reads, right click on "Ubuntu Linux 64 bit architecture" and select "Copy Link Address"
Now go to the command line on your virtual machine. First, change into your home directory (cd ~)
Now, use wget Linux utility to download the software.
- wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.5.7/sratoolkit.2.5.7-ubuntu64.tar.gz
- You'll notice that the filename has the extension *.tar.gz. tar stands for archive, and it's the Windows-equivalent of zip (putting multiple files/folders into one archive file). gz stands for "gzipped", and indicates that the tar archive file (also known as a tar-ball) is compressed.
To access the software embedded in the tar-ball, we have to unpack the tar ball. We can use the tar command in conjunction with several options:
- tar -x -z -f sratoolkit.2.5.7-ubuntu64.tar.gz
- -x: indicates that the tar-ball needs to be uncompressed
- -z: indicates that the compression format is gzip
- -f <filename>: indicates the file to perform the action
Once your cursor returns to your command prompt, the operation is complete. If you list all the files in your directory, you should now see a directory entitled sratoolkit.2.5.7-ubuntu64 d
Change into the directory, list the files, and then change into the bin directory. If you list the files, you will see many files that are colored green. These are executables, and indicate software utilities that are now available on your virtual machine.
1. The command we are primarily interested in is called fastq-dump.2.5.7. If you are in the bin directory, issue the command ./fastq-dump.2.5.7 to run the command:
  
  jabelsky@dinesh-mdh:~/sratoolkit.2.5.7-ubuntu64/bin$ ./fastq-dump.2.5.7
  
  Usage:
  ./fastq-dump.2.5.7 [options] <path> [<path>...]
  ./fastq-dump.2.5.7 [options] <accession>
  
  Use option --help for more information
  
  ./fastq-dump.2.5.7 : 2.5.7
2. Because this is a utility that we installed locally, we always have to specify the specific path to the command. If we are in the sratoolkit.2.5.7-ubuntu64/bin subdirectory, we can use the "./" notation to mean "execute this utility in the current directory"
  1. We could also use the ~/sratoolkit.2.5.7-ubuntu64/bin/fastq-dump.2.5.7 . Remember that ~ is equivalent to /home/duke_biochemistry_grad d
  2. Likewise, we could also specify the full path: /home/duke_biochemistry_grad/sratoolkit.2.5.7-ubuntu64/bin/fastq-dump.2.5.7

Using the fastq-dump utility to obtain the FASTQ reads

Change into your home directory (cd ~)
Just like with the tar command, we will use the fastq-dump command on the SRA file to extract the FASTQ reads. Like with the tar command, we need to include several options as well. For simplicity, let's just extract the first 10 reads.
- ~/sratoolkit.2.5.7-ubuntu64/bin/fastq-dump.2.5.7 -X 10 --split-files SRR3191542.sra
- -X <#>: Extract the first # reads from the SRA file
- --split-files: Generate separate "R1" and "R2" files for a paired-end experiment
If you now list at all the files beginning with SRR, you should now see addition *.fastq files:

jabelsky@dinesh-mdh:~$ ls SRR*
SRR3191542_1.fastq SRR3191542_2.fastq SRR3191542.sra

An even easier way to obtain the FASTQ reads (without having to download the *.sra file directly)

In the newest version of the sratoolkit (version 2.5.7), you can just specify the SRR accession number as the "input" argument to fastq-dump. fastq-dump will automatically fetch the SRA and convert it to the FASTQ format without having to explicitly download the *.sra file. For example, if you want to download the first 1,000,000 reads from the Mock2-1 dataset (SRR3194428), you can issue the following:

jabelsky@dinesh-mdh:~$ sratoolkit.2.5.7-ubuntu64/bin/fastq-dump -X 1000000 SRR3194428 Read 1000000 spots for SRR3194428 Written 1000000 spots for SRR3194428 jabelsky@dinesh-mdh:~$ ls SRR3194428* SRR3194428.fastq

This creates the output file SRR3194428.fastq, which contains the first 1,000,000 reads from the SRR3194428 file.

What is a FASTQ file?

The FASTQ format is a standard way to store sequencing read information. Check the wikipedia page on the specifications: https://en.wikipedia.org/wiki/FASTQ_format

jabelsky@dinesh-mdh:~$ head -12 SRR3191542_1.fastq
@SRR3191542.1 1 length=76
TGGCACATGTATACATATGTAACTAACCTGCCCGTTCCGCACTTCTACCCTCCCCCTTTTTTTCCCCCCACCTACC
+SRR3191542.1 1 length=76
8--8-C,C,C,C9C,C,C,C,,CC,,CCC,;,;,8;,;,6+8+;,;,<6;;,,,,67;,,;8+,,,,+7++8,,9,
@SRR3191542.2 2 length=76
CCTGGTTTAATTGCAGCATTATCTGTTATATTAACGTTTTTAACTGTCTCTTTCTTCCTCTTTCCTTCTTCCCTTT
+SRR3191542.2 2 length=76
88A99EFE8,@C,C,,C,CE,CFE,CC,C,<;,,,,,;;6:,,<<,<6<,66;;,,6,,6,,<6;,,,6<;;6,,;
@SRR3191542.3 3 length=76
CGAATTGTAGCCAGGATACTTCACCAAAGAAACATAATTTATATACTCCCTCCCCCCTACTCTTCCTCTCTTTTTC
+SRR3191542.3 3 length=76
8---AC,C,,CC,,,,C,CEEC,CC,,,,,,,;,;,,,,<,<,,,;,6;;,;88678+,;6;6;;,,66:,<<,66

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Quality Score

The quality score is encoded in ASCII, meaning that every character is equivalent to a numeric value. This is useful as it allows a single character to correspond to a number that would otherwise be equal to 2 digits.

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

33…………………………………………………….74

To obtain the quality score, first all the scores are shifted so that they are in the range from 0 to 41. This is done by subtracting 33 from every score.

The quality score is directly related to the probability that the sequencer has "called" the nucleotide incorrectly.

Q_score= -10 log₁₀(p)

p = 10^{-(Qscore / 10)}

For example, if the sequencer gives us a "#", that is equal to an ascii score of 35. We subtract 33 from that number and get 2. Then the probability that the sequencer has called this nucleotide incorrectly is:

p = 10^-(^{2 / 10)} = 10^-1/5= 0.6310

So there is a 63.1% chance that the nucleotide called is wrong.

How about for a quality score of an "C". The ascii score of "C" is 67, and subtracting 33 gives us 24. Then the probability that the sequencer has called this nucleotide incorrectly is:

p = 10^{-(24 / 10)}= 0.00398

So there is a 0.398% chance that the nucleotide called is wrong.