Sequencing Data

Geo Project Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78711

SRA data: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP070/SRP070895/SRR3191542/

SRA Tookit: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software

Choose "Ubuntu Linux 64-bit Architecture"

~/sratoolkit.2.5.7-ubuntu64/bin/fastq-dump -N 1 -X 10 --split-3 SRR3191542.sra

FASTQ format: https://en.wikipedia.org/wiki/FASTQ_format

jabelsky@dinesh-mdh:~$ head -12 SRR3191542_1.fastq
@SRR3191542.1 1 length=76
TGGCACATGTATACATATGTAACTAACCTGCCCGTTCCGCACTTCTACCCTCCCCCTTTTTTTCCCCCCACCTACC
+SRR3191542.1 1 length=76
8--8-C,C,C,C9C,C,C,C,,CC,,CCC,;,;,8;,;,6+8+;,;,<6;;,,,,67;,,;8+,,,,+7++8,,9,
@SRR3191542.2 2 length=76
CCTGGTTTAATTGCAGCATTATCTGTTATATTAACGTTTTTAACTGTCTCTTTCTTCCTCTTTCCTTCTTCCCTTT
+SRR3191542.2 2 length=76
88A99EFE8,@C,C,,C,CE,CFE,CC,C,<;,,,,,;;6:,,<<,<6<,66;;,,6,,6,,<6;,,,6<;;6,,;
@SRR3191542.3 3 length=76
CGAATTGTAGCCAGGATACTTCACCAAAGAAACATAATTTATATACTCCCTCCCCCCTACTCTTCCTCTCTTTTTC
+SRR3191542.3 3 length=76
8---AC,C,,CC,,,,C,CEEC,CC,,,,,,,;,;,,,,<,<,,,;,6;;,;88678+,;6;6;;,,66:,<<,66

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Quality Score

The quality score is encoded in ASCII, meaning that every character is equivalent to a numeric value. This is useful as it allows a single character to correspond to a number that would otherwise be equal to 2 digits.

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

33…………………………………………………….74

To obtain the quality score, first all the scores are shifted so that they are in the range from 0 to 41. This is done by subtracting 33 from every score.

The quality score is directly related to the probability that the sequencer has "called" the nucleotide incorrectly.

Q_score= -10 log₁₀(p)

p = 10^{-(Qscore / 10)}

For example, if the sequencer gives us a "#", that is equal to an ascii score of 35. We subtract 33 from that number and get 2. Then the probability that the sequencer has called this nucleotide incorrectly is:

p = 10^-(^{2 / 10)} = 10^-1/5= 0.6310

So there is a 63.1% chance that the nucleotide called is wrong.

How about for a quality score of an "C". The ascii score of "C" is 67, and subtracting 33 gives us 24. Then the probability that the sequencer has called this nucleotide incorrectly is:

p = 10^{-(24 / 10)}= 0.00398

So there is a 0.398% chance that the nucleotide called is wrong.