Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Publicly Available Data on NCBI

...

  1. Change into your home directory (cd ~)
  2. Just like with the tar command, we will use the fastq-dump command on the SRA file to extract the FASTQ reads.  Like with the tar command, we need to include several options as well.  For simplicity, let's just extract the first 10 reads.
    • ~/sratoolkit.2.5.7-ubuntu64/bin/fastq-dump.2.5.7 -X 10 --split-files SRR3191542.sra
    • -X <#>: Extract the first # reads from the SRA file
    • --split-files: Generate separate "R1" and "R2" files for a paired-end experiment
  3. If you now list at all the files beginning with SRR, you should now see addition *.fastq files:

    jabelsky@dinesh-mdh:~$ ls SRR*
    SRR3191542_1.fastq SRR3191542_2.fastq SRR3191542.sra

An even easier way to obtain the FASTQ reads (without having to download the *.sra file directly)

In the newest version of the sratoolkit (version 2.5.7), you can just specify the SRR accession number as the "input" argument to fastq-dump.  fastq-dump will automatically fetch the SRA and convert it to the FASTQ format without having to explicitly download the *.sra file.  For example, if you want to download the first 1,000,000 reads from the Mock2-1 dataset (SRR3194428), you can issue the following:

jabelsky@dinesh-mdh:~$ sratoolkit.2.5.7-ubuntu64/bin/fastq-dump -X 1000000 SRR3194428
Read 1000000 spots for SRR3194428
Written 1000000 spots for SRR3194428

jabelsky@dinesh-mdh:~$ ls SRR3194428*
SRR3194428.fastq

This creates the output file SRR3194428.fastq, which contains the first 1,000,000 reads from the SRR3194428 file.


What is a FASTQ file?

The FASTQ format is a standard way to store sequencing read information.  Check the wikipedia page on the specifications: https://en.wikipedia.org/wiki/FASTQ_format

...

  • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).

  • Line 2 is the raw sequence letters.

  • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

...