3. Day 2: Introduction to basic NGS pipelines

3.1. Overview

3.1.1. Lead Instructor

3.1.2. Helper(s)

  • Maria Tsagiopoulou | email
  • Ola Karrar | [@ZeinOla][https://twitter.com/ZeinOla] | email

3.1.3. General Topics

  • Introduction to basic NGS pipelines
    • Quality Assessment
      • fastqc / multiqc
      • trimmomatic
    • Mapping
      • bwa mem
    • Example pipeline: Variant Calling
      • samtools / bcftools

3.2. Schedule

  • 09:00 - 10:00 Intro to NGS pipelines
  • 10:00 - 11:00 Short read quality and trimming (part 1)
  • 11:00 - 11:30 Coffee break
  • 11:30 - 12:30 Short read quality and trimming (part 2)
  • 12:30 - 14:00 Lunch break
  • 14:00 - 15:00 Mapping
  • 15:00 - 16:00 Variant calling (part 1)
  • 16:00 - 16:15 Coffee break
  • 16:15 - 18:00 Variant calling (part 1)

3.3. Learning Objectives

  • Use fastqc and multiqc
  • Visualize read quality
  • Quality filter and trim reads
  • Distinguishing good/bad quality reads

3.4. Introduction to basic NGS pipelines

The slides of the introductory talk are available here

3.4.1. Basic DNASeq analysis/Variant calling workflow

Alt text

The following material and exercises are based on the Next-Gen Sequence Analysis Workshop that took place during the Data Intensive Biology Summer Institute at UC Davis, USA, 2-14/07/2018 (link here)

3.4.2. Quality Assessment

Learning objectives

  1. Use fastqc and multiqc
  2. Visualize read quality
  3. Quality filter and trim reads
  4. Distinguishing good/bad quality reads

We will use the following dataset. Run the command to download it locally:

curl -L -o ecoli_ref-5m.fastq.gz https://osf.io/frdz5/download

This is a FASTQ file – let’s take a look at it:

less ecoli_ref-5m.fastq.gz

(use the spacebar to scroll down, and type q to exit less)


  • where does the filename come from?

Links: FastQC

We’re going to use FastQC to summarize the data. We already installed fastqc above, with the conda command.

Now, run FastQC on the file:

fastqc ecoli_ref-5m.fastq.gz

Now type ls:

ls -d *fastqc.zip*

to list the files, and you should see:


Inside the fastqc directory you will find the report from the fastqc program. You can also download my versions of the html and zip outputs.


  • What should you pay attention to in the FastQC report?


There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file Trimmomatic

Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via conda.

The first thing we’ll need are the adapters to trim off (you might need to change the following command to reflect your own setup):

cp /home/fpsom/miniconda3/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/adapters/TruSeq2-PE.fa .

You can look at the contents of this file with cat TruSeq2-PE.fa (you can download the same file from here)

Now, to run Trimmomatic on both of them:

trimmomatic SE ecoli_ref-5m.fastq.gz \
        ecoli_ref-5m.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \

You should see output that looks like this:

Input Reads: 5000000 Surviving: 4996146 (99.92%) Dropped: 3854 (0.08%)
TrimmomaticSE: Completed successfully


  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Do you think parameters are different for RNAseq and genomic data sets?
  • What’s with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.

Links: FastQC again

Run FastQC again on the trimmed files:

fastqc ecoli_ref-5m.qc.fq.gz

Let’s take a look at the output files:

less ecoli_ref-5m.qc.fq.gz                                                    

(again, use spacebar to scroll, q to exit less).

You can also download my versions of the html and zip outputs. MultiQc

MultiQC aggregates results across many samples into a single report for easy comparison.

Run mulitqc on both the untrimmed and trimmed files

multiqc .

And now you should see output that looks like this:

[INFO   ]         multiqc : This is MultiQC v1.6
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 7 files..  [####################################]  100%
[INFO   ]          fastqc : Found 2 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete


  • is the quality trimmed data “better” than before?
  • Does it matter that you still have adapters!?

Exercise 1

Try running trimmomatic with different paramaters in order to further improve / salvage your dataset. Try with different setups and identify the impact they have.

Exercise 2

Run the same steps for the following dataset that comes from the Schurch et al, 2016 yeast RNAseq study:

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz

3.4.3. Mapping and Variant Calling

Learning objectives

  1. Explore coverage
  2. Understand the basics of mapping-based variant calling
  3. Learn basics of actually calling variants & visualizing

We will use the dataset you cleaned (trimmed and filtered) during the previous exercise. You can make sure you have the correct data loaded if you get the following output:

ls -la

-rwxrwxrwx 1 fpsom fpsom  59532325 Aug 19 11:56 ERR458493.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  59459449 Aug 19 12:02 ERR458493.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom  58566854 Aug 19 11:57 ERR458494.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  58496906 Aug 19 12:02 ERR458494.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom  58114810 Aug 19 11:58 ERR458495.fastq.gz
-rwxrwxrwx 1 fpsom fpsom  58048466 Aug 19 12:03 ERR458495.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 102201086 Aug 19 11:59 ERR458500.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 102160038 Aug 19 12:03 ERR458500.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 101222099 Aug 19 11:59 ERR458501.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 101182595 Aug 19 12:04 ERR458501.qc.fq.gz
-rwxrwxrwx 1 fpsom fpsom 100585843 Aug 19 12:01 ERR458502.fastq.gz
-rwxrwxrwx 1 fpsom fpsom 100545339 Aug 19 12:10 ERR458502.qc.fq.gz Download and gunzip the reference:

curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz

and look at it:

head orf_coding.fasta Prepare it for mapping:

bwa index orf_coding.fasta Map!

bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz  > ERR458493.sam Observe!

head ERR458493.sam

what does all this mean?? Visualize mapping

Goal: make it possible to go look at a specific bit of the genome. Index the reference genome:

samtools faidx orf_coding.fasta Convert the SAM file into a BAM file:

samtools import orf_coding.fasta.fai ERR458493.sam ERR458493.bam Sort the BAM file by position in genome:

samtools sort ERR458493.bam -o ERR458493.sorted.bam

3.4.4. Index the BAM file so that we can randomly access it quickly:

samtools index ERR458493.sorted.bam Visualize with tview:

samtools tview ERR458493.sorted.bam orf_coding.fasta

tview commands of relevance:

  • left and right arrows scroll
  • q to quit
  • CTRL-h and CTRL-l do “big” scrolls
  • Typing g allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:
    • YLR162W:293 (impressive pileup, shows two clear variants and three other less clear)
    • YDR034C-A:98 (impressive pileup, shows two clear variants)
    • YDR366C:310 (impressive pileup, less clear variants)
    • YLR256W:4420 (impressive pileup, less clear variants)
    • YBL105C:2179 (less depth, shows two clear variants)
    • YDR471W:152 (impressive pileup, shows one clear variant)

Get some summary statistics as well:

samtools flagstat ERR458493.sorted.bam

The output should look like this:

1092728 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
13 + 0 supplementary
0 + 0 duplicates
989419 + 0 mapped (90.55% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) Call variants!

Goal: find places where the reads are systematically different from the genome.

Now we can call variants using samtools mpileup:

samtools mpileup -u -t DP -f orf_coding.fasta ERR458493.sorted.bam | \
    bcftools call -mv -Ov > variants.vcf

You can download my version of the variants.vcf from here.

To look at the entire variants.vcf file you can do cat variants.vcf; all of the lines starting with # are comments. You can use tail variants.vcf to see the last ~10 lines, which should be all of the called variants.

Exercise 1

Try to run the variant calling pipeline using one of the other ERR458*** samples (i.e. bwa mem, samtools etc). Can you identify differences in the new vcf file?

Exercise 2

In a realistic scenario, you don’t want the variants per sample, but rather all the samples together (the mapping process is still done per sample). What would be the new samtools mpileup command that will incorporate multiple files in the variant calling process?


samtools mpileup -D -u -t DP -f orf_coding.fasta ERR458493.sorted.bam ERR458494.sorted.bam ERR458495.sorted.bam ERR458500.sorted.bam ERR458501.sorted.bam ERR458502.sorted.bam | bcftools call -mv -Ov > variantsAll.vcf

You can download the new “full” vcf file from here. Discussion points / extra things to cover

  • What are the drawbacks to mapping-based variant calling? What are the positives?
  • Where do reference genomes come from?