Слайд 1Genome assembly with SPAdes
Center for Algorithmic Biotechnology
SPbU
Слайд 4Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants
Слайд 5Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants
Assembly
Corrects sequencing
errors
Much longer sequences
Each genomic region is presented only once
May introduce
errors
Hard to perform analysis
Слайд 9De novo whole genome assembly
Слайд 11Genomic repeats
TATTCTTCCACGTAGGGCCTTCCACGCTTCG
Слайд 12Genomic repeats
TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG
Слайд 13Genomic repeats
TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG
Слайд 14Genomic repeats
TATTCTTCCACGTAGG
GGCCTTCCACGCTTCG
TATTCTTCCACGCTTCG
GGCCTTCCACGTAGG
Слайд 15Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG
Слайд 16Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG
Слайд 19SPAdes first steps
spades.py
spades.py --help
spades.py --test
Слайд 20SPAdes first steps
spades.py
spades.py --help
spades.py --test
-o
Слайд 21Input data formats
FASTA: .fasta / .fa
FASTQ: .fastq / .fq
Gzipped: .gz
Слайд 22Input data options
Unpaired reads
Illumina unpaired
-s single.fastq
-s single1.fastq -s single2.fastq ...
Слайд 23Input data options
Paired-end reads
Interlaced pairs in one file
>left_read_id
ACGTGCAGG…
>right_read_id
GCTTCGAGG…
Separate files
file1.fastq file2.fastq
>left_read_id >right_read_id
ACGTGCAGG… GCTTCGAGG…
Слайд 24Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1
file1.fastq --pe1-2 file2.fastq
Слайд 25Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1
file1.fastq --pe1-2 file2.fastq
--pe1-s unpaired.fastq
Слайд 26SPAdes performance options
Number of threads
-t N
Maximal available RAM (GB)
SPAdes will
terminate if exceeded
-m M
Слайд 27Pipeline options
Run only assembler (input reads are already corrected or
quality-trimmed)
--only-assembler
Слайд 28Input data options
Mate-pair reads
Cannot be used separately
Interlaced pairs in
one file
--mp1-12 mp.fastq
Separate files
--mp1-1 mp1.fastq --mp1-2 mp2.fastq
Слайд 29Hybrid assembly options
PacBio CLR
--pacbio pb.fastq
Oxford Nanopore reads
--nanopore nanopore_reads.fastq
Слайд 30Restarting SPAdes
SPAdes / system crashed
--continue -o your_output_dir
Слайд 31Genome assembly evaluation with QUAST
Center for Algorithmic Biotechnology
SPbU
Слайд 32In reality
SPAdes
ABySS
IDBA
Ray
Velvet
….
Слайд 33Which assembler to use?
ABySS
ALLPATHS-LG
CLC
IDBA-UD
MaSuRCA
MIRA
Ray
SOAPdenovo
SPAdes
Velvet
and many more...
Слайд 34Which assembler to use?
Different technologies (Illumina, 454, IonTorrent, ...)
Genome type
and size (bacteria, insects, mammals, plants, ...)
Type of prepared libraries
(single reads, paired-end, mate-pairs, combinations)
Type of data (multicell, metagenomic, single-cell)
Слайд 36Which assembler to use?
Assemblathon 1 & 2
Simulated and real datasets
More
than 30 teams competing
Independent studies
Papers (GAGE, GAGE-B, GABenchToB)
Web-sites (nucleotid.es, …)
Surveys
Genome assembly evaluation tools
QUAST
GAGE
Слайд 37Assembly evaluation
Basic evaluation
No extra input
Very quick
Reference-based evaluation
A lot of metrics
Very
accurate
De novo evaluation
Advanced analysis of de novo assemblies
Слайд 38Basic statistics
Only assemblies are needed (no additional input)
Very fast to
compute
Слайд 40Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000
Слайд 41Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000
bp)
Largest contig length
Слайд 42Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000
bp)
Largest contig length
Total assembly length
Слайд 43N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 44N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 45N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 46N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 47N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 48N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 49N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
Слайд 50N50
The maximum length X for which the collection of all
contigs of length >= X covers at least 50% of
the assembly
N50 = 60
Слайд 51L50
The minimum number X such that X longest contigs cover
at least 50% of the assembly
L50 = 3
Слайд 52L50
The minimum number X such that X longest contigs cover
at least 50% of the assembly
L50 = 3
Слайд 53N50-variations
N25, N75
L25, L75
N25 = 100, N75 = 40
L25 = 1,
L75 = 5
Слайд 54N50-variations
N25, N75
L25, L75
N25 = 100, N75 = 40
L25 = 1,
L75 = 5
Слайд 55N50-variations
N25, N75
L25, L50, L75
Слайд 56N50-variations
N25, N75
L25, L50, L75
Nx, Lx
Слайд 58Other
Number of N’s per 100 kbp
GC %
Слайд 59Other
Number of N’s per 100 kbp
GC %
Distributions of GC %
in small windows:
GC=37
GC=44
GC=41
GC=...
Слайд 61Reference-based metrics
A lot of metrics
Accurate assessment
Слайд 62Basic reference statistics
Reference length
Reference GC %
Number of chromosomes
Слайд 63Basic reference statistics
NGx, LGx
NG50 = 40
LG50 = 4
Слайд 64Basic reference statistics
NGx, LGx
NG50 = 40
LG50 = 4
Слайд 65Basic reference statistics
NGx, LGx
NG50 = 40 40
LG50 = 4 4
Слайд 66Alignment statistics
Assembly
Reference genome
Слайд 68
Genome fraction %
Alignment statistics
Слайд 69
Genome fraction %
Duplication ratio
Alignment statistics
Слайд 70
Genome fraction %
Duplication ratio
Number of gaps
Alignment statistics
Слайд 71Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Alignment statistics
Слайд 72Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned
contigs (full & partial)
Alignment statistics
Слайд 73Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned
contigs (full & partial)
Number of mismatches/indels per 100 kbp
Alignment statistics
Слайд 74Alignment statistics
Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of
unaligned contigs (full & partial)
Number of mismatches/indels per 100 kbp
Number
of genes/operons (full & partial)
Слайд 75Misassemblies
Contig
Reference genome
Chromosome 1
Chromosome 2
Слайд 76Misassemblies
Contig
Reference genome
Chromosome 1
Chromosome 2
Relocation
> 1kbp
Chromosome 2
Chromosome 1
Inversion
Chromosome 2
Chromosome 1
Translocation
Chromosome 2
Chromosome
Слайд 79NA50
Assembly A
Reference genome
Assembly B
200
100
Слайд 80NA50
Assembly A
Reference genome
Assembly B
200
100
N50 = 200
# misassemblies = 2
N50 =
100
# misassemblies = 0
Слайд 81NA50
Assembly A
Reference genome
Assembly B
200
100
N50 = 200
# misassemblies = 2
NA50 =
100
N50 = 100
# misassemblies = 0
NA50 = 100
Слайд 82QUality ASsesment Tool
for Genome Assemblies
Слайд 83QUAST
Assembly statistics
Basic statistics
Reference-based evaluation
Simple de novo evaluation
Available as a
web-based and a command line tool
quast.sf.net
Слайд 84QUAST: console tool
quast.py
quast.py --help
Слайд 85QUAST basics
quast.py
quast.py --help
quast.py contigs.fasta
quast.py [options] contigs.fasta
quast.py -o out_dir contigs.fasta
Слайд 86Reference options
Reference genome
-R reference.fasta
Gene annotation
-G genes.gff
Operon annotation
-O operons.gff
Слайд 87QUAST output
Reports in different formats
Plain text table
Tab separated values (Excel,
Google Spreadsheets)
Interactive HTML
Plots (PDF/PNG/SVG)
Nx, NGx, NAx
Genes
Cumulative length
Interactive contig viewers (Icarus)
Contig
alignment viewer
Contig size viewer
Слайд 88Contig alignment viewer
All alignments for each contig
Misassembly details
Contig ordering
along the genome
Overlaps / gaps
Слайд 90Contig size viewer
Contigs ordered from longest to shortest
N50, N75 (NG50,
NG75)
Filtration by contig size
Gene prediction results
Available without a reference
Слайд 93Read-based statistics
Number of aligned/unaligned reads
% of assembly covered by
reads
Слайд 94Read-based statistics
Number of aligned/unaligned reads
% of assembly covered by
reads
Points with low coverage
Points with multiple read clipping
Points with incorrect
insert sizes
Слайд 95Annotation-based statistics
Number of ORFs
Слайд 96Annotation-based statistics
Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM
(Majoros et al.)
Слайд 97Annotation-based statistics
Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM
(Majoros et al.)
Number of conservative genes
BUSCO (Simão et al.)
CEGMA (Korf
et al., no longer supported)