Разделы презентаций


Genome assembly with SPAdes

Содержание

Introduction

Слайды и текст этой презентации

Слайд 1Genome assembly with SPAdes
Center for Algorithmic Biotechnology
SPbU

Genome assembly with SPAdesCenter for Algorithmic BiotechnologySPbU

Слайд 2Introduction

Introduction

Слайд 3Why to assemble?

Why to assemble?

Слайд 4Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants

Why to assemble?Sequencing data Billions of short readsSequencing errorsContaminants

Слайд 5Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants

Assembly
Corrects sequencing

errors
Much longer sequences
Each genomic region is presented only once
May introduce

errors

Hard to perform analysis

Why to assemble?Sequencing data Billions of short readsSequencing errorsContaminantsAssemblyCorrects sequencing errorsMuch longer sequencesEach genomic region is presented

Слайд 6Assembly basics

Assembly basics

Слайд 7Assembly in a perfect world

Assembly in a perfect world

Слайд 8Assembly in real world

Assembly in real world

Слайд 9De novo whole genome assembly

De novo whole genome assembly

Слайд 10De novo whole genome assembly

De novo whole genome assembly

Слайд 11Genomic repeats







TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Genomic repeatsTATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 12Genomic repeats
TATTCTTC
CTTCCACG

CACGTAGG

GGCCTTCC
CTTCCACG
CACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Genomic repeatsTATTCTTC     CTTCCACG         CACGTAGG

Слайд 13Genomic repeats
TATTCTTC
CTTCCACG

CACGTAGG

GGCCTTCC
CTTCCACG
CACGCTTCG

Genomic repeatsTATTCTTC     CTTCCACG         CACGTAGG

Слайд 14Genomic repeats
TATTCTTCCACGTAGG
GGCCTTCCACGCTTCG


TATTCTTCCACGCTTCG
GGCCTTCCACGTAGG

Genomic repeatsTATTCTTCCACGTAGGGGCCTTCCACGCTTCGTATTCTTCCACGCTTCGGGCCTTCCACGTAGG

Слайд 15Genomic repeats



TATTCTTCCACGTAGG

ACGTAGGGCCTT

GCCTTCCACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Genomic repeatsTATTCTTCCACGTAGG           ACGTAGGGCCTT

Слайд 16Genomic repeats



TATTCTTCCACGTAGG

ACGTAGGGCCTT

GCCTTCCACGCTTCG

Genomic repeatsTATTCTTCCACGTAGG           ACGTAGGGCCTT

Слайд 17SPAdes assembler

SPAdes assembler

Слайд 18SPAdes first steps
spades.py

SPAdes first stepsspades.py

Слайд 19SPAdes first steps
spades.py
spades.py --help
spades.py --test

SPAdes first stepsspades.pyspades.py --helpspades.py --test

Слайд 20SPAdes first steps
spades.py
spades.py --help
spades.py --test
-o

SPAdes first stepsspades.pyspades.py --helpspades.py --test-o

Слайд 21Input data formats
FASTA: .fasta / .fa
FASTQ: .fastq / .fq
Gzipped: .gz

Input data formatsFASTA: .fasta / .faFASTQ: .fastq / .fqGzipped: .gz

Слайд 22Input data options
Unpaired reads
Illumina unpaired
-s single.fastq
-s single1.fastq -s single2.fastq ...

Input data optionsUnpaired readsIllumina unpaired-s single.fastq-s single1.fastq -s single2.fastq ...

Слайд 23Input data options
Paired-end reads
Interlaced pairs in one file
>left_read_id
ACGTGCAGG…
>right_read_id
GCTTCGAGG…

Separate files
file1.fastq file2.fastq
>left_read_id >right_read_id
ACGTGCAGG… GCTTCGAGG…



Input data optionsPaired-end readsInterlaced pairs in one file>left_read_idACGTGCAGG…>right_read_idGCTTCGAGG…Separate filesfile1.fastq						file2.fastq>left_read_id 				>right_read_idACGTGCAGG…						GCTTCGAGG…

Слайд 24Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq


Separate files
--pe1-1

file1.fastq --pe1-2 file2.fastq



Input data optionsPaired-end readsInterlaced pairs in one file--pe1-12 file.fastqSeparate files--pe1-1 file1.fastq --pe1-2 file2.fastq

Слайд 25Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq


Separate files
--pe1-1

file1.fastq --pe1-2 file2.fastq
--pe1-s unpaired.fastq



Input data optionsPaired-end readsInterlaced pairs in one file--pe1-12 file.fastqSeparate files--pe1-1 file1.fastq --pe1-2 file2.fastq --pe1-s unpaired.fastq

Слайд 26SPAdes performance options
Number of threads
-t N
Maximal available RAM (GB)
SPAdes will

terminate if exceeded
-m M


SPAdes performance optionsNumber of threads-t NMaximal available RAM (GB)SPAdes will terminate if exceeded-m M

Слайд 27Pipeline options
Run only assembler (input reads are already corrected or

quality-trimmed)
--only-assembler

Pipeline optionsRun only assembler (input reads are already corrected or quality-trimmed)--only-assembler

Слайд 28Input data options
Mate-pair reads
Cannot be used separately
Interlaced pairs in

one file
--mp1-12 mp.fastq
Separate files
--mp1-1 mp1.fastq --mp1-2 mp2.fastq




Input data optionsMate-pair reads Cannot be used separatelyInterlaced pairs in one file--mp1-12 mp.fastqSeparate files--mp1-1 mp1.fastq --mp1-2 mp2.fastq

Слайд 29Hybrid assembly options
PacBio CLR
--pacbio pb.fastq
Oxford Nanopore reads
--nanopore nanopore_reads.fastq


Hybrid assembly optionsPacBio CLR --pacbio pb.fastqOxford Nanopore reads--nanopore nanopore_reads.fastq

Слайд 30Restarting SPAdes
SPAdes / system crashed
--continue -o your_output_dir

Restarting SPAdesSPAdes / system crashed--continue -o your_output_dir

Слайд 31Genome assembly evaluation with QUAST
Center for Algorithmic Biotechnology
SPbU

Genome assembly evaluation with QUASTCenter for Algorithmic BiotechnologySPbU

Слайд 32In reality
SPAdes
ABySS
IDBA
Ray
Velvet
….

In realitySPAdes ABySS IDBA Ray Velvet ….

Слайд 33Which assembler to use?
ABySS
ALLPATHS-LG
CLC
IDBA-UD
MaSuRCA
MIRA
Ray
SOAPdenovo
SPAdes
Velvet
and many more...

Which assembler to use?ABySSALLPATHS-LGCLCIDBA-UDMaSuRCAMIRARaySOAPdenovoSPAdesVelvetand many more...

Слайд 34Which assembler to use?
Different technologies (Illumina, 454, IonTorrent, ...)
Genome type

and size (bacteria, insects, mammals, plants, ...)
Type of prepared libraries

(single reads, paired-end, mate-pairs, combinations)
Type of data (multicell, metagenomic, single-cell)
Which assembler to use?Different technologies (Illumina, 454, IonTorrent, ...)Genome type and size (bacteria, insects, mammals, plants, ...)Type

Слайд 35There is no best assembler

There is no best assembler

Слайд 36Which assembler to use?
Assemblathon 1 & 2
Simulated and real datasets
More

than 30 teams competing
Independent studies
Papers (GAGE, GAGE-B, GABenchToB)
Web-sites (nucleotid.es, …)
Surveys


Genome assembly evaluation tools
QUAST
GAGE

Which assembler to use?Assemblathon 1 & 2Simulated and real datasetsMore than 30 teams competingIndependent studiesPapers (GAGE, GAGE-B,

Слайд 37Assembly evaluation
Basic evaluation
No extra input
Very quick
Reference-based evaluation
A lot of metrics
Very

accurate
De novo evaluation
Advanced analysis of de novo assemblies

Assembly evaluationBasic evaluationNo extra inputVery quickReference-based evaluationA lot of metricsVery accurateDe novo evaluationAdvanced analysis of de novo

Слайд 38Basic statistics
Only assemblies are needed (no additional input)
Very fast to

compute

Basic statisticsOnly assemblies are needed (no additional input)Very fast to compute

Слайд 39Contig sizes
Number of contigs

Contig sizesNumber of contigs

Слайд 40Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000

Contig sizesNumber of contigsNumber of large contigs (i.e. > 1000 bp)

Слайд 41Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000

bp)
Largest contig length

Contig sizesNumber of contigsNumber of large contigs (i.e. > 1000 bp)Largest contig length

Слайд 42Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000

bp)
Largest contig length
Total assembly length

Contig sizesNumber of contigsNumber of large contigs (i.e. > 1000 bp)Largest contig lengthTotal assembly length

Слайд 43N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 44N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 45N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 46N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 47N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 48N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 49N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 50N50
The maximum length X for which the collection of all

contigs of length >= X covers at least 50% of

the assembly

N50 = 60

N50The maximum length X for which the collection of all contigs of length >= X covers at

Слайд 51L50
The minimum number X such that X longest contigs cover

at least 50% of the assembly

L50 = 3

L50The minimum number X such that X longest contigs cover at least 50% of the assemblyL50 =

Слайд 52L50
The minimum number X such that X longest contigs cover

at least 50% of the assembly

L50 = 3

L50The minimum number X such that X longest contigs cover at least 50% of the assemblyL50 =

Слайд 53N50-variations
N25, N75
L25, L75

N25 = 100, N75 = 40
L25 = 1,

L75 = 5

N50-variationsN25, N75L25, L75N25 = 100, N75 = 40L25 = 1, L75 = 5

Слайд 54N50-variations
N25, N75
L25, L75

N25 = 100, N75 = 40
L25 = 1,

L75 = 5

N50-variationsN25, N75L25, L75N25 = 100, N75 = 40L25 = 1, L75 = 5

Слайд 55N50-variations
N25, N75
L25, L50, L75

N50-variationsN25, N75L25, L50, L75

Слайд 56N50-variations
N25, N75
L25, L50, L75
Nx, Lx

N50-variationsN25, N75L25, L50, L75Nx, Lx

Слайд 57Other
Number of N’s per 100 kbp

OtherNumber of N’s per 100 kbp

Слайд 58Other
Number of N’s per 100 kbp
GC %

OtherNumber of N’s per 100 kbpGC %

Слайд 59Other
Number of N’s per 100 kbp
GC %
Distributions of GC %

in small windows:

GC=37
GC=44
GC=41
GC=...

OtherNumber of N’s per 100 kbpGC %Distributions of GC % in small windows:GC=37GC=44GC=41GC=...

Слайд 61Reference-based metrics
A lot of metrics
Accurate assessment

Reference-based metricsA lot of metricsAccurate assessment

Слайд 62Basic reference statistics
Reference length
Reference GC %
Number of chromosomes



Basic reference statisticsReference lengthReference GC %Number of chromosomes

Слайд 63Basic reference statistics
NGx, LGx

NG50 = 40
LG50 = 4

Basic reference statisticsNGx, LGxNG50 = 40LG50 = 4

Слайд 64Basic reference statistics
NGx, LGx

NG50 = 40
LG50 = 4

Basic reference statisticsNGx, LGxNG50 = 40LG50 = 4

Слайд 65Basic reference statistics
NGx, LGx

NG50 = 40 40
LG50 = 4 4

Basic reference statisticsNGx, LGxNG50 = 40 40LG50 = 4 4

Слайд 66Alignment statistics



Assembly
Reference genome


Alignment statisticsAssemblyReference genome

Слайд 67Alignment statistics





Alignment statistics

Слайд 68
Genome fraction %

Alignment statistics






Genome fraction %Alignment statistics

Слайд 69
Genome fraction %
Duplication ratio

Alignment statistics





Genome fraction %Duplication ratioAlignment statistics

Слайд 70
Genome fraction %
Duplication ratio
Number of gaps

Alignment statistics






Genome fraction %Duplication ratioNumber of gapsAlignment statistics

Слайд 71Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length

Alignment statistics






Genome fraction %Duplication ratioNumber of gapsLargest alignment lengthAlignment statistics

Слайд 72Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned

contigs (full & partial)

Alignment statistics







Genome fraction %Duplication ratioNumber of gapsLargest alignment lengthNumber of unaligned contigs (full & partial)Alignment statistics

Слайд 73Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned

contigs (full & partial)
Number of mismatches/indels per 100 kbp
Alignment statistics

Genome fraction %Duplication ratioNumber of gapsLargest alignment lengthNumber of unaligned contigs (full & partial)Number of mismatches/indels per

Слайд 74Alignment statistics
Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of

unaligned contigs (full & partial)
Number of mismatches/indels per 100 kbp
Number

of genes/operons (full & partial)
Alignment statisticsGenome fraction %Duplication ratioNumber of gapsLargest alignment lengthNumber of unaligned contigs (full & partial)Number of mismatches/indels

Слайд 75Misassemblies
Contig
Reference genome
Chromosome 1
Chromosome 2


MisassembliesContigReference genomeChromosome 1Chromosome 2

Слайд 76Misassemblies
Contig
Reference genome
Chromosome 1
Chromosome 2
Relocation
> 1kbp
Chromosome 2
Chromosome 1



Inversion
Chromosome 2
Chromosome 1


Translocation
Chromosome 2
Chromosome

MisassembliesContigReference genomeChromosome 1Chromosome 2Relocation> 1kbpChromosome 2Chromosome 1InversionChromosome 2Chromosome 1TranslocationChromosome 2Chromosome 1

Слайд 77There is no best metric
NB!

There is no best metricNB!

Слайд 78NA50
Assembly A
Assembly B
200

100

NA50Assembly AAssembly B200100

Слайд 79NA50
Assembly A
Reference genome
Assembly B
200

100

NA50Assembly AReference genomeAssembly B200100

Слайд 80NA50
Assembly A
Reference genome
Assembly B
200

100

N50 = 200
# misassemblies = 2

N50 =

100
# misassemblies = 0


NA50Assembly AReference genomeAssembly B200100N50 = 200# misassemblies = 2N50 = 100# misassemblies = 0

Слайд 81NA50
Assembly A
Reference genome
Assembly B
200

100

N50 = 200
# misassemblies = 2
NA50 =

100

N50 = 100
# misassemblies = 0
NA50 = 100


NA50Assembly AReference genomeAssembly B200100N50 = 200# misassemblies = 2NA50 = 100N50 = 100# misassemblies = 0NA50 =

Слайд 82QUality ASsesment Tool
for Genome Assemblies

QUality ASsesment Tool for Genome Assemblies

Слайд 83QUAST
Assembly statistics
Basic statistics
Reference-based evaluation
Simple de novo evaluation

Available as a

web-based and a command line tool
quast.sf.net

QUASTAssembly statistics Basic statisticsReference-based evaluationSimple de novo evaluationAvailable as a web-based and a command line toolquast.sf.net

Слайд 84QUAST: console tool
quast.py
quast.py --help

QUAST: console toolquast.pyquast.py --help

Слайд 85QUAST basics
quast.py
quast.py --help
quast.py contigs.fasta
quast.py [options] contigs.fasta
quast.py -o out_dir contigs.fasta

QUAST basicsquast.pyquast.py --helpquast.py contigs.fastaquast.py [options] contigs.fastaquast.py -o out_dir contigs.fasta

Слайд 86Reference options
Reference genome
-R reference.fasta
Gene annotation
-G genes.gff
Operon annotation
-O operons.gff

Reference optionsReference genome-R reference.fastaGene annotation-G genes.gff Operon annotation-O operons.gff

Слайд 87QUAST output
Reports in different formats
Plain text table
Tab separated values (Excel,

Google Spreadsheets)
Interactive HTML
Plots (PDF/PNG/SVG)
Nx, NGx, NAx
Genes
Cumulative length
Interactive contig viewers (Icarus)
Contig

alignment viewer
Contig size viewer
QUAST outputReports in different formatsPlain text tableTab separated values (Excel, Google Spreadsheets)Interactive HTMLPlots (PDF/PNG/SVG)Nx, NGx, NAxGenesCumulative lengthInteractive

Слайд 88Contig alignment viewer
All alignments for each contig
Misassembly details
Contig ordering

along the genome
Overlaps / gaps

Contig alignment viewerAll alignments for each contigMisassembly details Contig ordering along the genomeOverlaps / gaps

Слайд 89Contig alignment viewer

Contig alignment viewer

Слайд 90Contig size viewer
Contigs ordered from longest to shortest
N50, N75 (NG50,

NG75)
Filtration by contig size
Gene prediction results
Available without a reference

Contig size viewerContigs ordered from longest to shortestN50, N75 (NG50, NG75) Filtration by contig sizeGene prediction resultsAvailable

Слайд 91Contig size viewer

Contig size viewer

Слайд 92De novo evaluation

De novo evaluation

Слайд 93Read-based statistics
Number of aligned/unaligned reads
% of assembly covered by

reads

Read-based statisticsNumber of aligned/unaligned reads % of assembly covered by reads

Слайд 94Read-based statistics
Number of aligned/unaligned reads
% of assembly covered by

reads

Points with low coverage
Points with multiple read clipping
Points with incorrect

insert sizes
Read-based statisticsNumber of aligned/unaligned reads % of assembly covered by readsPoints with low coveragePoints with multiple read

Слайд 95Annotation-based statistics
Number of ORFs

Annotation-based statisticsNumber of ORFs

Слайд 96Annotation-based statistics
Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM

(Majoros et al.)


Annotation-based statisticsNumber of ORFsNumber of gene/operon-like regionsGeneMarkS (Borodovsky et al.)GlimmerHMM (Majoros et al.)

Слайд 97Annotation-based statistics
Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM

(Majoros et al.)
Number of conservative genes
BUSCO (Simão et al.)
CEGMA (Korf

et al., no longer supported)


Annotation-based statisticsNumber of ORFsNumber of gene/operon-like regionsGeneMarkS (Borodovsky et al.)GlimmerHMM (Majoros et al.)Number of conservative genesBUSCO (Simão

Слайд 98Thank you!
Questions?



Thank you!Questions?

Обратная связь

Если не удалось найти и скачать доклад-презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое TheSlide.ru?

Это сайт презентации, докладов, проектов в PowerPoint. Здесь удобно  хранить и делиться своими презентациями с другими пользователями.


Для правообладателей

Яндекс.Метрика