Skip to content
代码片段 群组 项目
提交 8ca8e5db 编辑于 作者: Sergey Koren's avatar Sergey Koren
浏览文件

Merge branch 'master' of github.com:marbl/canu

无相关合并请求
......@@ -25,4 +25,4 @@ Canu is a hierachical assembly pipeline which runs in four steps:
* Trim corrected sequences
* Assemble trimmed corrected sequences
Read the [documentation](http://canu.readthedocs.org/en/latest/ "docs")
Read the [documentation](http://canu.readthedocs.org/ "docs")
......@@ -68,9 +68,16 @@ Find the Output
Outputs from the assembly tasks are in:
- ecoli*/ecoli.layout
- ecoli*/ecoli.consensus.fasta
- ecoli*/ecoli.gfa
- ecoli*/ecoli.contigs.fasta
- ecoli*/ecoli.bubbles.fasta
- ecoli*/ecoli.unassembled.fasta
The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types.
The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types. The layout provides information on where each read ended up in the final assembly. The `GFA <http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/>`_ is the assembly graph generated by Canu. The fasta output is split into three types:
- contigs: everything which could be assembled and is part of the primary assembly. This includes both unique and repetitive elements
- bubbles: alternate paths in the graph which could not be merged into the primary assembly.
- unassembled: reads which could not be incorporated into the primary or bubble assemblies.
Correct, Trim and Assemble, Manually
......@@ -178,24 +185,29 @@ After the run completes, we can check the assembly statistics::
tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.tigStore 2 -G yeast/unitigging/asm.gkpStore
::
lenSingleton n10 siz 7013 sum 1210884 idx 116
lenSingleton sum 2338725 (genomeSize 12100000)
lenSingleton num 416
lenSingleton ave 5621
lenAssembled n10 siz 696203 sum 1453015 idx 1
lenAssembled n20 siz 575091 sum 2646269 idx 3
lenAssembled n30 siz 550579 sum 3755422 idx 5
lenAssembled n40 siz 455083 sum 5250476 idx 8
lenAssembled n50 siz 392191 sum 6088423 idx 10
lenAssembled n60 siz 205069 sum 7342769 idx 15
lenAssembled n70 siz 140204 sum 8504891 idx 22
lenAssembled n80 siz 99777 sum 9693133 idx 32
lenAssembled n90 siz 64744 sum 10949303 idx 48
lenAssembled n100 siz 15639 sum 12100894 idx 89
lenAssembled sum 12607682 (genomeSize 12100000)
lenAssembled num 150
lenAssembled ave 84051
lenSuggestRepeat sum 1135952 (genomeSize 12100000)
lenSuggestRepeat num 159
lenSuggestRepeat ave 7144
lenUnassembled ng10 12635 bp lg10 76 sum 1220193 bp
lenUnassembled ng20 9372 bp lg20 188 sum 2424217 bp
lenUnassembled ng30 7287 bp lg30 333 sum 3632625 bp
lenUnassembled ng40 4941 bp lg40 534 sum 4841897 bp
lenUnassembled ng50 2069 bp lg50 883 sum 6050798 bp
lenUnassembled sum 6321159 (genomeSize 12100000)
lenUnassembled num 1061
lenUnassembled ave 5957
lenContig ng10 761684 bp lg10 2 sum 1544923 bp
lenContig ng20 667922 bp lg20 4 sum 2942290 bp
lenContig ng30 567305 bp lg30 6 sum 4156720 bp
lenContig ng40 550087 bp lg40 8 sum 5271140 bp
lenContig ng50 446812 bp lg50 10 sum 6197825 bp
lenContig ng60 251216 bp lg60 14 sum 7356841 bp
lenContig ng70 183999 bp lg70 20 sum 8624087 bp
lenContig ng80 120025 bp lg80 28 sum 9744795 bp
lenContig ng90 84512 bp lg90 40 sum 10907525 bp
lenContig sum 11922888 (genomeSize 12100000)
lenContig num 67
lenContig ave 177953
Consensus Accuracy
-------------------
......@@ -203,9 +215,25 @@ While Canu corrects sequences and has 99% identity or greater with PacBio or Nan
If you have Illumina sequences available, `Pilon <http://www.broadinstitute.org/software/pilon/>`_ can also be used to polish either PacBio or Oxford Nanopore assemblies.
Changes
-------------------
- Support for reads up to 2Mbp in size (up from 130Kbp).
- Incorporate MHAP 2.0 which is 3-5X faster than previous version and has higher specificity
- Add `GFA <http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/>`_ output
- Improve diploid-aware assembly by categorizing output as primary contigs or unmerged bubbles. Annotate repeat and unique contigs in the output.
- Enable parallel overlap store construction on large genomes
- Enable `minimap <https://github.com/lh3/minimap>`_ as an option for generating overlaps during correction step. Corrected reads are generated as before with falcon_sense.
- Fix bug using shorter rather than longer reads for corrected reads/consensus computation
- Fix bug resuming without providing input sequences which would incorrectly set error rates
- Fix bug in bogart which would demote contained sequences as spurs incorrectly
- Fix bugs in falcon_sense which would hang when input had N bases and limit corrected reads to 65Kbp
Known Issues
-------------------
- Bogart (unitigger) has false positives in repeat breaking. Currently, the workaround is to increase the minimum overlap size to avoid detecting false repeats ca
used by short overlaps. Canu will automatically do this for large (>100MB) genomes.
- LSF support has limited testing
- Large memory usage while unitig consensus calling on unitigs over 100MB in size (140Mb contig uses approximate 75GB).
- Distributed file systems (such as GPFS) causes issues with memory mapped files, slowing down parts of Canu, including meryl (0-mercounts) and falcon-sense (2-correction).
......@@ -154,6 +154,15 @@ The tags are:
|utgmhap | the mhap overlapper, as used in the assembly phase |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|mmap | the `minimap <https://github.com/lh3/minimap>`_ overlapper |
+--------+-------------------------------------------------------------------+
|cormmap | the minimap overlapper, as used in the correction phase |
+--------+-------------------------------------------------------------------+
|obtmmap | the minimap overlapper, as used in the trimming phase |
+--------+-------------------------------------------------------------------+
|utgmmap | the minimap overlapper, as used in the assembly phase |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|ovb | the bucketizing phase of overlap store building |
+--------+-------------------------------------------------------------------+
|ovs | the sort phase of overlap store building |
......@@ -254,8 +263,8 @@ utgOvlErrorRate
Do not compute overlaps used for unitig construction above this error rate. Applies
to the standard overlapper, and realigning mhap overlaps.
(ADVANCED) It is possible to convert the mhap overlaps to alignment based overlaps using
``obtMhapReAlign=true`` or ``ovlMhapReAlign=true``. If so, the overlaps will be computed using
(ADVANCED) It is possible to convert the mhap or minimap overlaps to alignment based overlaps using
``obtReAlign=true`` or ``ovlReAlign=true``. If so, the overlaps will be computed using
either ``obtOvlErrorRate`` or ``utgOvlErrorRate``, depending on which overlaps are being generated.
Be sure to not confuse ``obtOvlErrorRate`` with ``obtErrorRate``:
......@@ -314,6 +323,7 @@ For example:
- To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used.
- To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used.
- To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used.
Ovl Overlapper Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -384,11 +394,18 @@ Mhap Overlapper Parameters
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
<tag>MhapMerSize
Use k-mers of this size for detecting overlaps.
<tag>MhapReAlign
<tag>ReAlign
After computing overlaps with mhap, compute a sequence alignment for each overlap.
<tag>MhapSensitivity
Either 'normal' or 'high'.
Either 'normal', 'high', or 'fast'.
Mhap also will down-weight frequent kmers (using tf-idf), but it's selection of frequent is not exposed.
Minimap Overlapper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
<tag>MMapBlockSize
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
<tag>MMapMerSize
Use k-mers of this size for detecting overlaps
Minimap also will ignore high-frequency minimzers, but it's selection of frequent is not exposed.
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册