Merge branch 'master' of github.com:marbl/canu

8ca8e5db · Sergey Koren · 812c161f · 553a612b · 8ca8e5db · 8ca8e5db
--- a/README.md
+++ b/README.md
@@ -25,4 +25,4 @@ Canu is a hierachical assembly pipeline which runs in four steps:
 * Trim corrected sequences
 * Assemble trimmed corrected sequences

-Read the [documentation](http://canu.readthedocs.org/en/latest/ "docs")
+Read the [documentation](http://canu.readthedocs.org/ "docs")
--- a/documentation/source/quick-start.rst
+++ b/documentation/source/quick-start.rst
@@ -68,9 +68,16 @@ Find the Output
 Outputs from the assembly tasks are in:

 - ecoli*/ecoli.layout
- ecoli*/ecoli.consensus.fasta
+- ecoli*/ecoli.gfa
+- ecoli*/ecoli.contigs.fasta
+- ecoli*/ecoli.bubbles.fasta
+- ecoli*/ecoli.unassembled.fasta

-The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types.
+The canu progress chatter records statistics such as an input read histogram, corrected read histogram, and overlap types. The layout provides information on where each read ended up in the final assembly. The `GFA <http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/>`_ is the assembly graph generated by Canu. The fasta output is split into three types:
+
+- contigs: everything which could be assembled and is part of the primary assembly. This includes both unique and repetitive elements
+- bubbles: alternate paths in the graph which could not be merged into the primary assembly.
+- unassembled: reads which could not be incorporated into the primary or bubble assemblies.


 Correct, Trim and Assemble, Manually
@@ -178,24 +185,29 @@ After the run completes, we can check the assembly statistics::
 tgStoreDump -sizes -s 12100000 -T yeast/unitigging/asm.tigStore 2 -G yeast/unitigging/asm.gkpStore

 ::
-
- lenSingleton n10 siz       7013 sum    1210884 idx        116
- lenSingleton sum    2338725 (genomeSize 12100000)
- lenSingleton num        416
- lenSingleton ave       5621
- lenAssembled n10 siz     696203 sum    1453015 idx          1
- lenAssembled n20 siz     575091 sum    2646269 idx          3
- lenAssembled n30 siz     550579 sum    3755422 idx          5
- lenAssembled n40 siz     455083 sum    5250476 idx          8
- lenAssembled n50 siz     392191 sum    6088423 idx         10
- lenAssembled n60 siz     205069 sum    7342769 idx         15
- lenAssembled n70 siz     140204 sum    8504891 idx         22
- lenAssembled n80 siz      99777 sum    9693133 idx         32
- lenAssembled n90 siz      64744 sum   10949303 idx         48
- lenAssembled n100 siz      15639 sum   12100894 idx         89
- lenAssembled sum   12607682 (genomeSize 12100000)
- lenAssembled num        150
- lenAssembled ave      84051
+ lenSuggestRepeat sum    1135952 (genomeSize 12100000)
+ lenSuggestRepeat num        159
+ lenSuggestRepeat ave       7144
+ lenUnassembled ng10       12635 bp   lg10      76   sum    1220193 bp
+ lenUnassembled ng20        9372 bp   lg20     188   sum    2424217 bp
+ lenUnassembled ng30        7287 bp   lg30     333   sum    3632625 bp
+ lenUnassembled ng40        4941 bp   lg40     534   sum    4841897 bp
+ lenUnassembled ng50        2069 bp   lg50     883   sum    6050798 bp
+ lenUnassembled sum    6321159 (genomeSize 12100000)
+ lenUnassembled num       1061
+ lenUnassembled ave       5957
+ lenContig ng10      761684 bp   lg10       2   sum    1544923 bp
+ lenContig ng20      667922 bp   lg20       4   sum    2942290 bp
+ lenContig ng30      567305 bp   lg30       6   sum    4156720 bp
+ lenContig ng40      550087 bp   lg40       8   sum    5271140 bp
+ lenContig ng50      446812 bp   lg50      10   sum    6197825 bp
+ lenContig ng60      251216 bp   lg60      14   sum    7356841 bp
+ lenContig ng70      183999 bp   lg70      20   sum    8624087 bp
+ lenContig ng80      120025 bp   lg80      28   sum    9744795 bp
+ lenContig ng90       84512 bp   lg90      40   sum   10907525 bp
+ lenContig sum   11922888 (genomeSize 12100000)
+ lenContig num         67
+ lenContig ave     177953

 Consensus Accuracy
 -------------------
@@ -203,9 +215,25 @@ While Canu corrects sequences and has 99% identity or greater with PacBio or Nan

 If you have Illumina sequences available, `Pilon <http://www.broadinstitute.org/software/pilon/>`_ can also be used to polish either PacBio or Oxford Nanopore assemblies.

+Changes
+-------------------
+
+- Support for reads up to 2Mbp in size (up from 130Kbp).
+- Incorporate MHAP 2.0 which is 3-5X faster than previous version and has higher specificity
+- Add `GFA <http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/>`_ output
+- Improve diploid-aware assembly by categorizing output as primary contigs or unmerged bubbles. Annotate repeat and unique contigs in the output.
+- Enable parallel overlap store construction on large genomes
+- Enable `minimap <https://github.com/lh3/minimap>`_ as an option for generating overlaps during correction step. Corrected reads are generated as before with falcon_sense.
+- Fix bug using shorter rather than longer reads for corrected reads/consensus computation
+- Fix bug resuming without providing input sequences which would incorrectly set error rates
+- Fix bug in bogart which would demote contained sequences as spurs incorrectly
+- Fix bugs in falcon_sense which would hang when input had N bases and limit corrected reads to 65Kbp
+
 Known Issues
 -------------------

+- Bogart (unitigger) has false positives in repeat breaking. Currently, the workaround is to increase the minimum overlap size to avoid detecting false repeats ca
+used by short overlaps. Canu will automatically do this for large (>100MB) genomes.
 - LSF support has limited testing
 - Large memory usage while unitig consensus calling on unitigs over 100MB in size (140Mb contig uses approximate 75GB).
 - Distributed file systems (such as GPFS) causes issues with memory mapped files, slowing down parts of Canu, including meryl (0-mercounts) and falcon-sense (2-correction).
--- a/documentation/source/tutorial.rst
+++ b/documentation/source/tutorial.rst
@@ -154,6 +154,15 @@ The tags are:
 |utgmhap | the mhap overlapper, as used in the assembly phase                |
 +--------+-------------------------------------------------------------------+
 +--------+-------------------------------------------------------------------+
+|mmap    | the `minimap <https://github.com/lh3/minimap>`_ overlapper                                      |
+--------+-------------------------------------------------------------------+
+|cormmap | the minimap overlapper, as used in the correction phase           |
+--------+-------------------------------------------------------------------+
+|obtmmap | the minimap overlapper, as used in the trimming phase             |
+--------+-------------------------------------------------------------------+
+|utgmmap | the minimap overlapper, as used in the assembly phase             |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
 |ovb     | the bucketizing phase of overlap store building                   |
 +--------+-------------------------------------------------------------------+
 |ovs     | the sort phase of overlap store building                          |
@@ -254,8 +263,8 @@ utgOvlErrorRate
  Do not compute overlaps used for unitig construction above this error rate.  Applies
  to the standard overlapper, and realigning mhap overlaps.

-(ADVANCED) It is possible to convert the mhap overlaps to alignment based overlaps using
-``obtMhapReAlign=true`` or ``ovlMhapReAlign=true``.  If so, the overlaps will be computed using
+(ADVANCED) It is possible to convert the mhap or minimap overlaps to alignment based overlaps using
+``obtReAlign=true`` or ``ovlReAlign=true``.  If so, the overlaps will be computed using
 either ``obtOvlErrorRate`` or ``utgOvlErrorRate``, depending on which overlaps are being generated.

 Be sure to not confuse ``obtOvlErrorRate`` with ``obtErrorRate``:
@@ -314,6 +323,7 @@ For example:
 - To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used.
 - To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used.
 - To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used.
+- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used.

 Ovl Overlapper Configuration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -384,11 +394,18 @@ Mhap Overlapper Parameters
  Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
 <tag>MhapMerSize
  Use k-mers of this size for detecting overlaps.
-<tag>MhapReAlign
+<tag>ReAlign
  After computing overlaps with mhap, compute a sequence alignment for each overlap.
 <tag>MhapSensitivity
-  Either 'normal' or 'high'.
+  Either 'normal', 'high', or 'fast'.

 Mhap also will down-weight frequent kmers (using tf-idf), but it's selection of frequent is not exposed.

+Minimap Overlapper Parameters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+<tag>MMapBlockSize
+  Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
+<tag>MMapMerSize
+  Use k-mers of this size for detecting overlaps

+Minimap also will ignore high-frequency minimzers, but it's selection of frequent is not exposed.