v0.8.2 · 标签 · HPCSource / cnvkit

v0.8.2
7b78a276 · Bump version to 0.8.2 · 12月 14, 2016
Version 0.8.2

This release covers a number of internal changes to improve the stability and
consistency of CNVkit, as well as new and improved command options to make more
features available from the command line.

Due to a slight change in the binning procedure (see `target` and `antitarget`
below), newly generated target and antitarget BED files, or a reference
generated with `batch`, may not use the same bin boundaries as earlier versions.
CNVkit will check these files for consistency and alert you if your BED or .cnn
files do not match because of this change, e.g. running `batch` from scratch
with the same panel but with two different CNVkit versions. If you want to
update CNVkit mid-project, either keep using the same reference.cnn file as
before for all new samples (as always), or regenerate all your
targetcoverage.cnn and antitargetcoverage.cnn files to build a new reference.

Dependencies
------------

- pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably
  faster and better able to handle newer VCF and gVCF features. (#159)
- pysam: Raise minimum version to 0.9.1.4.

Global changes
--------------

- When extracting a sample ID from a filename, instead of trimming everything
  after the first '.' character, only drop known or single-part extensions.  For
  example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now
  resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files
  will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding
  potential naming conflicts in the `batch` command when processing multiple
  samples. (#48)
- Always sort regions by genomic coordinates after reading a file. This doesn't
  modify the input file in-place, but ensures the output files are always sorted
  the same way.
- Gender detection is more robust. It now uses Mood's median test instead of the
  Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment
  per chromosome, it compares difference of weighted medians in autosomes versus
  sex chromosomes.

VCF parsing:

- Improve handling of VCFs from Mutect2 (#122, #153) and bcftools (#146).
- Don't reject records where FILTER is 'PASS' or '.'.
- VCF options are now consistent across the commands that can use them (`call`,
  `scatter`, `segment`, `export theta` and `export nexus-ogt`).
- New VCF option -z/--zygosity-freq to override VCF genotype calls. (#153, #132)

Commands
--------

target, antitarget:

- Divide bins evenly, using the same internal mechanism (the new
  GenomicArray.subdivide() method). Previously, subdivided regions were not
  always equal-sized as they should have been. Now, the coordinates of newly
  generated targets from a baits BED file may be a little different than before.

target:

- Drop zero-width bins (#167).
- Improve assignment of gene names to targets in WGS datasets. (#164)
- Accept any supported region format for --annotate, including BED, interval
  list and GFF, in addition to the already supported UCSC refFlat. The format is
  detected automatically. (#163)
- Raise an error if the given annotations file (refFlat or equivalent) and the
  given baited/targeted intervals do not have any overlapping chromosomes.

antitarget:

- Set the default average bin size to 150kb. Previously, the CLI default was
  200kb, but the API default was 100kb; experience shows 150kb works well.

access:

- Avoid a possible error when more than 1000 small regions are excluded from a
  single sequencing-accessible region. (#150)

coverage:

- Fix a unicode vs. bytes incompatibility on Python 3. (#147)
- Fix a crash if the input BED has more than 4 columns.

reference:

- Add -g/--gender option to declare the chromosomal sex of the input sample(s)
  (same for all), instead of detecting/guessing for each sample. (#161)
- Ensure printed table of bad bins is a reasonable width. (#140)

segment:

- With a VCF (`-v`), don't output 'cn1' and 'cn2' columns; calculate the 'baf'
  column the same as in `call`. (#148)
- Improve memory efficiency somewhat when using a VCF. (#162)
- Fix possible 1-base overlap of output segments when using the `cbs` or
  `flasso` methods. Specifically, the start positions were erroneously all
  shifted 1 base to the left before. (#158)

scatter, heatmap:

- Improve rendering of genomes much smaller than the human genome, e.g. yeast,
  by scaling telomere padding to the total genome size.  The blank space at
  chromosome boundaries was set to a fixed number of basepairs, but is now
  calculated as 0.3% of the whole genome size (sum of chromosome lengths) --
  which works out the same for the human genome. (#155)

scatter:

- Add option `--segment-color`. Now you can choose 'red' if you like.

metrics:

- Input `-s`/`--segments` is now optional. If not given, compare bin log2 values
  to chromosome medians instead of segment means.

import-theta, export theta:

- Drop sex chromosomes, since THetA2 doesn't handle them well. (#103, #153)

API changes
-----------

tabio:

- Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4
- Detect more formats with `tabio.read_auto`: BED, interval list, text
  coordinates (chr:start-end), refFlat, GFF, TSV with column names.
- Remove module `ngfrills.regions`, no longer needed.

GenomicArray:

- Moved to new sub-package 'genome'
- Rename method `select` to `filter`
- Rename method `match_to_bins` to `into_ranges` and generalize.
- New methods `flatten`, `merge`, `resize_ranges`, `subdivide`, `subtract`

In general, the 'genome' functionality can be reached by using the `tabio`
sub-package to load a GenomicArray instance and use its methods
directly:

    from cnvlib import tabio
    regions = tabio.read_auto(filename)
    # Generate 500bp flanking regions
    regions.resize_ranges(500).merge().subtract(regions)
下载源代码