Skip to content
使用标签,可以设置提交历史上的特定点为重要提交
  • v0.9.12
    dd834b0b · Update version to 0.9.12 ·
    Version 0.9.12
    ==============
    
    Bug fixes
    ---------
    
    - Re-enable `coverage -q/--min-mapq` option. (#912; thanks @rach-kennedy)
    - Prevent CBS segmentation failures due to nulls in input .cnr (#914, #436, #582, maybe #760, #896, #901 and nf-core/sarek#1625)
    - Raise max pomegranate dependency version from <=0.14.9 to <1.0.0 to avoid conflicts
      during installation (#911, #890)
  • v0.9.11
    450726e0 · Bump version to 0.9.11 ·
    Version 0.9.11
    ==============
    
    New features
    ------------
    
    - Most commands include a new option, `--diploid-parx-genome`, to treat the
      pseudoautosomal regions (PAR1/2) of human chromosome X as autosomal, i.e. diploid
      regardless of sample sex. The value it takes is a human reference genome ID such as
      "grch38". This feature should help reduce false calls on sex chromosomes in human
      samples. (Thanks @rollf; #789)
    - The `fix` command takes a new option `--smoothing-window-fraction` to allow manual
      tuning of the smoothing window used in GC and other automatic bias corrections.
      (Thanks @kkchau; #859)
    - hg38 refFlat and genome accessibility data files are now included in the source tree.
      (Thanks @berguner; #822, #837)
    
    Bug fixes
    ---------
    
    - The Docker image once again includes the additional scripts beyond cnvkit.py.
    - User-specified sample sex with `-x` now works properly. (Thanks @28rietd and @ccoo22;
      #843, #851)
    - User-specified smoothing window size now applies in HMM segmentation. (Thanks
      @zhuying412; #833, #835)
    - An error in `export vcf` has been fixed. (Thanks @pwwang; #818)
    
    Other updates
    -------------
    
    - Dependency versions are updated to match Ubuntu 23.04 Lunar, more or less.
    - Automated testing is done on Python version 3.8 through 3.12 -- these are the
      "supported" versions.
    - Small documentation fixes.
  • v0.9.10
    8d477b00 · Bumb version to 0.9.10 ·
    Version 0.9.10
    ==============
    
    This long-awaited release includes major plotting enhancements in the `heatmap`,
    `scatter`, and `diagram` commands, as well as a new `export gistic` command, thanks to
    joint work by @tetedange13 and @tskir (see below).
    
    There are also significant infrastructure improvements including bug fixes, modernized
    packaging, and build/test automation.
    
    New features
    ------------
    
    `diagram`:
    
    - New options `--no-gene-labels` to not display gene labels on the plot, and `-c` /
      `--chromosome` to plot a single chromosome (#628, #629, #634; thanks @tetedange13)
    
    `heatmap`:
    
    New CLI options  (#35, #625, #632, #652; thanks @tetedange13 and @tskir):
    
    - `--vertical`: Transpose the plot, displaying the genome axis vertically instead of horizontally
    - `--delimit-samples`: Add an delimitation line between each sample row (or column, with
      `--vertical`)
    - `--title`: Set the plot title
    
    `scatter`:
    
    - New option `--fig-size`: Set the output image dimensions (#600, #641; thanks
      @tetedange13 and @tskir)
    - Show triangles at the bottom of the plot to indicate where segments are hidden below
      the plotted region by automatic pruning at 'ymin=-5'. Also log a warning when this
      happens. (#385, #643, #645; thanks @tetedange13, @tskir, and @micknudsen)
    
    `export gistic`:
    
    - New export command to generate an unsegmented "markers" file for use with GISTIC.
      GISTIC also takes a second input file with corresponding segments in SEG format, which
      CNVkit can generate with `export seg`. (#622, #623, #776; thanks @tetedange13, @tskir,
      @BioComSoftware)
    
    API and CLI changes
    -------------------
    
    - Running `cnvkit.py` without any arguments will now display the full help text instead
      of an error message.
    - Supporting scripts (aside from `cnvkit.py`) are no longer installed automatically.
      They are still available in the source tree.
    
    Documentation
    -------------
    
    - Clarified `bintest` usage, provided an example, and explained outputs. (#646; thanks
      @tetedange13 and @tskir)
    
    Bugfixes
    --------
    
    - Fixed several errors and warnings due to outdated usage of dependencies, e.g. pandas,
      pysam.
    - Fixed the Dockerfile and Docker image to install R packages properly for CNVkit to use
      internally. (#765; thanks @28rietd)
    - Made the Makefile example/test workflow more portable across environments. (#661,
      #666, #695, #699; thanks @tetedange13)
    - `batch`: Apply --drop-low-coverage option in the segmetrics step. (#694)
    - `bintest`: Include 'probes' column in .cns output so that it is valid .cns (closes #693)
    - `fix`: Condense the error message when coordinate set contains duplicate values. (#637,
      #638; thanks @tskir)
    - `fix`: Choose a smoothing window fraction based on the data size to help correct
      biases better at the extremes of the GC range, where previously some residual GC bias
      could still be present after correction. (#379)
    - BED inputs: Handle UCSC BED 'browser' header line, as used in Agilent BED files with a
      2-line header. (closes #696, #618)
    
    Internal
    --------
    
    - Modernized the packaging configuration with pyproject.toml, leaving a stub setup.py
      for legacy setuptools compatibility. (#790)
    - Set up automated testing through GitHub Actions (GHA) to verify Python versions 3.7
      through 3.10 using pytest and tox. The latter make local testing with multiple
      Python versions more reliable, too. (#792, #793, #794)
    - Updated minimum dependency versions to roughly match Ubuntu 22.04 LTS packages; these
      are used in CI, too.
    - Applied black and pylint to reformat the codebase consistently and replace deprecated
      calls to libraries. (#795)
    - Remove joblib pinning (#589, #770; thanks @DavidCain and @risicle)
    - Remove networkx pinning (#606, #771; thanks @DavidCain)
    - Make the extreme-GC filters more easily configurable via `params.py` (#738, #752, #753,
      #764; thanks @tetedange13 and @tsivaarumugam)
  • v0.9.9
    aff02f9f · Bump version to 0.9.9 ·
    Version 0.9.9
    -------------
    
    This release contains a new script and, more importantly, a volley of bug fixes
    by @tskir, a new CNVkit collaborator.
    
    New script `genome_instability_index.py`:
    - For each given sample (.cnr or .cns, ideally .call.cns), this script reports
      two values, the number of non-neutral segments and the fraction of the total
      sequencing-accessible genome that they cover. Together, these values have been
      described as the Genome Instability Index (G2I) by [Bonnet et al.
      (2012)](https://doi.org/10.1186/1755-8794-5-54). These numbers are not
      difficult to calculate directly from .cns files, but they are frequently
      requested, so here you go.
    
    Bug fixes by @tskir:
    
    Installation:
    - Set NetworkX minimum version to work with pomegranate on Python 3.9.
      (#614, #606; thanks @auberginekenobi)
    
    genemetrics, diagram, scatter:
    
    - Fix an error in iterating over chromosomes during gene-wise operations or
      gene selection. (#580, #573, #576, #579; thanks  @diushiguzhi @eriktoo
      @hrkemp @drmrgd @HYan-lei)
    
    access:
    
    - Fix an error when all chromosomes listed in the exclusion BED file appear
      only once. (#581, #574; thanks @dajana17)
    
    autobin:
    
    - Allow specifying explicit output filenames via -o/--output. If this option is
      not used, the behavior is the same as before. Some pipeline frameworks such
      as Snakemake require output filenames to be explicit in wrapped commands.
      (#608, #607; thanks @enes-ak)
    - Fix median-size file selection. (#613, #611; thanks @michaelsykes)
    
    coverage:
    
    - Fix a potential crash with the -c option; generally make the -c option's
      results more stable. This changes the results you'd get with `coverage -c`
      compared to previous CNVkit versions, but in any case -c isn't recommended
      for production use, only for algorithm exploration. (#598, #593; thanks
      @joys8998)
    
    genemetrics:
    
    - Rename column `n_bins` to `probes` in output, for compatibility with 'call'
      and 'export' commands. (#586, #585; thanks @eriktoo)
    
    scatter:
    
    - Avoid losing short segments in rasterized PNG output, depending on DPI
      settings.  (#615, #604; thanks @jimmy200340)
    - Allow NCBI-style chromosome names that contain a ".", e.g. "NC_039902.1".
      (#603, #602; thanks @amora197)
    
    segment:
    
    - Fix an IndexError during smoothing when the signal is shorter than a window,
      e.g. on chrY where the chromosome contains few bins. (#590, #587; thanks
      @tetedange13)
    
    Improvements from other contributors:
    
    - scripts/guess_baits.py: Fix a copy-paste error on script launch.  (#588; thanks @sssimonyang)
    - Documentation: Link to the Debian package alongside other packages. (#562; thanks @mr-c)
  • v0.9.8
    b218280e · Bump version to 0.9.8 ·
    Version 0.9.8
    -------------
    
    Continuing a focus on stability and compatibility with other software:
    
    * Support for reading CRAM files with an optional user-provided local FASTA
      file for the reference genome sequence. (#555; thanks @johnegarza)
    * Call Rscript subprocess with safer flags for the R environment. Previously,
      `--vanilla` ignored R environments with the library path in a non-default
      location specified in the user's .Rprofile. Now, `--no-restore` and
      `--no-environ` ensure a clean environment but still respect the user's
      .Rprofile settings beyond that. (#491; thanks @pablo-gar)
    * Compatibility with the latest release of pandas. (#502, #523)
    
    This release also fixes some regressions reported since the release of CNVkit
    0.9.7 (which introduced a number of new performance optimizations).
    
    * `scatter`: A bug when plotting a region of a chromosome. (#536, #457; thanks tskir)
    * `scatter`: An IndexError when plotting entire chromosomes, e.g. chr7. (#541,
      #461, #535; thanks @tskir)
    * `fix`: A bug that occurred after automatic bias corrections, introducing
      NaN-valued rows in placed of rejected bins, leading to a downstream crash in
      CBS segmentation. (#551, #436, #547; thanks @johnegarza)
  • v0.9.7
    Version 0.9.7
    
    Stable release with only minor changes from the previous beta release 0.9.7.b1.
    
    New contributions:
    
    - Cram support: Look for and use .cram + .crai alignment and index file pairs,
      in addition to .bam + .bai. (#495, #434; thanks @sridhar0605)
    - Update Docker file to use Python 3 apt packages and pip3 (#493; thanks
      @keiranmraine)
    - Documentation fix (#496; thanks @rollf)
  • v0.9.7.b1
  • v0.9.7.b0
    ac14e5ac · Bump version to 0.9.7.b0 ·
    Version 0.9.7-beta
    
    This release contains several major enhancements  particularly relevant to germline
    analysis. If used in production pipelines, further evaluation and benchmarking would be
    wise. Highlights:
    
    **Control sample clustering**: To make better use of larger reference sample pools,
    `reference --cluster` will correlate the given normal samples' bin-wise coverage depths
    to extract clusters to be used as reference profiles. The reference .cnn file produced
    this way will then contain the `log2` and `spread` summary statistics for each cluster,
    in addition to the global summary stats. Given this "clustered reference" profile, `fix
    --cluster` will then correlate each test sample to each clustered `log2` profile in the
    reference to choose the most relevant control pool for normalization. The `batch` option
    `--cluster` will perform both these steps. Nod to Gambin lab and the authors of
    ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)
    
    Calculation of bin weights has changed. **This will change your segmentation results**,
    hopefully for the better. Details below. (#429)
    
    The `batch` pipeline now performs some **segmentation post-processing** automatically:
    calculating and filtering segmentation calls by 50% confidence intervals of the segment
    mean log2 ratios, in order to reduce false positives, followed by separate bin-level
    testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation.
    The bin- and segment-level results are returned as separate .cns files; deciding whether
    and how to combine or use these results together is left as an exercise for the user.
    
    We've **dropped Python 2.7 support**. Python version 3.5 or later is now required.
    
    This is a beta release. Please let me know how it works for you via the Issues page. If
    this release contains any issues that are blocking your work, try installing one of the
    previous stable versions 0.9.6 or 0.9.5::
    
        conda install cnvkit=0.9.6
    
    Dependencies
    ------------
    
    - Remove all Python 2.7 compatibility shims.
    - Raise minimum pandas version from 0.20.1 to 0.23.3.
    - Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older
      hmmlearn implementation.
    
    Commands
    --------
    
    `batch`:
    
    - Post-process segments with `segmetrics` (50% CI), `call` (filter by CI, but don't call
      integer copy number), and `bintest`.
    - Return `bintest` result as a separate, independent .cns output.
    - Add option '--segment-method', equivalent to `segment -m`.
    - Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
    - Add option `--cluster`, passed to `reference` and `fix` if given. (#308)
    
    `bintest`:
    
    - New command superseding `cnv_ztest.py` script.
    - Report p-value as a column `p_bintest` (previously `ztest`) in the .cns output.
    - Fix probabilities for positive log2 values, i.e. gains, which previously always had
      p-value = 1.0. (#429)
    
    `fix`:
    
    - Change calculation of bin weights to be more consistent with `1-var` meaning,
      with more emphasis on reference spread. It is now simpler, more consistent with
      `import-rna`, and particularly improves the accuracy of `bintest`. (#429)
    - Squeeze the range of reference-free weights
    - Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful
      signal.
    - With `--cluster` and a clustered reference input, calculate the test sample's Pearson
      correlation versus each cluster's log2, and take the best one for normalization.
    
    `reference`:
    
    - With `--cluster`, do k-means clustering of the sample bin-level read depth correlation
      matrix, per [Kusmirek et al. 2018](https://doi.org/10.1101/478313).
      Parameter k defaults to the cube root of number of samples. Only clusters of at least
      4 samples are kept for emitting summary statistics in the reference profile.
    
    `segment`:
    
    - hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with
      a narrow bandwidth.
    - Use HMM for post-TCN segmentation on VCF allele freqs
    - Add parameter for smoothing before CBS (thanks @EwaMarek)
    
    `segmetrics`:
    
    - Add 'ttest' option for 1-sample t-test p-value.
    - Implement & expose --smooth-bootstrap option.  For smoothing, KDE bandwidth is based
      on each bin's weight as a proxy for the SD of its log2 ratio values.  To reduce the
      risk of over-smoothing on larger sample sizes, we use a loose interpretation of
      Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases
      (k^-1/4).
    
    API
    ---
    
    - `do_heatmap`: Add 'ax' parameter (thanks @fbrundu)
    - `CNA.residuals()`: speed; keep index intact in returned pd.Series
    - smoothing: Linearly roll-off weights in mirrored wings.  Affects CNA.smoothed() /
      savgol, but not rolling median bias correction.
    - Rename `CNA.smoothed()` to `CNA.smooth_log2()`, since it returns the smoothed log2
      values, not a new/altered CNA.
    
    Bug fixes
    ---------
    
    - `batch`: Fix argparse formatting issue (#466)
    - `import-rna`: Fix a regression in reading 2-column per-gene counts (`-f counts`).
    - `reference`: Fix sex inference/usage when creating haploid-x reference (#459; thanks
      @duartemolha)
    - `scatter`: Use a safe matplotlib backend on OS X to avoid crash
    - VariantArray: Fix/streamline indexing of variants by bin/segment
  • v0.9.6
    1c8d69d7 · Bump version to 0.9.6 ·
    Version 0.9.6
    =============
    
    Much-needed maintenance and bug fixes, for the most part. Some key dependencies
    have changed, though this should be generally painless for you, and one or two
    regressions introduced by recent optimizations have been fixed.
    
    This will be the last CNVkit version to run on Python 2.7. The next major
    release of pandas (0.25.0) will remove support for Python 2.7, and once that
    happens it will become increasingly difficult to install future versions of
    CNVkit on Python 2.7 -- so we're not going to try.
    
    The segmentation method `flasso` depends on the R package `cghFLasso`, which is
    unmaintained and has been removed from CRAN.  For now, `segment -m flasso` is
    still supported if you already have `cghFLasso` installed. But given the above,
    `flasso` will be removed from the next CNVkit version in favor of the HMM-based
    methods.
    
    Dependencies
    ------------
    
    - Raised minimum pandas version from 0.18.1 to 0.20.1, and support up to 0.24.2,
      resolving some warnings and an error in pandas 0.22+. (#413; thanks @chapmanb)
    - The soft dependency on `hmmlearn` is replaced with an explicit dependency on
      `pomegranate` for the HMM-based segmentation methods. This dependency will now
      be pulled in automatically when installing via `pip` or `conda`.
    - The R package `cghFLasso` has been removed from CRAN, and therefore is no
      longer a dependency of CNVkit and will not be installed automatically through
      the standard `conda` installation method. (#419)
    
    Commands
    --------
    
    `antitarget`:
    
    - Be more specific in removing noncanonical chromosomes (e.g. alternate
      contigs, mitochondria) from the binned regions. This avoids skipping
      chromosomes of interest in some non-human genomes with non-numeric contig
      names, like yeast. (#388; credit for regexes to @brentp)
    
    `coverage`:
    
    - With `--count-reads`, use query aligned length to handle soft-clipped reads
      properly. Now the results with and without this option should be similar.
    (#411; thanks @desnar)
    
    `segment`:
    
    - For `-m flasso`, partition array by chromosome to avoid edge effects. (#409, #412; thanks @giladmishne)
    - Removed the deprecated option `--rlibpath`; use `--rscript-path` instead.
    - Note that the HMM methods are still provisional. A stable, supported version
      of these methods will be provided in the next CNVkit release.
    
    Python API
    ----------
    
    - `do_scatter` now returns a figure (#408; thanks @jeremy9959)
    
    Bug fixes
    ---------
    
    - `scatter`: Whole chromosomes can once again be specified with `-c`. (In the
      previous release, a chromosome without coordinates would cause an IndexError.)
      (#393)
    - `import-rna`: Option --max-log2 can now be specified by users. (Previously,
      only the default value of +3.0 worked.)
    - VCF I/O (`skgenome.tabio`): Support GATK 4's VCF files that contain records
      with empty ALT alleles, substituting zero if ALT AD is missing. (#391; thanks
      @chapmanb)
    - Due to a certain versioning-dependent interaction between numpy, pandas,
      cython, and conda (details [here](https://github.com/numpy/numpy/pull/432)),
      CNVkit may have printed spurious RuntimeWarning messages which could be safely
      ignored. The current release attempts to silence these messages if they occur.
      (#390).
  • v0.9.5
    fd355525 · Bump version to 0.9.5 ·
    Minor bugfix and usability improvement.
    
    `autobin`:
        Ensure targets are non-empty and match BAM chrom names (closes #371)
    
    `segment`:
        segment: Suppress help text for deprecated --rlibpath (#317)
        segment: Fix help text display (#380)
  • v0.9.4
    6a6266b1 · Bump version to 0.9.4 ·
  • v0.9.3
    9bdb0831 · Bump version to 0.9.3 ·
    Version 0.9.3
    
    This release fixes a single bug that caused the `segmetrics` command to crash
    (#325).
    
    Specifically, the command would crash unless at least one option from each of
    the following option sets was specified:
    
    - Location statistics: --mean, --median, --mode
    - Spread statistics: --stdev, --sem, --mad, --mse, --iqr, --bivar
    - Interval statistics: --ci, --pi
    
    This bug would not be triggered by calling `cnvlib.do_segmetrics` through the
    Python API, which is why it was not caught in automated testing.
  • v0.9.2
    7faa7b08 · Bump version to 0.9.2 ·
    Version 0.9.2
    
    This release contains a new command `import-rna` to infer coarse-grained copy
    number from RNA expression data. (#151)
    
    Three new HMM-based segmentation methods are offered: 'hmm', 'hmm-germline', and
    'hmm-tumor'. These should be considered experimental and used with caution; the
    implementations are likely change in the next release.
    
    The option `--male-reference` in the commands `batch`, `reference`, `fix`,
    `call`, and `export` (at least) has been renamed to `--haploid-x-reference`
    everywhere to reduce user confusion. A shim is in place so `--male-reference`
    will continue to work.
    
    Documentation, logging, and some error messages are improved.
    
    Thanks to @chapmanb, @MajoroMask, and others for contributing to this release.
    
    Dependencies
    ------------
    
    - 'pandas' version 0.22 is supported.
    - 'pysam' version 0.13.0 is supported.
    - 'hmmlearn' version 0.2 is a run-time requirement to use the new HMM-based
      segmentation methods. The rest of CNVkit can be run without it. To ensure the
      right version is installed, install CNVkit with conda as usual, then install
      hmmlearn with pip within the CNVkit conda environment.
    - Assume and require pip/setuptools for installation. (This is included with
      stock Python 2.7 and later.)
    
    Scripts
    -------
    
    - New script "skg_convert.py" to convert between BED, GATK interval list, GFF,
      VCF, and tabular formats using the 'skgenome.tabio' sub-package, with options
      for simple post-processing.
    - Removed the deprecated script refFlat2bed.py. (Use skg_convert.py instead.)
    
    Commands
    --------
    
    `access`:
    
    - Drop noncanonical, untargeted contigs/chromsomes by default. This affects
      analyses run from scratch with `batch`, too. (#169, #299)
    
    `segment`:
    
    - Three new methods can be specified with `-m`: `hmm`, `hmm-germline`, and
      `hmm-tumor`.
    - With `-m flasso`, force a breakpoint at centromeres, as was already done for
      the default 'cbs' method.
    
    `reference`:
    
    - The option `--antitargets` is no longer required to build a flat reference.
      Previously, building a flat reference for WGS or TAS required creating an
      empty file to use as antitargets alongside the target BED.
    - Print a warning if the sample sex inferred from targets does not match that of
      antitargets. (#281)
    
    `scatter`:
    
    - Removed the deprecated, invisible option `--background-marker`. (Use
      `--antitarget-marker` instead.)
    - Trendlines should reflect small CNVs better, while preserving overall
      smoothing. The implementation now uses the Savitzky-Golay method instead of a
      Kaiser window, and the smoothing bandwidth is better-tuned. (This can also
      slightly improve outlier filtering in `segment`.)
    
    `export seg`:
    
    - Add option `--enumerate-chroms` to replace chromosome or contig names with
      sequential integers. Previously, this renumbering was always done, following
      some version of the SEG format. But since most tools don't require the contigs
      to be sequential integers, and this behavior causes trouble for users, it's
      now disabled by default. (#282)
    
    `gainloss`/`genemetrics`:
    
    - Rename `gainloss` command to `genemetrics`. A shim is in place so `cnvkit.py
      gainloss` will continue to work. (#278)
    - Report segment- and bin-level weight and probes separately. (#107, #278)
    
    Bug fixes
    ---------
    
    - autobin: Require -g/--access for WGS (#289)
    - batch: Use the "access" regions for the WGS workflow to choose bin size; these
      were previously being ignored, so bin sizes were too large, being based on the
      size of the whole genome, not just sequencing-accessible regions.
    - call: Safely handle bins with zero weight when running `call --filter cn`.
      (chapmanb/bcbio-nextgen#2112; thanks @chapmanb)
    - coverage, guess_baits.py: Handle input BED files containing >4 columns. (#301)
    - gainloss: Without `-s`, make 'depth' the weighted mean of bins, not just the
      first bin's value.
    - segment: Ensure the .cns output file's columns are sorted properly (#291)
    - vcfio: Don't crash if a record has no ALT values (#279)
    - tabio:
    
        - Recognize BED format with decimal in chromosome name (#293)
        - Improvements to GFF/GTF/GFF3 parsing. The new options are mostly
          accessible through the Python API and the script 'skg_convert.py'. (#311)
        - In 'read_auto' (and all CNVkit commands that take regions as input),
          determine the file format first by checking the file extension and
          verifying the format of the first(-ish) line. Only if that doesn't work,
          fallback to the original method of testing the first(-ish) line against a
          brittle series of regular expressions. (#315)
    
    Python API
    ----------
    
    - cnvlib.write: Newly available at the top level to write tabular files (like
      .cnr and .cns), symmetric with 'cnvlib.read()'. The 'cnvlib.tabio' alias to
      'skgenome.tabio' has been removed; to read and write formats other than
      TSV-with-header ('tab'), import and use 'skgenome.tabio' directly.
    - CopyNumArray.squash_genes: remove deprecated keyword argument 'squash_background'. Use 'squash_antitarget' instead.
    - segmetrics: Move the functions supporting this command from 'cnvlib.command' to
      a new module 'cnvlib.segmetrics'.
  • v0.9.1
    b20dc5be · Bump version to 0.9.1 ·
    Version 0.9.1
    
    Highlights: Useful enhancements and changes to plotting and segmentation, and a
    new script for single-exon CNV testing. Plus, bug fixes and usability
    improvements to avoid unexpected errors. (#250, #255, #262, etc.)
    
    Dependencies
    ------------
    
    - Compatible with the most recent pandas version 0.21.0
      (#273, #274; thanks @chapmanb)
    - R dependencies were reduced to simplify installation
    
    Scripts
    -------
    
    - Renamed "cnn_*.py" to "cnv_*.py"
    - New script "cnv_ztest.py" to detect single-bin (e.g. single exon) deep
      deletions and high-level amplifications.
    - In "cnv_updater.py", rename "Background" (i.e. off-target) bins to
      "Antitarget", addition to adding a "depth" column if it's missing.
    
    Commands
    --------
    
    `autobin`:
    
    - Raise the maximum target/antitarget bin sizes to 50kb/1Mb.
    
    `fix`:
    
    - Allow specifying sample_id via ``--sample-id``/``-id``, in case the input
      coverage filenames do not have the expected form
      "sample_id.targetcoverage.cnn" and "sample_id.antitargetcoverage.cnn".
      (#269; thanks @chapmanb)
    
    `segment`:
    
    - Process each chromosome arm separately (with 'cbs' and 'haar', but not
      'flasso'). Centromere locations are guessed from the largest gap between
      sequencing-accessible regions, and are not necessarily the true locations,
      although they do match fairly well on the human genome.
    - Logging of dropped bins is streamlined somewhat.
    - New method `-m none` to only calculate arm-level segment means (for testing
      and experimentation).
    
    `scatter`:
    
    - Highlight non-neutral segments from .call.cns. If segments have the columns
      'cn' and potentially also 'cn1' and 'cn2' (as added by the `call` command),
      use those fields to display copy number alterations, LOH and allelic imbalance
      with colorized segments (orange by default), and use gray for neutral
      segments. If a VCF is also given, the same is done for SNVs in the lower
      panel.  Otherwise, all segments are colorized as before. (#18, #157)
    - New option `--by-bins` to display x-axis positions by sequential bin number on
      each chromosome, rather than genomic coordinates. This makes the plots much
      more useful with targeted amplicon sequencing data, or very small gene panels.
      (#63)
    - Trend line (`--trend`) now accounts for bin weights, which generally results
      in a better fit.
    - Improved interaction of -c and -g options:
    
        - Only apply the window margin (-w) if -g is used alone, or -c specifies a small
          chromosomal region with no genes.
        - Allow an empty gene list (-g '' or -g ',') to prevent highlighting and
          labeling of any genes / small non-genic "Selection" in the -c region.
        - If any gene in -g is not fully within the region specified by -c, name that
          gene and its coordinates in the error message.
        - If the -c region has size <=0, show a specific error message.
        - Handle NaN log2 values when calculating y-axis limits.
    
    `heatmap`:
    
    - Incorporate the `--by-bins` argument to match `scatter`. (#63)
    - Warn if selected region contains no data for a sample. This helps troubleshoot
      if a chromosome name was mis-specified on the command line. (#268)
    
    `export seg`:
    
    - Change column headers to match DNAcopy output. The column headers generally
      don't matter in the SEG format, but the DNAcopy dataframe is considered the
      canonical form.
    
    Python API
    ----------
    
    - cnvlib.do_segment -- new keyword argument min_weight to drop bins with
      'weight' below the specified value. If not used, then only bins with weight 0
      will be dropped. This feature is not recommended for normal usage and is not
      available on the command line.
    - cnvlib.do_scatter -- Remove deprecated keyword argument 'background_marker' in
      favor of 'antitarget_marker', corresponding to `scatter` options deprecated in
      v0.9.0.
    - cnvlib.cnary.CopyNumArray: Add method 'smoothed', which calculates the
      trendline displayed by the `scatter` command.
    - skgenome.tabio: Add read support for samtools 'dict' format, which resembles the
      plain-text SAM header and can contain chromosome names and sizes.
    - skgenome.gary.GenomicArray: Add magic methods __bool__ (Py3) and __nonzero__
      (Py2) to ensure an empty GenomicArray, i.e. 0 rows, is treated as false-ish on
      both Python 2.7 and 3.x.
  • v0.9.0
    87a0c6ed · Bump version to 0.9.0 ·
    Version 0.9.0
    =============
    
    In addition to bug fixes, documentation updates, and usability improvements,
    this release includes some larger changes:
    
    - The off-target bins in .cnn and .cnr files are now assigned the label
      "Antitarget" instead of "Background" in the "gene" column. The label
      "Background" in existing files will still be handled the same way, but new
      output files generated with CNVkit 0.9.0 and later will use the "Antitarget"
      label -- so, earlier versions of CNVkit may have problems with files produced
      by CNVkit 0.9.0. Some command line options and API keyword arguments similarly
      replace "background" with "antitarget", with shims in place for compatibility
      with existing scripts. (#171)
    
    - The sub-packages 'genome' and 'tabio' are now in a separate top-level package
      'skgenome', still included in the CNVkit distribution. (See "Python API"
      below.) This does not affect the command-line usage of CNVkit, but clears the
      way to extract a scikit-genome package that can be installed and used
      separately from CNVkit for computing with genomic intervals.
    
    Documentation
    -------------
    
    - Link to example VCF in the test suite
    - Describe the 'breaks' command's output columns ( #220)
    - Show an example customizing a plot with pyplot ( #196)
    
    Dependencies
    ------------
    
    - pysam: raise minimum to 0.10; support new version 0.11.2.1 (#218; thanks
      @chapmanb)
    - pandas: support new version 0.20.1 (#215)
    - numpy: support new version 0.13 (#235, #238)
    
    Commands
    --------
    
    `batch`:
    
    - Log the CNVkit version number at the start of the run
    - Print a message at the end if no tumor/test samples specified. (#214)
    - Clarify error messages for bad option combinations (#216)
    - Removed deprecated, suppressed/invisible option `--split`. It was a shim in
      the 0.8 series to support old scripts.
    
    `reference`:
    
    - Ensure the inferred chromosomal sex matches between the targets and
      antitargets for the same sample. If the inferences do not match, prefer
      antitargets. (#234, #237)
    
    `fix`:
    
    - Warn & don't reweight bins if most antitargets have no/low coverage. This
      avoids a variety of surprising downstream problems when the input was
      specified as hybrid capture (the default), but is actualy from
      targeted amplicon sequencing, or otherwise has no reads mapped to most
      off-target bins.
    
    `segment`:
    
    - Log the segmentation and p-value/q-value threshold
    
    `call`:
    
    - Add option --center-at
    - Let --center w/o argument do 'median'
    
    `diagram`:
    
    - New option `--title` to add a custom title to the top of the generated figure
      (#239; thanks @micknudsen)
    
    `export vcf`:
    
    - When given a .cnr file corresponding to the usual segmented input file (.cns),
      emit the CIPOS and CIEND tags in the generated VCF. These indicate the
      "fuzzy" coordinates of segment breakpoints. Here, the ranges are simply the
      widths of the underlying bins adjacent to each segment breakpoint. These tags
      can help meta-methods aggregate/harmonize CNVkit's calls with those of other
      structural variant callers. (#72)
    
    `import-picard`:
    
    - Don't accept directory as an argument (was deprecated).
    - Be a little more flexible in filenames accepted: instead of requiring input
      files to be named `*.targetcoverage.???` or `*.antitargetcoverage.???`, strip
      the full suffix and default to 'targetcoverage.cnn' output suffix, or
      'antitargetcoverage.cnn' if input filename contains 'antitarget'. Works the
      same for filenames following the earlier convention, but now pretty safe for
      amplicon targets with arbitrary filenames, and slightly less spooky.
    
    Bug fixes
    ---------
    
    - `antitarget`: Don't crash if -g/--access is not given (#207)
    - `batch`: Don't crash in 'wgs' mode when given just targets (-t) without a
      FASTA reference genome sequence (-f)
    -`call --filter ampdel`: Drop segments with copy number (`cn` field) between 0
      and 5, exclusive, as the documentation indicates. Previously, it was just
      merging adjacent segments with copy number 1--4, but not dropping them. (#222)
    - `export cdt`: Match the CDT spec. Fix a regression in which columns could be
      swapped/misaligned versus the header. Add a dummy "EWEIGHT" row to ensure Java
      TreeView starts reading data from the correct line in the file.
    - `export theta`: Don't crash on bins where reference is NaN. (#168)
    - `metrics`, `descriptives`: Handle degenerate/trivial cases consistently. (#202)
    - `segment`: Handle sample names that are integers with leading zeros (#213)
    - `sex`: Don't crash if chrX and chrY are both missing (#236)
    - VCF parsing (`call`, `scatter`, `segment`):
        - Safely handle small or empty VCF files that previously could trigger a
          crash during BAF calculation. Now, with an empty VCF an all-blank "baf"
          will be emitted. (#218, #224; thanks @chapmanb)
        - Improve handling of Mutect2 VCF files, somewhat. Mutect2 VCFs are still
          not recommended as input to CNVkit; try FreeBayes or GATK HaplotypeCaller
          instead. (#195)
    
    Python API
    ----------
    
    Moved sub-packages 'genome' and 'tabio' to separate top-level package 'skgenome'
    (#201). The top-level `cnvlib` API is mostly the same otherwise, but supporting
    modules were refactored to decouple `skgenome` from `cnvlib` and remove
    redundancies. In particular:
    
    - Split module `cnvlib.core` split into `skgenome.tabio` and `cnvlib.cmdutil`
    - Remove GenomicArray static method `row2label` in favor of functions `to_label`
      and `from_label` in new module `skgenome.rangelabel`.
    - The SEG writer in 'tabio' now replaces chromosome names with 1-based integer
      indices, per SEG spec/convention. The `export seg` command now uses this
      writer directly.
    
    Scripts
    -------
    
    - Remove the script `coverage_bin_size.py`, previously deprecated in favor of
      the `autobin` command.
    - Add `skg_convert.py` to convert between tabular formats.
    - Add `cnn_annotate.py` to replace the 'gene' field for each bin in a .cnn or
      .cnr file, given a gene annotation database like refFlat.txt. The need for
      this comes up occasionally when users notice at the end of an analysis that
      vendor-annotated targets are not the desired gene names.
  • v0.8.5
    e8e4c9ac · Bump version to 0.8.5 ·
    Version 0.8.5
    
    New 'autobin' command, replacing the script `coverage_bin_size.py`. Fixed some
    bugs and usability issues. Unit tests improved, especially for the
    'cnvlib.genome' sub-package.
    
    Dependencies
    ------------
    
    - Pandas 0.18.1 is once again supported. Previously the minimum version was
      0.19.1. (chapmanb/bcbio-nextgen#1836)
    - Pysam minimum version is still 0.9.1.4, but slightly older versions in the
      0.9 series may still work too. (#192)
    
    Commands
    --------
    
    `autobin`:
    
    - New command, replacing and extending the script `coverage_bin_size.py`. The
      script is still included (and shares most of the same code), but is considered
      deprecated and will be removed in the 0.9.0 release. (#170)
    - In 'amplicon' and 'hybrid' modes, ensure sampling regions for coverage is the
      same in every run by set random seed. (#191)
    
    `antitarget`, `autobin`, `batch`:
    
    - Fix an issue in GenomicArray.subtract() that caused some of the expected
      output regions to be missing. In cases where this caused an entire chromosome
      to be lost, the `coverage_bin_size.py` script` and autobin` and `batch`
      commands in `hybrid` mode would crash. (chapmanb/bcbio-nextgen#1799)
    
    `batch`, `diagram`:
    
    - Fix creation of chromosomal diagrams with `--diagram` and the `diagram`
      command. (#190)
    
    `export`:
    
    - In `export seg`, use 1-based indexing in the SEG output. (#197)
    - Fix `export cdt` format; it was generating Java TreeView (jtv) earlier.
  • v0.8.4
    8a5e3c28 · Bump version to 0.8.4 ·
    Version 0.8.4
    
    This minor release focuses on improving usability and fixing some bugs.
    Documentation is updated (thanks @kyleabeauchamp for #186).
    
    Dependencies
    ------------
    
    - Raise minimum pandas version from 0.18.1 to 0.19.0
    - Raise minimum matplotlib version to 1.3.1
    
    Commands
    --------
    
    `fix`, `metrics`:
    
    - Set PRNG seed to ensure reproducible results. The pipeline is now fully
      repeatable with identical results if run in serial, i.e. without `-p`.
    
    `fix`, `reference`:
    
    - Ensure bias smoothing window size is at least 5. This reduces the occurrence of
      0-log2, 0-spread bins on a 32-bin dataset, but doesn't eliminate it. (#181)
    
    `fix`:
    
    - Don't complain about mismatched sample IDs if antitargets are blank. This
      allows reusing a blank "MT" file in a shell loop for WGS and amplicon data.
    
    `reference`:
    
    - Make antitargets (antitarget.bed or \*.antitargetcoverage.cnn) an optional
      argument. Previously this argument was required, so processing WGS or amplicon
      data, which has no off-target regions or reads, required the user to create
      and provide a blank BED file or appropriately named, empty .cnn files. (#183)
    
    `segment`:
    
    - Don't log "Dropped 0 low-coverage bins". Only log when it actually drops bins.
    
    `diagram`, `heatmap`:
    
    - Add option `--no-shift-xy`.  Shifting X and Y according reference and sample
      sex was done in diagram, but not heatmap. Now it's optional in both.
    
    `heatmap`:
    
    - Add a legend of log2 ratio colors to the plot. (#36)
    - Add options `-x`/`--sample-sex` and `-y`/`--male-reference`. (#172)
    
    `gender`/`sex`:
    
    - Rename 'gender' command to 'sex', with shim for backward compatibility. (#182)
    - In other commands, the `-g`/`--gender`` argument is renamed to
      `-x`/`--sample-sex`, also with a compatibility shim. Argument values `x` and
      `y` are accepted in addition to `f`/`female` and `m`/`male`, respectively.
    
    `import-picard`:
    
    - Deprecate searching a directory tree for files. It was a vestige of early lab
      work, and makes a shaky assumption about Picard CalculateHsMetrics
      ``--PER_TARGET_COVERAGE`` output filenames.
    
    API
    ---
    
    - The ``do_*`` function implementations moved to their named modules. The
      ``do_*`` functions can still be called or imported from the `cnvlib` and
      `cnvlib.commands` modules.
    - All parsing and serialization of "chr:start-end" genomic region labels is
      consolidated under a new module, `cnvlib.genome.rangelabel`. These functions
      are used in in tabio.textcoord, GenomicArray.labels(), and elsewhere to ensure
      consistent behavior.
    
    Internal
    --------
    
    - `cnvlib.genome`: Handle nested bins correctly in the `merge`, `flatten`, and
      `intersect` modules, functions and GenomicArray methods. Verified with
      thorough unit tests.
    - VCF: If the paired normal sample's genotypes are all 0/0 or missing, fall back
      to `--zygosity-freq` (inference from b-allele frequency) rather than marking
      all variants as somatic.  Then infer and drop additional somatic SNVs based on
      genotype after parsing, and only if that wouldn't drop all records.  This
      allows CNVkit to safely distinguish somatic vs. germline in VCFs from Mutect2,
      though Mutect2 is still not recommended. (#184)
  • v0.8.3
    fa444809 · Bump version to 0.8.3 ·
  • v0.8.2
    7b78a276 · Bump version to 0.8.2 ·
    Version 0.8.2
    
    This release covers a number of internal changes to improve the stability and
    consistency of CNVkit, as well as new and improved command options to make more
    features available from the command line.
    
    Due to a slight change in the binning procedure (see `target` and `antitarget`
    below), newly generated target and antitarget BED files, or a reference
    generated with `batch`, may not use the same bin boundaries as earlier versions.
    CNVkit will check these files for consistency and alert you if your BED or .cnn
    files do not match because of this change, e.g. running `batch` from scratch
    with the same panel but with two different CNVkit versions. If you want to
    update CNVkit mid-project, either keep using the same reference.cnn file as
    before for all new samples (as always), or regenerate all your
    targetcoverage.cnn and antitargetcoverage.cnn files to build a new reference.
    
    Dependencies
    ------------
    
    - pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably
      faster and better able to handle newer VCF and gVCF features. (#159)
    - pysam: Raise minimum version to 0.9.1.4.
    
    Global changes
    --------------
    
    - When extracting a sample ID from a filename, instead of trimming everything
      after the first '.' character, only drop known or single-part extensions.  For
      example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now
      resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files
      will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding
      potential naming conflicts in the `batch` command when processing multiple
      samples. (#48)
    - Always sort regions by genomic coordinates after reading a file. This doesn't
      modify the input file in-place, but ensures the output files are always sorted
      the same way.
    - Gender detection is more robust. It now uses Mood's median test instead of the
      Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment
      per chromosome, it compares difference of weighted medians in autosomes versus
      sex chromosomes.
    
    VCF parsing:
    
    - Improve handling of VCFs from Mutect2 (#122, #153) and bcftools (#146).
    - Don't reject records where FILTER is 'PASS' or '.'.
    - VCF options are now consistent across the commands that can use them (`call`,
      `scatter`, `segment`, `export theta` and `export nexus-ogt`).
    - New VCF option -z/--zygosity-freq to override VCF genotype calls. (#153, #132)
    
    Commands
    --------
    
    target, antitarget:
    
    - Divide bins evenly, using the same internal mechanism (the new
      GenomicArray.subdivide() method). Previously, subdivided regions were not
      always equal-sized as they should have been. Now, the coordinates of newly
      generated targets from a baits BED file may be a little different than before.
    
    target:
    
    - Drop zero-width bins (#167).
    - Improve assignment of gene names to targets in WGS datasets. (#164)
    - Accept any supported region format for --annotate, including BED, interval
      list and GFF, in addition to the already supported UCSC refFlat. The format is
      detected automatically. (#163)
    - Raise an error if the given annotations file (refFlat or equivalent) and the
      given baited/targeted intervals do not have any overlapping chromosomes.
    
    antitarget:
    
    - Set the default average bin size to 150kb. Previously, the CLI default was
      200kb, but the API default was 100kb; experience shows 150kb works well.
    
    access:
    
    - Avoid a possible error when more than 1000 small regions are excluded from a
      single sequencing-accessible region. (#150)
    
    coverage:
    
    - Fix a unicode vs. bytes incompatibility on Python 3. (#147)
    - Fix a crash if the input BED has more than 4 columns.
    
    reference:
    
    - Add -g/--gender option to declare the chromosomal sex of the input sample(s)
      (same for all), instead of detecting/guessing for each sample. (#161)
    - Ensure printed table of bad bins is a reasonable width. (#140)
    
    segment:
    
    - With a VCF (`-v`), don't output 'cn1' and 'cn2' columns; calculate the 'baf'
      column the same as in `call`. (#148)
    - Improve memory efficiency somewhat when using a VCF. (#162)
    - Fix possible 1-base overlap of output segments when using the `cbs` or
      `flasso` methods. Specifically, the start positions were erroneously all
      shifted 1 base to the left before. (#158)
    
    scatter, heatmap:
    
    - Improve rendering of genomes much smaller than the human genome, e.g. yeast,
      by scaling telomere padding to the total genome size.  The blank space at
      chromosome boundaries was set to a fixed number of basepairs, but is now
      calculated as 0.3% of the whole genome size (sum of chromosome lengths) --
      which works out the same for the human genome. (#155)
    
    scatter:
    
    - Add option `--segment-color`. Now you can choose 'red' if you like.
    
    metrics:
    
    - Input `-s`/`--segments` is now optional. If not given, compare bin log2 values
      to chromosome medians instead of segment means.
    
    import-theta, export theta:
    
    - Drop sex chromosomes, since THetA2 doesn't handle them well. (#103, #153)
    
    API changes
    -----------
    
    tabio:
    
    - Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4
    - Detect more formats with `tabio.read_auto`: BED, interval list, text
      coordinates (chr:start-end), refFlat, GFF, TSV with column names.
    - Remove module `ngfrills.regions`, no longer needed.
    
    GenomicArray:
    
    - Moved to new sub-package 'genome'
    - Rename method `select` to `filter`
    - Rename method `match_to_bins` to `into_ranges` and generalize.
    - New methods `flatten`, `merge`, `resize_ranges`, `subdivide`, `subtract`
    
    In general, the 'genome' functionality can be reached by using the `tabio`
    sub-package to load a GenomicArray instance and use its methods
    directly:
    
        from cnvlib import tabio
        regions = tabio.read_auto(filename)
        # Generate 500bp flanking regions
        regions.resize_ranges(500).merge().subtract(regions)
  • v0.8.1
    fb83f230 · Bump version to 0.8.1 ·
    Version 0.8.1
    
    This is primarily a bugfix release.
    The [documentation](https://cnvkit.readthedocs.io/) is also improved,
    particularly covering the cnvlib API.
    
    API:
    
    - For convenience in scripting, the relevant functions for running each CLI
      command (`cnvlib.commands.do_*`) are exported to the top level. For example:
      `import cnvlib; cnvlib.do_batch(...)`
    
    Bug fixes:
    
    - `access`: Avoid a type-validation error on Python 3. (#141)
    
    - `batch`: Parallel processing now selects an appropriate number of workers for
      each step of the pipeline, reducing CPU contention when processing multiple
      samples in parallel. (#138)
    
    - `call`: Apply the `ci` and `sem` filters before calculating b-allele
      frequencies and absolute copy number, as these filters can alter the final
      calls.
    
    - `reference`: Safely handle an edge case in detecting gender from sample
      coverage depths when all bins have identical coverage depth, e.g. no coverage.
      (#144)
    
    - `segment`: Fix handling and segmentation of SNV allele frequencies from a VCF.
      Ensure output column ordering is correct. Avoid a crash that could occur when
      SNV segmentation produces a segment that does not cover any coverage bins.
      (chapmanb/bcbio-nextgen#1590)
    
    - tabio: Improve handling of empty files, including VCFs with no samples and/or
      no locus records. If records and samples are present but genotypes are missing
      or undetectable, `scatter`, `call` and `export` would previously reject all
      records when filtering for SNPs, but will now accept all records instead.