Skip to content
Version 0.9.7-beta

This release contains several major enhancements  particularly relevant to germline
analysis. If used in production pipelines, further evaluation and benchmarking would be
wise. Highlights:

**Control sample clustering**: To make better use of larger reference sample pools,
`reference --cluster` will correlate the given normal samples' bin-wise coverage depths
to extract clusters to be used as reference profiles. The reference .cnn file produced
this way will then contain the `log2` and `spread` summary statistics for each cluster,
in addition to the global summary stats. Given this "clustered reference" profile, `fix
--cluster` will then correlate each test sample to each clustered `log2` profile in the
reference to choose the most relevant control pool for normalization. The `batch` option
`--cluster` will perform both these steps. Nod to Gambin lab and the authors of
ExomeDepth, CoNVaDING, CLAMMS, and others for inspiration. (#308)

Calculation of bin weights has changed. **This will change your segmentation results**,
hopefully for the better. Details below. (#429)

The `batch` pipeline now performs some **segmentation post-processing** automatically:
calculating and filtering segmentation calls by 50% confidence intervals of the segment
mean log2 ratios, in order to reduce false positives, followed by separate bin-level
testing to detect small (e.g. exon-size) CNVs that were not caught by segmentation.
The bin- and segment-level results are returned as separate .cns files; deciding whether
and how to combine or use these results together is left as an exercise for the user.

We've **dropped Python 2.7 support**. Python version 3.5 or later is now required.

This is a beta release. Please let me know how it works for you via the Issues page. If
this release contains any issues that are blocking your work, try installing one of the
previous stable versions 0.9.6 or 0.9.5::

    conda install cnvkit=0.9.6

Dependencies
------------

- Remove all Python 2.7 compatibility shims.
- Raise minimum pandas version from 0.20.1 to 0.23.3.
- Add scikit-learn (dependency of pomegranate, for HMM segmentation). Remove the older
  hmmlearn implementation.

Commands
--------

`batch`:

- Post-process segments with `segmetrics` (50% CI), `call` (filter by CI, but don't call
  integer copy number), and `bintest`.
- Return `bintest` result as a separate, independent .cns output.
- Add option '--segment-method', equivalent to `segment -m`.
- Rename option '--method' to '--seq-method' (but '--method' still accepted for now).
- Add option `--cluster`, passed to `reference` and `fix` if given. (#308)

`bintest`:

- New command superseding `cnv_ztest.py` script.
- Report p-value as a column `p_bintest` (previously `ztest`) in the .cns output.
- Fix probabilities for positive log2 values, i.e. gains, which previously always had
  p-value = 1.0. (#429)

`fix`:

- Change calculation of bin weights to be more consistent with `1-var` meaning,
  with more emphasis on reference spread. It is now simpler, more consistent with
  `import-rna`, and particularly improves the accuracy of `bintest`. (#429)
- Squeeze the range of reference-free weights
- Drop bins with gc outside [.3, .7]. CLAMMS paper shows these bins carry no useful
  signal.
- With `--cluster` and a clustered reference input, calculate the test sample's Pearson
  correlation versus each cluster's log2, and take the best one for normalization.

`reference`:

- With `--cluster`, do k-means clustering of the sample bin-level read depth correlation
  matrix, per [Kusmirek et al. 2018](https://doi.org/10.1101/478313).
  Parameter k defaults to the cube root of number of samples. Only clusters of at least
  4 samples are kept for emitting summary statistics in the reference profile.

`segment`:

- hmm: Fix pomegranate-based implementation. Use iterative Savitzky-Golay smoothing with
  a narrow bandwidth.
- Use HMM for post-TCN segmentation on VCF allele freqs
- Add parameter for smoothing before CBS (thanks @EwaMarek)

`segmetrics`:

- Add 'ttest' option for 1-sample t-test p-value.
- Implement & expose --smooth-bootstrap option.  For smoothing, KDE bandwidth is based
  on each bin's weight as a proxy for the SD of its log2 ratio values.  To reduce the
  risk of over-smoothing on larger sample sizes, we use a loose interpretation of
  Silverman's Rule to reduce the bandwidth as the number of bins in a segment increases
  (k^-1/4).

API
---

- `do_heatmap`: Add 'ax' parameter (thanks @fbrundu)
- `CNA.residuals()`: speed; keep index intact in returned pd.Series
- smoothing: Linearly roll-off weights in mirrored wings.  Affects CNA.smoothed() /
  savgol, but not rolling median bias correction.
- Rename `CNA.smoothed()` to `CNA.smooth_log2()`, since it returns the smoothed log2
  values, not a new/altered CNA.

Bug fixes
---------

- `batch`: Fix argparse formatting issue (#466)
- `import-rna`: Fix a regression in reading 2-column per-gene counts (`-f counts`).
- `reference`: Fix sex inference/usage when creating haploid-x reference (#459; thanks
  @duartemolha)
- `scatter`: Use a safe matplotlib backend on OS X to avoid crash
- VariantArray: Fix/streamline indexing of variants by bin/segment