Version 0.8.2 This release covers a number of internal changes to improve the stability and consistency of CNVkit, as well as new and improved command options to make more features available from the command line. Due to a slight change in the binning procedure (see `target` and `antitarget` below), newly generated target and antitarget BED files, or a reference generated with `batch`, may not use the same bin boundaries as earlier versions. CNVkit will check these files for consistency and alert you if your BED or .cnn files do not match because of this change, e.g. running `batch` from scratch with the same panel but with two different CNVkit versions. If you want to update CNVkit mid-project, either keep using the same reference.cnn file as before for all new samples (as always), or regenerate all your targetcoverage.cnn and antitargetcoverage.cnn files to build a new reference. Dependencies ------------ - pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably faster and better able to handle newer VCF and gVCF features. (#159) - pysam: Raise minimum version to 0.9.1.4. Global changes -------------- - When extracting a sample ID from a filename, instead of trimming everything after the first '.' character, only drop known or single-part extensions. For example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding potential naming conflicts in the `batch` command when processing multiple samples. (#48) - Always sort regions by genomic coordinates after reading a file. This doesn't modify the input file in-place, but ensures the output files are always sorted the same way. - Gender detection is more robust. It now uses Mood's median test instead of the Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment per chromosome, it compares difference of weighted medians in autosomes versus sex chromosomes. VCF parsing: - Improve handling of VCFs from Mutect2 (#122, #153) and bcftools (#146). - Don't reject records where FILTER is 'PASS' or '.'. - VCF options are now consistent across the commands that can use them (`call`, `scatter`, `segment`, `export theta` and `export nexus-ogt`). - New VCF option -z/--zygosity-freq to override VCF genotype calls. (#153, #132) Commands -------- target, antitarget: - Divide bins evenly, using the same internal mechanism (the new GenomicArray.subdivide() method). Previously, subdivided regions were not always equal-sized as they should have been. Now, the coordinates of newly generated targets from a baits BED file may be a little different than before. target: - Drop zero-width bins (#167). - Improve assignment of gene names to targets in WGS datasets. (#164) - Accept any supported region format for --annotate, including BED, interval list and GFF, in addition to the already supported UCSC refFlat. The format is detected automatically. (#163) - Raise an error if the given annotations file (refFlat or equivalent) and the given baited/targeted intervals do not have any overlapping chromosomes. antitarget: - Set the default average bin size to 150kb. Previously, the CLI default was 200kb, but the API default was 100kb; experience shows 150kb works well. access: - Avoid a possible error when more than 1000 small regions are excluded from a single sequencing-accessible region. (#150) coverage: - Fix a unicode vs. bytes incompatibility on Python 3. (#147) - Fix a crash if the input BED has more than 4 columns. reference: - Add -g/--gender option to declare the chromosomal sex of the input sample(s) (same for all), instead of detecting/guessing for each sample. (#161) - Ensure printed table of bad bins is a reasonable width. (#140) segment: - With a VCF (`-v`), don't output 'cn1' and 'cn2' columns; calculate the 'baf' column the same as in `call`. (#148) - Improve memory efficiency somewhat when using a VCF. (#162) - Fix possible 1-base overlap of output segments when using the `cbs` or `flasso` methods. Specifically, the start positions were erroneously all shifted 1 base to the left before. (#158) scatter, heatmap: - Improve rendering of genomes much smaller than the human genome, e.g. yeast, by scaling telomere padding to the total genome size. The blank space at chromosome boundaries was set to a fixed number of basepairs, but is now calculated as 0.3% of the whole genome size (sum of chromosome lengths) -- which works out the same for the human genome. (#155) scatter: - Add option `--segment-color`. Now you can choose 'red' if you like. metrics: - Input `-s`/`--segments` is now optional. If not given, compare bin log2 values to chromosome medians instead of segment means. import-theta, export theta: - Drop sex chromosomes, since THetA2 doesn't handle them well. (#103, #153) API changes ----------- tabio: - Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4 - Detect more formats with `tabio.read_auto`: BED, interval list, text coordinates (chr:start-end), refFlat, GFF, TSV with column names. - Remove module `ngfrills.regions`, no longer needed. GenomicArray: - Moved to new sub-package 'genome' - Rename method `select` to `filter` - Rename method `match_to_bins` to `into_ranges` and generalize. - New methods `flatten`, `merge`, `resize_ranges`, `subdivide`, `subtract` In general, the 'genome' functionality can be reached by using the `tabio` sub-package to load a GenomicArray instance and use its methods directly: from cnvlib import tabio regions = tabio.read_auto(filename) # Generate 500bp flanking regions regions.resize_ranges(500).merge().subtract(regions)