.. currentmodule:: tangermeme =============== Release History =============== Version 1.0.0 ============= Highlights ---------- - Our first major release, corresponding to the paper publication. - Changes the `recursive_seqlet` calling algorithm slightly to be more principled - Adds in new design methods and features seqlets ------- - The `recursive_seqlet` algorithm has been slightly altered to make the calculated p-values more faithful. Rather than calculating a null as the empirically observed attribution sum across different lengths, where the "p-value" is just the probability that the observed attribution is higher, null distributions for different lengths are inferred from the previous lengths design ------ - `screen` is added in as a new design method that randomly generates sequences and chooses the one with the best predictions. Each batch is fast because nothing special is done, but also each batch is independent from the others and so there is no guarantee that each iteration yields better results - Design methods now allow you to not pass in a `y` target value and instead will try to just maximize the predictions. Version 0.5.1 ============= Highlights ---------- - Add `summits` to `extract_loci` to center on summits when a BED10 file is provided - Improve casting of indexes for variant effect predictions - Slight improvement to the usability of deep_lift_shap Version 0.5.0 ============= Highlights ---------- - The Tomtom and FIMO tools have been moved to memesuite-lite so they can be used without a PyTorch dependency - All internals tools that used Tomtom and FIMO now call the memesuite-lite versions annotate -------- - The call to tomtom now goes to memesuite-lite io ---- - `read_meme` now calls the memesuite-lite function and wraps the numpy arrays into torch tensors. - `return_filtered` has been added as an optional parameter to `extract_loci` where, if set to true, returns a list of indexes for the loci that are kept or discarded. Note that the indexes are into the INTERLEAVED LOCI, not the original set of indices. plot ---- - Improved the placement of annotation labels in `plot_logo`. Thanks Nikolaus Mandlburger! - Fixed a bug where annotations were extended an additional basepair to the right seqlet ------ - The `recursive_seqlet` algorithm has been slightly modified to more closely match the provided description. This change involves using the calculated p-values instead of the maximum p-value for each position across all seqlets of smaller size. As a consequence, motifs should not be be shifted to the right anymore. utils ----- - Added a `example_to_fasta_coordinates` which will convert the relative coordinates in examples to exact coordinates on the genome when provided with a BED file of examples and a FASTA file. This is useful if you have seqlet coordinates for each example and need to convert them to positions on the genome. Version 0.4.4 ============= io ---- - Added `one_hot_to_fasta` which takes a 3D one-hot encoded tensor and an optional list of headers and outputs a FASTA file with those sequences. plot ---- - Added `plot_attributions` which wraps the calculation and the visualization of attributions between multiple models and multiple sequences. - Added `show_score` to `plot_logo` where you can optionally hide the score from the visualization predict ------- - Added `dtype` to `predict`, which will autocast the model and the data to the desired dtype to increase speed. Currently only supports the dtypes supported by torch.autocast. This allows datasets to be represented as torch.uint8 and only converted to higher precision in each batch, yielding significant memory savings. tools/fimo ---------- - Fixed a bug to allow dict[str: numpy.ndarray] to be used for the motifs. Thanks @SeppeDeWinter! Version 0.4.3 ============= ersatz ------ - Substitute now accepts Ns or all-zeroes positions as inputs and, at those positions, will not alter the original sequence. If only one motif is given, this will be the same across all background sequences. If one motif is given per background sequence, this is done on a per-background example. - The above change means that higher-level functions like `marginalize` can now be run with motifs that contain missing characters, without any changes needed. - The default `start` and `end` of `dinucleotide_shuffle` have been set to `None` because using `0` and `-1` meant that the last provided position never got shuffled. design ------ - Changed `mask` parameter to `output_mask` - Added `input_mask` which restricts what positions can be the start of motifs, so design can be restricted to subsets of the sequence or certain important elements can be ignored. - Significantly sped up the creation of sequences with tiled motifs implanted using a numba function, which can speed up design 3-10x. - Added in `greedy_marginalize` which design constructs using marginalizations Version 0.4.1 ============= Highlights ---------- plots ----- - Fixed a bug where `plot_logo` raises an error when `start` and `end` are not provided but `annotations` are. - Fixed a bug where `plot_logo` plots annotations using calls to `plt` instead of directly on the provided artboard. tools ----- - Sped up `tomtom` by using more compact dtypes and avoiding cache misses - Added `symmetric_tomtom` which takes in a set of items and orders them such that the smaller item is always the query and the larger one is always the target. This reduces the number of background distributions that need to be made from a quadratic number to a linear one, significantly speeding up the algorithm. utils ----- - Added `reverse_complement` function that can convert one-hot encodings and strings. Thanks @Al-Murphy! Version 0.4.0 ============== Highlights ---------- - At a high level, this release focuses on quick ways to understand what a model has learned. This means extending seqlet calling functionality as well as introducing handling of annotations, which are any sort of notation of span along the genome -- seqlet calls, motif matches, and hit calls. annotate -------- - Added in a new file for handling annotations. - Includes a `count_annotations` function for converting a sparse list of annotations into a dense matrix of counts. - Also includes a `pairwise_annotations` function for looking at pairs of motifs that are learned. - Also includes a `pairwise_annotations_space` function for looking at spacing between pairs of functions. - Also includes an `annotate_seqlet` function for annotating seqlets using TOMTOM and a reference database. seqlet ------ - Added in `recursive_seqlets`, which calls seqlets using a recursive definition that all spans within a seqlet must also be independently called as seqlets. plot ---- - Added in `plot_pwm` that takes in a PWM whose rows sum to 1 and plots the information content weighted characters as well as the reverse complement. utils ----- - Added in a `pwm_consensus` function that takes in a single PWM and returns a one-hot encoded version of the consensus sequence. - Added in an `extract_signal` function for extracting sums over variable-length spans from tensors. Version 0.3.0 ============== Highlights ---------- - Added in a new TOMTOM implementation and a revamped FIMO implementation - TOMTOM and FIMO both have command-line tools in `tangermeme` FIMO ---- - The PyTorch implementation has been exchanged for a numba based one. - The new signature is a single function called `fimo` - A command-line tool can be used with the signature `tangermeme fimo ...` TOMTOM ------ - A numba-based implementation has been added in the function `tomtom` - A command-line tool can be used with the signature `tangermeme tomtom ...` utils ----- - `chunk` and `unchunk` have been added in to chunk long sequences into blocks that can be operated on by methods with fixed-window inputs, such as machine learning models, and for converting the predictions from these approaches back into a contiguous format. match ----- - Implemented updates to substantially reduce memory use and runtime of extract_matching_loci. This was mainly achieved by 1) Avoid using io.extract_loci, which one hot encodes all loci into a single large tensor. Instead, the locus sequences are extracted one by one, keeping only one in memory at a time. The N and GC percentages are calculated directly from the sequence, and only those values are stored. 2) Calculate genome wide N and GC percentages by taking slices of the chromosomal DNA sequences and using the count method of python strings. This is significantly faster than the previous approach using numpy isin, and avoids keeping several copies of the sequence in memory at the same time. - Various other changes: 1) Counts from regions that cannot be extracted from a provided bigwig file (such as for a missing chromosome) are now set to nan rather than 0. This will effect the threshold value used for filtering background regions. 2) Small change to the binning strategy for gc values, which could mean that matching loci generated in a previous version will not be reproduced exactly in all cases, even when using the same random seed. 3) Enable the handling of 'N' in sequences or [0,0,0,0], i.e. an ambiguous genomic positions. Updated the `characters()` and the `_validate_input()` in `utils` module to enable this. Version 0.2.3 ============== match ----- - Expanded the `ignore` parameter to ignore all non-ACGT characters. Version 0.2.2 ============== plot ---- - Fixed issue in `plot_logo` raised by @sandyfloren where passing in annotations without passing in `start` or `end` would raise an error. Now, `start` defaults to 0 and `end` defaults to the length of the sequence. tools ----- - FIMO is now base 2 instead of base e, to better match the MEME-suite tool. p-values should remain the same but scores will change. - FIMO `hits` will now return p-values, and will longer return an uninformative `attr` column product ------- - `apply_pairwise` has been added along with documentation and unit tests match ----- - Fixes an issue with trying to calculate the mean over an array of integers by changing the array to be dtype float. via @adamyhe Version 0.2.1 ============== deep_lift_shap -------------- - Removed the autocasting to 32-bit floats, enabling attributions to be calculated at other resolutions - Removes ~100 LOC and the DeepLiftShap object, integrating that code directly into the `deep_lift_shap` function - Only assigns hooks once at the beginning of the function and clears them upon an error or completion of function, instead of assigning and clearing hooks every batch Version 0.2.0 ============== Highlights ---------- - Alters the API of several functions to make them more general, with the option of taking in a function to apply instead of defaulting to predict, while still back compatible - Adds in `deep_lift_shap` and `seqlet` to operate on attributions deep_lift_shap ------------- - Added in a stand-alone implementation of deep_lift_shap - This implementation resolves several issues with Captum, e.g., with pooling layers - Allows batching of example-reference pairs across examples (so batch_size can be > than n_shuffles) - Allows batch_size to be much smaller than n_shuffles with the results aggregated once all references have been processed to allow large models to be run - Allows additional non_linear operations to be registered by passing in a dictionary - Allows the raw multipliers to be returned with `raw_output=True` or the aggregated attribution scores ism ---- - Changes the default output from the raw output (which you can get with `raw_output=True`) to defaultly aggregated attribution values to make the API compatible marginalize ------------ - Change the signature to take in an optional function that gets applied before/after the substitution, default is predict - Change the signature to take in **kwargs that get passed into the optional function - Change the signature to take in `additional_func_kwargs` that is an alternative and safer way to pass arguments into the function ablate ------ - Change the signature to take in an optional function that gets applied before/after the ablation, default is predict - Change the signature to take in **kwargs that get passed into the optional function - Change the signature to take in `additional_func_kwargs` that is an alternative and safer way to pass arguments into the function space ------ - Change the signature to take in an optional function that gets applied before/after the substitutions, default is predict - Change the signature to take in **kwargs that get passed into the optional function - Change the signature to take in `additional_func_kwargs` that is an alternative and safer way to pass arguments into the function variant_effect -------------- - Change the name of `marginal_substitution_effect` to `substitution_effect` - Change the API of `substitution_effect` to take in a tensor of original sequences and a tensor of substitutions - Change the API of `substitution_effect` to take in an optional function and **kwargs and `additional_func_kwargs` to pass into `func` - Change the name of `marginal_deletion_effect` to `deletion_effect` - Change the API of `deletion_effect` to take in a tensor of original sequences and a tensor of deletions - Change the API of `deletion_effect` to take in an optional function and **kwargs and `additional_func_kwargs` to pass into `func` - Change the name of `marginal_insertion_effect` to `insertion_effect` - Change the API of `insertion_effect` to take in a tensor of original sequences and a tensor of insertions - Change the API of `insertion_effect` to take in an optional function and **kwargs and `additional_func_kwargs` to pass into `func` seqlet ------ - Add a new file for the identification of seqlets - Add `tfmodisco_seqlets` which is a simplified and documented version of the seqlet calling in `tfmodisco` that returns dataframes Version 0.1.0 ============== Highlights ---------- - This is the first major release of tangermeme and contains the first version of the core functionality. ersatz ------ - This module implements common sequence manipulation methods such as substitutions, insertions, deletions, and shufflings of sequences. predict ------- - This module implements efficient batched prediction that can handle models that accept multiple inputs or multiple outputs. marginalize ----------- - This module implements marginalization experiments, where predictions are made for a set of sequences, a motif is substituted into the middle, and then new predictions are made for the new sequences. space ----- - This module implements spacing experiments where predictions are made for a set of sequences, a set of motifs are inserted with a given spacing, and then new predictions are made for the new sequences. io --- - This module implements I/O functions for common data types as well as for extracting examples for machine learning models. ism --- - This module implements in silico saturated mutagenesis (ISM). variant_effect -------------- - This module implements functions for evaluating the marginal effect of variants on model predictions. Version 0.0.1 ============== Highlights ---------- - Initial release