Release History
Version 1.0.0
Highlights
Our first major release, corresponding to the paper publication.
Changes the recursive_seqlet calling algorithm slightly to be more principled
Adds in new design methods and features
seqlets
The recursive_seqlet algorithm has been slightly altered to make the calculated p-values more faithful. Rather than calculating a null as the empirically observed attribution sum across different lengths, where the “p-value” is just the probability that the observed attribution is higher, null distributions for different lengths are inferred from the previous lengths
design
screen is added in as a new design method that randomly generates sequences and chooses the one with the best predictions. Each batch is fast because nothing special is done, but also each batch is independent from the others and so there is no guarantee that each iteration yields better results
Design methods now allow you to not pass in a y target value and instead will try to just maximize the predictions.
Version 0.5.1
Highlights
Add summits to extract_loci to center on summits when a BED10 file is provided
Improve casting of indexes for variant effect predictions
Slight improvement to the usability of deep_lift_shap
Version 0.5.0
Highlights
The Tomtom and FIMO tools have been moved to memesuite-lite so they can be used without a PyTorch dependency
All internals tools that used Tomtom and FIMO now call the memesuite-lite versions
annotate
The call to tomtom now goes to memesuite-lite
io
read_meme now calls the memesuite-lite function and wraps the numpy arrays into torch tensors.
return_filtered has been added as an optional parameter to extract_loci where, if set to true, returns a list of indexes for the loci that are kept or discarded. Note that the indexes are into the INTERLEAVED LOCI, not the original set of indices.
plot
Improved the placement of annotation labels in plot_logo. Thanks Nikolaus Mandlburger!
Fixed a bug where annotations were extended an additional basepair to the right
seqlet
The recursive_seqlet algorithm has been slightly modified to more closely match the provided description. This change involves using the calculated p-values instead of the maximum p-value for each position across all seqlets of smaller size. As a consequence, motifs should not be be shifted to the right anymore.
utils
Added a example_to_fasta_coordinates which will convert the relative coordinates in examples to exact coordinates on the genome when provided with a BED file of examples and a FASTA file. This is useful if you have seqlet coordinates for each example and need to convert them to positions on the genome.
Version 0.4.4
io
Added one_hot_to_fasta which takes a 3D one-hot encoded tensor and an optional list of headers and outputs a FASTA file with those sequences.
plot
Added plot_attributions which wraps the calculation and the visualization of attributions between multiple models and multiple sequences.
Added show_score to plot_logo where you can optionally hide the score from the visualization
predict
Added dtype to predict, which will autocast the model and the data to the desired dtype to increase speed. Currently only supports the dtypes supported by torch.autocast. This allows datasets to be represented as torch.uint8 and only converted to higher precision in each batch, yielding significant memory savings.
tools/fimo
Fixed a bug to allow dict[str: numpy.ndarray] to be used for the motifs. Thanks @SeppeDeWinter!
Version 0.4.3
ersatz
Substitute now accepts Ns or all-zeroes positions as inputs and, at those positions, will not alter the original sequence. If only one motif is given, this will be the same across all background sequences. If one motif is given per background sequence, this is done on a per-background example.
The above change means that higher-level functions like marginalize can now be run with motifs that contain missing characters, without any changes needed.
The default start and end of dinucleotide_shuffle have been set to None because using 0 and -1 meant that the last provided position never got shuffled.
design
Changed mask parameter to output_mask
Added input_mask which restricts what positions can be the start of motifs, so design can be restricted to subsets of the sequence or certain important elements can be ignored.
Significantly sped up the creation of sequences with tiled motifs implanted using a numba function, which can speed up design 3-10x.
Added in greedy_marginalize which design constructs using marginalizations
Version 0.4.1
Highlights
plots
Fixed a bug where plot_logo raises an error when start and end are not provided but annotations are.
Fixed a bug where plot_logo plots annotations using calls to plt instead of directly on the provided artboard.
tools
Sped up tomtom by using more compact dtypes and avoiding cache misses
Added symmetric_tomtom which takes in a set of items and orders them such that the smaller item is always the query and the larger one is always the target. This reduces the number of background distributions that need to be made from a quadratic number to a linear one, significantly speeding up the algorithm.
utils
Added reverse_complement function that can convert one-hot encodings and strings. Thanks @Al-Murphy!
Version 0.4.0
Highlights
At a high level, this release focuses on quick ways to understand what a model has learned. This means extending seqlet calling functionality as well as introducing handling of annotations, which are any sort of notation of span along the genome – seqlet calls, motif matches, and hit calls.
annotate
Added in a new file for handling annotations.
Includes a count_annotations function for converting a sparse list of annotations into a dense matrix of counts.
Also includes a pairwise_annotations function for looking at pairs of motifs that are learned.
Also includes a pairwise_annotations_space function for looking at spacing between pairs of functions.
Also includes an annotate_seqlet function for annotating seqlets using TOMTOM and a reference database.
seqlet
Added in recursive_seqlets, which calls seqlets using a recursive definition that all spans within a seqlet must also be independently called as seqlets.
plot
Added in plot_pwm that takes in a PWM whose rows sum to 1 and plots the information content weighted characters as well as the reverse complement.
utils
Added in a pwm_consensus function that takes in a single PWM and returns a one-hot encoded version of the consensus sequence.
Added in an extract_signal function for extracting sums over variable-length spans from tensors.
Version 0.3.0
Highlights
Added in a new TOMTOM implementation and a revamped FIMO implementation
TOMTOM and FIMO both have command-line tools in tangermeme
FIMO
The PyTorch implementation has been exchanged for a numba based one.
The new signature is a single function called fimo
A command-line tool can be used with the signature tangermeme fimo …
TOMTOM
A numba-based implementation has been added in the function tomtom
A command-line tool can be used with the signature tangermeme tomtom …
utils
chunk and unchunk have been added in to chunk long sequences into blocks that can be operated on by methods with fixed-window inputs, such as machine learning models, and for converting the predictions from these approaches back into a contiguous format.
match
Implemented updates to substantially reduce memory use and runtime of extract_matching_loci. This was mainly achieved by
Avoid using io.extract_loci, which one hot encodes all loci into a single large tensor. Instead, the locus sequences are extracted one by one, keeping only one in memory at a time. The N and GC percentages are calculated directly from the sequence, and only those values are stored.
Calculate genome wide N and GC percentages by taking slices of the chromosomal DNA sequences and using the count method of python strings. This is significantly faster than the previous approach using numpy isin, and avoids keeping several copies of the sequence in memory at the same time.
Various other changes:
Counts from regions that cannot be extracted from a provided bigwig file (such as for a missing chromosome) are now set to nan rather than 0. This will effect the threshold value used for filtering background regions.
Small change to the binning strategy for gc values, which could mean that matching loci generated in a previous version will not be reproduced exactly in all cases, even when using the same random seed.
Enable the handling of ‘N’ in sequences or [0,0,0,0], i.e. an ambiguous genomic positions. Updated the characters() and the _validate_input() in utils module to enable this.
Version 0.2.3
match
Expanded the ignore parameter to ignore all non-ACGT characters.
Version 0.2.2
plot
Fixed issue in plot_logo raised by @sandyfloren where passing in annotations without passing in start or end would raise an error. Now, start defaults to 0 and end defaults to the length of the sequence.
tools
FIMO is now base 2 instead of base e, to better match the MEME-suite tool. p-values should remain the same but scores will change.
FIMO hits will now return p-values, and will longer return an uninformative attr column
product
apply_pairwise has been added along with documentation and unit tests
match
Fixes an issue with trying to calculate the mean over an array of integers by changing the array to be dtype float. via @adamyhe
Version 0.2.1
deep_lift_shap
Removed the autocasting to 32-bit floats, enabling attributions to be calculated at other resolutions
Removes ~100 LOC and the DeepLiftShap object, integrating that code directly into the deep_lift_shap function
Only assigns hooks once at the beginning of the function and clears them upon an error or completion of function, instead of assigning and clearing hooks every batch
Version 0.2.0
Highlights
Alters the API of several functions to make them more general, with the option of taking in a function to apply instead of defaulting to predict, while still back compatible
Adds in deep_lift_shap and seqlet to operate on attributions
deep_lift_shap
Added in a stand-alone implementation of deep_lift_shap
This implementation resolves several issues with Captum, e.g., with pooling layers
Allows batching of example-reference pairs across examples (so batch_size can be > than n_shuffles)
Allows batch_size to be much smaller than n_shuffles with the results aggregated once all references have been processed to allow large models to be run
Allows additional non_linear operations to be registered by passing in a dictionary
Allows the raw multipliers to be returned with raw_output=True or the aggregated attribution scores
ism
Changes the default output from the raw output (which you can get with raw_output=True) to defaultly aggregated attribution values to make the API compatible
marginalize
Change the signature to take in an optional function that gets applied before/after the substitution, default is predict
Change the signature to take in **kwargs that get passed into the optional function
Change the signature to take in additional_func_kwargs that is an alternative and safer way to pass arguments into the function
ablate
Change the signature to take in an optional function that gets applied before/after the ablation, default is predict
Change the signature to take in **kwargs that get passed into the optional function
Change the signature to take in additional_func_kwargs that is an alternative and safer way to pass arguments into the function
space
Change the signature to take in an optional function that gets applied before/after the substitutions, default is predict
Change the signature to take in **kwargs that get passed into the optional function
Change the signature to take in additional_func_kwargs that is an alternative and safer way to pass arguments into the function
variant_effect
Change the name of marginal_substitution_effect to substitution_effect
Change the API of substitution_effect to take in a tensor of original sequences and a tensor of substitutions
Change the API of substitution_effect to take in an optional function and **kwargs and additional_func_kwargs to pass into func
Change the name of marginal_deletion_effect to deletion_effect
Change the API of deletion_effect to take in a tensor of original sequences and a tensor of deletions
Change the API of deletion_effect to take in an optional function and **kwargs and additional_func_kwargs to pass into func
Change the name of marginal_insertion_effect to insertion_effect
Change the API of insertion_effect to take in a tensor of original sequences and a tensor of insertions
Change the API of insertion_effect to take in an optional function and **kwargs and additional_func_kwargs to pass into func
seqlet
Add a new file for the identification of seqlets
Add tfmodisco_seqlets which is a simplified and documented version of the seqlet calling in tfmodisco that returns dataframes
Version 0.1.0
Highlights
This is the first major release of tangermeme and contains the first version of the core functionality.
ersatz
This module implements common sequence manipulation methods such as substitutions, insertions, deletions, and shufflings of sequences.
predict
This module implements efficient batched prediction that can handle models that accept multiple inputs or multiple outputs.
marginalize
This module implements marginalization experiments, where predictions are made for a set of sequences, a motif is substituted into the middle, and then new predictions are made for the new sequences.
space
This module implements spacing experiments where predictions are made for a set of sequences, a set of motifs are inserted with a given spacing, and then new predictions are made for the new sequences.
io
This module implements I/O functions for common data types as well as for extracting examples for machine learning models.
ism
This module implements in silico saturated mutagenesis (ISM).
variant_effect
This module implements functions for evaluating the marginal effect of variants on model predictions.
Version 0.0.1
Highlights
Initial release