annotate
- tangermeme.annotate.annotate_seqlets(X, seqlets, motifs, n_nearest=1, n_jobs=-1, **kwargs)
Annotate a set of seqlets according to a motif database using TOMTOM.
This function takes in a set of seqlets and a motif database and assigns to each seqlet the most significant motif match. This match is done using TOMTOM on just the one-hot encoded sequence for the seqlet and the PWM for the motifs. If no motifs have a statistical significance below the provided threshold, an index of -1 is returned.
The computation is independent for each seqlet, so ordering of seqlets should not matter, not should running this function on a subset of seqlets versus the full set.
Parameters
- X: torch.Tensor, shape=(n_examples, len(alphabet), length)
A one-hot encoded set of sequences. These are only relevant for extracting the sequences spanned by each seqlet.
- seqlets: pandas.DataFrame, shape=(n_seqlets, 3+)
A BED-formatted set of seqlets to annotate. The first three columns must correspond to the example index in X, the start, and the end (both base-0). The annotations can have additional columns, but those are ignored.
- motifs: dict or str
A dictionary of motifs where the keys are motif names and the values are PWMs with shape (len(alphabet), len(motif)), or a string to a MEME-formatted file.
- n_nearest: int, optional
The number of motif matches to return for each seqlet, starting with the most significant hit. Default is 1.
- n_jobs: int, optional
The number of threads to run TOMTOM on in parallel. -1 means use all available threads. Default is -1.
- **kwargs: arguments, optional
Any other arguments to pass into the TOMTOM algorithm.
Returns
- idxs: torch.Tensor, shape=(len(seqlets), n_nearest)
The index of the most significant motif hit(s) for each seqlet.
- pvals: torch.Tensor, shape=(len(seqlets), n_nearest)
The p-value of the returned significant hits.
- tangermeme.annotate.count_annotations(X, dtype=torch.uint8, shape=None, dim=None)
A method for counting the annotations for each example.
This function takes in a tensor of (example_idx, annotation_idx) pairs and returns a tensor of counts where each row is an example, each column is an annotation, and the value is the number of times that annotation appears in that example. The dimensions of the returned matrix will then be the maximum example_idx and motif_idx value even if intermediary values are not observed.
Parameters
- X: torch.Tensor, shape=(n_annotations, 2) or tuple of two vectors
A tensor of annotations where the first column is the example_idx and the second column is the annotation_idx. Both should be integers.
- dtype: torch.dtype, optional
The dtype of the returned matrix. Default is torch.uint8.
- shape: tuple or None, optional
A user-defined shape of the returned count matrix. Use this if you are not sure whether all examples or annotations are observed in X but you want to have a constant shape. If None, derive shape from the maximum values in X. Default is None.
- dim: None, 0, or 1, optional
Whether to aggregate the counts along one of the axes. If set to None, the full count matrix will be returned. If set to 0 the first dimension is summed over, returning the number of occurences of each annotation. If set to 1, the second dimension is summed over, returning the number of annotations per example. Default is None.
Returns
- y: torch.Tensor, shape=(max(example_idx), max(motif_idx))
A sparse tensor where each row is an example and each column is an annotation and the values within are the number of times each annotation appears in each example.
- tangermeme.annotate.pairwise_annotations(X, dtype=torch.int64, unique=True, symmetric=True, shape=None)
Returns the number of times pairs of annotations occur in an example.
This function takes in a tensor of (example_idx, annotation_idx) pairs and returns a tensor of counts where each row is an annotation_idx and each column is also an annotation_idx, and the values within are the number of times that the each pair of annotations appears in the same example.
The returned matrix will be a symmetric matrix that has a total number of counts equal to twice the number of provided examples.
Parameters
- X: torch.Tensor, shape=(n_annotations, 2)
A tensor of annotations where the first column is the example_idx and the second column is the annotation_idx. Both should be integers.
- dtype: torch.dtype, optional
The dtype of the returned matrix. Default is torch.int64.
- unique: bool, optional
Whether to count only unique pairs within each example or each instance. For example, if set to True, an example with 3 instances of KL4 would contribute only a single KLF4-KLF4 interaction to the matrix, whereas if set to True, would contribute many. Default is True.
- symmetric: bool, optional
Whether to return a symmetric matrix or one where the row is the first element in X and the column is the subsequent element in X. If symmetric, the diagonal is NOT double counted. Default is True.
- shape: int or None, optional
The number of rows and columns to use in the matrix. If None, infer the number from X. Default is None.
Returns
- y: torch.Tensor, shape=(max(example_idx), max(motif_idx))
A sparse tensor where each row is an example and each column is an annotation and the values within are the number of times each annotation appears in each example.
- tangermeme.annotate.pairwise_annotations_spacing(X, max_distance=100, dtype=torch.uint8, symmetric=True, shape=None)
Finds the number of times each annotation pairs happens at each distance.
This function takes in a tensor of (example_idx, annotation_idx, start, end) tuples and returns a tensor of counts where the first two dimensions are annotation indexes and the third dimension is the spacing between the pair. The values within the tensor are the count of the number of times that pair of annotations is found with that spacing. Importantly, distance is calculated between the end of the motif on the left and the start of the motif on the right. Put another way, it is the number of nucleotides between each motif and so is invariant to their lengths.
The returned tensor will be extremely sparse, so be careful about setting the maximum distance to a value that is very high. A sparse tensor could be used here, but given how cheap memory is getting this function is meant to be a quick solution that handles most cases.
Parameters
- X: torch.Tensor, shape=(n_annotations, 4)
A tensor of annotations where the columns should be the example_idx, annotation_idx, start of the annotation, and end of the annotation. The end of the annotation should not be inclusive, so two adjacent annotations with zero spacing between them should have the same integer index for the end of one motif and the start of the next.
- max_distance: int, optional
The maximum distance between two annotations in the same example to consider. Default is 100.
- dtype: torch.dtype, optional
The dtype of the returned matrix. Default is torch.uint8.
- symmetric: bool, optional
Whether to return a symmetric matrix or one where the row is the first element in X and the column is the subsequent element in X. If symmetric, the diagonal is NOT double counted. Default is True.
- shape: int or None, optional
The number of rows and columns to use in the matrix. If None, infer the number from X. Default is None.
Returns
- y: torch.Tensor, shape=(max(example_idx), max(motif_idx))
A sparse tensor where each row is an example and each column is an annotation and the values within are the number of times each annotation appears in each example.