seqlet

tangermeme.seqlet.recursive_seqlets(X, threshold=0.01, min_seqlet_len=4, max_seqlet_len=25, additional_flanks=0, n_bins=1000)

A seqlet caller implementing the recursive seqlet algorithm.

NOTE: Currently only positive seqlets will be identified. The easiest way to get negative seqlets is to run this on the absolute value of the attribution values and then re-extract the attribution sums given the boundaries.

This algorithm identifies spans of high attribution characters, called seqlets, using a simple approach derived from the Tomtom/FIMO algorithms. First, distributions of attribution sums are created for all potential seqlet lengths by discretizing the attribution sum into integers. Then, CDFs are calculated for each distribution (or, more specifically, 1-CDFs). Finally, p-values are calculated via lookup to these 1-CDFs for all potential CDFs, yielding a (n_positions, n_lengths) matrix of p-values.

This algorithm then identifies seqlets by defining them to have a key property: all internal spans of a seqlet must also have been called a seqlet. This means that all spans from min_seqlet_len to max_seqlet_len, starting at any position in the seqlet, and fully contained by the borders, must have a p-value below the threshold. Functionally, this means finding entries where the upper left triangle rooted in it is comprised entirely of values below the threshold. Graphically, for a candidate seqlet starting at X and ending at Y to be called a seqlet, all the values within the bounds (in addition to X) must also have a p-value below the threshold.

min_seqlet_len

… … . | … . / … … . . … … . | … / … … … … … . | . . / … … … . … … . | . / … … … . . … … . | / … … … … … … . X … … . . Y … . … … … … … … … … … … … … … …

The seqlets identified by this approach will usually be much smaller than those identified by the TF-MoDISco approach, including sometimes missing important characters on the flanks. You can set additional_flanks to a higher value if you want to include additional positions on either side. Importantly, the initial seqlet calls cannot overlap, but these additional characters are not considered when making that determination. This means that seqlets may appear to overlap when additional_flanks is set to a higher value.

Parameters

X: torch.Tensor or numpy.ndarray, shape=(-1, length)

Attributions for each position in each example. The identity of the characters is not relevant for seqlet calling, so this should be the “projected” attributions, i.e., the attribution of the observed characters.

threshold: float, optional

The p-value threshold for calling seqlets. All positions within the triangle (as detailed above) must be below this threshold. Default is 0.01.

min_seqlet_len: int, optional

The minimum length that a seqlet must be, and the minimal length of span that must be identified as a seqlet in the recursive property. Default is 4.

max_seqlet_len: int, optional

The maximum length that a seqlet can be. Default is 25.

additional_flanks: int, optional

An additional value to subtract from the start, and to add to the end, of all called seqlets. Does not affect the called seqlets.

n_bins: int, optional

The number of bins to use when estimating the PDFs and CDFs. Default is 1000.

Returns

seqlets: pandas.DataFrame, shape=(-1, 5)

A BED-formatted dataframe containing the called seqlets, ranked from lowest p-value to higher p-value. The returned p-value is the p-value of the (location, length) span and is not influenced by the other values within the triangle.

tangermeme.seqlet.tfmodisco_seqlets(X_attr, window_size=21, flank=10, target_fdr=0.2, min_passing_frac=0.03, max_passing_frac=0.2, weak_threshold_for_counting_sign=0.8)

Extract seqlets using the procedure from TF-MoDISco.

Seqlets are contiguous spans of high attribution characters. This method for identifying them is the one that is implemented in the TF-MoDISco algorithm. Importantly, TF-MoDISco does several post-processing steps on these seqlets that are interleaved in the pattern identification procedure so the final set of seqlets actually used by patterns in TF-MoDISco will be smaller than the set that are returned here.

The seqlets returned by this procedure have been optimized to be useful for motif discovery, and so are generally much longer and less sensitive than one might initially expect. The seqlets are longer because the local context that patterns occur in might be useful, and because uninformative characters on the flanks can easily be trimmed off. The seqlets are also less sensitive, in the sense that sometimes spans that one might call a seqlet by eye are missed, to prevent noise from contaminating the found patterns.

Parameters

X_attr: torch.Tensor, shape=(-1, length)

A tensor of attribution values for each position in the sequence. The attributions here will be summed across the length of the alphabet so the values must be amenable to that. This means that, most likely, it should be attribution values multiplied by the one-hot encodings so only the present characters have attributions.

window_size: int, optional

The size of the window of attribution values to sum over when identifying seqlets. This is not the only component of seqlet size but is the most important. Default is 21.

flank: int, optional

A number of characters on either end of the window to add to each seqlet. This is done primarily to remove the effect of surrounding positions and not have overlapping seqlets. Default is 10.

target_fdr: float, optional

A FDR value to set on attribution score sums over windows when separating called seqlets from background. Default is 0.2.

min_passing_frac: float, optional

Require that at least this proportion of windows pass seqlet identification. Default is 0.03.

max_passing_frac: float, optional

Require that no more than this proportion of windows pass seqlet identification. Default is 0.2.

weak_threshold_for_counting_sign: float, optional

A minimal threshold to use when setting the final threshold value separating seqlets from non-seqlets.

Returns

seqlets: pandas.DataFrame, shape=(-1, 4)

A tensor containing the example index, start position, end position, and attribution sum for each seqlet that passes the thresholds.