design

tangermeme.design.screen.screen(model: ~torch.nn.modules.module.Module, shape: tuple[int, ...], y: ~torch.Tensor | list[~torch.Tensor] | None = None, loss: ~collections.abc.Callable[[...], ~typing.Any] = MSELoss(), tol: float = 0.001, max_iter: int = -1, args: tuple | None = None, n_best: int = 1, alphabet: list[str] = ['A', 'C', 'G', 'T'], batch_size: int = 32, func: ~collections.abc.Callable[[...], ~typing.Any] = <function random_one_hot>, additional_func_kwargs: dict | None = None, dtype: str | ~torch.dtype | None = None, device: str | ~torch.device | None = None, random_state: int | ~numpy.random.mtrand.RandomState | None = None, verbose: bool = False) → Tensor

Screen randomly generated sequences and choose the best one.

Potentially, the conceptually simplest method for design is to randomly generate a batch of examples and evaluate them using the provided model, keeping only the n_best top hits according to the loss function. This is called “screening”, as one is “screening” a large pool of random potential designs for activity and keeping only those that appear good according to some loss function.

Although this function will likely be slow since each batch is independent from the others, i.e., you are not guaranteed to be getting closer to a goal with each step, you may be surprised by how good the generations are.

Parameters

model: torch.nn.Module: A PyTorch model to use for making predictions. These models can take in any number of inputs and make any number of outputs. The additional inputs must be specified in the args parameter.
shape: tuple: Dimensions for the randomly generated sequences, excluding the batch dimension. For a model expecting an input like (32, 4, 2114), where 32 is the batch size, shape should be (4, 2114).
y: torch.Tensor or list of torch.Tensors or None: A tensor or list of Tensors providing the desired output from the model. The type and shape must be compatible with the provided loss function and comparable to the output from model. Each tensor should have a shape of (1, n) where n is the number of outputs from the model. The first dimension is 1 to make broadcasting work correctly. If None, simply choose the edit that yields the strongest response from the model. Default is None.
loss: function, optional: This function must take in y and y_hat where y is the desired output from the model and y_hat is the current prediction from the model given the substitutions. By default, this is the torch.nn.MSELoss().
tol: float, optional: A threshold on the loss below which the screening procedure terminates. Termination requires the loss of the worst kept candidate (i.e. the n_best-th best so far) to fall below tol — when n_best > 1 the current best may have been below tol for many iterations before this condition triggers. Default is 1e-3.
max_iter: int, optional: The maximum number of iterations to run before terminating the procedure. Set to -1 for no limit. Default is -1.
args: tuple or list or None, optional: An optional set of additional arguments to pass into the model. If provided, each element in the tuple or list is one input to the model and the element must be formatted to be the same batch size as X. If None, no additional arguments are passed into the forward function. Default is None.
n_best: int, optional: The number of sequences to return at the end, ranked from the lowest loss to the highest loss. Setting to 1 means only return the very best sequence observed across all generation batches. Default is 1.
batch_size: int, optional: The number of sequences to generate (via func) and evaluate per iteration. This controls the size of each generation/screening batch; it is NOT forwarded to predict’s own batch_size (which retains its default). Default is 32.
func: function, optional: The function to use to generate sequences. The signature of this function must be that it takes in a tuple of the shape of the batch to generate, e.g. (32, 4, 2114), and also a random state. Default is random_one_hot.
additional_func_kwargs: dict or None, optional: Additional named arguments to pass into the function when it is called. This is provided as an alternate path to route arguments into the function in case they overlap, name-wise, with those in this function, or if you want to be absolutely sure that the arguments are making their way into the function. The dict is not modified in place. Default is None.
dtype: str or torch.dtype or None, optional: The dtype to use with mixed precision autocasting. If None, use the dtype of the model. This allows you to use int8 to represent large data sets and only convert batches to the higher precision, saving memory. Default is None.
device: str or torch.device or None, optional: The device to move the model and batches to when making predictions. If None, use CUDA when available and fall back to CPU otherwise. Default is None.
random_state: int or None, optional: The random seed to use to ensure determinism of the generation function. The seed is incremented by 1 at the end of each iteration so successive iterations draw different sequences while remaining reproducible. If None, not deterministic. Default is None.
verbose: bool, optional: Whether to display a progress bar during predictions. Default is False.

Returns

X: torch.Tensor, shape=(n_best, len(alphabet), length): The screened examples with the lowest loss.

tangermeme.design.greedy_substitution.greedy_substitution(model: Module, X: Tensor, y: Tensor | list[Tensor] | None = None, motifs: list[str] | None = None, loss: Callable[[...], Any] = MSELoss(), reverse_complement: bool = True, input_mask: Tensor | None = None, output_mask: Tensor | None = None, tol: float = 0.001, max_iter: int = -1, args: tuple | None = None, alphabet: list[str] = ['A', 'C', 'G', 'T'], batch_size: int = 32, device: str | device | None = None, verbose: bool = False) → Tensor

Greedily add motifs to achieve a desired goal.

This design function will greedily add motifs to achieve a desired output from the model. Each round, the function will iterate through all possible motifs, substitute each one with the given spacing, and keep the one whose loss function is the smallest. This process will continue until either the maximum number of iterations is reached (at which point, max_iter motifs will have been inserted into the sequence) or the loss falls below tol.

Accordingly, the choice of loss function and desired output from the model is crucial for good design. Usually, the loss function can be Euclidean distance, but for models with more complex outputs or for subtle design tasks one may want to use something else, such as Jensen-Shannon divergence.

Parameters

model: torch.nn.Module: A PyTorch model to use for making predictions. These models can take in any number of inputs and make any number of outputs. The additional inputs must be specified in the args parameter.
X: torch.tensor, shape=(1, len(alphabet), length): A one-hot encoded sequence to use as the base for design. This must be a single sequence and has the first dimension for broadcasting reasons.
y: torch.Tensor or list of torch.Tensors or None: A tensor or list of Tensors providing the desired output from the model. The type and shape must be compatible with the provided loss function and comparable to the output from model. Each tensor should have a shape of (1, n) where n is the number of outputs from the model. The first dimension is 1 to make broadcasting work correctly. If None, simply choose the edit that yields the strongest response from the model. Default is None.
motifs: list of strings or None: A list of strings where each string is a motif that can be inserted into the sequence. These strings will be one-hot encoded according to the provided alphabet. If None, use the provided alphabet as the motifs to only change one character at a time. Default is None.
loss: function, optional: This function must take in y and y_hat where y is the desired output from the model and y_hat is the current prediction from the model given the substitutions. By default, this is the torch.nn.MSELoss().
reverse_complement: bool, optional: Whether to augment the provided list of motifs with their reverse complements. This will double the runtime. Default is True.
input_mask: torch.Tensor or None, optional: A mask on input positions that can be the start of substitution. Any motif can be substituted in starting at each allowed position even if the contiguous span of the mask is shorter than the motif. True means that a motif can be substituted in starting at that position and False means that it cannot be. Default is None.
output_mask: torch.Tensor or None, optional: A mask on the outputs from the model to consider. True means to include the outputs in the loss, False means to exclude those outputs from the loss. If None, use all outputs. Default is None.
tol: float, optional: A threshold on the amount of improvement necessary according to loss, where the procedure will stop once the improvement is below. Default is 1e-3.
max_iter: int, optional: The maximum number of iterations to run before terminating the procedure. The loop condition is iteration < max_iter, so any non-positive value (including the -1 default) causes the loop body to never execute and the original X to be returned unchanged. Pass a positive integer (or a large sentinel such as 10**9) for actual iteration. Default is -1.
args: tuple or list or None, optional: An optional set of additional arguments to pass into the model. If provided, each element in the tuple or list is one input to the model and the element must be formatted to be the same batch size as X. If None, no additional arguments are passed into the forward function. Default is None.
alphabetset or tuple or list, optional: A pre-defined alphabet where the ordering of the symbols is the same as the index into the returned tensor, i.e., for the alphabet [‘A’, ‘B’] the returned tensor will have a 1 at index 0 if the character was ‘A’. Characters outside the alphabet are ignored and none of the indexes are set to 1. This is not necessary or used if a one-hot encoded tensor is provided for the motif. Default is [‘A’, ‘C’, ‘G’, ‘T’].
batch_size: int, optional: The number of examples to make predictions for at a time. Default is 32.
device: str or torch.device or None, optional: The device to move the model and batches to when making predictions. If None, use CUDA when available and fall back to CPU otherwise. Default is None.
verbose: bool, optional: Whether to display a progress bar during predictions. Default is False.

Returns

X: torch.Tensor, shape=(-1, len(alphabet), length): The edited sequence.

tangermeme.design.beam_substitution.beam_substitution(model: Module, X: Tensor, y: Tensor | list[Tensor] | None = None, motifs: list[str] | None = None, loss: Callable[[...], Any] = MSELoss(), reverse_complement: bool = True, input_mask: Tensor | None = None, output_mask: Tensor | None = None, beam_size: int = 4, n_best: int = 1, tol: float = 0.001, max_iter: int = -1, args: tuple | None = None, alphabet: list[str] = ['A', 'C', 'G', 'T'], batch_size: int = 32, device: str | device | None = None, verbose: bool = False) → Tensor

Beam search over motif substitutions to achieve a desired goal.

This is a generalization of greedy_substitution. Rather than committing to the single best edit each round, beam search keeps the beam_size best complete sequences (the “beam”) and expands all of them. The classic difficulty with applying beam search to a sequence-to-function model is that the model only produces a meaningful output once a full, fixed-length input is filled, so there is no natural notion of scoring a partially-built sequence the way one scores a partial sentence. This implementation sidesteps that entirely: every candidate in the beam is always a complete, fixed-length sequence, because each step substitutes a motif into an existing sequence rather than growing one. What is searched over is therefore not positions in a growing string but trajectories through edit-space.

Each round, every beam member is expanded by tiling every motif at every allowed position (exactly as greedy_substitution does for its single sequence), all resulting complete sequences are scored by their absolute loss, and the global top-beam_size are kept as the next beam. The current beam members are themselves carried forward as candidates, so the beam never regresses. Identical sequences are de-duplicated before pruning so that the beam does not collapse onto a single sequence. Setting beam_size=1 recovers greedy_substitution; larger beams hedge across multiple trajectories and can recover good multi-edit combinations that the greedy method prunes away after a locally-suboptimal first edit.

As with greedy_substitution, the choice of loss function and desired output is crucial. Usually the loss can be Euclidean distance, but for models with more complex outputs one may want something else, such as Jensen-Shannon divergence.

Parameters

model: torch.nn.Module: A PyTorch model to use for making predictions. These models can take in any number of inputs and make any number of outputs. The additional inputs must be specified in the args parameter.
X: torch.tensor, shape=(1, len(alphabet), length): A one-hot encoded sequence to use as the base for design. This must be a single sequence and has the first dimension for broadcasting reasons.
y: torch.Tensor or list of torch.Tensors or None: A tensor or list of Tensors providing the desired output from the model. The type and shape must be compatible with the provided loss function and comparable to the output from model. Each tensor should have a shape of (1, n) where n is the number of outputs from the model. The first dimension is 1 to make broadcasting work correctly. If None, simply choose the edits that yield the strongest response from the model. Default is None.
motifs: list of strings or None: A list of strings where each string is a motif that can be inserted into the sequence. These strings will be one-hot encoded according to the provided alphabet. If None, use the provided alphabet as the motifs to only change one character at a time. Default is None.
loss: function, optional: This function must take in y and y_hat where y is the desired output from the model and y_hat is the current prediction from the model given the substitutions. By default, this is the torch.nn.MSELoss().
reverse_complement: bool, optional: Whether to augment the provided list of motifs with their reverse complements. This will double the runtime. Default is True.
input_mask: torch.Tensor or None, optional: A mask on input positions that can be the start of substitution. Any motif can be substituted in starting at each allowed position even if the contiguous span of the mask is shorter than the motif. True means that a motif can be substituted in starting at that position and False means that it cannot be. Default is None.
output_mask: torch.Tensor or None, optional: A mask on the outputs from the model to consider. True means to include the outputs in the loss, False means to exclude those outputs from the loss. If None, use all outputs. Default is None.
beam_size: int, optional: The number of complete sequences to keep in the beam each round. Setting this to 1 recovers greedy_substitution. Larger values explore more trajectories at a cost that scales linearly with beam_size. Default is 4.
n_best: int, optional: The number of sequences to return at the end, ranked from the lowest loss to the highest loss. Must be no larger than beam_size; if larger, it is clamped to the number of distinct sequences in the final beam. Default is 1.
tol: float, optional: A threshold on the amount of improvement necessary according to loss, where the procedure will stop once the round-over-round improvement of the best beam member is below it. Default is 1e-3.
max_iter: int, optional: The maximum number of iterations (edits) to run before terminating the procedure. Set to -1 for no limit. Default is -1.
args: tuple or list or None, optional: An optional set of additional arguments to pass into the model. If provided, each element in the tuple or list is one input to the model and the element must be formatted to be the same batch size as X. If None, no additional arguments are passed into the forward function. Default is None.
alphabetset or tuple or list, optional: A pre-defined alphabet where the ordering of the symbols is the same as the index into the returned tensor, i.e., for the alphabet [‘A’, ‘B’] the returned tensor will have a 1 at index 0 if the character was ‘A’. Characters outside the alphabet are ignored and none of the indexes are set to 1. This is not necessary or used if a one-hot encoded tensor is provided for the motif. Default is [‘A’, ‘C’, ‘G’, ‘T’].
batch_size: int, optional: The number of examples to make predictions for at a time. Default is 32.
device: str or torch.device or None, optional: The device to move the model and batches to when making predictions. If None, use CUDA when available and fall back to CPU otherwise. Default is None.
verbose: bool, optional: Whether to display a progress bar during predictions. Default is False.

Returns

X: torch.Tensor, shape=(n_best, len(alphabet), length): The designed sequences, ranked from lowest loss to highest loss.

tangermeme.design.greedy_marginalize.greedy_marginalize(model: Module, X: Tensor, y: Tensor | list[Tensor], motifs: list[str], loss: Callable[[...], Any] = MSELoss(), max_spacing: int = 12, reverse_complement: bool = True, output_mask: Tensor | None = None, tol: float = 0.001, max_iter: int = -1, args: tuple | None = None, alphabet: list[str] = ['A', 'C', 'G', 'T'], batch_size: int = 32, device: str | device | None = None, verbose: bool = False) → Tensor

Greedily builds a construct and evaluates it using marginalizations.

This approach attempts to find a set of motifs and their orientations and spacings (a “construct”) that yield a desired objective. Rather than editing an initial sequence, this approach is just trying to greedily build a construct and evaluate its performance by marginalizing all other positions. Accordingly, rather than trying every motif at every position, it only tries motifs at all positions within a given spacing from the flanks of the construct that has been built so far.

The algorithm proceeds like this: first, it implants each motif in the middle of each sequence and keeps the one that achieves the best improvement. Then, each motif is implanted between the left edge of the motif minus spacing and the right edge of the motif plus spacing (by default 24 more nucleotides) and each position in that span (24 + orig motif length) are considered. This allows subsequent motifs to edit parts of previously implanted motifs while still significantly restricting the search space.

This method is useful when you want to generally know what set of motifs and their other properties achieve your desired goal, without considering a specific sequence. For instance, if you train a model to predict accessibility and then want to design accessible regions generally, this approach may be more appropriate than greedy_substitution.

Parameters

model: torch.nn.Module: A PyTorch model to use for making predictions. These models can take in any number of inputs and make any number of outputs. The additional inputs must be specified in the args parameter.
X: torch.tensor, shape=(1, len(alphabet), length): A one-hot encoded sequence to use as the base for design. This must be a single sequence and has the first dimension for broadcasting reasons.
y: torch.Tensor or list of torch.Tensors: A tensor or list of Tensors providing the desired output from the model. The type and shape must be compatible with the provided loss function and comparable to the output from model. Each tensor should have a shape of (1, n) where n is the number of outputs from the model. The first dimension is 1 to make broadcasting work correctly.
motifs: list of strings: A list of strings where each string is a motif that can be inserted into the sequence. These strings will be one-hot encoded according to the provided alphabet.
loss: function, optional: This function must take in y and y_hat where y is the desired output from the model and y_hat is the current prediction from the model given the substitutions. By default, this is the torch.nn.MSELoss().
reverse_complement: bool, optional: Whether to augment the provided list of motifs with their reverse complements. This will double the runtime. Default is True.
max_spacing: int, optional: The maximum spacing on either side of the existing construct at which a new motif can be implanted. Each iteration considers positions in a window extending max_spacing nucleotides past the current left and right flanks of the construct. Default is 12.
output_mask: torch.Tensor or None, optional: A mask on the outputs from the model to consider. True means to include the outputs in the loss, False means to exclude those outputs from the loss. If None, use all outputs. Default is None.
tol: float, optional: A threshold on the amount of improvement necessary according to loss, where the procedure will stop once the improvement is below. Default is 1e-3.
max_iter: int, optional: The maximum number of iterations to run before terminating the procedure. Set to -1 for no limit. Default is -1.
args: tuple or list or None, optional: An optional set of additional arguments to pass into the model. If provided, each element in the tuple or list is one input to the model and the element must be formatted to be the same batch size as X. If None, no additional arguments are passed into the forward function. Default is None.
alphabetset or tuple or list, optional: A pre-defined alphabet where the ordering of the symbols is the same as the index into the returned tensor, i.e., for the alphabet [‘A’, ‘B’] the returned tensor will have a 1 at index 0 if the character was ‘A’. Characters outside the alphabet are ignored and none of the indexes are set to 1. This is not necessary or used if a one-hot encoded tensor is provided for the motif. Default is [‘A’, ‘C’, ‘G’, ‘T’].
batch_size: int, optional: The number of examples to make predictions for at a time. Default is 32.
device: str or torch.device or None, optional: The device to move the model and batches to when making predictions. If None, use CUDA when available and fall back to CPU otherwise. Default is None.
verbose: bool, optional: Whether to display a progress bar during predictions. Default is False.

Returns

X: torch.Tensor, shape=(len(alphabet), length): The designed construct.