Tutorial A1: Sequence Manipulation

Being able to manipulate sequences is central to analyzing what machine learning models have learned because they allow you to easily ask hypotheticals. Many common downstream analyses, such as in silico marginalizations and feature attributions rely on sequence manipulations to handle motif insertion or construct background sequences, respectively. But, these forms of manipulations are also invaluable tools for dissecting individual regions and doing sequence design. For instance, given a locus of interest, one can shuffle regions (such as known motifs) and see what impact the shuffling has on the prediction. Alternatively, one can iterative insert motifs into a sequence to see whether a predictive model yields a desired result.

Given the foundational nature of these operations, they are central to tangermeme and are implemented in tangermeme.ersatz. Why ersatz?

407fcb445f8c4888a42877c8f5857af9

Accordingly, “ersatz sequences” in this context are those that are not natural sequences – either because they are alterations of a native sequence, or they are fully synthetic.

Substitutions

The simplest operation is the substitution, where some positions are switched out for others. Consider the following where an AP-1 motif (https://jaspar.elixir.no/matrix/MA1144.1/) is inserted into uniformly random nucleotides. Colloquially, “substitutions” are sometimes called “insertions”, and while there are similarities between the two they are not exactly the same thing. Let’s begin by randomly generating one sequence. Following the PyTorch format, the one-hot encoded sequences should have the shape (batch_size, alphabet_size, sequence_length). Generation can be made deterministic using the random_state parameter.

As a note, all of the operations in tangermeme.ersatz happen on one-hot encoded tensors but, for simplicity of seeing the effect of each operation in this tutorial, we will convert these tensors back to characters.

[1]:

from tangermeme.utils import random_one_hot
from tangermeme.utils import characters

X = random_one_hot((1, 4, 20), random_state=0)
characters(X[0])

[1]:

'ATCATTTTCTCGATGAAAGC'

Now, let’s put the motif in. We can either pass in a string, in which case the same motif is added in to each sequence in the batch at the same position, or we can pass in a one-hot encoding. If the one-hot encoding has shape (1, alphabet_size, motif_size), the same motif is added to each sequence similarly to if a string were used, but if the one-hot encoding has the shape (batch_size, alphabet_size, motif_size) then the i-th sequence in the batch has the i-th one-hot encoding substituted in.

[2]:

from tangermeme.ersatz import substitute

X_ap1 = substitute(X, "TGACTCA")
characters(X_ap1[0])

[2]:

'ATCATTTTGACTCAGAAAGC'

By default, the substitution will happen in the middle of the sequence. If you’d like to control where it happens you can pass in a parameter start with the index to start the substitution.

[3]:

X_ap1 = substitute(X, "TGACTCA", start=2)
characters(X_ap1[0])

[3]:

'ATTGACTCATCGATGAAAGC'

If we have a one-hot encoding rather than a string sequence, we can pass that in instead without needing to first convert it back to characters.

[4]:

from tangermeme.utils import one_hot_encode

motif = one_hot_encode("TGACTCA").unsqueeze(0)

X_ap1 = substitute(X, motif)
characters(X_ap1[0])

[4]:

'ATCATTTTGACTCAGAAAGC'

Multisubstitutions

Sometimes one would like to make multiple substitutions in the same sequence given some spacing between substitutions. Although this could be achieved by calling substitute multiple times, we can provide a convenient wrapper for this with the multisubstitue function. This function has a very similar signature to substitute except that it takes in a list of motifs and the spacing between them. We can try it out first given no spacing between the two motifs.

[5]:

from tangermeme.ersatz import multisubstitute

X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 0, start=2)
characters(X_ap12[0])

[5]:

'ATTGACTCATGACTCAAAGC'

Now, let’s add a little bit of spacing.

[6]:

X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 2, start=2)
characters(X_ap12[0])

[6]:

'ATTGACTCATCTGACTCAGC'

Finally, if we have more than two motifs we can optionally provide spacing values between each set of motifs. Note that if we keep the spacing value as an integer but provide more than two motifs that the same spacing is used between all motif pairs.

[7]:

X_ap12 = multisubstitute(X, ["TGA", "TCA", "TGA", "TGAC"], [0, 2, 1], start=2)
characters(X_ap12[0])

[7]:

'ATTGATCACTTGATTGACGC'

Insertions

Related to substitutions are insertions. As mentioned before, sometimes people say “insertions” when what they mean are “substitutions”, but insertions involve adding the new sequence without modifying or deleting any of the existing sequence. Essentially, the returned sequence will be longer than the original sequence because it now additionally contains the sequence being added. In contrast, substitutions preserve the length of the sequence because characters are explicitly being changed from those in the original sequence to those in the new sequence. Let’s see that in action.

[8]:

from tangermeme.ersatz import insert

X_ap1 = insert(X, "TGACTCA")
characters(X_ap1[0])

[8]:

'ATCATTTTCTTGACTCACGATGAAAGC'

[9]:

X.shape[-1], len("TGACTCA"), len(characters(X_ap1[0]))

[9]:

(20, 7, 27)

Deletion

In direct contrast to insertions are deletions: insertions involve adding new sequence to an existing sequence and deletions involve deleting existing sequence and not replacing it with anything. Accordingly, one only passes in sequences and the coordinates to delete instead of passing in any form of new sequence.

[10]:

from tangermeme.ersatz import delete

X_del = delete(X, start=0, end=5)
characters(X[0]), characters(X_del[0])

[10]:

('ATCATTTTCTCGATGAAAGC', 'TTTCTCGATGAAAGC')

Next, let’s try deleting a portion from the middle.

[11]:

X_del = delete(X, start=10, end=15)
characters(X_del[0])

[11]:

'ATCATTTTCTAAAGC'

[12]:

X.shape[-1], X_del.shape[-1]

[12]:

(20, 15)

Randomize

Deleting positions is one way that we can remove information from a sequence. However, this can pose some issues – both practically, in terms of needing a sequence of the same length when using machine learning models, and conceptually, in that removing a motif entirely isn’t really a biologically plausible alternative to observed sequence. An alternative approach to deleting positions is to replace those positions with randomly generated characters. This would keep the sequence the same length but remove the motif.

Here, we replace the first five positions with randomly generated characters and keep the remaining characters the same.

[13]:

from tangermeme.ersatz import randomize

X_rand = randomize(X, start=0, end=5, random_state=0)
characters(X[0]), characters(X_rand[0, 0])

[13]:

('ATCATTTTCTCGATGAAAGC', 'GGGGCTTTCTCGATGAAAGC')

Frequently, when using randomly generated sequences, one wishes to generate many randomizations so that one can average over the randomness induces by the sequences. To make this easy, tangermeme allows you to pass in a parameter n specifying the number of randomizations to perform, and returns a tensor with one more dimension than the original tensor whenever randomness is involved. Specifically, the returned tensor will have shape (n_orig_sequences, n_shuffles, alphabet_size, seq_len) so that you can shuffle each of many sequence many times.

[14]:

X_rand = randomize(X, start=0, end=5, n=10, random_state=0)

for i in range(10):
    print(characters(X_rand[0, i]))

GGGGCTTTCTCGATGAAAGC
GCTTCTTTCTCGATGAAAGC
TGGTATTTCTCGATGAAAGC
AATTTTTTCTCGATGAAAGC
TTCTATTTCTCGATGAAAGC
GATGCTTTCTCGATGAAAGC
CTCGATTTCTCGATGAAAGC
GGGTGTTTCTCGATGAAAGC
CCGAGTTTCTCGATGAAAGC
GAACCTTTCTCGATGAAAGC

Shuffle

A problem with independently generating sequence at each position is that the sampled sequences might have unrealistic compositions. For instance, when you use uniformly randomly generated sequences, the GC content is fairly high compared to naturally occuring sequences. When trying to create backgrounds for specific loci, you might prefer to instead shuffle the positions to ensure that the composition of these backgrounds are the same, while any types of motif are disrupted.

We can use the shuffle function to completely shuffle a batch of sequences. Each returned sequence will have the same composition as the original sequence that was shuffled, but a different ordering of the elements. In the genomics setting, this means that the same number of each nucleotide will be present but motifs will likely be removed.

[15]:

from tangermeme.ersatz import shuffle

X_shuf = shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

[15]:

('ATCATTTTCTCGATGAAAGC', 'GTCCCATTTCTGTTAGAAAA')

Similar to the randomize function, if we want to shuffle a sequence many times, we can use the parameter n.

[16]:

X_shuf = shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

GTCCCATTTCTGTTAGAAAA
GTGACACACAATACTTTGTT
ATATGCCTAATCAGTTATGC
GATCAATAGAGCTATCCTTT
TTCCTGCAAATTGTCTAAGA
ATCCATTCGGGACTTTATAA
GGTTTACTAACGCCAATTAT
TGTAAGTACCCTATTCTGAA
TTTTTATACAGCAGAGACTC
CAAACTTTCGCTTAGTATAG

Furthermore, we can restrict our shuffling to only a portion of the sequence. This can be valuable if you want to knock out a portion of the sequence, such as a known motif or broader regulatory region. All you need to do is specify the start (inclusive) and end (not inclusive) positions that the shuffling should occur.

[17]:

X_shuf = shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

[17]:

('ATCATTTTCTCGATGAAAGC', 'ATCATCTTTGGATCTAAAGC')

Dinucleotide Shuffle

In the genomics setting the CG dinucleotide plays an outsized role compared to other dinucleotides and so is significantly underrepresented in the genome. Because normal shuffling will disrupt dinucleotide content, and hence change the proportion of CGs in the sequence, sometimes one wants to use a shuffling strategy that explicitly preserves dinucleotide content.

In tangermeme, the dinucleotide_shuffle function operates similarly to the shuffle function. For instance, we can shuffle entire sequences:

[18]:

from tangermeme.ersatz import dinucleotide_shuffle

X_shuf = dinucleotide_shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

[18]:

('ATCATTTTCTCGATGAAAGC', 'ATCATTTCTCGATTGAAAGC')

We can generate many shuffles using the n parameter.

[19]:

X_shuf = dinucleotide_shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

ATCATTTCTCGATTGAAAGC
ATCTCAAATTTTCGATGAGC
ATTTTCTCAATCGAATGAGC
AAATTTTCTCATCGATGAGC
ATCTTCATCGATTTGAAAGC
ATCAATTCTTTCGAATGAGC
AATTCTCAATTCGATTGAGC
AAATCTCATTTCGATTGAGC
ATCATTTCTTCGAAATGAGC
ATCTTTCAATTCGAATGAGC

And we can shuffle only a portion of the sequence.

[20]:

X_shuf = dinucleotide_shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

[20]:

('ATCATTTTCTCGATGAAAGC', 'ATCATTCTTTCGATGAAAGC')

As a note, the strategy for doing dinucleotide shuffling that is implemented will always keep the first and last nucleotides the same. Depending on the sequence composition and the length of the region being shuffled, it can be impossible to produce new dinucleotide shuffled sequences. Passing in verbose will raise a warning when at least one position (other than the first and last positions) are always the same character. Regardless of the value of verbose, an error will be raised when all returned sequences are identical.

[21]:

X_shuf = dinucleotide_shuffle(X, start=10, end=15, random_state=0, n=100, verbose=True)
characters(X[0]), characters(X_shuf[0, 0])

Warning: At least one position in dinucleotide shuffle is identical across all positions.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 X_shuf = dinucleotide_shuffle(X, start=10, end=15, random_state=0, n=100, verbose=True)
      2 characters(X[0]), characters(X_shuf[0, 0])

File /shared/aemurphy/tangermeme/tangermeme/ersatz.py:612, in dinucleotide_shuffle(X, start, end, n, random_state, verbose)
    610 X_shufs = []
    611 for i in range(X.shape[0]):
--> 612   insert_ = _dinucleotide_shuffle(X[i, :, start:end], n_shuffles=n, 
    613               random_state=random_state+i, verbose=verbose)
    615    X_shuf = torch.clone(X[i:i+1]).repeat(n, 1, 1)
    616    X_shuf[:, :, start:end] = insert_

File /shared/aemurphy/tangermeme/tangermeme/ersatz.py:552, in _dinucleotide_shuffle(X, n_shuffles, random_state, verbose)
    549            print("Warning: At least one position in dinucleotide shuffle " +
    550                    "is identical across all positions.")
    551 if conserved.max(dim=0).values.min() == n_shuffles and n_shuffles > 1:
--> 552   raise ValueError("All dinucleotide shuffles yield identical " +
    553            "sequences, potentially due to a lack of diversity in sequence.")
    555 return shuffled_sequences

ValueError: All dinucleotide shuffles yield identical sequences, potentially due to a lack of diversity in sequence.

Reverse Complement

A common approach to training/testing genomic deep learning models is to use the reverse complement of a DNA sequence.

In tangermeme this can be done with character represnetations of DNA or with one-hot encodings using reverse_complement

[22]:

from tangermeme.ersatz import reverse_complement

reverse_complement("ATCG",ohe=False)

[22]:

'CGAT'

In the same manner as one_hot_encode, reverse_complement is designed to be used on one sequence (Tensor shape (alphabet length,sequence length))

[23]:

motif = one_hot_encode("TGACTCA")
print(motif.shape)
rev_comp_motif = reverse_complement(motif)
print(rev_comp_motif.shape)
characters(rev_comp_motif)

torch.Size([4, 7])
torch.Size([4, 7])

[23]:

'TGAGTCA'

This function can also handle missing values in DNA sequences, represented by ‘N’ or by [0,0,0,0]:

[24]:

reverse_complement("ATCGN",ohe=False)

[24]:

'NCGAT'

[25]:

motif = one_hot_encode("TGACTCAN")
characters(reverse_complement(motif),allow_N=True)

[25]:

'NTGAGTCA'