{ "cells": [ { "cell_type": "markdown", "id": "2f3e190e-21ec-4587-be02-a0751578a5b8", "metadata": {}, "source": [ "### Tutorial C1: I/O and Data Loading\n", "\n", "Like most machine learning applications, a crucial component of genomics/proteomics workflows is being able to easily load data stored in common formats and being able to cross-reference multiple files. Accordingly, `tangermeme` has built-in functions for reading most common data types with the goal being to make loading data for training and evaluating machine learning models simple. These functions are implemented in `tangermeme.io`." ] }, { "cell_type": "markdown", "id": "72081d9a-e3a9-4c0c-ae7d-5732e010706f", "metadata": {}, "source": [ "#### MEME Files\n", "\n", "MEME files are text files that contain position-weight matrices (PWMs) that contain the probability (or sometimes frequency) of each position being each character in a motif. They are one of the most common data formats for storing protein binding motifs due to their simplicity. In `tangermeme`, we can read a MEME file using `read_meme` and get a dictionary where the keys are the names of the motif and the value is a numpy array of PWM." ] }, { "cell_type": "code", "execution_count": 1, "id": "b5d87f48-c7cd-45cb-9570-cf4177edecae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['MEOX1_homeodomain_1', 'HIC2_MA0738.1', 'GCR_HUMAN.H11MO.0.A', 'FOSL2+JUND_MA1145.1', 'TEAD3_TEA_2', 'ZN263_HUMAN.H11MO.0.A', 'PAX7_PAX_2', 'SMAD3_MA0795.1', 'MEF2D_HUMAN.H11MO.0.A', 'FOXQ1_MOUSE.H11MO.0.C', 'TBX19_MA0804.1', 'Hes1_MA1099.1'])" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tangermeme.io import read_meme\n", "\n", "memes = read_meme(\"../../tests/data/test.meme\")\n", "memes.keys()" ] }, { "cell_type": "code", "execution_count": 2, "id": "32e22384-69ea-4406-834c-a9649d9594aa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.47 , 0.046, 0.354, 0.13 ],\n", " [0.092, 0.01 , 0.838, 0.06 ],\n", " [0.444, 0.156, 0.218, 0.182],\n", " [0.898, 0.026, 0.028, 0.048],\n", " [0.012, 0.948, 0.026, 0.014],\n", " [0.762, 0.028, 0.09 , 0.12 ],\n", " [0.158, 0.298, 0.352, 0.192],\n", " [0.632, 0.368, 0. , 0. ],\n", " [0.402, 0.284, 0.218, 0.096],\n", " [0.18 , 0.054, 0.036, 0.73 ],\n", " [0.014, 0.022, 0.954, 0.01 ],\n", " [0.052, 0.038, 0.024, 0.886],\n", " [0.19 , 0.266, 0.182, 0.362],\n", " [0.062, 0.848, 0.006, 0.084],\n", " [0.176, 0.352, 0.026, 0.446]])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "memes['GCR_HUMAN.H11MO.0.A']" ] }, { "cell_type": "markdown", "id": "406827db-4046-4925-b6a6-412000bdba4b", "metadata": {}, "source": [ "Sometimes, we might have a very large MEME file, like one containing an entire motif database, and so it would be useful to only load the first few motifs. This is usually the case when you are trying to prototype or debug some code and want to work faster initially. We can limit the number of motifs read in using the `n_motifs` parameter." ] }, { "cell_type": "code", "execution_count": 3, "id": "eb07b924-b71c-4c6f-bc4c-1445742b9a29", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['MEOX1_homeodomain_1', 'HIC2_MA0738.1'])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "memes = read_meme(\"../../tests/data/test.meme\", n_motifs=2)\n", "memes.keys()" ] }, { "cell_type": "markdown", "id": "22886198-d00a-4b9a-ab1d-967257b8d30b", "metadata": {}, "source": [ "#### Extract Loci\n", "\n", "Potentially, the most important IO function is the one that loads data examples for training or evaluating machine learning models. In our setting, this involves taking in a FASTA file, a BED file, and optionally a list of bigwig files, and returns examples centered at the coordinates in the BED file. Specifically, `extract_loci` will return examples centered at the midpoint of the coordinates in the bed file with a fixed `in_window` -- when extracting from the FASTA file -- and a fixed `out_window` -- when extracting from the bigwig files. Basically, this function will construct examples from the FASTA and optionally bigwig files that are properly aligned. The sequence is one-hot encoded according to an alphabet (by default the nucleotide alphabet) and returned as a `(n_loci, alphabet_size, in_window)` dimensional tensor of dtype int8 to save on memory.\n", "\n", "If only the FASTA and BED files are provided, only one-hot encoded sequences will be returned centered at the midpoints specified in the BED file." ] }, { "cell_type": "code", "execution_count": 4, "id": "2f58791b-eaf3-4779-bffa-4a18d0238e9c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[[1, 1, 0, 0, 0, 1, 0, 0, 0, 1],\n", " [0, 0, 1, 0, 0, 0, 1, 0, 0, 0],\n", " [0, 0, 0, 0, 1, 0, 0, 0, 1, 0],\n", " [0, 0, 0, 1, 0, 0, 0, 1, 0, 0]],\n", "\n", " [[0, 0, 1, 0, 0, 1, 0, 0, 1, 0],\n", " [1, 0, 0, 1, 1, 0, 0, 0, 0, 1],\n", " [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],\n", " [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]],\n", "\n", " [[0, 1, 0, 0, 0, 0, 1, 0, 1, 0],\n", " [0, 0, 1, 0, 1, 0, 0, 1, 0, 0],\n", " [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", " [0, 0, 0, 1, 0, 1, 0, 0, 0, 1]],\n", "\n", " [[0, 0, 1, 0, 0, 0, 0, 0, 1, 0],\n", " [1, 0, 0, 1, 0, 0, 1, 0, 0, 0],\n", " [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],\n", " [0, 1, 0, 0, 1, 0, 0, 1, 0, 1]],\n", "\n", " [[1, 0, 0, 0, 1, 0, 1, 0, 0, 0],\n", " [0, 1, 0, 1, 0, 0, 0, 0, 1, 0],\n", " [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", " [0, 0, 1, 0, 0, 1, 0, 1, 0, 1]]], dtype=torch.int8)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tangermeme.io import extract_loci\n", "\n", "extract_loci(\"../../tests/data/test.bed\", \"../../tests/data/test.fa\", in_window=10)" ] }, { "cell_type": "markdown", "id": "181b90cf-8937-444d-b559-d0eaf0afeefd", "metadata": {}, "source": [ "If we have multiple BED files that we want to load from we can pass in a list of BED files. The coordinates specified in the BED files will be interleaved until one of the files is exhausted, and then the remainder of the coordinates from the other file will be used." ] }, { "cell_type": "code", "execution_count": 5, "id": "26aa288a-749d-4fed-8c7c-7ed1f13bf10d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([14, 4, 10])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_loci([\"../../tests/data/test.bed\", \"../../tests/data/test2.bed\"], \"../../tests/data/test.fa\", in_window=10).shape" ] }, { "cell_type": "markdown", "id": "adfb05a4-aea9-4292-b18d-bc7c7d8f9ee9", "metadata": {}, "source": [ "Similar to the `read_meme` function, we can also limit the number of loci that are returned using the `n_loci` parameter, which will result in only the first entries in the file -- or the interleaved list of loci. This is useful when trying to debug or prototype functions that rely on these examples." ] }, { "cell_type": "code", "execution_count": 6, "id": "719a18fb-68c7-4bcf-b050-c4391291b098", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([6, 4, 10])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_loci([\"../../tests/data/test.bed\", \"../../tests/data/test2.bed\"], \"../../tests/data/test.fa\", in_window=10, n_loci=6).shape" ] }, { "cell_type": "markdown", "id": "f80b5772-3a65-4052-b7ac-3832ae8546e9", "metadata": {}, "source": [ "If we want to extract nucleotide sequence and also the corresponding signal centered at these coordinates, we can provide a list of `signals` and get back two tensors -- one of sequence centered at the loci, and one of signal centered at the same loci. The windows for the sequence and signals can be different if desired, with `in_window` controlling the size of the sequence being returned and `out_window` controlling the size of the signal being returned." ] }, { "cell_type": "code", "execution_count": 7, "id": "6ae316ca-bc4a-4e1a-a574-4eb132386a54", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(torch.Size([5, 4, 10]), torch.Size([5, 1, 6]))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = extract_loci(\"../../tests/data/test.bed\", \"../../tests/data/test.fa\", signals=[\"../../tests/data/test.bw\"], \n", " in_window=10, out_window=6)\n", "\n", "X.shape, y.shape" ] }, { "cell_type": "markdown", "id": "4c4dc256-6514-49de-bc77-b46ce5783fff", "metadata": {}, "source": [ "$y$ will be of shape `(batch_size, n_signals, out_window)`, even when only one signal is provided. If multiple signals files are provided, the extracted signal in the returned tensor has the same order as the signal files provided in the list.\n", "\n", "Sometimes when training a sequence-based machine learning model we would like to add jitter to the inputs. Basically, even when we extract examples at the midpoint of each pair of coordinates in the bed file, we would like the actual example to be offset by some amount. This can be viewed as a form of data augmentation and can help prevent the models from overfitting to the exact positioning of the signal. For example, if a model is trained on only peaks it may end up learning to predict a peak-like shape regardless of the input because the center of the region being predicted is, by definition, a peak.\n", "\n", "The `extract_loci` function allows you to pass in a `max_jitter` argument. Importantly, using this argument does not extract regions with a random jitter added to the position. Rather, the function will extract examples with a wider `in_window` and `out_window` value, with `2*max_jtiter` added to each. This will allow you to write data generators that randomly sample windows from the selected examples so that you can have a different jitter at each iteration while still minimizing the memory cost of extracting all the examples." ] }, { "cell_type": "code", "execution_count": 8, "id": "5916771f-2d74-40b3-bb4d-fa72431b03f1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(torch.Size([5, 4, 18]), torch.Size([5, 1, 14]))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = extract_loci(\"../../tests/data/test.bed\", \"../../tests/data/test.fa\", signals=[\"../../tests/data/test.bw\"], \n", " in_window=10, out_window=6, max_jitter=4)\n", "\n", "X.shape, y.shape" ] }, { "cell_type": "markdown", "id": "0081e65f-caa0-42c9-8acf-094d4718fdb4", "metadata": {}, "source": [ "In your data generation function, you would now want to extract windows of size 10 from the input (potentially by randomly selecting a number between 0 and `2*max_jitter` as the starting point and extracting the next 10 positions) and windows of size 6 from the output. This will give you all the benefits of jittering in a practical, memory-efficient solution." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }