io

tangermeme.io.extract_loci(loci: str | list[str] | DataFrame | list[DataFrame], sequences: str | Fasta | dict, signals: list | None = None, in_signals: list | None = None, chroms: list[str] | None = None, in_window: int = 2114, out_window: int = 1000, max_jitter: int = 0, min_counts: float | None = None, max_counts: float | None = None, target_idx: int = 0, n_loci: int | None = None, summits: bool = False, alphabet: list[str] = ['A', 'C', 'G', 'T'], ignore: list[str] = ['N'], exclusion_lists: list | None = None, return_mask: bool = False, verbose: bool = False) → tuple

Extract sequence and signal information for each provided locus.

This function will take in a set of loci, sequences, and optionally signals, and return the sequences and signals at each of the loci. Each of these parameters can be a filename, which is loaded internally, or an appropriate Python object (see below for details). The nomenclature in/out refers to the expected inputs and outputs of the downstream machine learning model, not this function.

For each locus a sequence window of size in_window will be extracted from the sequences file and each of the in_signals files if provided, and a window of size out_window will be extracted from each of the signals files if provided. These windows are centered at the middle of the provided regions but will all be of the same size, regardless of the size of the peak.

If max_jitter is provided, it will expand the windows for both the input and output. The results are not actually jittered, but this expanded window allows for downstream data generators to create jittered data while reducing the memory footprint of the returned data.

There are a few reasons that the returned elements may not match one-to-one with the provided loci:

If any of the coordinates fall off the end of chromosomes after

accounting for jitter, the locus will be removed.

If any of the loci fall on chromosomes not in a provided list,

they will be removed.

If min_counts or max_counts are specified and the locus has a

number of counts not in those boundaries, the locus will be removed.

If exclusion lists are provided, they will be used to filter out loci that fall in 100bp chunks that also include any of the regions in any of the exclusion lists. For example, if one of the exclusion lists has an element that is

chr7 108 234

loci will be removed if any of their bp fall within chr7 100 300.

Parameters

loci: str or pandas.DataFrame or list/tuple of such: Either the path to a bed file or a pandas DataFrame object containing three columns: the chromosome, the start, and the end, of each locus to train on. Alternatively, a list or tuple of strings/DataFrames where the intention is to train on the interleaved concatenation, i.e., when you want to train on peaks and negatives.
sequences: str or dictionary: Either the path to a fasta file to read from or a dictionary where the keys are the unique set of chromosomes and the values are one-hot encoded sequences as numpy arrays or memory maps.
signals: list of strs or list of dictionaries or None, optional: A list of filepaths to bigwig files, where each filepath will be read using pybigtools, or a list of dictionaries where the keys are the same set of unique chromosomes and the values are numpy arrays or memory maps. If None, no signal tensor is returned. Default is None.
in_signals: list of strs or list of dictionaries or None, optional: A list of filepaths to bigwig files, where each filepath will be read using pybigtools, or a list of dictionaries where the keys are the same set of unique chromosomes and the values are numpy arrays or memory maps. If None, no tensor is returned. Default is None.
chroms: list or None, optional: A set of chromosomes to extract loci from. Loci in other chromosomes in the locus file are ignored. If None, all loci are used. Default is None.
in_window: int, optional: The input window size. Default is 2114.
out_window: int, optional: The output window size. Default is 1000.
max_jitter: int, optional: The maximum amount of jitter to add, in either direction, to the midpoints that are passed in. Default is 0.
min_counts: float or None, optional: The minimum number of counts, summed across the length of each example and across all tasks, needed to be kept. If None, no minimum. Default is None.
max_counts: float or None, optional: The maximum number of counts, summed across the length of each example and across all tasks, needed to be kept. If None, no maximum. Default is None.
target_idx: int, optional: When specifying min_counts or max_counts, the single signal file to use when determining if a region has a number of counts in that range. Default is 0.
n_loci: int or None, optional: A cap on the number of loci to return. Note that this is not the number of loci that are considered. The difference is that some loci may be filtered out for various reasons, and those are not counted towards the total. If None, no cap. Default is None.
summits: bool, optional: Whether to return a region centered around the summit instead of the center between the start and end. If True, it will add the 10th column (index 9) to the start to get the center of the window, and so the data must be in narrowPeak format.
alphabetset or tuple or list: A pre-defined alphabet where the ordering of the symbols is the same as the index into the returned tensor, i.e., for the alphabet [‘A’, ‘B’] the returned tensor will have a 1 at index 0 if the character was ‘A’. Characters outside the alphabet are ignored and none of the indexes are set to 1. Default is [‘A’, ‘C’, ‘G’, ‘T’].
ignore: list, optional: A list of characters to ignore in the sequence, meaning that no bits are set to 1 in the returned one-hot encoding. Put another way, the sum across characters is equal to 1 for all positions except those where the original sequence is in this list. Default is [‘N’].
exclusion_lists: list or None, optional: A list of strings of filenames to BED-formatted files containing exclusion lists, i.e., regions where overlapping loci should be filtered out. If None, no filtering is performed based on exclusion zones. Default is None.
return_mask: bool, optional: Whether to return a tensor containing whether each element in the provided loci have been filtered out because of falling off the edge of chromosomes or the signal not falling in the specified boundaries. Default is False.
verbose: bool, optional: Whether to display a progress bar while loading. Default is False.

Returns

seqs: torch.tensor, shape=(n, 4, in_window+2*max_jitter): The extracted sequences in the same order as the loci in the locus file after optional filtering by chromosome.
signals: torch.tensor, shape=(n, len(signals), out_window+2*max_jitter): The extracted signals where the first dimension is in the same order as loci in the locus file after optional filtering by chromosome and the second dimension is in the same order as the list of signal files. If no signal files are given, this is not returned.
in_signals: torch.tensor, shape=(n, len(in_signals), in_window+2*max_jitter): The extracted in signals where the first dimension is in the same order as loci in the locus file after optional filtering by chromosome and the second dimension is in the same order as the list of in signal files. If no in signal files are given, this is not returned.
kept_mask: torch.tensor, shape=(n0,), dtype=bool: A boolean vector of length equal to the number of pre-filtered peaks, with entries being True if they were kept and False if they were filtered out. Applying this mask to the complete set of interleaved peaks will yield the returned values. Only returned if return_mask=True.

tangermeme.io.read_meme(filename: str, n_motifs: int | None = None) → dict[str, Tensor]

Read a MEME file and return a dictionary of PWMs.

This method takes in the filename of a MEME-formatted file to read in and returns a dictionary of the PWMs where the keys are the metadata line and the values are the PWMs.

This function is a wrapper around the memelite one, except that it returns torch tensors instead of numpy arrays.

Parameters

filename: str: The filename of the MEME-formatted file to read in.
n_motifs: int or None, optional: If provided, stop reading after this many motifs have been parsed. If None, read all motifs in the file. Default is None.

Returns

motifs: dict: A dictionary of the motifs in the MEME file.

tangermeme.io.read_vcf(filename: str) → DataFrame

Read a VCF file into a pandas DataFrame

This function takes in the name of a file that is VCF formatted and returns a pandas DataFrame with the comments filtered out. This will only return the first 9 columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT); any per-sample genotype columns past column 9 are silently dropped.

Compressed VCFs are read transparently when pandas detects the compression from the filename extension (e.g., .vcf.gz works via pandas.read_csv). BCF (binary VCF) files are NOT supported by this function.

Parameters

filename: str: The path to the VCF-formatted file to read in. May be plain .vcf or gzip-compressed .vcf.gz.

Returns

vcf: pandas.DataFrame: A pandas DataFrame containing the rows, with columns CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT.