match

Provides functions for the calculation of GC-content genome-wide and the sampling of GC-matched negatives.

tangermeme.match.extract_matching_loci(loci: str | DataFrame, fasta: str, in_window: int = 2114, out_window: int = 1000, max_n_perc: float = 0.1, gc_bin_width: float = 0.02, bigwig: str | None = None, signal_beta: float = 0.5, chroms: list[str] | None = None, random_state: int | RandomState | None = None, n_jobs: int = 1, verbose: bool = False) → DataFrame

Extract matching loci given a fasta file.

This function takes in a set of loci (a bed file or a pandas dataframe in bed format) and returns a GC-matched set of negatives. This will also perform basic filtering to ignore regions of the genome that are too high in Ns. Optionally, it can take in a bigwig and a signal threshold and only select regions that have fewer than a threshold of counts in each region.

Importantly, it will apply max_n_perc to both the loci that are passed in and also potential regions that can be selected. This means that if a locus passed in has higher than max_n_perc number of unspecified positions, it will be filtered out, and a smaller number of positions will be selected. This is done because the GC content of a region with many Ns in it is not trustworthy.

Parameters

loci: str or pandas dataframe: A filepath to a bed file, or a pandas dataframe in bed format.
fasta: str: The filepath to the FASTA file to extract sequences from.
in_window: int: The window to calculate the GC content over, corresponding to the input window of the downstream model that will be trained.
out_window: int: The window to calculate signal for and apply the signal threshold to, corresponding to the output window of the downstream model that will be trained.
max_n_perc: float, range=(0, 1.0), optional: The maximum percentage of N characters in each window to be considered. All windows with a higher percentage are discarded. Default is 0.1.
gc_bin_width: float, range=(0, 1.0), optional: The bin size to discretize GC content. Default is 0.02.
bigwig: str or None, optional: If filtering regions based on signal strength, calculate the signal from this bigwig. If None, do not filter based on signal strength. Default is None.
signal_beta: float, optional: A multiplier of the robust minimum signal calculated from loci that each background region must have fewer reads than. Only relevant if a bigwig is passed in. Must be a number when bigwig is set (the code computes robust_min * signal_beta); passing None together with a bigwig raises a TypeError. Default is 0.5.
chroms: list, tuple, or None, optional: A set of chromosomes to use when choosing matching loci. If None, only use chromosomes that the loci themselves are drawn from. Default is None.
random_state: numpy.random.RandomState, int or None, optional: A random state to use for sampling loci. If a RandomState object or an integer, this will produce deterministic sampling. If None, sampling will be different each time. Default is None.
n_jobs: integer, optional: Number of parallel processes to use for extracting background GC content. -1 means use all available CPUs. Default is 1.
verbose: bool, optional: Whether to print display bars and diagnostics to ensure that the sampling is reasonable. When set to True, there may be a large amount of output. Default is False.

Returns

matched_loci: pandas.DataFrame: A bed-formatted set of matched loci sorted first by chromosome and then by position on the chromosome. Note that these are not sorted such that the i-th position in this file is a GC match for the i-th position in the original locus file.