match
Provides functions for the calculation of GC-content genome-wide and the sampling of GC-matched negatives.
- tangermeme.match.extract_matching_loci(loci, fasta, in_window=2114, out_window=1000, max_n_perc=0.1, gc_bin_width=0.02, bigwig=None, signal_beta=0.5, chroms=None, random_state=None, n_jobs=-1, verbose=False)
Extract matching loci given a fasta file.
This function takes in a set of loci (a bed file or a pandas dataframe in bed format) and returns a GC-matched set of negatives. This will also perform basic filtering to ignore regions of the genome that are too high in Ns. Optionally, it can take in a bigwig and a signal threshold and only select regions that have fewer than a threshold of counts in each region.
Importantly, it will apply max_n_perc to both the loci that are passed in and also potential regions that can be selected. This means that if a locus passed in has higher than max_n_perc number of unspecified positions, it will be filtered out, and a smaller number of positions will be selected. This is done because the GC content of a region with many Ns in it is not trustworthy.
Parameters
- loci: str or pandas dataframe
A filepath to a bed file, or a pandas dataframe in bed format.
- fasta: str
The filepath to the FASTA file to extract sequences from.
- in_window: int
The window to calculate the GC content over, corresponding to the input window of the downstream model that will be trained.
- out_window: int
The window to calculate signal for and apply the signal threshold to, corresponding to the output window of the downstream model that will be trained.
- max_n_perc: float, range=(0, 1.0), optional
The maximum percentage of N characters in each window to be considered. All windows with a higher percentage are discarded. Default is 0.1.
- gc_bin_width: float, range=(0, 1.0), optional
The bin size to discretize GC content. Default is 0.02.
- bigwig: str or None, optional
If filtering regions based on signal strength, calculate the signal from this bigwig. If None, do not filter based on signal strength. Default is None.
- signal_beta: float or None, optional
A multiplier of the robust minimum signal calculated from loci that each background region must have fewer reads then. Only relevant if a bigwig is passed in. Default is 0.5.
- chroms: list, tuple, or None, optional
A set of chromosomes to use when choosing matching loci. If None, only use chromosomes that the loci themselves are drawn from. Default is None.
- random_state: numpy.random.RandomState, int or None, optional
A random state to use for sampling loci. If a RandomState object or an integer, this will produce deterministic sampling. If None, sampling will be different each time. Default is None.
- n_jobs: integer, optional
Number of parallel processes to use for extracting background gc content. -1 means use all available CPUs. Default is -1.
- verbose: bool, optional
Whether to print display bars and diagnostics to ensure that the sampling is reasonable. When set to True, there may be a large amount of output. Default is False.
Returns
- matched_loci: pandas.DataFrame
A bed-formatted set of matched loci sorted first by chromosome and then by position on the chromosome. Note that these are not sorted such that the i-th position in this file is a GC match for the i-th position in the original locus file.