Gene cooccurrence and marker selection¶
cooccurrence computes gene co-occurrence within a specified radius and coloc2markers selects markers from the co-occurrence matrix.
cooccurrence¶
The cooccurrence command computes how frequently each pair of genes appear together within a specified radius, either as a binary relation or weighted by distance with exponential decay. (We also have a merge-mtx command in case you want to add up multiple co-occurrence matrices potentially computed from different datasets.)
Usage¶
punkst cooccurrence --in-tsv tiles.tsv --in-index tiles.index \
--feature-dict features.tsv --out ./out \
--icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
--radius 15 --halflife 10 --binary --threads 4
Required¶
--in-tsv - The tiled data created by pts2tiles.
--in-index The index file created by pts2tiles.
--icol-x - Column index for x coordinate (0-based).
--icol-y - Column index for y coordinate (0-based).
--icol-feature - Column index for feature list (0-based).
--icol-val - Column index for the integer count values (0-based).
--radius - Radius within which to count co-occurrence.
--out - Output prefix for generated files.
--feature-dict - If feature column contains non-integer values, provide a dictionary/list of all feature names.
Optional¶
--bounding-boxes - Rectangular query regions specified as xmin ymin xmax ymax coordinates (plain numbers separated by spaces, without parenthesis). A single co-occurrence matrix is created using pixels in the union of the regions. Can specify multiple regions by adding multiple 4-tuples. e.g. --bounding-boxes 5040 2300 8830 4640 100 200 1500 1600 for two rectangles.
--weight-by-count - Weight co-occurrence by the product of transcript counts at each pixel. Default: false. (This is unlikely to have a noticeable impact unless for very dense sequencing data)
--halflife - Half-life (starting from 1 for zero distance, the distance where the weight is 0.5) for exponential decay weighting. Default: -1, unweighted by distance within the radius.
--min-neighbor - Minimum number of neighboring pixels within radius for a pixel to be included (meant to be used to reduce the influence from sparse/non-tissue regions). Default: 1.
--local-min - Minimum co-occurrence value within a tile to record. Default: 0.
--threads - Number of threads to use. Default: 1.
--binary - Output results in binary format. Default: false (TSV output). Using the binary format is more efficient especially if you were to run the
coloc2markers command later.
Output files¶
-
Co-occurrence matrix named
{prefix}.mtx.bin(if--binaryis used) or{prefix}.mtx.tsv(otherwise). -
Marginal information per gene named
{prefix}.marginals.tsv. The columns are feature index (corresponding to row \& column index in the matrix), name, total counts, total number of pixels, used pixels, used neighbors.
Adding up multiple matrices¶
For convenience
punkst merge-mtx --in-list mtx_list.txt --binary --binary-output --shared-nrows $M --out cooccur.merged
Just include the file paths of the input matrices in a file provided to --in-list.
coloc2markers¶
The coloc2markers command selects optimal marker features from a co-occurrence matrix generated by the cooccurrence command. It can optionally find neighbors for selected or specified markers and recover the corresponding expression profiles.
Usage¶
Just finding markers:
punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10
Recovering gene expression profiles around each set of markers:
punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10 --recover-factors --weight-by-counts --threads 4
Required¶
--input - Input co-occurrence matrix (binary or TSV format) from the cooccurrence command.
--info - Input gene/feature information file from the cooccurrence command.
--K - Number of markers to select. Note that marker selection is deterministic and sequential, so the set of markers generated with small K is almost certainly contained in the set generated with a larger K. Specifying a large K then trim down to the desired number is perhaps recommended.
--out - Output prefix for generated files.
Optional¶
--binary - Specify that input matrix is in binary format. Default: false (assumes TSV).
--value-bytes - Number of bytes per value in binary matrix. Default: 8, match with the output from cooccurrence. Only used with --binary.
--min-count - Minimum count for a feature to be considered as a marker. This should be a reaonably large number, otherwise the results are driven by rare genes.
--fixed - List of markers that must be included in the selection (strings, separated by spaces). Currently assume the input are distinct markers.
--find-neighbors - Find neighbors (often co-localized genes) for each selected marker.
--neighbors - Number of top neighbors to find for each marker. Default: 10. Can be used instead of flag --find-neighbors.
--neighbor-max-rank-fraction - Maximum rank (in terms of the quantile/fraction among all genes) to consider for mutual neighbors. Default: 0.1, meaning that if gene A is among the top 10% of genes to be around conditional on observing gene B, but gene B is not among the top 10%, say among the bottom 50% of genes to be observed around gene A, then they are not considered as neighbors.
Recover factors¶
--recover-factors - Recover underlying factors from the co-occurrence matrix after marker selection.
--threads - Number of threads to use for factor recovery. Default: -1 (auto). Only used with --recover-factors.
--max-iter - Maximum iterations for factor recovery. Default: 500.
--tol - Convergence tolerance for factor recovery. Default: 1e-6.
--weight-by-counts - Weight factors by gene counts. Default: false.
--verbose - Verbosity level for output messages. Default: 0.
Output files¶
-
{prefix}.top.tsv- List of selected markers -
if
--find-neighborsor--neighborsis used:{prefix}.pairs.tsv- Detailed pairwise relationships. Columns: index, name of gene 1, total count of gene 1, name of gene 2, total count of gene 2, (weighted) proportion of gene 2 in gene 1's neighbors, vice versa, rank of gene 2 as gene 1's neighbor, vice versa.{prefix}.short.txt- Compact neighbor lists. The first gene on each line is the selected marker, followed by its neighbors.
-
if
--recover-factorsis used:{prefix}.factors.tsvfor recovered factor matrix.