Skip to content

Gene cooccurrence and marker selection

cooccurrence computes gene co-occurrence within a specified radius and coloc2markers selects markers from the co-occurrence matrix.

cooccurrence

The cooccurrence command computes how frequently each pair of genes appear together within a specified radius, either as a binary relation or weighted by distance with exponential decay. (We also have a merge-mtx command in case you want to add up multiple co-occurrence matrices potentially computed from different datasets.)

Usage

punkst cooccurrence --in-tsv tiles.tsv --in-index tiles.index \
--feature-dict features.tsv --out ./out \
--icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
--radius 15  --halflife 10 --binary --threads 4

Required

--in-tsv - The tiled data created by pts2tiles.

--in-index The index file created by pts2tiles.

--icol-x - Column index for x coordinate (0-based).

--icol-y - Column index for y coordinate (0-based).

--icol-feature - Column index for feature list (0-based).

--icol-val - Column index for the integer count values (0-based).

--radius - Radius within which to count co-occurrence.

--out - Output prefix for generated files.

--feature-dict - If feature column contains non-integer values, provide a dictionary/list of all feature names.

Optional

--bounding-boxes - Rectangular query regions specified as xmin ymin xmax ymax coordinates (plain numbers separated by spaces, without parenthesis). A single co-occurrence matrix is created using pixels in the union of the regions. Can specify multiple regions by adding multiple 4-tuples. e.g. --bounding-boxes 5040 2300 8830 4640 100 200 1500 1600 for two rectangles.

--weight-by-count - Weight co-occurrence by the product of transcript counts at each pixel. Default: false. (This is unlikely to have a noticeable impact unless for very dense sequencing data)

--halflife - Half-life (starting from 1 for zero distance, the distance where the weight is 0.5) for exponential decay weighting. Default: -1, unweighted by distance within the radius.

--min-neighbor - Minimum number of neighboring pixels within radius for a pixel to be included (meant to be used to reduce the influence from sparse/non-tissue regions). Default: 1.

--local-min - Minimum co-occurrence value within a tile to record. Default: 0.

--threads - Number of threads to use. Default: 1.

--binary - Output results in binary format. Default: false (TSV output). Using the binary format is more efficient especially if you were to run the coloc2markers command later.

Output files

  • Co-occurrence matrix named {prefix}.mtx.bin (if --binary is used) or {prefix}.mtx.tsv (otherwise).

  • Marginal information per gene named {prefix}.marginals.tsv. The columns are feature index (corresponding to row \& column index in the matrix), name, total counts, total number of pixels, used pixels, used neighbors.

Adding up multiple matrices

For convenience

punkst merge-mtx --in-list mtx_list.txt --binary --binary-output --shared-nrows $M --out cooccur.merged

Just include the file paths of the input matrices in a file provided to --in-list.

coloc2markers

The coloc2markers command selects optimal marker features from a co-occurrence matrix generated by the cooccurrence command. It can optionally find neighbors for selected or specified markers and recover the corresponding expression profiles.

Usage

Just finding markers:

punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10

Recovering gene expression profiles around each set of markers:

punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10 --recover-factors --weight-by-counts --threads 4

Required

--input - Input co-occurrence matrix (binary or TSV format) from the cooccurrence command.

--info - Input gene/feature information file from the cooccurrence command.

--K - Number of markers to select. Note that marker selection is deterministic and sequential, so the set of markers generated with small K is almost certainly contained in the set generated with a larger K. Specifying a large K then trim down to the desired number is perhaps recommended.

--out - Output prefix for generated files.

Optional

--binary - Specify that input matrix is in binary format. Default: false (assumes TSV).

--value-bytes - Number of bytes per value in binary matrix. Default: 8, match with the output from cooccurrence. Only used with --binary.

--min-count - Minimum count for a feature to be considered as a marker. This should be a reaonably large number, otherwise the results are driven by rare genes.

--fixed - List of markers that must be included in the selection (strings, separated by spaces). Currently assume the input are distinct markers.

--find-neighbors - Find neighbors (often co-localized genes) for each selected marker.

--neighbors - Number of top neighbors to find for each marker. Default: 10. Can be used instead of flag --find-neighbors.

--neighbor-max-rank-fraction - Maximum rank (in terms of the quantile/fraction among all genes) to consider for mutual neighbors. Default: 0.1, meaning that if gene A is among the top 10% of genes to be around conditional on observing gene B, but gene B is not among the top 10%, say among the bottom 50% of genes to be observed around gene A, then they are not considered as neighbors.

Recover factors

--recover-factors - Recover underlying factors from the co-occurrence matrix after marker selection.

--threads - Number of threads to use for factor recovery. Default: -1 (auto). Only used with --recover-factors.

--max-iter - Maximum iterations for factor recovery. Default: 500.

--tol - Convergence tolerance for factor recovery. Default: 1e-6.

--weight-by-counts - Weight factors by gene counts. Default: false.

--verbose - Verbosity level for output messages. Default: 0.

Output files

  • {prefix}.top.tsv - List of selected markers

  • if --find-neighbors or --neighbors is used:

    • {prefix}.pairs.tsv - Detailed pairwise relationships. Columns: index, name of gene 1, total count of gene 1, name of gene 2, total count of gene 2, (weighted) proportion of gene 2 in gene 1's neighbors, vice versa, rank of gene 2 as gene 1's neighbor, vice versa.
    • {prefix}.short.txt - Compact neighbor lists. The first gene on each line is the selected marker, followed by its neighbors.
  • if --recover-factors is used: {prefix}.factors.tsv for recovered factor matrix.