Gene cooccurrence and marker selection¶
cooccurrence
computes gene co-occurrence within a specified radius and coloc2markers
selects markers from the co-occurrence matrix.
cooccurrence¶
The cooccurrence
command computes how frequently each pair of genes appear together within a specified radius, either as a binary relation or weighted by distance with exponential decay. (We also have a merge-mtx
command in case you want to add up multiple co-occurrence matrices potentially computed from different datasets.)
Usage¶
punkst cooccurrence --in-tsv tiles.tsv --in-index tiles.index \
--feature-dict features.tsv --out ./out \
--icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
--radius 15 --halflife 10 --binary --threads 4
Required¶
--in-tsv
- The tiled data created by pts2tiles
.
--in-index
The index file created by pts2tiles
.
--icol-x
- Column index for x coordinate (0-based).
--icol-y
- Column index for y coordinate (0-based).
--icol-feature
- Column index for feature list (0-based).
--icol-val
- Column index for the integer count values (0-based).
--radius
- Radius within which to count co-occurrence.
--out
- Output prefix for generated files.
--feature-dict
- If feature column contains non-integer values, provide a dictionary/list of all feature names.
Optional¶
--bounding-boxes
- Rectangular query regions specified as xmin ymin xmax ymax
coordinates (plain numbers separated by spaces, without parenthesis). A single co-occurrence matrix is created using pixels in the union of the regions. Can specify multiple regions by adding multiple 4-tuples. e.g. --bounding-boxes 5040 2300 8830 4640 100 200 1500 1600
for two rectangles.
--weight-by-count
- Weight co-occurrence by the product of transcript counts at each pixel. Default: false. (This is unlikely to have a noticeable impact unless for very dense sequencing data)
--halflife
- Half-life (starting from 1 for zero distance, the distance where the weight is 0.5) for exponential decay weighting. Default: -1, unweighted by distance within the radius.
--min-neighbor
- Minimum number of neighboring pixels within radius for a pixel to be included (meant to be used to reduce the influence from sparse/non-tissue regions). Default: 1.
--local-min
- Minimum co-occurrence value within a tile to record. Default: 0.
--threads
- Number of threads to use. Default: 1.
--binary
- Output results in binary format. Default: false (TSV output). Using the binary format is more efficient especially if you were to run the
coloc2markers
command later.
Output files¶
-
Co-occurrence matrix named
{prefix}.mtx.bin
(if--binary
is used) or{prefix}.mtx.tsv
(otherwise). -
Marginal information per gene named
{prefix}.marginals.tsv
. The columns are feature index (corresponding to row \& column index in the matrix), name, total counts, total number of pixels, used pixels, used neighbors.
Adding up multiple matrices¶
For convenience
punkst merge-mtx --in-list mtx_list.txt --binary --binary-output --shared-nrows $M --out cooccur.merged
Just include the file paths of the input matrices in a file provided to --in-list
.
coloc2markers¶
The coloc2markers
command selects optimal marker features from a co-occurrence matrix generated by the cooccurrence
command. It can optionally find neighbors for selected or specified markers and recover the corresponding expression profiles.
Usage¶
Just finding markers:
punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10
Recovering gene expression profiles around each set of markers:
punkst coloc2markers --input out.mtx.bin --binary --info out.marginals.tsv \
--out ./markers --K 24 --neighbors 10 --recover-factors --weight-by-counts --threads 4
Required¶
--input
- Input co-occurrence matrix (binary or TSV format) from the cooccurrence
command.
--info
- Input gene/feature information file from the cooccurrence
command.
--K
- Number of markers to select. Note that marker selection is deterministic and sequential, so the set of markers generated with small K
is almost certainly contained in the set generated with a larger K
. Specifying a large K
then trim down to the desired number is perhaps recommended.
--out
- Output prefix for generated files.
Optional¶
--binary
- Specify that input matrix is in binary format. Default: false (assumes TSV).
--value-bytes
- Number of bytes per value in binary matrix. Default: 8, match with the output from cooccurrence
. Only used with --binary
.
--min-count
- Minimum count for a feature to be considered as a marker. This should be a reaonably large number, otherwise the results are driven by rare genes.
--fixed
- List of markers that must be included in the selection (strings, separated by spaces). Currently assume the input are distinct markers.
--find-neighbors
- Find neighbors (often co-localized genes) for each selected marker.
--neighbors
- Number of top neighbors to find for each marker. Default: 10. Can be used instead of flag --find-neighbors
.
--neighbor-max-rank-fraction
- Maximum rank (in terms of the quantile/fraction among all genes) to consider for mutual neighbors. Default: 0.1, meaning that if gene A is among the top 10% of genes to be around conditional on observing gene B, but gene B is not among the top 10%, say among the bottom 50% of genes to be observed around gene A, then they are not considered as neighbors.
Recover factors¶
--recover-factors
- Recover underlying factors from the co-occurrence matrix after marker selection.
--threads
- Number of threads to use for factor recovery. Default: -1 (auto). Only used with --recover-factors
.
--max-iter
- Maximum iterations for factor recovery. Default: 500.
--tol
- Convergence tolerance for factor recovery. Default: 1e-6.
--weight-by-counts
- Weight factors by gene counts. Default: false.
--verbose
- Verbosity level for output messages. Default: 0.
Output files¶
-
{prefix}.top.tsv
- List of selected markers -
if
--find-neighbors
or--neighbors
is used:{prefix}.pairs.tsv
- Detailed pairwise relationships. Columns: index, name of gene 1, total count of gene 1, name of gene 2, total count of gene 2, (weighted) proportion of gene 2 in gene 1's neighbors, vice versa, rank of gene 2 as gene 1's neighbor, vice versa.{prefix}.short.txt
- Compact neighbor lists. The first gene on each line is the selected marker, followed by its neighbors.
-
if
--recover-factors
is used:{prefix}.factors.tsv
for recovered factor matrix.