Differential expression tests¶

punkst multi-conditional-de-pixel: Multi-sample (pairwise), cell type-specific DE using pixel-level annotations.
punkst conditional-de-region-pixel: Cell type-specific DE between two GeoJSON-defined regions using one annotation file and one transcript file.
punkst multi-conditional-de-pois: Multi-sample (pairwise), cell type-specific DE using Poisson regression on spot-level data.
punkst de-chisq: Chi-squared DE on pseudobulk matrices.

de-chisq¶

de-chisq runs \(\chi^2\) tests on a pseudobulk matrix for a quick and dirty check on gene enrichment at factor level. Provide one input for a 1-vs-rest test per factor, or two inputs for a pairwise comparison of each factor between datasets.

1-vs-rest per factor:
```
punkst de-chisq --input pseudobulk.tsv --out de.tsv --min-fc 1.5 --max-pval 1e-3
```
By default, it also reports 1-vs-neighbors results that highlight more specific genes enriched in this factor compared to k (default 3) other most similar factors.

Pairwise comparison between two pseudobulk matrices:

punkst de-chisq --input sampleA.pseudobulk.tsv sampleB.pseudobulk.tsv \
  --out de.pair.tsv --min-fc 1.5 --max-pval 1e-4

Input Format¶

The input is a TSV file where the first row must be a header including an arbitrary label for the feature name column (e.g. "Gene") followed by K factor / cell type names. Then each row starts with the gene name followed by counts of that gene for each factor.

Such matrices will be generated by punkst topic-model and pixel-decode.

For pairwise input, the two matrices should have matching factor columns. Feature names are intersected between the two files.

Required Parameters¶

--input - Input TSV file. Provide one file for 1-vs-rest tests, or two files for pairwise comparison. --out - Output TSV file.

Optional Parameters¶

--min-count-per-feature - Minimum total count for a feature to be considered. Default: 100.

--max-pval - Max p-value for output. Default: 1e-3.

--min-fc - Minimum fold change. Default: 1.5.

--min-count - Minimum observed count for a (feature, factor) pair to be tested. Default: 10.

--pseudocount - Pseudocount added to each cell for the Chi-squared test. Default: 0.5.

--threads - Number of threads. Default: 1. (normally unnecessary)

--neighbor-k - Number of nearest neighbor columns to aggregate as background (single input only). Default: 3. Must be smaller than K-1. When set, writes an additional output file with 1-vs-neighbors results.

--confusion - (Pairwise mode only; experimental heuristics) Two confusion matrices to deconvolve the two input pseudobulk matrices before comparison, meant to reduce the bias due to the difference of spatial environment or contaminations (from neighboring factors) between datasets.

Output¶

Single input output columns:

Feature, Factor, Chi2, FoldChange, log10pval

When --neighbor-k > 0, an additional file is written with the same columns and the suffix .1vsNeighbors.tsv (if --out ends with .tsv, that extension is stripped first).

Two-input output columns:

Feature, Factor, Chi2, FoldChange, log10pval, Count1, Count2

Rows are sorted by factor (ascending) and Chi2 (descending) within each factor.

multi-conditional-de-pixel¶

multi-conditional-de-pixel joins pixel-level annotation results with the original transcript data on the fly and performs cell-type-specific DE between dataset groups. It aggregates transcripts into grid units of size --grid-size stratified by pixel level cell type assignments then runs pairwise tests for each cell type using a Binomial model. By default it builds all pairwise contrasts from --labels (or numeric indices); for more general comparisons, use a contrast design file. The main p-values are computed from robust sandwich estimators of the standard errors of the estimated effect sizes. Optional permutation p-values can also be added with --perm.

\(X^{(k)}_{im} \sim\) Binom \((N^{(k)}_{im}, \pi^{(k)}_{im})\) for cell type \(k\), bin \(i\) and gene \(m\), where both \(X\) and \(N\) are soft-aggregated counts from pixel level cell type assignments.

logit \((\pi^{(k)}_{im}) = a^{(k)}_m + y_i b^{(k)}_m\), where \(y_i\) is a binary indicator for the dataset (0/1). The null hypothesis is \(b^{(k)}_m = 0\).

Note: the intended use is when the cell type model is from external sources (e.g. apply pixel-decode with a reference-based cell type pseudobulk matrix). The interpretation is tricky if the cell types are learned from the same datasets. Either way, this is only a data exploration tool and we do not claim that it is statistically rigorous.

Example usage (pairwise contrasts between input datasets):

punkst multi-conditional-de-pixel \
  --anno sampleA/pixel sampleB/pixel --binary \
  --pts sampleA/transcripts.tiled sampleB/transcripts.tiled \
  --labels sampleA sampleB \
  --K 12 --features features.tsv \
  --grid-size 20 --min-count 10 --perm 5000 \
  --icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
  --out de/pixel_de --threads 8 --seed 1

Example usage with explicit contrast file:

punkst multi-conditional-de-pixel \
  --contrast contrast.tsv \
  --K 12 --features features.tsv \
  --grid-size 20 --min-count 10 --perm 5000 \
  --icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
  --out de/pixel_de --threads 8 --seed 1

contrast.tsv format (tab-delimited):

anno_prefix pts_prefix  B_vs_A  C_vs_A
sampleA/pixel   sampleA/transcripts.tiled   -1  -1
sampleB/pixel   sampleB/transcripts.tiled   1   0
sampleC/pixel   sampleC/transcripts.tiled   0   1

Each contrast column uses -1 (group 0), 1 (group 1), or 0 (exclude). Contrast names come from the header.

Required Parameters¶

Provide either --anno, --pts (or --anno-data/--anno-index, --pts) or a --contrast design file.

--anno - Prefixes of pixel annotation files (from punkst pixel-decode). For each prefix, the tool expects <prefix>.tsv (or <prefix>.bin if --binary) and <prefix>.index. Cannot be combined with --contrast.

--anno-data, --anno-index - Alternative to --anno. Provide explicit data and index files (same count for each).

--pts - Prefixes of transcript data files (from punkst pts2tiles). For each prefix, the tool expects <prefix>.tsv and <prefix>.index. Cannot be combined with --contrast.

--contrast - Contrast design TSV with columns: anno_prefix, pts_prefix, and one or more contrast columns (values -1/0/1). When provided, it replaces --anno/--anno-data/--anno-index/--pts and --labels.

--K - Number of factors in the annotation files.

--features - A list of features to test, one feature name per line (lines starting with # are ignored).

--grid-size - Grid size used to aggregate transcripts into units.

--icol-x, --icol-y, --icol-feature, --icol-val - Column indices for X/Y coordinates, feature name, and count in the transcript files (0-based).

--out - Output prefix.

Annotation and transcript tiles must use the same tile size for each dataset pair (this is guaranteed if the pixel level decoding results are generated from the corresponding transcript tiles by punkst pixel-decode).

Optional Parameters¶

--labels - Labels for datasets used in pairwise output; defaults to 0..N-1. Ignored when --contrast is supplied.

--binary - Indicates the annotation data files are binary (.bin).

--confusion - Optional per-dataset confusion matrices (K x K TSV). When omitted, the command estimates one confusion matrix from each annotation dataset on the fly.

--min-count-per-feature - Minimum total count (in each pairwise comparison) for a feature to be considered. Default: 100.

--max-pval - Max p-value for output. Default: 1 (output all).

--max-pval-deconv - If at least two cell types (slices) have a p-value less than this threshold, the tool runs a deconvolution to estimate effects adjusted for other cell types. Default: 0.05.

--min-or - Minimum odds ratio for output (bidirectional, so OR\(>x\) and OR\(<1/x\) are kept). Default: 1 (output all).

--min-count - Minimum observed cell-type-specific count for a unit to be included. Default: 10.

--perm - Number of permutations for beta calibration (two-group contrasts). Default: 0 (model-based p-values only).

--min-or-perm - Minimum odds ratio for permutation testing (bidirectional). Default: 1.2.

--seed - Random seed for permutation.

--threads - Number of threads. Default: 1. Almost the entire runtime is spent in data loading, so multi-threading is very useful for large datasets since tiles of data are loaded simultaneously.

--bounded, --xmin, --xmax, --ymin, --ymax - Optional transcript-side bounding-box subsetting before aggregation.

--debug - Enable debug logging.

Output Files¶

Given --out PREFIX, the tool generates one main output file per contrast, plus auxiliary files.

For each contrast, PREFIX.CONTRAST.tsv is generated, where CONTRAST is from the --contrast file header or labelA_vs_labelB for auto-pairwise.
Main output columns:
Slice: Cell type index (0-based).
Feature: Feature name.
Beta: Observed log odds ratio, i.e. logit(Pi1) - logit(Pi0).
log10p: -log10(p-value) for Beta.
Pi0, Pi1: Observed feature proportions in group 0 and group 1.
TotalCount: Total feature count in this cell type and contrast.
Beta_deconv: Deconvolved log odds ratio adjusted for cell type mixing. This is experimental and is only non-zero when deconvolution is triggered and succeeds.
FC_deconv: Deconvolved fold change computed from the deconvolved latent proportions.
log10p_deconv: -log10(p-value) for Beta_deconv.
p_perm: Included only when --perm > 0.
Auxiliary files:
PREFIX.nobs.tsv: The number of units and total pixel counts per cell type and dataset.
PREFIX.sums.tsv: Total feature counts per cell type and dataset for units passing filters.

conditional-de-region-pixel¶

conditional-de-region-pixel runs the same pixel-based conditional DE test, but for a single annotation dataset and a single transcript dataset split into two groups by two polygon-defined regions. The two groups are defined by --region-neg and --region-pos, and the command computes one confusion matrix from each region on the fly from the annotation probabilities that are actually used during aggregation.

Region membership is defined by transcript coordinates. Annotation pixels are used to annotate transcripts after region filtering, with a raster-assisted fast path to avoid exact polygon checks for most transcripts.

Example usage:

punkst conditional-de-region-pixel \
  --anno sample/pixel --binary \
  --pts sample/transcripts.tiled \
  --region-neg regions/control.geojson \
  --region-pos regions/treatment.geojson \
  --region-label-neg control \
  --region-label-pos treatment \
  --K 12 --features features.tsv \
  --grid-size 20 --min-count 10 --perm 2000 \
  --icol-x 0 --icol-y 1 --icol-feature 2 --icol-val 3 \
  --out de/region_de --threads 8

Required Parameters¶

Provide either --anno or both --anno-data and --anno-index.

--anno - Prefix of the pixel annotation files. The tool expects <prefix>.tsv (or <prefix>.bin if --binary) and <prefix>.index.

--anno-data, --anno-index - Alternative to --anno. Provide the explicit annotation data and index files.

--pts - Prefix of the transcript tiled dataset. The tool expects <prefix>.tsv and <prefix>.index.

--region-neg, --region-pos - GeoJSON files defining the negative/reference and positive/comparison regions. See GeoJSON Region Input for the accepted file format.

--K - Number of factors in the annotation file.

--features - A list of features to test, one feature name per line (lines starting with # are ignored).

--grid-size - Grid size used to aggregate transcripts into units.

--icol-x, --icol-y, --icol-feature, --icol-val - Column indices for X/Y coordinates, feature name, and count in the transcript file (0-based).

--out - Output prefix.

The annotation and transcript tiled inputs must use the same tile size.

Optional Parameters¶

--region-label-neg, --region-label-pos - Labels used in the contrast name and auxiliary outputs. Defaults: 0, 1.

--region-scale - Integer coordinate scale used during GeoJSON preprocessing. Default: 10.

--binary - Indicates the annotation data file is binary (.bin).

--pseudo-rel - Relative pseudo count fraction used in the Binomial model. Default: 0.05.

--min-prob - Minimum annotation probability for a pixel to contribute to a slice. Default: 0.01.

--min-count-per-feature - Minimum total count for a feature to be considered. Default: 100.

--max-pval - Max p-value for output. Default: 1.

--max-pval-deconv - If at least two cell types reach this p-value threshold, run deconvolution. Default: 0.05.

--min-or - Minimum odds ratio for output. Default: 1.

--min-or-perm - Minimum odds ratio for doing permutation. Default: 1.2.

--min-count - Minimum observed factor-specific count for a unit to be included. Default: 10.

--perm - Number of permutations for beta calibration. Default: 0.

--seed - Random seed for permutation.

--threads - Number of threads. Default: 1.

--aux-suff - Optional suffix inserted before .nobs.tsv and .sums.tsv.

--debug - Enable debug logging.

Output Files¶

Given --out PREFIX, the command generates:

PREFIX.REGIONNEG_vs_REGIONPOS.tsv, where the names come from --region-label-neg and --region-label-pos.
PREFIX.nobs.tsv
PREFIX.sums.tsv

The main output columns are the same as multi-conditional-de-pixel:

Slice
Feature
Beta
log10p
Pi0
Pi1
TotalCount
Beta_deconv
FC_deconv
log10p_deconv
p_perm when --perm > 0

If the two regions overlap, transcripts in the overlap contribute to both groups.

multi-conditional-de-pois¶

multi-conditional-de-pois performs cell-type-specific differential expression on pre-defined spatial units (e.g. hexagons from tiles2hex, other binned data, single cells, or low resolution data like Visium). It leverages a pre-trained topic model (e.g., LDA from punkst topic-model) to estimate cell type mixture proportions within each unit first, then for each gene and cell type, it fits a Poisson regression model to estimate differential expression between specified groups of datasets.

For gene \(m\) and unit \(i\), the model is: \(Y_{im} \sim \text{Pois}(\lambda_{im})\) with rate \(\lambda_{im}\) defined by a mixture across cell types: \(\lambda_{im} = \sum_k \theta_{ik} g^{-1}( \eta_{mk}^0 + x_i b_{mk} )\) where \(\theta_{ik}\) is the proportion of cell type \(k\) in unit \(i\), \(x_i \in \set{\pm 1}\) indicates the group that unit \(i\) belongs to. The null hypothesis is \(b_{mk}=0\), and \(g\) is a link function. The log1p link \(g^{-1}(\eta)=c_i (\exp(\eta) - 1) \ (c_i=n_i/L)\) is much more efficient than the canonical log link with similar performance in practice.

Example usage with explicit contrast file:

punkst multi-conditional-de-pois \
  --contrast contrast.tsv \
  --model my_lda_model.bin \
  --link log1p \
  --out de/unit_de --threads 8

contrast.tsv format (tab-delimited):

in-data in-meta B_vs_A
sampleA/hex.tsv sampleA/hex.json    -1
sampleB/hex.tsv sampleB/hex.json    1

Each contrast column uses -1 (group 0), 1 (group 1), or 0 (exclude). Contrast names come from the header. The in-data column points to unit-level count data (e.g. from tiles2hex) and in-meta points to the corresponding metadata file.

Required Parameters¶

Provide either --in-data and --in-meta (for automatic pairwise contrasts) or a --contrast design file.

--model - Path to the trained topic model file (e.g. LDA from topic-model).

--out - Output prefix.

--contrast - Contrast design TSV with columns: in-data, in-meta, and one or more contrast columns (values -1/0/1). When provided, it replaces --in-data, --in-meta and --labels.

Optional Parameters¶

--in-data - Input data files (.tsv from tiles2hex).

--in-meta - Metadata files (.json from tiles2hex).

--labels - Labels for datasets used in pairwise output; defaults to 0..N-1. Ignored when --contrast is supplied.

--link - Regression link function: log or log1p. Default: log1p.

--min-count - Minimum total count per unit to be included. Default: 50.

--min-count-per-feature - Minimum total count for a feature to be tested. Default: 100.

--min-units-per-feature - Minimum number of units with a non-zero count for a feature to be tested. Default: K * 10 where K is number of cell types.

--max-pval - Max p-value for output. Default: 1.

--se-method - Method for calculating standard errors. 1: Fisher (model-based), 2: Sandwich (robust), 3: both. Default: 1.

--size-factor - Size factor \(L\) for the Poisson regression model. Default: 10000.0.

--threads - Number of threads. Default: 1.

Output Files¶

Given --out PREFIX, the tool generates one main output file per contrast, plus an auxiliary file.

For each contrast, PREFIX.CONTRAST.tsv is generated, where CONTRAST is from the --contrast file header or labelA_vs_labelB for auto-pairwise.
Columns:
- Feature: Feature name.
- Factor: Cell type name.
- Beta: log fold change for the feature in this cell type.
- SE: Standard error of Beta. If se-method=3, an additional SE0 column with the Fisher-based SE is included.
- log10p: -log10(p-value) for Beta.
- Dist2bd: Distance of the beta estimate to the optimization boundary. Small values may indicate unstable estimates.
- n: Number of units included in the test for this feature.
- Count1, Count2: Total counts of the feature in group 1 and group 2.
An auxiliary file PREFIX.CONTRAST.eta0.tsv is also generated, containing the fitted baseline (log) expression rates for each gene and cell type.