Visualization¶

High resolution image of pixel level factorization results¶

draw-pixel-factors visualizes the results of pixel-decode

punkst draw-pixel-factors --in-tsv ${path}/pixel.decode.tsv --header-json ${path}/pixel.decode.json --in-color ${path}/color.rgb.tsv --out ${path}/pixel.png --scale 100 --xmin ${xmin} --xmax ${xmax} --ymin ${ymin} --ymax ${ymax}

--in-tsv specifies the input data file created by pixel-decode.

--header-json specifies the header created by pixel-decode.

--in-color specifies a tsv file with the colors for each factor. The first three columns will be interpreted as R, G, B values in the range $0-255$. The valid lines will be assigned to factors in the order they appear in this file.

--xmin, --xmax, --ymin, --ymax specify the range of the coordinates.

--scale scales input coordinates to pixels in the output image. int((x-xmin)/scale) equals the horizontal pixel coordinate in the image.

--out specifies the output png file.

If your specified --transform in lda4hex, one way to create the color table is to use the helper python script

python punkst/ext/py/color_helper.py --input ${path}/prefix.results.tsv --output ${path}/color

Pseudobulk differential expression analysis¶

de_bulk.py performs naive differential expression analysis on pseudobulk count data

The script performs chi-square tests to identify genes that are significantly enriched in each factor compared to the background, outputting results with fold changes, p-values, and chi-square statistics.

Note that the statistics are not calibrated when the pseudobulk table is generated by punkst pixel-decode, it is only meant for exploratory analysis.

python ext/py/de_bulk.py --input ${path}/pseudobulk.tsv --output ${path}/de_bulk.tsv --feature_label Feature --thread 4

--input specifies the input pseudobulk count table with genes as rows and factors as columns. punkst pixel-decode generates one such file with suffix pseudobulk.tsv. This file has to have a header row with column names. It has to have one column with gene names and one column for each factor.

--output specifies the output file for differential expression results.

--feature_label specifies the column name for feature names (default: "Feature").

--min_ct_per_feature minimum total count for a feature to be included in analysis (default: 50).

--max_pval_output maximum p-value threshold for output (default: 1e-3).

--min_fold_output minimum fold change threshold for output (default: 1.5).

--min_output_per_factor minimum number of top genes to output per factor even if not significant (default: 10).

--thread number of threads for parallel processing (default: 1).

--use_input_header if specified, all columns except for the column named --feature_label will be viewed as factors and the column names will be preserved as factor IDs in the output. Otherwise (default), only the columns where the column names contain integers will be considered as factors and the numbers will be extracted to be used as factor IDs.

--feature optional file with features to restrict analysis to. The column containing feature names should have the column name as specified by --feature_label.

HTML report for factor weights and top genes¶

factor_report.py generates HTML reports summarizing factor characteristics and top genes

The script generates an interactive HTML report (${output_pref}.factor.info.html) and a TSV summary (${output_pref}.factor.info.tsv) containing factor weights, top differentially expressed genes, and visualization colors.

python ext/py/factor_report.py --de ${path}/de_bulk.tsv --pseudobulk ${path}/pseudobulk.tsv --color_table ${path}/color.rgb.tsv --output_pref ${path}/report

--de specifies the differential expression results file from de_bulk.py.

--pseudobulk specifies the pseudobulk count table.

--color_table specifies the RGB color table for factors. This probably should be the same file as that used for punkst draw-pixel-factors.

--output_pref specifies the output prefix for generated files.

--feature_label specifies the column name for features (default: "Feature").

--n_top_gene maximum number of top genes to include in report (default: 20).

--min_top_gene minimum number of top genes to show per factor (default: 10).

--max_pval maximum p-value threshold for significant genes (default: 0.001).

--min_fc minimum fold change threshold for significant genes (default: 1.5).

--annotation optional file with factor annotations to display instead of the factor IDs. It is a tsv file where the first column contains factor IDs as appear in the header of the pseudobulk table, and the second column contains the annotation.

--anchor optional file with anchor genes chosen to represent each factor. It is a tsv file where the first column contains factor IDs as appear in the header of the pseudobulk table, and the second column contains the anchor gene names (separated by things other than tabs).