Visualization¶
High resolution image of pixel level factorization results¶
draw-pixel-factors
visualizes the results of pixel-decode
punkst draw-pixel-factors --in-tsv ${path}/pixel.decode.tsv --header-json ${path}/pixel.decode.json --in-color ${path}/color.rgb.tsv --out ${path}/pixel.png --scale 100 --xmin ${xmin} --xmax ${xmax} --ymin ${ymin} --ymax ${ymax}
--in-tsv
specifies the input data file created by pixel-decode
.
--header-json
specifies the header created by pixel-decode
.
--in-color
specifies a tsv file with the colors for each factor. The first three columns will be interpreted as R, G, B values in the range \(0-255\). The valid lines will be assigned to factors in the order they appear in this file.
--xmin
, --xmax
, --ymin
, --ymax
specify the range of the coordinates.
--scale
scales input coordinates to pixels in the output image. int((x-xmin)/scale)
equals the horizontal pixel coordinate in the image.
--out
specifies the output png file.
If your specified --transform
in lda4hex
, one way to create the color table is to use the helper python script
Pseudobulk differential expression analysis¶
de_bulk.py
performs naive differential expression analysis on pseudobulk count data
The script performs chi-square tests to identify genes that are significantly enriched in each factor compared to the background, outputting results with fold changes, p-values, and chi-square statistics.
Note that the statistics are not calibrated when the pseudobulk table is generated by punkst pixel-decode
, it is only meant for exploratory analysis.
python ext/py/de_bulk.py --input ${path}/pseudobulk.tsv --output ${path}/de_bulk.tsv --feature_label Feature --thread 4
--input
specifies the input pseudobulk count table with genes as rows and factors as columns. punkst pixel-decode
generates one such file with suffix pseudobulk.tsv
. This file has to have a header row with column names. It has to have one column with gene names and one column for each factor.
--output
specifies the output file for differential expression results.
--feature_label
specifies the column name for feature names (default: "Feature").
--min_ct_per_feature
minimum total count for a feature to be included in analysis (default: 50).
--max_pval_output
maximum p-value threshold for output (default: 1e-3).
--min_fold_output
minimum fold change threshold for output (default: 1.5).
--min_output_per_factor
minimum number of top genes to output per factor even if not significant (default: 10).
--thread
number of threads for parallel processing (default: 1).
--use_input_header
if specified, all columns except for the column named --feature_label
will be viewed as factors and the column names will be preserved as factor IDs in the output. Otherwise (default), only the columns where the column names contain integers will be considered as factors and the numbers will be extracted to be used as factor IDs.
--feature
optional file with features to restrict analysis to. The column containing feature names should have the column name as specified by --feature_label
.
HTML report for factor weights and top genes¶
factor_report.py
generates HTML reports summarizing factor characteristics and top genes
The script generates an interactive HTML report (${output_pref}.factor.info.html
) and a TSV summary (${output_pref}.factor.info.tsv
) containing factor weights, top differentially expressed genes, and visualization colors.
python ext/py/factor_report.py --de ${path}/de_bulk.tsv --pseudobulk ${path}/pseudobulk.tsv --color_table ${path}/color.rgb.tsv --output_pref ${path}/report
--de
specifies the differential expression results file from de_bulk.py
.
--pseudobulk
specifies the pseudobulk count table.
--color_table
specifies the RGB color table for factors. This probably should be the same file as that used for punkst draw-pixel-factors
.
--output_pref
specifies the output prefix for generated files.
--feature_label
specifies the column name for features (default: "Feature").
--n_top_gene
maximum number of top genes to include in report (default: 20).
--min_top_gene
minimum number of top genes to show per factor (default: 10).
--max_pval
maximum p-value threshold for significant genes (default: 0.001).
--min_fc
minimum fold change threshold for significant genes (default: 1.5).
--annotation
optional file with factor annotations to display instead of the factor IDs. It is a tsv file where the first column contains factor IDs as appear in the header of the pseudobulk table, and the second column contains the annotation.
--anchor
optional file with anchor genes chosen to represent each factor. It is a tsv file where the first column contains factor IDs as appear in the header of the pseudobulk table, and the second column contains the anchor gene names (separated by things other than tabs).