Skip to content

Notes on processing data from different technologies/platforms

Here are instructions on how to convert raw data from different platforms to the generic pixel/transcript level input format for punkst. This input format is a plain tsv file storing transcripts/pixels by spatial tiles, accompanied by an index file. The minimum information is x and y coordinates (in microns), transcript IDs (gene names), and counts. If you have auxiliary information or annotations that may be useful for downstream analysis, such as z-plane, cell IDs, subcellular compartment annotions, etc., keep them as additional columns.

For spot/single cell level data, topic-model now accepts 10X style DGE files directly, but you can also convert them to our custom format (see the last section below)

After the conversion, you can follow the standard workflow as described in the quick start page to run the pipeline (and specify all coordinate/size related parameters in microns). For platforms that provide cell coordinates, we also extracted the cell centers and you can try the experimental workflow in examples/with_cell_centers.

(We currently documented examples for 10X Genomics Visium HD, Xenium, NanoString CosMx SMI, and Vizgen MERSCOPE data. We've also applied punkst to Seq-scope, Stereo-seq, and other similar platforms, we are working on providing more information.)

Visium HD

First we need to locate the "binned output" directory and the subdirectory with the original resolution. For the data downloaded from 10X website it is called binned_outputs/square_002um. Let's call it RAWDIR. In this directory, you should find the subdirecotries, spatial and filtered_feature_bc_matrix (I guess you could also use data in raw_feature_bc_matrix but I've not tried it yet).

In the spatial directory, you should have a json file that contains the scaling factor of the coordinates, named as scalefactors_json.json. Let's grep the scaling factor (or set it manually).

microns_per_pixel=$(grep -w microns_per_pixel ${RAWDIR}/spatial/scalefactors_json.json | perl -lane '$_ =~ m/.*"microns_per_pixel": ([0-9.]+)/; print $1' )
#  "microns_per_pixel": 0.2737129726047599 in an example data

The spatial coordinates for each barcode are stored in parguet format, we do not support this format directly. Let's convert it to plain tsv file. duckdb seems fast and easy to use:

cd ${RAWDIR}/spatial/
duckdb -c "COPY (SELECT * FROM read_parquet('tissue_positions.parquet')) TO 'tissue_positions.tsv' (HEADER, DELIMITER '\t');"

Alternatively, you can try to use pyarrow and pandas in python.

Next, punkst has a command to merge the 10X style dge files (and the spatial coordinates) into a single file as our standard input:

brc_raw=${RAWDIR}/spatial/tissue_positions.tsv # the one converted from parquet
mtx_path=${RAWDIR}/filtered_feature_bc_matrix # path to the 10X style dge files
punkst convert-dge \
--microns-per-pixel ${microns_per_pixel} \
--exclude-regex '^mt-' --in-tissue-only \
--in-positions ${brc_raw} \
--in-dge-dir ${mtx_path} \
--output-dir ${path} \
--coords-precision 4

Here the optional flag --exclude-regex takes a regular expression to exclude genes matching with the regex. In the above example we exclude all mitochondrial genes.

The optional flag --in-tissue-only will exclude all barcodes that are labeled as not in the tissue.

The command writes transcripts.tsv with coordinates in microns.

The first column of the output file is the row index (0-based) of the barcode in the input barcode file (${mtx_path}/barcodes.tsv.gz), in case we want to recover the original barcode ID later.

CosMx SMI