Notes on processing data from different technologies/platforms¶
Here are instructions on how to convert raw data from different platforms to the generic input format for FICTURE
Except for Visium HD (see below), we provide a pair of template Makefile
and config_prepare.json
files in the examples/format_input
directory for each platform. You can copy config_prepare.json
to your directory and modify the parameters, then generate the concrete Makefile
by
python /path/to/punkst/ext/py/generate_workflow.py \
-t /path/to/punkst/examples/format_input/cosmx/Makefile \
-c config_prepare.json -m Makefile
make
.
After the conversion, you can follow the standard workflow as described in the quick start page to run the pipeline (and specify all coordinate/size related parameters in microns). For platforms that provide cell coordinates, we also extracted the cell centers and you can try the experimental workflow in examples/with_cell_centers
.
Visium HD¶
First we need to locate the "binned output" directory and the subdirectory with the original resolution. For the data downloaded from 10X website it is called binned_outputs/square_002um
. Let's call it RAWDIR
. In this directory, you should find the subdirecotries, spatial
and filtered_feature_bc_matrix
(I guess you could also use data in raw_feature_bc_matrix
but I've not tried it yet).
In the spatial
directory, you should have a json file that contains the scaling factor of the coordinates, named as scalefactors_json.json
.
Let's grep the scaling factor (or set it manually).
microns_per_pixel=$(grep -w microns_per_pixel ${RAWDIR}/spatial/scalefactors_json.json | perl -lane '$_ =~ m/.*"microns_per_pixel": ([0-9.]+)/; print $1' )
# "microns_per_pixel": 0.2737129726047599 in an example data
The spatial coordinates for each barcode are stored in parguet
format, we do not support this format directly. Let's convert it to plain tsv file. duckdb seems fast and easy to use:
cd ${RAWDIR}/spatial/
duckdb -c "COPY (SELECT * FROM read_parquet('tissue_positions.parquet')) TO 'tissue_positions.tsv' (HEADER, DELIMITER '\t');"
Alternatively, you can try to use pyarrow and pandas in python.
Next, punkst has a command to merge the 10X style dge files (and the spatial coordinates) into a single file as our standard input:
brc_raw=${RAWDIR}/spatial/tissue_positions.tsv # the one converted from parquet
mtx_path=${RAWDIR}/filtered_feature_bc_matrix # path to the 10X style dge files
punkst convert-dge \
--microns-per-pixel ${microns_per_pixel} \
--exclude-regex '^mt-' --in-tissue-only \
--in-positions ${brc_raw} \
--in-dge-dir ${mtx_path} \
--output-dir ${path} \
--coords-precision 4
Here the optional flag --exclude-regex
takes a regular expression to exclude genes matching with the regex. In the above example we exclude all mitochondrial genes.
The optional flag --in-tissue-only
will exclude all barcodes that are labeled as not in the tissue.
The command writes transcripts.tsv
with coordinates in microns.
CosMx SMI¶
You can use the template Makefile
and config_prepare.json
in punkst/examples/format_input/cosmx
to conver CosMx raw output files to the generic input format. Alternatively, see the bash commands below.
Copy the config file to your directory and set the raw file names. For example, here is an example for the public mouse half brain data:
{
"workflow": {
"raw_tx" : "Run1000_S1_Half_tx_file.csv",
"raw_meta": "Run1000_S1_Half_metadata_file.csv",
"microns_per_pixel": 0.12,
"datadir": "/output/test"
}
}
You can find "microns_per_pixel"
in the ReadMe.html, it may say something like "To convert to microns multiply the pixel value by 0.12 um per pixel".
The following are the commands ran in the Makefile
:
# Extract cell coordinates
cut -d',' -f 7-8 ${RAW_META} | tail -n +2 | awk -F',' -v OFS="\t" \
-v mu=${MICRONS_PER_PIXEL} \
'{printf "%.2f\t%.2f\n", mu * $1, mu * $2 > out;}' > cell_coordinates.tsv
# Extract transcripts
awk -F',' -v mu=${MICRONS_PER_PIXEL} '\
NR==1{gsub(/"/, "", $0); print $3, $4, $8, "count", $7, $9 }\
NR>1{gsub(/"/, "", $8); gsub(/"/, "", $9); printf "%.2f\t%.2f\t%s\t%d\t%d\t%s\n", mu*$3, mu*$4, $8, 1, $7, $9 } ' ${RAW_TX} > transcripts.tsv
MERSCOPE¶
You can use the template Makefile
and config_prepare.json
in punkst/examples/format_input/merscope
to conver MERSCOPE raw output files to the generic input format.
Set "rawdir"
to be the path that contains the MERSCOPE output files. We will need the following files: cell_metadata.csv.gz
and detected_transcripts.csv.gz
. If your data is compressed, set "compressed"
to 1, otherwise (plain csv) set it to 0.
Set "datadir"
to the output directory.
The following are the commands ran in the Makefile
:
# Extract cell coordinates
zcat ${RAWDIR}/cell_metadata.csv.gz | cut -d',' -f 4-9 | tail -n +2 | awk -F',' -v OFS="\t" '{ print $1, $2; }' > cell_coordinates.tsv
# Extract transcripts
zcat ${RAWDIR}/detected_transcripts.csv.gz \
| cut -d',' -f2-5,9 \
| sed \
-e '0,/barcode/{s/barcode/#barcode/}' \
-e 's/,/\t/g' \
-e 's/$/\t1/' \
-e '0,/barcode/{s/\t1$/\tcount/}' \
> transcripts.tsv
Xenium¶
You can use the template Makefile
and config_prepare.json
in punkst/examples/format_input/xenium
to conver Xenium raw output files to the generic input format.
In the config.json
, you need specify "raw_transcripts"
as the path of the transcript file transcripts.csv.gz
and "raw_cells"
as the path of the cell metadata cells.csv.gz
.
The following are the commands ran in the Makefile
:
# Extract transcripts
zcat transcripts.csv.gz \
| cut -d',' -f4-6 | sed 's/"//g' \
| awk -F',' -v OFS="\t" '{ print $2, $3, $1, "1" }' \
> transcripts.tsv
# Extract cell coordinates
zcat cells.csv.gz \
| cut -d',' -f2-3 \
| tail -n +2 \
| awk -F',' -v OFS="\t" '{printf "%.4f\t%.4f\n", $1,$2;}' > cell_coordinates.tsv