tile-op¶
tile-op provides utilities to view and manipulate the tiled data files created by punkst pixel-decode or punkst pts2tiles.
It supports
-
inspecting the index
-
converting binary tiled files to TSV
-
reorganizing fragmented tiles to a regular grid
-
merging multiple inference results
-
annotating (tiled) point level file with inference results
-
compute joint probability distributions of factors
(Except for printing the index, all operations are intended to be used separately)
Usage¶
Main input & output¶
The main input are the tiled pixel level files created by punkst pixel-decode, either in the custom binary format or in plain TSV format.
You can specify the pair of data and index files using --in-data and --in-index, or specify the prefix using --in.
When using --in, without --binary, the tool assumes the data file is <in>.tsv and the index file is <in>.index, and with --binary it assumes the data file is <in>.bin and the index file is <in>.index.
Use --out to specify the output prefix. In some operations use --binary-out to specify that the output is to be written in binary format.
Basic Inspection and Conversion¶
To inspect the index of a tiled file:
To dump a binary tiled file to a plain TSV file:
The output include path/prefix.dump.tsv and path/prefix.dump.index.
Fix fragmented Tiles¶
The output of punkst pixel-decode is organized into non-overlapping rectangular tiles that jointly cover the entire space, but the tiles do not fit into a regular grid.
If we would need to merge multiple sets of inference results or want to join the inference results with point level data, currently we have to reorganize the data to a regular grid first. (The tile size shoud be already stored in the input's index file (path/prefix.index), currently we don't support generic reorganization)
Note that this is not required for visualization draw-pixel-factors.
Merge Multiple Inference Results¶
You can merge multiple inference files (e.g., from different models) into a single file. This finds the intersection of tiles and concatenates the results ((factor, probability) pairs) for each pixel.
punkst tile-op --in path/result1 [--binary] \
--merge-emb path/result2.tsv path/result3.bin --k2keep 3 1 2 \
--out path/merged_result --binary-out
--merge-emb - One or more other inference files (created by pixel-decode) to merge with the main input file. They can be in either TSV or binary format, but have to have proper index files stored ad <prefix>.index.
--k2keep - (Optional) A list of integers specifying how many top factors to keep from each source file (including the main input). If not provided, all factors are kept.
--binary-out - (Optional) Save the merged output in binary format instead of TSV.
In the above example, from file result1.bin (or .tsv) we keep top 3 factors, from result2.tsv we keep top 1 factor, and from result3.bin we keep top 2 factors. If the specified number exceeds the number of factors available in the corresponding file, all factors in the file are kept.
Annotate Points with Inference Results¶
You can annotate a transcript file with the inference results. The query file is required to be generated by punkst pts2tiles with the same tile structure as the result file so that the tool can efficiently join it with the inference results, but you can apply pts2tiles to any tsv file that contains X, Y coordinates as two of its columns.
punkst tile-op --in path/prefix [--binary] \
--annotate-pts path/transcripts --icol-x 0 --icol-y 1 \
--out path/merged
--annotate-pts - Prefix of the points file (the tool expects <prefix>.tsv and <prefix>.index) to be annotated.
--icol-x - 0-based column index for X coordinate in the points file.
--icol-y - 0-based column index for Y coordinate in the points file.
Compute Joint Probability Distributions¶
You can compute the correlations or co-occurrences between factors, either from a single model or between inference results from multiple models applied to the same dataset. This is approximated by the sum of products of posterior probabilities across all pixels, although for each pixel only the top-K factors are considered (those stored in the inference result file).
Single Input¶
For a single inference result file:
Output:-
path/out_prefix.marginal.tsv: Marginal sums of posterior probabilities for each factor. -
path/out_prefix.joint.tsv: Sum of products for each pair of factors.
If the file contains multiple sets of results (e.g. a merged), the output is the same as the multi-input case below, where it stores marginal and within-model joint output for each source separately, and produces cross-source products (e.g., path/out_prefix.0v1.cross.tsv).
Merging and Computing on the Fly¶
You can also compute these statistics while merging multiple inference result files on the fly, without writing the large merged file to disk.
punkst tile-op --prob-dot --in path/result1 [--binary] \
--merge-emb path/result2.tsv path/result3.bin \
--out path/out_prefix
This supports --k2keep to reduce the number of top-K factors used in each source before computing the products.
Output:
-
path/out_prefix.0.marginal.tsv,path/out_prefix.1.marginal.tsv, ... (one per input source) -
path/out_prefix.0.joint.tsv, ... (internal dot products for each source) -
path/out_prefix.0v1.cross.tsv,path/out_prefix.0v2.cross.tsv, ... (cross-source dot products)