System Projection on Genomes#

The write_systems command enables the projection of systems, previously detected at the pangenome level (see systems command), onto individual genomes. Projection relies on system detection results and the genomic context of gene families within organisms.

Projection Workflow#

The projection process has been optimized and proceeds as follows:

1. Load Detected Systems and Metadata#

Detected systems from the .h5 pangenome file are loaded
Required metadata and gene families are retrieved
System-to-family mappings are established for efficient processing

2. Build Gene Context Components#

For each organism and functional unit, the workflow uses a component-based approach instead of graph construction:

Identify Model Genes: Extract genes belonging to system families in each organism
Group by Contig: Organize genes by their chromosomal/plasmid location
Extract Windows: Use extract_contig_window() to identify genomic regions containing system genes within the specified window size
Create Components: Each window becomes a component containing all genes (model + context) within that region

This approach directly identifies co-localized gene clusters.

3. Project System Units#

Each system unit is evaluated in organisms through the following steps:

Unit Requirements Validation#

Family Requirements: Check if required families from the model are present
Completeness Calculation: Determine what fractions of model families are found
Context Analysis: Identify additional families within the same genomic context

System State Classification#

Components are classified into three genomic organization states:

strict: All model families are found within the same connected component/window
split: Model families are present but spread across multiple disconnected components
extended: All model families are in the same context with additional intervening families

Gene Categorization#

Each projected gene is categorized as:

model: Gene belongs to a family defined in the system model
context: Gene is co-localized with model genes but not part of the system definition
filtered: Gene was excluded during filtering steps

4. Aggregate and Filter Projections#

The projection includes advanced filtering options:

Standard Projection#

Collects all valid projections for each organism
Calculates completeness metrics
Maintains full system context information

One-Unit-Per-Family Filtering#

New optimization that handles overlapping system units:

Overlap Resolution: When multiple units contain the same gene family, keeps only the unit with the highest completeness
Overlapping Units Tracking: Records information about filtered units in overlapping_units column
System Elimination Options:
- eliminate_filtered_systems: Remove entire systems if any model families were filtered
- eliminate_empty_systems: Remove systems with no remaining model families

5. Write Output#

Projection results are written as TSV files with improved organization and metadata. See Output Files for details on the organization and contents.

Projection command Line Usage#

Basic Projection#

panorama write_systems \
    --pangenomes pangenomes.tsv \
    --models models.tsv \
    --sources defense_finder \
    --projection \
    --threads 8 \
    --output results/

Advanced Options#

panorama write_systems \
    --pangenomes pangenomes.tsv \
    --models models.tsv \
    --sources defense_finder immune_system \
    --projection \
    --association RGPs spots \                # Associate systems with RGPs and hotspots
    --partition \                             # Write partition heatmap files
    --canonical \                             # Project canonical versions of systems
    --organisms organism_A organism_B \       # Project only these organisms
    --threads 16 \
    --force \
    --output results/

Projection command Line Arguments#

Projection-specific keys#

Argument	Type	Default	Description
`--projection`	flag	False	Enables the projection of systems onto genomes
`--organisms`	list	None	List of organisms to project (defaults to all)
`--canonical`	flag	False	Also project canonical versions of systems

Required Arguments#

Argument	Type	Description
`--pangenomes`	Path	TSV file listing pangenome .h5 files to process
`--output`	Path	Output directory for results
`--models`	Path	Path(s) to model list files
`--sources`	str	Name(s) of the systems sources

Optional Arguments#

Argument	Type	Default	Description
`--threads`	int	1	Number of parallel threads to use
`--verbose`	int (choice)	1	Verbose level: 0 (warnings/errors), 1 (info), 2 (debug)
`--log`	Path	stdout	Log output file
`-d`, `--disable_prog_bar`	flag	False	Disable the progress bars
`--force`	flag	False	Force writing in the output directory and pangenome file

Projection Output Files#

Output is organized in the specified --output directory with subdirectories for each pangenome and source combination:

output/
├── pangenome_1/
│   └── source_1/
│       ├── systems.tsv                    # Pangenome summary
│       └── projection/
│           ├── organism_A.tsv             # Per-organism detailed results
│           ├── organism_B.tsv
│           └── ...
└── pangenome_2/
    └── source_1/
        ├── systems.tsv
        └── projection/
            └── ...

1. Pangenome Systems Summary (`systems.tsv`)#

This file provides a high-level summary of all detected systems across the pangenome:

Column	Description
system number	Unique numeric ID for the system
system name	Name of the system (corresponds to model name)
functional unit name	Name of the functional unit within the system
organism	Organism name where the system is detected
model_GF	Comma-separated list of gene families encoding system functions
context_GF	Comma-separated list of gene families found in genomic context but not part of the model
partition	Pangenome partition of the system (persistent, shell, cloud, or combinations)
completeness	Average proportion of model families found across organisms (0.0-1.0)
strict	Number of organisms with strict genomic organization
split	Number of organisms with split genomic organization
extended	Number of organisms with extended genomic organization

Additional columns (when using --association):

RGPs: Associated Regions of Genomic Plasticity
spots: Associated hotspots of genome evolution
modules: Associated functional modules

2. Organism Projection Files (`projection/<organism>.tsv`)#

Each organism gets a detailed file with gene-level projections:

Column	Description
system number	Unique system ID
system name	System name from the model
functional unit name	Functional unit name
subsystem number	ID for the genomic component/subgraph
organism	Organism name
gene family	Gene family identifier
partition	Pangenome partition (persistent/shell/cloud)
annotation	Functional annotation from metadata
secondary_names	Alternative names for the gene family
gene.ID	Unique gene identifier
gene.name	Gene name/locus tag
contig	Contig/chromosome name
start	Gene start position
stop	Gene stop position
strand	Gene orientation (+/-)
is_fragment	Whether gene is fragmented
category	Gene category: `model`, `context`, or `filtered`
genomic organization	System organization: `strict`, `split`, or `extended`
completeness	Proportion of model families present in this organism
product	Gene product description
overlapping_units	Information about overlapping units (format: `unit:completeness`)