Conserved Spots Comparison Across Pangenomes#

The compare_spots command identifies and analyzes conserved genomic spots across multiple pangenomes by comparing their gene family composition and genomic organization patterns. This analysis builds upon previously computed spots from individual pangenomes and uses Gene Family Relatedness Relationship (GFRR) metrics to identify regions that are conserved across different bacterial populations. Optionally, it can integrate systems detection results to analyze biological systems within conserved regions.

Conserved Spots Detection Workflow#

The conserved spots comparison process runs as follows:

  1. Load and Validate Pangenomes

    • Multiple pangenomes are loaded from .h5 files based on a .tsv file.

    • Each pangenome is validated to ensure that spots and RGPs have been computed.

  2. Create Spots Graph

    • All spots from all pangenomes are represented as nodes in a unified NetworkX graph.

    • Each spot is characterized by its bordering gene families.

  3. Compute GFRR-based Edges

    • For each pair of spots from different pangenomes:

      • Gene families at spot borders are extracted and compared.

      • GFRR metrics (min_gfrr and max_gfrr) are computed based on shared families.

    • Edges are added between spots that exceed both GFRR cutoff thresholds.

  4. Cluster Conserved Spots

    • Graph clustering algorithms identify groups of similar spots that represent conserved genomic regions across pangenomes based on the selected GFRR metric.

  5. Systems Integration (Optional)

    • When enabled, systems analysis creates linkage graphs showing relationships between biological systems through their association with conserved spots.

  6. Write Results to Files

    • Conserved spots are saved as detailed TSV files and optional graph formats (GEXF, GraphML) for visualization.

Compare spots command Line Usage#

Basic conserved spots comparison:

panorama compare_spots \
--pangenomes pangenomes.tsv \
--output conserved_spots_results \
--gfrr_metrics min_gfrr \
--gfrr_cutoff 0.8 0.8 \
--threads 8

With system analysis enabled:

panorama compare_spots \
--pangenomes pangenomes.tsv \
--output conserved_spots_results \
--systems \
--models defense_systems.tsv \
--sources defense_finder \
--gfrr_cutoff 0.8 0.8 \
--graph_formats gexf graphml \
--threads 8

Key Options 📋#

Shortcut

Argument

Type

Required/Optional

Description

-p

–pangenomes

File path

Required

TSV file listing .h5 pangenomes with computed spots

-o

–output

Directory path

Required

Output directory for conserved spots results

–gfrr_metrics

String

Optional

GFRR metric for clustering: ‘min_gfrr’ (conservative) or ‘max_gfrr’ (liberal)

–gfrr_cutoff

Float Float

Optional

Two thresholds for min_gfrr and max_gfrr values (default: 0.8 0.8)

–seed

Int

Optional

Random seed to guarantee reproductibility (default 42)

–dup_margin

Float

Optional

Minimum ratio for multigenic family detection (default: 0.05)

–systems

Flag

Optional

Enable systems analysis within conserved spots

-m

–models

File path(s)

Required with –systems

Path(s) to system model files (required with –systems)

-s

–sources

String(s)

Required with –systems

System source names corresponding to models (required with –systems)

–canonical

Flag

Optional with –systems

Include canonical systems in analysis

–graph_formats

String(s)

Optional

Export graph formats: gexf, graphml

Advanced Configuration Arguments#

Shortcut

Argument

Type

Optional

Description

–cluster

str (file path)

True

Tab-separated file with pre-computed clustering results (cluster_name family_id format)

–tmpdir

str (directory path)

True

Directory for temporary files (default: /tmp)

–keep_tmp

bool (flag)

True

Keep temporary files after completion

-c

–cpus

int

True

Number of CPU threads for parallel processing (default: 1)

–verbose

int (choice)

True

Verbose level: 0 (warnings/errors), 1 (info), 2 (debug) (default: 1)

–log

str (file path)

True

Log output file (default: stdout)

-d

–disable_prog_bar

bool (flag)

True

Disable the progress bars

–force

bool (flag)

True

Force writing in output directory and pangenome file

Note

PANORAMA can perform the clustering step first thing, but it’s also possible to use pre-computed clustering results with the --cluster argument. If you use let PANORAMA perform the clustering, you can look at the Clustering section for more details about options.

GFRR Metrics#

Metric

Formula

Description

min_gfrr

shared_families / min(families_spot1, families_spot2)

Conservative metric requiring high overlap relative to smaller set

max_gfrr

shared_families / max(families_spot1, families_spot2)

Liberal metric allowing partial overlap relative to larger set

Sensitivity Control#

The dual cutoff system provides fine-grained control over conservation stringency:

Cutoff Level

min_gfrr

max_gfrr

Behavior

Strict

0.8

0.8

High-confidence conserved spots only

Moderate

0.6

0.7

Balanced sensitivity and specificity

Permissive

0.4

0.5

Detects distant conservation patterns

Output#

PANORAMA generates multiple outputs: detailed spot information files, summary tables, and optional graph visualizations.

File Organization#

output_directory/
├── conserved_spots/
│ ├── conserved_spot_1.tsv
│ ├── conserved_spot_2.tsv
| |── ....................
│ └── conserved_spot_N.tsv
├── all_conserved_spots.tsv
├── conserved_spots.gexf (optional)
├── conserved_spots.graphml (optional)
├── systems_link_with_conserved_spots_louvain.gexf (optional)
└── systems_link_with_conserved_spots_mst.gexf (optional)

Individual Conserved Spot Files#

Each conserved_spot_X.tsv contains detailed RGP-level information:

Column

Description

Spot_ID

Original spot identifier from source pangenome

Pangenome

Source pangenome name

RGP_Name

Region of Genomic Plasticity name within the spot

Gene_Families

Comma-separated list of gene families in the RGP

Summary File#

all_conserved_spots.tsv provides an overview of all conserved spots:

Column

Description

Conserved_Spot_ID

Unique identifier for the conserved spot group

Spot_ID

Individual spot identifier from source pangenome

Pangenome

Source pangenome name

Num_RGPs

Number of RGPs in this spot

Num_Gene_Families

Total number of gene families in this spot

Conserved spots Graph Files (Optional)#

When --graph_formats is enabled, additional graph files are generated:

  • Conserved spots graph in GEXF format

  • Conserved spots graph in GraphML format

Node attributes include conserved spot ID, pangenome name, spot ID, the number of gene families, and the number of RGPs. Edge attributes include GFRR metric and the number of shared gene families.

[PLACEHOLDER: Example conserved spots visualization across pangenomes]

Systems Analysis Files (Optional)#

When --systems is specified, generate systems_link_with_conserved_spots_louvain.gexf/graphml Network graphs of conserved system clusters. These graphs are generated using the Louvain algorithm.

[PLACEHOLDER: Systems linkage graph showing relationships through conserved spots]