Systems Comparison Across Pangenomes#

The compare_systems command identifies and analyzes conserved biological systems across multiple pangenomes by comparing their gene family composition and computing similarity metrics. This analysis builds upon previously detected systems from individual pangenomes and uses Gene Family Relatedness Relationship (GFRR) metrics to identify systems that are conserved across different bacterial populations. The analysis generates visualizations showing system distribution patterns and creates graphs of conserved system clusters.

Systems Comparison Workflow#

The systems comparison process runs as follows:

  1. Load and Validate Pangenomes

    • Multiple pangenomes are loaded from .h5 files based on a .tsv file.

    • Each pangenome is validated to ensure that systems have been detected for the specified sources.

  2. Create Systems

    • All systems from all pangenomes are represented as nodes in a unified NetworkX graph.

    • Each system is characterized by its gene families and model families for similarity assessment.

  3. Compute GFRR-based Edges

    • For each pair of systems from different pangenomes:

      • Model gene families are compared using GFRR metrics.

      • If model families exceed thresholds, all gene families are compared.

    • Edges are added between systems that exceed both GFRR cutoff thresholds.

  4. Cluster Conserved Systems

    Graph clustering algorithms (Louvain) identify groups of similar systems that represent conserved biological systems across pangenomes based on the selected GFRR metric.

  5. Generate Visualizations

    Heatmaps showing system distribution patterns across pangenomes are generated in HTML format for interactive exploration.

  6. Write Results to Files

    Conserved systems are saved as graph files (GEXF, GraphML) and summary tables for further analysis and visualization.

System comparison command Line Usage#

Basic systems comparison with heatmap generation:

panorama compare_systems \
--pangenomes pangenomes.tsv \
--models defense_systems.tsv \
--sources defense_finder \
--output systems_comparison_results \
--heatmap \
--threads 8

Full analysis with conserved systems clustering:

panorama compare_systems \
--pangenomes pangenomes.tsv \
--models defense_systems.tsv cas_systems.tsv \
--sources defense_finder CasFinder \
--output systems_comparison_results \
--heatmap \
--gfrr_metrics min_gfrr_models \
--gfrr_cutoff 0.8 0.8 \
--gfrr_models_cutoff 0.2 0.2 \
--graph_formats gexf graphml \
--threads 8

Key Options#

Shortcut

Argument

Type

Optional

Description

-p

–pangenomes

str (file path)

False

TSV file listing .h5 pangenomes with detected systems

-m

–models

List[str] (file paths)

False

Path(s) to system model files (must match –sources order)

-s

–sources

List[str]

False

Name(s) of systems sources (must match –models order)

-o

–output

str (directory path)

False

Output directory for comparison results

–gfrr_cutoff

List[float] (2 values)

True

Two thresholds for min_gfrr and max_gfrr values (default: 0.5 0.8)

–seed

Int

Optional

Random seed to guarantee reproductibility (default 42)

–heatmap

bool (flag)

True

Generate heatmaps showing system distribution across pangenomes

–gfrr_metrics

str (choice)

True

GFRR metric for clustering conserved systems (min_gfrr_models, max_gfrr_models, min_gfrr, max_gfrr)

–gfrr_models_cutoff

List[float] (2 values)

True

GFRR thresholds for model gene families (default: 0.4 0.6)

–graph_formats

List[str]

True

Export graph formats: gexf, graphml

–canonical

bool (flag)

True

Include canonical system versions in analysis

Advanced Configuration Arguments#

Shortcut

Argument

Type

Optional

Description

–cluster

str (file path)

True

Tab-separated file with pre-computed clustering results (cluster_name\tfamily_id format)

–method

str (choice)

True

MMSeqs2 clustering method: linclust or cluster (default: linclust)

–tmpdir

str (directory path)

True

Directory for temporary files (default: /tmp)

–keep_tmp

bool (flag)

True

Keep temporary files after completion

-c

–cpus

int

True

Number of CPU threads for parallel processing (default: 1)

–verbose

int (choice)

True

Verbose level: 0 (warnings/errors), 1 (info), 2 (debug) (default: 1)

–log

str (file path)

True

Log output file (default: stdout)

-d

–disable_prog_bar

bool (flag)

True

Disable the progress bars

–force

bool (flag)

True

Force writing in output directory and pangenome file

Note

PANORAMA can perform the clustering step first thing, but it’s also possible to use pre-computed clustering results with the --cluster argument. If you use let PANORAMA perform the clustering, you can look at the Clustering section for more details about options.

GFRR Metrics for Systems#

Metric

Target Families

Description

min_gfrr_models

Model families only

Conservative metric using core functional families

max_gfrr_models

Model families only

Liberal metric using core functional families

min_gfrr

All families

Conservative metric using complete gene repertoire

max_gfrr

All families

Liberal metric using complete gene repertoire

Cutoff Configuration#

The dual-cutoff system provides hierarchical filtering:

Filtering Stage

Cutoffs

Purpose

Model families

gfrr_models_cutoff

Primary filter using core functional genes

All families

gfrr_cutoff

Secondary filter using complete gene repertoire

Output#

PANORAMA generates multiple outputs: interactive heatmaps, network graphs, and summary tables for comprehensive systems analysis.

File Organization#

output_directory/
├── heatmap_number_systems.html
├── heatmap_normalized_systems.html
├── conserved_systems.gexf (optional)
└── conserved_systems.graphml (optional)

Files description#

Heatmap Visualizations#

Interactive HTML heatmaps showing system distribution patterns:

File

Description

heatmap_number_systems.html

Raw counts of each system type per pangenome

heatmap_normalized_systems.html

Normalized percentages showing relative abundance

[PLACEHOLDER: Heatmap showing system distribution across multiple pangenomes]

[PLACEHOLDER: Normalized heatmap showing relative system abundance patterns]

Conserved System Clustering#

Network Graphs#

When --gfrr_metrics and --graph_formats are specified, generate conserved_systems.gexf/graphml Network graphs of conserved system clusters. Node attributes include system metadata, pangenome information, and cluster assignments Edge attributes contain GFRR similarity scores and the number of shared gene families.

[PLACEHOLDER: Network graph of conserved systems clusters with different colors]