Gene Family Alignment Across Pangenomes#

The align command performs sequence alignment of gene families between multiple pangenomes using MMseqs2 to identify homologous relationships and sequence similarities across different bacterial populations. This analysis supports both targeted inter-pangenome comparisons (excluding intra-pangenome alignments) and comprehensive all-against-all alignments that capture both inter- and intra-pangenome relationships.

Alignment Workflow#

The gene family alignment process runs as follows:

Load and Validate Pangenomes
- Multiple pangenomes are loaded from .h5 files based on a .tsv file.
- Each pangenome is validated to ensure gene families have been clustered and sequences are available.
Extract Gene Family Sequences
- Gene family sequences are extracted from each pangenome and written to individual FASTA files.
- Sequences are compressed and organized in temporary directories for processing.
Create MMseqs2 Databases
- Individual sequence databases are created for each pangenome using MMseqs2.
- For all-against-all mode, sequences are combined into a single unified database.
Perform Sequence Alignments
- Inter-pangenome mode: Pairwise alignments between all pangenome combinations, excluding self-alignments.
- All-against-all mode: Comprehensive alignment including both inter- and intra-pangenome comparisons.
- MMseqs2 search algorithms apply identity and coverage thresholds to identify significant matches.
Process and Merge Results
- Alignment results are converted from binary format to human-readable TSV files.
- Multiple alignment files are merged into consolidated output files.
Write Results to Files
- Final alignment results are saved as detailed TSV files containing sequence similarity metrics.

Alignment command Line Usage#

Basic inter-pangenome alignment:

panorama align \
--pangenomes pangenomes.tsv \
--output alignment_results \
--inter_pangenomes \
--align_identity 0.8 \
--align_coverage 0.8 \
--threads 8

Comprehensive all-against-all alignment:

panorama align \
--pangenomes pangenomes.tsv \
--output alignment_results \
--all_against_all \
--align_identity 0.5 \
--align_coverage 0.8 \
--align_cov_mode 0 \
--threads 8 \
--keep_tmp

Key Options#

Shortcut	Argument	Type	Required/Optional	Description
-p	–pangenomes	File path	Required	TSV file listing .h5 pangenomes with gene families and sequences
-o	–output	Directory path	Required	Output directory for alignment results
—	–inter_pangenomes	Flag	Required (either)	Align gene families between pangenomes only (excludes intra-pangenome)
—	–all_against_all	Flag	Required (either)	Align all gene families including intra-pangenome comparisons

MMseqs2 Alignment Parameters#

Shortcut	Argument	Type	Optional	Description
—	–align_identity	Float	True	Minimum identity percentage threshold (0.0-1.0, default: 0.5)
—	–align_coverage	Float	True	Minimum coverage percentage threshold (0.0-1.0, default: 0.8)
—	–align_cov_mode	Int	True	Coverage mode: 0=query, 1=target, 2=shorter seq, 3=longer seq, 4=both, 5=all (default: 0)

Advanced Configuration Arguments#

Shortcut	Argument	Type	Optional	Description
—	–tmpdir	str (directory path)	True	Directory for temporary files (default: system temp directory)
—	–keep_tmp	bool (flag)	True	Keep temporary files after completion (useful for debugging)
—	–threads	int	True	Number of CPU threads for parallel processing (default: 1)

Alignment Modes#

Inter-Pangenome Alignment#

This mode performs alignments only between different pangenomes, excluding intra-pangenome comparisons:

Use case: Identifying shared gene families between populations
Results: Focus on inter-population relationships

All-Against-All Alignment#

This mode performs comprehensive alignments including both inter- and intra-pangenome comparisons:

Use case: Complete similarity analysis including within-population diversity
Results: Complete gene family relationship matrix

Parameter Guidelines#

Identity Thresholds#

Threshold	Use Case
0.9-1.0	Nearly identical sequences
0.7-0.9	Highly similar homologs
0.5-0.7	Moderate similarity
0.3-0.5	Low similarity (use with caution)

Coverage Thresholds#

Threshold	Description
0.8-1.0	High coverage requirement
0.6-0.8	Moderate coverage
0.4-0.6	Permissive coverage

Coverage Modes#

Mode	Target Coverage
0	Query coverage
1	Target coverage
2	Shorter sequence coverage

Output Files#

PANORAMA generates alignment results in standardized TSV format with detailed similarity metrics.

File Organization#

output_directory/
├── inter_pangenomes.tsv        (inter-pangenome mode)
└── all_against_all.tsv         (all-against-all mode)

Alignment Results Format#

Each alignment file contains the following columns:

Column	Description	Example
query	Query gene family identifier	PG1_FAM_001
target	Target gene family identifier	PG2_FAM_045
identity	Percentage sequence identity (0.0-1.0)	0.85
qlength	Query sequence length in amino acids	150
tlength	Target sequence length in amino acids	145
alnlength	Alignment length in amino acids	142
e_value	E-value of the alignment	1.2e-45
bits	Bit score of the alignment	185.2