Gene Family Clustering Across Pangenomes#

The cluster command groups related gene families from multiple pangenomes into homologous clusters based on sequence similarity using MMseqs2. This analysis identifies gene families that share common evolutionary origins across different bacterial populations, enabling comparative genomics studies and the construction of meta-pangenomes. The clustering supports both fast (linclust) and sensitive (cluster) methods to accommodate different accuracy and performance requirements.

Clustering Workflow#

The gene family clustering process runs as follows:

Load and Validate Pangenomes
- Multiple pangenomes are loaded from .h5 files based on a .tsv file.
- Each pangenome is validated to ensure gene families have been clustered and sequences are available.
Extract Gene Family Sequences
- Gene family sequences are extracted from each pangenome and written to compressed FASTA files.
- All sequences are combined while maintaining pangenome-specific identifiers.
Create Unified MMseqs2 Database
- All gene family sequences are combined into a single MMseqs2 database.
- Database indexing optimizes subsequent clustering operations.
Perform Sequence Clustering
- Linclust method: Fast linear-time clustering suitable for large datasets with moderate sensitivity requirements.
- Cluster method: Sensitive clustering with comprehensive sequence comparisons for high-accuracy results.
- MMseqs2 applies identity and coverage thresholds to group similar sequences.
Process Clustering Results
- Binary clustering results are converted to human-readable TSV format.
- Cluster IDs are assigned to group-related gene families.
Write Results to Files
- Final clustering results are saved with cluster assignments and membership details.

Clustering command Line Usage#

Fast clustering with linclust:

panorama cluster \
--pangenomes pangenomes.tsv \
--output clustering_results \
--method linclust \
--cluster_identity 0.8 \
--cluster_coverage 0.8 \
--threads 8

Sensitive clustering with comprehensive parameters:

panorama cluster \
--pangenomes pangenomes.tsv \
--output clustering_results \
--method cluster \
--cluster_identity 0.5 \
--cluster_coverage 0.8 \
--cluster_sensitivity 7.5 \
--cluster_max_seqs 500 \
--threads 8 \
--keep_tmp

Key Options#

Shortcut	Argument	Type	Required/Optional	Description
-p	–pangenomes	File path	Required	TSV file listing .h5 pangenomes with gene families and sequences
-o	–output	Directory path	Required	Output directory for clustering results
-m	–method	String	Required	Clustering method: ‘linclust’ (fast) or ‘cluster’ (sensitive)

MMseqs2 Core Clustering Parameters#

Shortcut	Argument	Type	Optional	Description
—	–cluster_identity	Float	True	Minimum sequence identity threshold (0.0-1.0, default: 0.5)
—	–cluster_coverage	Float	True	Minimum coverage threshold (0.0-1.0, default: 0.8)
—	–cluster_cov_mode	Int	True	Coverage mode: 0=query, 1=target, 2=shorter, 3=longer, 4=both, 5=all (default: 0)
—	–cluster_eval	Float	True	E-value threshold for sequence similarity (default: 1e-3)

Cluster method-specific parameters#

Argument	Type	Description
–cluster_sensitivity	Float	Search sensitivity (higher = more sensitive but slower)
–cluster_max_seqs	Int	Maximum sequences per cluster representative
–cluster_min_ungapped	Int	Minimum ungapped alignment score

Advanced Parameters (Both Methods)#

Argument	Type	Description
–cluster_comp_bias_corr	Int	Compositional bias correction: 0=disabled, 1=enabled
–cluster_kmer_per_seq	Int	Number of k-mers per sequence
–cluster_align_mode	Int	Alignment mode: 0=auto, 1=score only, 2=extended, 3=both, 4=fast
–cluster_max_seq_len	Int	Maximum sequence length
–cluster_max_reject	Int	Maximum number of rejected sequences
–cluster_mode	Int	Clustering algorithm: 0=Set Cover, 1=Connected Component, 2=Greedy, 3=Greedy Low Memory

Advanced Configuration Arguments#

Shortcut	Argument	Type	Optional	Description
—	–tmpdir	str (directory path)	True	Directory for temporary files (default: system temp directory)
—	–keep_tmp	bool (flag)	True	Keep temporary files after completion (useful for debugging)
—	–threads	int	True	Number of CPU threads for parallel processing (default: 1)

Clustering Methods Comparison#

Linclust (Fast Method)#

Algorithm: Linear time complexity clustering Performance: Fast execution suitable for large datasets Sensitivity: Moderate - may miss distant relationships

Parameter Range	Recommendation
Identity: 0.7-0.9	High confidence clusters
Identity: 0.5-0.7	Balanced approach
Identity: 0.3-0.5	Permissive clustering

Cluster (Sensitive Method)#

Algorithm: Comprehensive pairwise comparisons Performance: Slower but more thorough Sensitivity: High - detects distant homologs

Parameter	Recommended Range	Impact
Sensitivity	4.0-6.0	Standard sensitivity
Sensitivity	6.0-8.0	High sensitivity (slower)
Max Seqs	100-300	Balanced performance
Max Seqs	500-1000	Comprehensive but slow

Parameter Guidelines#

Identity and Coverage Combinations#

Identity	Coverage	Clustering Behavior	Biological Interpretation
0.9	0.9	Very strict	Nearly identical sequences
0.8	0.8	Strict	Highly conserved families
0.6	0.8	Moderate	Homologous families
0.5	0.7	Permissive	Distant homologs
0.3	0.5	Very permissive	Potential remote homology

Coverage Mode Selection#

Mode	Coverage Target	Best Use Case
0	Query coverage	Find complete matches to smaller sequences
1	Target coverage	Find sequences that completely match targets
2	Shorter sequence	Balanced approach (recommended)
3	Longer sequence	Conservative clustering

Output Files#

PANORAMA generates clustering results with detailed cluster membership information.

File Organization#

output_directory/
└── clustering.tsv

Clustering Results Format#

The clustering results file contains the following columns:

Column	Description	Example
cluster_id	Unique cluster identifier (integer)	0
referent	Representative gene family ID	PG1_FAM_001
in_clust	Gene family member of this cluster	PG2_FAM_045

Results Interpretation#

cluster_id: Sequential integer identifying each cluster group
referent: The representative (typically first or most abundant) family in the cluster
in_clust: All gene families assigned to this cluster (including the referent)

Example clustering results:

cluster_id  referent      in_clust
         PG1_FAM_001   PG1_FAM_001
         PG1_FAM_001   PG2_FAM_045
         PG1_FAM_001   PG3_FAM_078
         PG1_FAM_002   PG1_FAM_002
         PG1_FAM_002   PG2_FAM_123