Gene Family Clustering Across Pangenomes#
The cluster command groups related gene families from multiple pangenomes into homologous clusters based on sequence similarity using MMseqs2. This analysis identifies gene families that share common evolutionary origins across different bacterial populations, enabling comparative genomics studies and the construction of meta-pangenomes. The clustering supports both fast (linclust) and sensitive (cluster) methods to accommodate different accuracy and performance requirements.
Clustering Workflow#
The gene family clustering process runs as follows:
Load and Validate Pangenomes
Multiple pangenomes are loaded from .h5 files based on a .tsv file.
Each pangenome is validated to ensure gene families have been clustered and sequences are available.
Extract Gene Family Sequences
Gene family sequences are extracted from each pangenome and written to compressed FASTA files.
All sequences are combined while maintaining pangenome-specific identifiers.
Create Unified MMseqs2 Database
All gene family sequences are combined into a single MMseqs2 database.
Database indexing optimizes subsequent clustering operations.
Perform Sequence Clustering
Linclust method: Fast linear-time clustering suitable for large datasets with moderate sensitivity requirements.
Cluster method: Sensitive clustering with comprehensive sequence comparisons for high-accuracy results.
MMseqs2 applies identity and coverage thresholds to group similar sequences.
Process Clustering Results
Binary clustering results are converted to human-readable TSV format.
Cluster IDs are assigned to group-related gene families.
Write Results to Files
Final clustering results are saved with cluster assignments and membership details.
Clustering command Line Usage#
Fast clustering with linclust:
panorama cluster \ --pangenomes pangenomes.tsv \ --output clustering_results \ --method linclust \ --cluster_identity 0.8 \ --cluster_coverage 0.8 \ --threads 8
Sensitive clustering with comprehensive parameters:
panorama cluster \ --pangenomes pangenomes.tsv \ --output clustering_results \ --method cluster \ --cluster_identity 0.5 \ --cluster_coverage 0.8 \ --cluster_sensitivity 7.5 \ --cluster_max_seqs 500 \ --threads 8 \ --keep_tmp
Key Options#
Shortcut |
Argument |
Type |
Required/Optional |
Description |
|---|---|---|---|---|
-p |
–pangenomes |
File path |
Required |
TSV file listing .h5 pangenomes with gene families and sequences |
-o |
–output |
Directory path |
Required |
Output directory for clustering results |
-m |
–method |
String |
Required |
Clustering method: ‘linclust’ (fast) or ‘cluster’ (sensitive) |
MMseqs2 Core Clustering Parameters#
Shortcut |
Argument |
Type |
Optional |
Description |
|---|---|---|---|---|
— |
–cluster_identity |
Float |
True |
Minimum sequence identity threshold (0.0-1.0, default: 0.5) |
— |
–cluster_coverage |
Float |
True |
Minimum coverage threshold (0.0-1.0, default: 0.8) |
— |
–cluster_cov_mode |
Int |
True |
Coverage mode: 0=query, 1=target, 2=shorter, 3=longer, 4=both, 5=all (default: 0) |
— |
–cluster_eval |
Float |
True |
E-value threshold for sequence similarity (default: 1e-3) |
Cluster method-specific parameters#
Argument |
Type |
Description |
|---|---|---|
–cluster_sensitivity |
Float |
Search sensitivity (higher = more sensitive but slower) |
–cluster_max_seqs |
Int |
Maximum sequences per cluster representative |
–cluster_min_ungapped |
Int |
Minimum ungapped alignment score |
Advanced Parameters (Both Methods)#
Argument |
Type |
Description |
|---|---|---|
–cluster_comp_bias_corr |
Int |
Compositional bias correction: 0=disabled, 1=enabled |
–cluster_kmer_per_seq |
Int |
Number of k-mers per sequence |
–cluster_align_mode |
Int |
Alignment mode: 0=auto, 1=score only, 2=extended, 3=both, 4=fast |
–cluster_max_seq_len |
Int |
Maximum sequence length |
–cluster_max_reject |
Int |
Maximum number of rejected sequences |
–cluster_mode |
Int |
Clustering algorithm: 0=Set Cover, 1=Connected Component, 2=Greedy, 3=Greedy Low Memory |
Advanced Configuration Arguments#
Shortcut |
Argument |
Type |
Optional |
Description |
|---|---|---|---|---|
— |
–tmpdir |
str (directory path) |
True |
Directory for temporary files (default: system temp directory) |
— |
–keep_tmp |
bool (flag) |
True |
Keep temporary files after completion (useful for debugging) |
— |
–threads |
int |
True |
Number of CPU threads for parallel processing (default: 1) |
Clustering Methods Comparison#
Linclust (Fast Method)#
Algorithm: Linear time complexity clustering Performance: Fast execution suitable for large datasets Sensitivity: Moderate - may miss distant relationships
Parameter Range |
Recommendation |
|---|---|
Identity: 0.7-0.9 |
High confidence clusters |
Identity: 0.5-0.7 |
Balanced approach |
Identity: 0.3-0.5 |
Permissive clustering |
Cluster (Sensitive Method)#
Algorithm: Comprehensive pairwise comparisons Performance: Slower but more thorough Sensitivity: High - detects distant homologs
Parameter |
Recommended Range |
Impact |
|---|---|---|
Sensitivity |
4.0-6.0 |
Standard sensitivity |
Sensitivity |
6.0-8.0 |
High sensitivity (slower) |
Max Seqs |
100-300 |
Balanced performance |
Max Seqs |
500-1000 |
Comprehensive but slow |
Parameter Guidelines#
Identity and Coverage Combinations#
Identity |
Coverage |
Clustering Behavior |
Biological Interpretation |
|---|---|---|---|
0.9 |
0.9 |
Very strict |
Nearly identical sequences |
0.8 |
0.8 |
Strict |
Highly conserved families |
0.6 |
0.8 |
Moderate |
Homologous families |
0.5 |
0.7 |
Permissive |
Distant homologs |
0.3 |
0.5 |
Very permissive |
Potential remote homology |
Coverage Mode Selection#
Mode |
Coverage Target |
Best Use Case |
|---|---|---|
0 |
Query coverage |
Find complete matches to smaller sequences |
1 |
Target coverage |
Find sequences that completely match targets |
2 |
Shorter sequence |
Balanced approach (recommended) |
3 |
Longer sequence |
Conservative clustering |
Output Files#
PANORAMA generates clustering results with detailed cluster membership information.
File Organization#
output_directory/
└── clustering.tsv
Clustering Results Format#
The clustering results file contains the following columns:
Column |
Description |
Example |
|---|---|---|
cluster_id |
Unique cluster identifier (integer) |
0 |
referent |
Representative gene family ID |
PG1_FAM_001 |
in_clust |
Gene family member of this cluster |
PG2_FAM_045 |
Results Interpretation#
cluster_id: Sequential integer identifying each cluster group
referent: The representative (typically first or most abundant) family in the cluster
in_clust: All gene families assigned to this cluster (including the referent)
Example clustering results:
cluster_id referent in_clust
0 PG1_FAM_001 PG1_FAM_001
0 PG1_FAM_001 PG2_FAM_045
0 PG1_FAM_001 PG3_FAM_078
1 PG1_FAM_002 PG1_FAM_002
1 PG1_FAM_002 PG2_FAM_123