panorama.alignment package#

Submodules#

panorama.alignment.align module#

PANORAMA alignment module for inter-pangenome gene family comparisons.

This module provides functionality to align gene families between pangenomes using MMseqs2, supporting both inter-pangenome-only and all-against-all alignment modes. It handles sequence database creation, alignment execution, and result processing with comprehensive error handling and progress tracking.

class panorama.alignment.align.AlignmentConfig(ALIGN_FORMAT: List[str] | None = None, ALIGN_COLUMNS: List[str] | None = None, DEFAULT_IDENTITY: float = 0.8, DEFAULT_COVERAGE: float = 0.8, DEFAULT_COV_MODE: int = 0, DEFAULT_THREADS: int = 1, INTER_PANGENOMES_OUTPUT: str = 'inter_pangenomes.tsv', ALL_AGAINST_ALL_OUTPUT: str = 'all_against_all.tsv')#

Bases: object

Configuration constants for alignment operations.

ALIGN_COLUMNS: List[str] | None = None#
ALIGN_FORMAT: List[str] | None = None#
ALL_AGAINST_ALL_OUTPUT: str = 'all_against_all.tsv'#
DEFAULT_COVERAGE: float = 0.8#
DEFAULT_COV_MODE: int = 0#
DEFAULT_IDENTITY: float = 0.8#
DEFAULT_THREADS: int = 1#
INTER_PANGENOMES_OUTPUT: str = 'inter_pangenomes.tsv'#
exception panorama.alignment.align.AlignmentError#

Bases: Exception

Custom exception for alignment-related errors.

exception panorama.alignment.align.AlignmentValidationError#

Bases: AlignmentError

Custom exception for alignment parameter validation errors.

panorama.alignment.align._execute_alignment(query_db: Path, target_db: Path, aln_db: Path, tmpdir: Path, identity: float, coverage: float, cov_mode: int, threads: int) Path#

Execute the actual MMseqs2 alignment command.

Parameters:
  • query_db – Query database path.

  • target_db – Target database path.

  • aln_db – Alignment database path.

  • tmpdir – Temporary directory.

  • identity – Identity threshold.

  • coverage – Coverage threshold.

  • cov_mode – Coverage mode.

  • threads – Number of threads.

Returns:

Path – Path to the alignment database.

Raises:

AlignmentError – If alignment execution fails.

panorama.alignment.align._validate_alignment_parameters(identity: float, coverage: float, cov_mode: int, threads: int) None#

Validate alignment parameters.

Parameters:
  • identity – Sequence identity threshold (0.0-1.0).

  • coverage – Coverage threshold (0.0-1.0).

  • cov_mode – Coverage mode for MMseqs2 (0-5).

  • threads – Number of threads (positive integer).

Raises:

AlignmentValidationError – If any parameter is invalid.

panorama.alignment.align._validate_directory_access(directory: Path, create_if_missing: bool = False) None#

Validate directory exists and is accessible.

Parameters:
  • directory – Directory path to validate.

  • create_if_missing – Whether to create a directory if it doesn’t exist.

Raises:

AlignmentValidationError – If directory validation fails.

panorama.alignment.align.align_db(query_db: Path, target_db: Path, tmpdir: Path, aln_db: Path | None = None, query_name: str = '', target_name: str = '', identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, threads: int = 1) Path#

Perform sequence alignment between query and target databases using MMseqs2.

This function executes MMseqs2 search to align sequences from the query database against the target database, applying specified identity and coverage thresholds.

Parameters:
  • query_name – Name of the query database

  • target_name – Name of the target database

  • query_db – Path to MMseqs2 query sequences database.

  • target_db – Path to MMseqs2 target sequences database.

  • aln_db – Optional path for the alignment results database. If None, a temporary file will be created.

  • tmpdir – Temporary directory for MMseqs2 operations. If None, the system temp directory will be used.

  • identity – Minimum sequence identity threshold (0.0-1.0). Defaults to 0.8.

  • coverage – Minimum coverage threshold (0.0-1.0). Defaults to 0.8.

  • cov_mode

    Coverage mode for MMseqs2 (0-5). Defaults to 0.
    • 0: coverage of the query,

    • 1: coverage of the target,

    • 2: coverage of the shorter sequence.

  • threads – Number of threads for alignment. Defaults to 1.

Returns:

Path – Path to the MMseqs2 alignment results database.

Raises:
  • AlignmentError – If alignment execution fails.

  • AlignmentValidationError – If parameters are invalid.

  • FileNotFoundError – If input databases don’t exist.

panorama.alignment.align.align_pangenomes(pangenome2db: Dict[str, Path], tmpdir: Path, identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, threads: int = 1, disable_bar: bool = False) List[Path]#

Perform all pairwise alignments between multiple pangenomes.

This function executes alignments between all possible pairs of pangenomes using their MMseqs2 databases, with progress tracking and error handling.

Parameters:
  • pangenome2db – Dictionary mapping pangenome names to their MMseqs2 database paths.

  • identity – Minimum sequence identity threshold (0.0-1.0). Defaults to 0.8.

  • coverage – Minimum coverage threshold (0.0-1.0). Defaults to 0.8.

  • cov_mode – Coverage mode for MMseqs2 (0-5). Defaults to 0.

  • tmpdir – Temporary directory for operations. If None, uses system temp.

  • threads – Number of threads per alignment. Defaults to 1.

  • disable_bar – Whether to disable the progress bar. Defaults to False.

Returns:

List[Path] – List of paths to alignment result files for each pangenome pair.

Raises:
  • AlignmentError – If any alignment fails.

  • AlignmentValidationError – If parameters are invalid.

panorama.alignment.align.align_pangenomes_pair(pangenomes_pair: Tuple[str, str], tmpdir: Path, db_pair: Tuple[Path, Path], identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, threads: int = 1) Path#

Align gene families between two specific pangenomes.

This function performs pairwise alignment between gene families from two different pangenomes, executing both the alignment and conversion to human-readable format.

Parameters:
  • pangenomes_pair – Tuple containing names of the two pangenomes to align.

  • db_pair – Tuple containing paths to MMseqs2 databases for the pangenome pair.

  • identity – Minimum sequence identity threshold (0.0-1.0). Defaults to 0.8.

  • coverage – Minimum coverage threshold (0.0-1.0). Defaults to 0.8.

  • cov_mode – Coverage mode for MMseqs2 (0-5). Defaults to 0.

  • tmpdir – Temporary directory for operations. If None, uses system temp.

  • threads – Number of threads for processing. Defaults to 1.

Returns:

Path – Path to the alignment results file in TSV format.

Raises:
  • AlignmentError – If alignment or conversion fails.

  • AlignmentValidationError – If parameters are invalid.

panorama.alignment.align.all_against_all_align(families_seq: List[Path], output: Path, identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, tmpdir: Path | None = None, threads: int = 1) DataFrame#

Perform all-against-all alignment of gene families including intra-pangenome comparisons.

This function combines all gene family sequences into a single database and performs self-alignment, capturing both inter- and intra-pangenome similarities.

Parameters:
  • families_seq – List of paths to gene family sequence files from all pangenomes.

  • output – Directory where alignment results will be written. Will be created if it doesn’t exist.

  • identity – Minimum sequence identity threshold (0.0-1.0). Defaults to 0.8.

  • coverage – Minimum coverage threshold (0.0-1.0). Defaults to 0.8.

  • cov_mode – Coverage mode for MMseqs2 (0-5). Defaults to 0.

  • tmpdir – Temporary directory for operations. If None, uses system temp.

  • threads – Number of threads for processing. Defaults to 1.

Returns:

pd.DataFrame – DataFrame containing all alignment results with columns defined in CONFIG.ALIGN_COLUMNS.

Raises:
  • AlignmentError – If the alignment process fails.

  • AlignmentValidationError – If parameters are invalid.

  • FileNotFoundError – If sequence files don’t exist.

Notes

Output directory and temporary directory are supposed to be already validated. See the launch function to see how validation is done.

panorama.alignment.align.check_align_parameters(args: Namespace) None#

Validate command line arguments for alignment operations.

This function performs comprehensive validation of all alignment parameters provided via command line arguments, ensuring they meet the required constraints before proceeding with alignment operations.

Parameters:

args – Parsed command line arguments containing alignment parameters. Expected attributes: tmpdir, align_identity, align_coverage.

Raises:
  • AlignmentValidationError – If any parameter validation fails.

  • NotADirectoryError – If tmpdir is not a valid directory.

panorama.alignment.align.check_pangenome_align(pangenome: Pangenome) None#

Validate that a pangenome is ready for alignment operations.

This function checks that the pangenome has been properly processed and contains the necessary data for gene family alignment operations.

Parameters:

pangenome – Pangenome object to validate. Must have clustered genes and associated gene family sequences.

Raises:

AttributeError – If pangenome is missing required data or processing steps.

panorama.alignment.align.inter_pangenome_align(pangenome2families_seq: Dict[str, Path], output: Path, tmpdir: Path, identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, threads: int = 1, disable_bar: bool = False) None#

Perform inter-pangenome alignment without intra-pangenome comparisons.

This function aligns gene families between different pangenomes while excluding alignments within the same pangenome. It creates MMseqs2 databases for each pangenome and performs all pairwise comparisons.

Parameters:
  • pangenome2families_seq – Dictionary mapping pangenome names to their respective gene family sequence files.

  • output – Directory where alignment results will be written. Will be created if it doesn’t exist.

  • identity – Minimum sequence identity threshold (0.0-1.0). Defaults to 0.8.

  • coverage – Minimum coverage threshold (0.0-1.0). Defaults to 0.8.

  • cov_mode – Coverage mode for MMseqs2 (0-5). Defaults to 0.

  • tmpdir – Temporary directory for operations. If None, uses system temp.

  • threads – Number of threads for processing. Defaults to 1.

  • disable_bar – Whether to disable progress bars. Defaults to False.

Raises:
  • AlignmentError – If the alignment process fails.

  • AlignmentValidationError – If parameters are invalid.

  • FileNotFoundError – If sequence files don’t exist.

Notes

Output directory and temporary directory are supposed to be already validated. See the launch function to see how validation is done.

panorama.alignment.align.launch(args: Namespace) None#

Main entry point for alignment operations.

This function orchestrates the complete alignment workflow, including parameter validation, pangenome loading, sequence writing, and alignment execution based on the specified mode (inter-pangenome or all-against-all).

Parameters:

args – Parsed command line arguments containing all configuration parameters for the alignment operation.

Raises:
  • AlignmentError – If any step of the alignment process fails.

  • AlignmentValidationError – If parameter validation fails.

  • NotADirectoryError – If specified directories are invalid.

panorama.alignment.align.launch_pangenomes_alignment(pangenomes: Pangenomes, output: Path, tmpdir: Path, inter_pangenomes: bool = False, all_against_all: bool = False, identity: float = 0.8, coverage: float = 0.8, cov_mode: int = 0, threads: int = 1, lock: Lock = None, force: bool = False, disable_bar: bool = False) None#

Launches the alignment of pangenome families based on the specified mode.

Parameters:
  • pangenomes (Pangenomes) – A collection of pangenomes to be aligned.

  • output (Path) – Path to the directory where the alignment results will be stored.

  • tmpdir (Path) – Path to the temporary directory for intermediate files during the alignment process.

  • inter_pangenomes (bool) – If True, performs an inter-pangenome alignment mode.

  • all_against_all (bool) – If True, performs an all-against-all alignment mode.

  • identity (float) – Minimum sequence identity threshold for the alignment.

  • coverage (float) – Minimum sequence coverage threshold for the alignment.

  • cov_mode (int) – Coverage mode to dictate how the coverage threshold is applied.

  • threads (int) – Number of threads to be used for parallel processing.

  • lock (Lock) – A multiprocessing lock to synchronize access to certain operations.

  • force (bool) – If True, allows overwriting or recreating the output directory.

  • disable_bar (bool) – If True, disables progress bars in the alignment process.

Raises:

AlignmentValidationError – If none of the alignment modes (-inter_pangenomes or -all_against_all) are specified.

panorama.alignment.align.merge_aln_res(align_results: List[Path], outfile: Path) None#

Merge multiple alignment result files into a single consolidated file.

This function reads multiple TSV alignment files and combines them into a single file with proper headers and consistent formatting.

Parameters:
  • align_results – List of paths to individual alignment result files. All files must be in TSV format with consistent columns.

  • outfile – Path where the merged results will be written. Directory must exist and be writable.

Raises:
  • AlignmentError – If merging fails due to file I/O or format issues.

  • FileNotFoundError – If any input file doesn’t exist.

panorama.alignment.align.parser_align(parser: ArgumentParser) None#

Configure the argument parser for the alignment command.

This function adds all necessary command-line arguments for the alignment functionality, including required arguments, alignment modes, and optional parameters.

Parameters:

parser – ArgumentParser to configure with alignment arguments.

panorama.alignment.align.parser_mmseqs2_align(parser: ArgumentParser) _ArgumentGroup#

Add MMseqs2-specific arguments to the parser.

Parameters:

parser – ArgumentParser to add MMseqs2 arguments to.

Returns:

argparse._ArgumentGroup – The argument group containing MMseqs2 options.

panorama.alignment.align.subparser(sub_parser: _SubParsersAction) ArgumentParser#

Create argument subparser for alignment command.

This function sets up the command-line interface for the alignment functionality within the PANORAMA tool suite.

Parameters:

sub_parser – The subparser object from argparse to add alignment command to.

Returns:

argparse.ArgumentParser – Configured parser for alignment command.

panorama.alignment.align.write_alignment(query_db: Path, target_db: Path, aln_db: Path, outfile: Path, threads: int = 1) None#

Convert MMseqs2 alignment database to human-readable format.

This function uses MMseqs2’s convertalis command to convert binary alignment results into a tab-separated values file with specified output format for downstream analysis.

Parameters:
  • query_db – Path to MMseqs2 query database.

  • target_db – Path to MMseqs2 target database.

  • aln_db – Path to MMseqs2 alignment results database.

  • outfile – Path where the converted alignment results will be written.

  • threads – Number of threads for the conversion process. Defaults to 1.

Raises:
  • AlignmentError – If alignment conversion fails.

  • FileNotFoundError – If input databases don’t exist.

panorama.alignment.cluster module#

PANORAMA clustering module for pangenome gene family clustering.

This module provides functionality to cluster gene families across multiple pangenomes using MMseqs2 with support for both linclust (fast) and cluster (sensitive) methods. It handles sequence processing, clustering execution, and result formatting with comprehensive error handling and progress tracking.

class panorama.alignment.cluster.ClusteringConfig(CLUSTER_COLUMN_NAMES: ~typing.List[str] = <factory>, DEFAULT_THREADS: int = 1, DEFAULT_IDENTITY: float = 0.5, DEFAULT_COVERAGE: float = 0.8, DEFAULT_COV_MODE: int = 0, DEFAULT_EVAL: float = 0.001, DEFAULT_MMSEQS2_OPTIONS: ~typing.Dict[str, str | int | float] = <factory>, CLUSTERING_OUTPUT: str = 'clustering.tsv')#

Bases: object

Configuration constants for clustering operations.

CLUSTERING_OUTPUT: str = 'clustering.tsv'#
CLUSTER_COLUMN_NAMES: List[str]#
DEFAULT_COVERAGE: float = 0.8#
DEFAULT_COV_MODE: int = 0#
DEFAULT_EVAL: float = 0.001#
DEFAULT_IDENTITY: float = 0.5#
DEFAULT_MMSEQS2_OPTIONS: Dict[str, str | int | float]#
DEFAULT_THREADS: int = 1#
exception panorama.alignment.cluster.ClusteringError#

Bases: Exception

Custom exception for clustering-related errors.

class panorama.alignment.cluster.ClusteringMethod#

Bases: object

Defines a class representing clustering methods and their choices.

CHOICES = ['linclust', 'cluster']#
CLUSTER = 'cluster'#
LINCLUST = 'linclust'#
exception panorama.alignment.cluster.ClusteringValidationError#

Bases: ClusteringError

Custom exception for clustering parameter validation errors.

panorama.alignment.cluster._execute_cluster(seq_db: Path, mmseqs2_opt: Dict[str, str | int | float], cluster_db: Path, tmpdir: Path, threads: int) Path#

Execute the actual cluster command.

Parameters:
  • seq_db – Sequence database path.

  • mmseqs2_opt – MMseqs2 options.

  • cluster_db – Cluster database path.

  • tmpdir – Temporary directory.

  • threads – Number of threads.

Returns:

Path – Path to the clustering database.

Raises:

ClusteringError – If clustering execution fails.

panorama.alignment.cluster._execute_clustering_command(cmd: List[str], method_name: str) float#

Execute MMseqs2 clustering command with timing and error handling.

Parameters:
  • cmd – Command to execute as a list of strings.

  • method_name – Name of the clustering method for logging.

Returns:

float – Execution time in seconds.

Raises:

ClusteringError – If command execution fails.

panorama.alignment.cluster._execute_linclust(seq_db: Path, mmseqs2_opt: Dict[str, str | int | float], lclust_db: Path, tmpdir: Path, threads: int) Path#

Execute the actual linclust command.

Parameters:
  • seq_db – Sequence database path.

  • mmseqs2_opt – MMseqs2 options.

  • lclust_db – Linclust database path.

  • tmpdir – Temporary directory.

  • threads – Number of threads.

Returns:

Path – Path to the clustering database.

Raises:

ClusteringError – If clustering execution fails.

panorama.alignment.cluster._prepare_mmseqs2_options(args: Namespace) Dict[str, str | int | float]#

Prepare MMseqs2 options dictionary from command line arguments.

Parameters:

args – Command line arguments containing MMseqs2 parameters.

Returns:

Dict[str, Union[int, float, str]] – Dictionary of MMseqs2 options.

panorama.alignment.cluster._validate_clustering_parameters(threads: int, mmseqs2_options: Dict[str, str | int | float]) None#

Validate clustering parameters.

Parameters:
  • threads – Number of threads (positive integer).

  • mmseqs2_options – Dictionary containing MMseqs2 parameters.

Raises:

ClusteringValidationError – If any parameter is invalid.

panorama.alignment.cluster._validate_directory_access(directory: Path, create_if_missing: bool = False) None#

Validate directory exists and is accessible.

Parameters:
  • directory – Directory path to validate.

  • create_if_missing – Whether to create a directory if it doesn’t exist.

Raises:

ClusteringValidationError – If directory validation fails.

panorama.alignment.cluster.check_cluster_parameters(args: Namespace) None#

Validate command line arguments for clustering operations.

This function performs comprehensive validation of all clustering parameters provided via command line arguments, ensuring they meet the required constraints before proceeding with clustering operations.

Parameters:

args – Parsed command line arguments containing clustering parameters. Expected attributes: tmpdir and clustering-specific parameters.

Raises:
  • ClusteringValidationError – If any parameter validation fails.

  • NotADirectoryError – If tmpdir is not a valid directory.

panorama.alignment.cluster.check_pangenome_cluster(pangenome: Pangenome) None#

Validate that a pangenome is ready for clustering operations.

This function checks that the pangenome has been properly processed and contains the necessary data for gene family clustering operations.

Parameters:

pangenome – Pangenome object to validate. Must have clustered genes and associated gene family sequences.

Raises:

AttributeError – If pangenome is missing required data or processing steps.

panorama.alignment.cluster.cluster_gene_families(pangenomes: Pangenomes, method: str, mmseqs2_opt: Dict[str, str | int | float], tmpdir: Path, threads: int = 1, lock: Lock | None = None, disable_bar: bool = False) Path#

Cluster gene families from multiple pangenomes using MMseqs2.

This is the main function that orchestrates the complete clustering workflow: writing sequences, creating databases, performing clustering, and formatting results.

Parameters:
  • pangenomes – Pangenomes object containing multiple pangenome instances. All pangenomes must have clustered genes and family sequences.

  • method – Clustering method to use. Must be “linclust” or “cluster”. - “linclust” is faster but less sensitive, - “cluster” is more sensitive but slower.

  • mmseqs2_opt – Dictionary containing MMseqs2 clustering parameters. Must include all required parameters for the chosen method.

  • tmpdir – Temporary directory for operations. If None, uses system temp.

  • threads – Number of threads for processing. Defaults to 1.

  • lock – Optional multiprocessing Lock for thread safety. If None, operations may not be thread-safe in multiprocessing contexts.

  • disable_bar – Whether to disable progress bars. Defaults to False.

Returns:

Path – Path to the final clustering results file in TSV format with columns: cluster_id, referent, in_clust.

Raises:
  • ClusteringError – If the clustering process fails.

  • ClusteringValidationError – If parameters are invalid.

  • ValueError – If the method is not “linclust” or “cluster”.

Notes

Temporary directory is supposed to be already validated

panorama.alignment.cluster.cluster_launcher(seq_db: Path, mmseqs2_opt: Dict[str, str | int | float], cluster_db: Path | None = None, tmpdir: Path | None = None, threads: int = 1) Path#

Launch MMseqs2 cluster (sensitive clustering) on gene family sequences.

Cluster is a more sensitive but slower clustering method that provides better clustering quality through more thorough sequence comparison.

Parameters:
  • seq_db – Path to MMseqs2 sequence database containing gene families.

  • mmseqs2_opt – Dictionary containing MMseqs2 clustering parameters. Required keys: - max_seqs, - min_ungapped, - comp_bias_corr, - sensitivity, - kmer_per_seq, - identity, - coverage, - cov_mode, - eval, - align_mode, - max_seq_len, - max_reject, - clust_mode.

  • cluster_db – Optional path for the clustering results database. If None, a temporary file will be created.

  • tmpdir – Temporary directory for MMseqs2 operations. If None, the system temp directory will be used.

  • threads – Number of threads for clustering. Defaults to 1.

Returns:

Path – Path to the MMseqs2 clustering results database.

Raises:
  • ClusteringError – If clustering execution fails.

  • ClusteringValidationError – If parameters are invalid.

  • FileNotFoundError – If the sequence database doesn’t exist.

panorama.alignment.cluster.create_tsv(db: Path, clust: Path, output: Path, threads: int = 1) None#

Convert MMseqs2 clustering database to TSV format.

This function uses MMseqs2’s createtsv command to convert binary clustering results into a human-readable tab-separated values file.

Parameters:
  • db – Path to the MMseqs2 sequence database used for clustering.

  • clust – Path to the MMseqs2 clustering results database.

  • output – Path where the TSV file will be written.

  • threads – Number of threads for conversion. Defaults to 1.

Raises:
  • ClusteringError – If TSV creation fails.

  • FileNotFoundError – If input databases don’t exist.

panorama.alignment.cluster.launch(args: Namespace) None#

Main entry point for clustering operations.

This function orchestrates the complete clustering the workflow, including parameter validation, pangenome loading, clustering execution, and result processing based on the specified method and parameters.

Parameters:

args – Parsed command line arguments containing all configuration parameters for the clustering operation. Expected attributes: pangenomes, output, method, tmpdir, threads, keep_tmp, disable_prog_bar, force, and MMseqs2 parameters.

Raises:
  • ClusteringError – If any step of the clustering process fails.

  • ClusteringValidationError – If parameter validation fails.

  • NotADirectoryError – If specified directories are invalid.

panorama.alignment.cluster.linclust_launcher(seq_db: Path, mmseqs2_opt: Dict[str, str | int | float], lclust_db: Path | None = None, tmpdir: Path | None = None, threads: int = 1) Path#

Launch MMseqs2 linclust (fast clustering) on gene family sequences.

Linclust is a faster clustering method suitable for large datasets where speed is more important than sensitivity. It uses a linear time complexity algorithm for clustering.

Parameters:
  • seq_db – Path to MMseqs2 sequence database containing gene families.

  • mmseqs2_opt – Dictionary containing MMseqs2 clustering parameters. Required keys: comp_bias_corr, kmer_per_seq, identity, coverage, cov_mode, eval, align_mode, max_seq_len, max_reject, clust_mode.

  • lclust_db – Optional path for the clustering results database. If None, a temporary file will be created.

  • tmpdir – Temporary directory for MMseqs2 operations. If None, a system temp directory will be used.

  • threads – Number of threads for clustering. Defaults to 1.

Returns:

Path – Path to the MMseqs2 clustering results database.

Raises:
  • ClusteringError – If clustering execution fails.

  • ClusteringValidationError – If parameters are invalid.

  • FileNotFoundError – If the sequence database doesn’t exist.

panorama.alignment.cluster.parser_cluster(parser: ArgumentParser) None#

Configure the argument parser for clustering command.

This function adds all necessary command-line arguments for the clustering functionality, including required arguments, clustering methods, and optional parameters.

Parameters:

parser – ArgumentParser to configure with clustering arguments.

panorama.alignment.cluster.parser_mmseqs2_cluster(parser: ArgumentParser) _ArgumentGroup#

Add MMseqs2-specific clustering arguments to the parser.

Parameters:

parser – ArgumentParser to add MMseqs2 arguments to.

Returns:

argparse._ArgumentGroup – The argument group containing MMseqs2 clustering options.

panorama.alignment.cluster.subparser(sub_parser: _SubParsersAction) ArgumentParser#

Create the argument subparser for clustering command.

This function sets up the command-line interface for the clustering functionality within the PANORAMA tool suite.

Parameters:

sub_parser – The subparser object from argparse to add clustering command to.

Returns:

argparse.ArgumentParser – Configured parser for clustering command.

panorama.alignment.cluster.write_clustering(clust_res: Path, outfile: Path) None#

Process and write clustering results with proper cluster IDs.

This function reads raw MMseqs2 clustering results, assigns unique cluster IDs, and writes the processed results in a standardized format with proper headers.

Parameters:
  • clust_res – Path to the raw clustering results file from MMseqs2. Must be a TSV file with referent and member columns.

  • outfile – Path where the processed clustering results will be written. Directory must exist and be writable.

Raises:
  • ClusteringError – If processing or writing fails.

  • FileNotFoundError – If the input file doesn’t exist.

panorama.alignment.utils module#

Module for creating MMseqs2 databases and processing pangenome families sequences.

This module provides functionality to create MMseqs2 sequence databases and write pangenome families sequences using multithreading for improved performance.

class panorama.alignment.utils.MMSeqsConfig(DEFAULT_DB_TYPE: int = 0, PROTEIN_FAMILIES_FILENAME: str = 'all_protein_families.faa.gz', FAMILY_FILTER_ALL: str = 'all')#

Bases: object

Configuration constants for MMseqs2 operations.

DEFAULT_DB_TYPE: int = 0#
FAMILY_FILTER_ALL: str = 'all'#
PROTEIN_FAMILIES_FILENAME: str = 'all_protein_families.faa.gz'#
exception panorama.alignment.utils.MMSeqsError#

Bases: Exception

Custom exception for MMseqs2 related errors.

exception panorama.alignment.utils.PangenomeProcessingError#

Bases: Exception

Custom exception for pangenome processing errors.

panorama.alignment.utils._validate_seq_files(seq_files: List[Path]) None#

Validate sequence files exist and are readable.

Parameters:

seq_files – List of sequence file paths to validate.

Raises:

PangenomeProcessingError – If any file is invalid or doesn’t exist.

panorama.alignment.utils.createdb(seq_files: List[Path], output: Path, db_name: str = 'mmseqs_db', db_type: int = 0) Path#

Create a MMseqs2 sequence database from the given FASTA files.

This function creates an MMseqs2 database by combining multiple sequence files into a single database that can be used for sequence searches and clustering.

Parameters:
  • seq_files – List of FASTA file paths to include in the database. All files must exist and be readable.

  • output – Temporary directory where the database will be created. Directory must exist and be writable.

  • db_type – Type of MMseqs2 database to create. Defaults to 0 (protein). Valid values: 0 (protein), 1 (nucleotide), 2 (HMM profile).

Returns:

Path – Path to the created MMseqs2 database file.

Raises:
  • PangenomeProcessingError – If input validation fails or file operations fail.

  • MMSeqsError – If MMseqs2 command execution fails.

panorama.alignment.utils.write_pangenomes_families_sequences(pangenomes: Pangenomes, output: Path, threads: int = 1, lock: Lock | None = None, disable_bar: bool = False) Dict[str, Path]#

Write protein families sequences for multiple pangenomes using multithreading.

This function processes multiple pangenomes concurrently to extract and write protein families sequences. Each pangenome’s sequences are written to a separate subdirectory within the temporary directory.

Parameters:
  • pangenomes – Pangenomes object containing multiple pangenome instances. Must contain at least one pangenome.

  • output – Output directory where sequence files will be written. Directory will be created if it doesn’t exist.

  • threads – Number of worker threads to use for parallel processing. Defaults to 1. Must be a positive integer.

  • lock – Optional multiprocessing Lock object for thread-safe operations. If None, a new lock will be initialized for thread safety.

  • disable_bar – Whether to disable the progress bar display. Defaults to False (progress bar will be shown).

Returns:

Dict[str, Path] – Dictionary mapping pangenome names to their corresponding protein families sequence file paths.

Raises:

PangenomeProcessingError – If input validation fails or processing errors occur.

Note

  • Each pangenome’s sequences are written to tmpdir/{pangenome_name}/

  • Sequence names are prefixed with pangenome name to avoid duplicates

  • Files are compressed (.gz) to save space

panorama.alignment.utils.write_protein_families_sequences(pangenome: Pangenome, output_dir: Path) Tuple[str, Path]#

Write protein families sequences for a single pangenome.

This is a wrapper function designed for use in multithreading contexts.

Parameters:
  • pangenome – Pangenome object containing gene families data.

  • output_dir – Directory where sequences will be written.

Returns:

Tuple[str, Path] – Pangenome name and path to the generated sequences file.

Raises:

PangenomeProcessingError – If sequence writing fails.

Note

This function prefixes sequence names with pangenome name to ensure uniqueness across different pangenomes when combining sequences.

Module contents#