panorama.annotate package#

Submodules#

panorama.annotate.annotate module#

panorama.annotate.annotate.annot_pangenomes(pangenomes, source=None, table=None, hmm=None, threads=1, k_best_hit=None, lock=None, force=False, disable_bar=False, **hmm_kwgs)#

Gene families annotation with HMM or TSV files for multiple pangenomes in multiprocessing.

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
source (str) – Name of the annotation source. Defaults to None.
table (Path) – Path to the metadata file for gene families annotation. Defaults to None.
hmm (Path) – Path to hmm list file. Defaults to None.
threads (int) – Number of available threads. Defaults to 1.
k_best_hit (int) – Number of best hits to keep. Defaults to None.
lock (Lock) – Lock for multiprocessing. Defaults to None.
force (bool) – Flag to allow force overwrite in pangenomes. Defaults to False.
disable_bar (bool) – Flag to disable progress bar. Defaults to False.
**hmm_kwgs (Any) – Arbitrary keyword arguments for hmm alignment.

Raises:

AssertionError – If neither HMM nor TSV are provided.

panorama.annotate.annotate.annot_pangenomes_with_hmm(pangenomes, hmm=None, source='', mode='fast', threads=1, disable_bar=False, **hmm_kwgs)#

Main function to add annotation to pangenome from tsv file.

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
hmm (Path) – Path to hmm list file. Defaults to None.
source (str) – Name of the annotation source. Defaults to “”.
mode (str) – Which mode to use to annotate gene families with HMM. Defaults to “fast”.
threads (int) – Number of available threads. Defaults to 1.
disable_bar (bool) – Flag to disable progress bar. Defaults to False.
**hmm_kwgs – Arbitrary keyword arguments for HMM annotation.

Returns:

Dict[str, pd.DataFrame] – Dictionary with for each pangenome a dataframe
containing gene families metadata given by HMM.

Return type:

Dict[str, DataFrame]

panorama.annotate.annotate.check_annotate_args(args, silence_warning=False)#

Checks the provided arguments to ensure that they are valid.

Parameters:

args (Namespace) – The parsed arguments.
silence_warning (bool) – Flag to silence warning messages. Defaults to False. This option is used for pansystems workflow to not have unwanted warnings.

Returns:

Tuple[Dict[str, Any], Dict[str, Any]]
Two dictionaries containing necessary information and HMM keyword arguments.

Raises:

argparse.ArgumentError – If any required arguments are missing or invalid.

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

panorama.annotate.annotate.check_pangenome_annotation(pangenome, source, force=False)#

Check pangenome information before adding annotation.

Parameters:

pangenome (Pangenome) – Pangenome object that will be checked.
source (str) – Source of annotation to check if already in pangenome.
force (bool) – Flag to allow overwriting/erasing annotation. Defaults to False.

Raises:

KeyError – If a source with the same name already exists, and force is False.

panorama.annotate.annotate.get_k_best_hit(group, k_best_hit)#

Get the K best hits for a given group in a dataframe.

Parameters:

group – Dataframe group.
k_best_hit (int) – Number of best hits to keep.

Returns:

pd.DataFrame – K best hits per group.

Return type:

DataFrame

panorama.annotate.annotate.keep_best_hit(metadata, k_best_hit)#

Keep the k best hit for a given metadata.

Parameters:

metadata (DataFrame) – Metadata dataframe with multiple annotations for gene families.
k_best_hit (int) – Number of best hits to keep.

Returns:

pd.DataFrame – Filtered metadata dataframe with only the k best hits.

Return type:

DataFrame

panorama.annotate.annotate.launch(args)#

Launch functions to annotate pangenomes

Parameters:: args (Namespace) – argument given in CLI
Return type:: None

panorama.annotate.annotate.parser_annot(parser)#

Add argument to parser for annot command

Parameters:: parser – parser for annot argument

panorama.annotate.annotate.parser_annot_hmm(parser)#

Add argument to parser for HMM annotation

Parameters:: parser – parser for annot argument

panorama.annotate.annotate.read_families_metadata(pangenome, metadata)#

Read gene families metadata for one pangenome.

Parameters:

pangenome (Pangenome) – Pangenome object for which metadata will be associated.
metadata (Path) – Path to metadata file containing metadata to add to pangenome.

Returns:

Tuple[pd.DataFrame, str] – The metadata dataframe and the name of the pangenome.

Return type:

Tuple[DataFrame, str]

panorama.annotate.annotate.read_families_metadata_mp(pangenomes, table, threads=1, lock=None, disable_bar=False)#

Read gene families metadata for multiple pangenomes in multiprocessing.

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
table (Path) – Path to the metadata file for gene families.
threads (int) – Number of available threads. Defaults to 1.
lock (Lock) – Lock for multiprocessing execution. Defaults to None.
disable_bar (bool) – Flag to disable progress bar. Defaults to False.

Returns:

Dict[str, pd.DataFrame] – Dictionary with the metadata linked to pangenome by its name.

Return type:

Dict[str, DataFrame]

panorama.annotate.annotate.remove_redundant_annotation(metadata)#

Remove redundant annotation based on score, e-value, and bias.

Parameters:: metadata (DataFrame) – Metadata dataframe containing annotations.
Returns:: pd.DataFrame – Dataframe with redundant annotations removed.
Return type:: DataFrame

panorama.annotate.annotate.subparser(sub_parser)#

Subparser to launch PANORAMA Command line

Parameters:: sub_parser – sub_parser for annot command
Returns:: argparse.ArgumentParser – parser arguments for annot command
Return type:: ArgumentParser

panorama.annotate.annotate.write_annotations_to_pangenome(pangenome, metadata, source, k_best_hit=None, force=False, disable_bar=False)#

Write gene families annotation for one pangenome.

Parameters:

pangenome (Pangenome) – Pangenome linked to metadata.
metadata (DataFrame) – Metadata dataframe.
source (str) – Metadata source.
k_best_hit (int) – Number of best hits to keep. Defaults to None.
force (bool) – Boolean to allow force writing in pangenomes. Defaults to False.
disable_bar (bool) – Allow disabling the progress bar. Defaults to False.

panorama.annotate.annotate.write_annotations_to_pangenomes(pangenomes, pangenomes2metadata, source, k_best_hit=None, threads=1, lock=None, force=False, disable_bar=False)#

Write gene families annotation for pangenomes in multiple processing.

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
pangenomes2metadata (Dict[str, DataFrame]) – Dictionary with for each pangenome
associated. (the metadata dataframe)
source (str) – Metadata source.
k_best_hit (int) – Number of best hits to keep. Defaults to None.
threads (int) – Number of available threads. Defaults to 1.
lock (Lock) – Lock for multiprocessing execution. Defaults to None.
force (bool) – Boolean to allow force to write in pangenomes. Defaults to False.
disable_bar (bool) – Allow disabling the progress bar. Defaults to False.

panorama.annotate.hmm_search module#

panorama.annotate.hmm_search.annot_with_hmm(pangenome, hmms, meta=None, source='', mode='fast', msa=None, msa_format='afa', tblout=False, Z=4000, domtblout=False, pfamtblout=False, output=None, threads=1, tmp=None, disable_bar=False)#

Annotate a pangenome using a collection of HMM profiles and return best hits per family.

Supports three annotation modes: - “fast”: Uses referent sequences or representative sequences. - “sensitive”: Uses all genes in gene families with HMMScan/HMMSearch. - “profile”: Not yet implemented; intended to use family alignments.

Parameters:

pangenome (Pangenome) – Pangenome object with gene families.
hmms (Dict[str, List[HMM]]) – Dictionary of annotation source → list of HMMs.
meta (DataFrame) – Optional metadata for the HMMs.
source (str) – Name of the annotation source.
mode (str) – Annotation mode to use (“fast”, “sensitive”, “profile”).
msa (Path) – Path to a file listing MSAs (only used in “profile” mode).
msa_format (str) – Format of MSAs if provided (default: “afa”).
tblout (bool) – If True, write per-sequence hits.
pfamtblout (bool) – If True, write Pfam-format hits.
Z (int) – Effective number of comparisons (default: 4000).
domtblout (bool) – If True, write per-domain hits.
output (Path) – Directory to write annotation results.
threads (int) – Number of threads for parallel processing.
tmp (Path) – Temporary directory for intermediate files.
disable_bar (bool) – If True, disable progress bars.

Returns:

pd.DataFrame – DataFrame containing best hit per gene or per family.

Raises:

ValueError – If the number of MSAs doesn’t match the number of gene families.
AssertionError – If output is required but not provided.
NotImplementedError – If “profile” mode is selected (not yet available).

Return type:: DataFrame

panorama.annotate.hmm_search.annot_with_hmmscan(hmms, gf_sequences, meta=None, Z=4000, threads=1, tmp=None, disable_bar=False)#

Annotate sequences by scanning them against HMM profiles using HMMER’s hmmscan.

Parameters:

hmms (Dict[str, List[HMM]]) – Dictionary of HMMs grouped by cutoff type.
gf_sequences (Union[SequenceFile, List[DigitalSequence]]) – Digital sequences to annotate.
meta (DataFrame) – Optional metadata for evaluating hit criteria.
Z (int) – Effective number of database comparisons. Default is 4000.
threads (int) – Number of threads to use. Default is 1.
tmp (Path) – Temporary directory for intermediate files. Default is system temp dir.
disable_bar (bool) – If True, disables the progress bar. Default is False.

Returns:

Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]] –

Annotation results that pass filters.
All raw TopHits results from HMMER.

Return type:

Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]]

panorama.annotate.hmm_search.annot_with_hmmsearch(hmms, gf_sequences, meta=None, Z=4000, threads=1, disable_bar=False)#

Annotate HMM profiles by searching them against a block of target sequences using HMMER’s hmmsearch.

Parameters:

hmms (Dict[str, List[HMM]]) – Dictionary of HMMs grouped by cutoff type.
gf_sequences (SequenceBlock) – Digital sequence block representing target sequences.
meta (DataFrame) – Optional metadata for evaluating hit criteria.
Z (int) – Effective number of database comparisons. Default is 4000.
threads (int) – Number of threads to use. Default is 1.
disable_bar (bool) – If True, disables the progress bar. Default is False.

Returns:

Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]] –

Filtered annotation results.
All TopHits results returned by HMMER.

Return type:

Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]]

panorama.annotate.hmm_search.assign_hit(hit, meta)#

Evaluate whether a hit from HMMER alignment meets filtering criteria.

Parameters:

hit (Hit) – HMMER hit object representing a match between a query and a model.
meta (DataFrame) – DataFrame containing annotation thresholds and metadata.

Returns:

Union[Tuple[str, str, str, float, float, float, float, str, str], None] – A tuple of annotation
information if the hit passes filters, otherwise None.

Return type:

Optional[Tuple[str, str, str, float, float, float, float, str, str]]

panorama.annotate.hmm_search.digit_family_sequences(pangenome, disable_bar=False)#

Convert each gene family’s consensus or sequence into a digital format for HMM profile creation.

Parameters:

pangenome (Pangenome) – Pangenome object containing gene families.
disable_bar (bool) – If True, disables the progress bar. Default is False.

Returns:

Tuple[List[DigitalSequence], bool] – A list of digitalized sequences for each gene family,
and a boolean indicating whether the total size is below available system memory.

Return type:

Tuple[List[DigitalSequence], bool]

panorama.annotate.hmm_search.digit_gene_sequences(pangenome, threads=1, tmp=None, keep_tmp=False, disable_bar=False)#

Convert all gene sequences from the pangenome into a digital format for HMMER alignment.

Parameters:

pangenome (Pangenome) – Pangenome object containing annotated gene data.
threads (int) – Number of threads to use when exporting sequences. Default is 1.
tmp (Path) – Temporary directory where intermediate files will be written.
None (If)
directory. (uses system temp)
keep_tmp (bool) – Whether to keep temporary files after execution. Default is False.
disable_bar (bool) – If True, disables progress bars. Default is False.

Returns:

Tuple[SequenceFile, bool] – A tuple containing: - The digitalized SequenceFile object for downstream HMMER processing. - A boolean indicating whether the size of the sequence file is below 10% of available system memory.

Return type:

Tuple[SequenceFile, bool]

panorama.annotate.hmm_search.get_metadata_df(result, mode='fast', gene2family=None)#

Refactor HMM alignment results into a structured metadata DataFrame.

Handles basic cleaning and optionally joins gene-family metadata in “sensitive” mode to allow grouping by family.

Parameters:

result (List[Tuple[str, str, str, float, float, float, float, str, str]]) – List of raw alignment results from HMM search.
mode (str) – Annotation mode used (“fast” or “sensitive”).
gene2family (Dict[str, str]) – Required for “sensitive” mode. Maps gene IDs to family names.

Returns:

pd.DataFrame – Cleaned and optionally merged metadata DataFrame.

Return type:

DataFrame

panorama.annotate.hmm_search.get_msa(pangenome, tmpdir, threads=1, disable_bar=False)#

Compute multiple sequence alignments (MSA) for all gene families in the pangenome.

Parameters:

pangenome (Pangenome) – Pangenome object containing gene family information.
tmpdir (Path) – Directory to store temporary MSA output files.
threads (int) – Number of threads to use for parallel execution. Default is 1.
disable_bar (bool) – If True, disables the progress bar. Default is False.

Returns:

pd.DataFrame – A DataFrame mapping each gene family ID to its corresponding MSA file path.

Return type:

DataFrame

panorama.annotate.hmm_search.profile_gf(gf, msa_path, msa_format='afa')#

Build an HMM profile for a single gene family using its MSA.

Parameters:

gf (GeneFamily) – Gene family object to be profiled.
msa_path (Path) – Path to the MSA file.
msa_format (str) – Format of the MSA file (e.g., “afa”). Default is “afa”.

Raises:

Exception – If the MSA file is unreadable or if building the HMM profile fails.

panorama.annotate.hmm_search.profile_gfs(pangenome, msa_df, msa_format='afa', threads=1, disable_bar=False)#

Generate HMM profiles for all gene families in the pangenome.

Parameters:

pangenome (Pangenome) – Pangenome object containing gene families.
msa_df (DataFrame) – DataFrame mapping gene family IDs to MSA file paths.
msa_format (str) – Format used to read MSA files. Default is “afa”.
threads (int) – Number of threads for parallel processing. Default is 1.
disable_bar (bool) – If True, disables the progress bar. Default is False.

panorama.annotate.hmm_search.read_hmms(hmm_db, disable_bar=False)#

Read a set of HMM files and categorize them based on available cutoffs.

Parameters:

hmm_db (Path) – Path to the tab-delimited file listing HMM metadata.
disable_bar (bool) – If True, disables the progress bar. Default is False.

Returns:

Tuple[Dict[str, List[HMM]], pd.DataFrame] – A dictionary categorizing HMMs by cutoff type
(gathering, trusted, noise, or None), and a DataFrame with metadata.

Raises:

Exception – If reading an HMM file fails unexpectedly.

Return type:

Tuple[Dict[str, List[HMM]], DataFrame]

panorama.annotate.hmm_search.write_top_hits(all_top_hits, output, source, tblout=False, domtblout=False, pfamtblout=False, name='panorama', mode='fast')#

Write pyhmmer search hits to file in various tabular formats.

Depending on the flags provided, writes per-sequence (tbl), per-domain (domtbl), and/or Pfam-style (pfamtbl) formatted results.

Parameters:

all_top_hits (List[TopHits]) – List of pyhmmer TopHits objects.
output (Path) – Directory where output files will be written.
source (str) – Name of the annotation source (used in subfolder naming).
tblout (bool) – If True, write per-sequence hits (*.tbl).
domtblout (bool) – If True, write per-domain hits (*.domtbl).
pfamtblout (bool) – If True, write hits in Pfam format (*.pfamtbl).
name (str) – Name of the pangenome (used for folder structure).
mode (str) – Alignment mode used for the annotation (e.g., “fast”, “sensitive”).