panorama.annotate package#
Submodules#
panorama.annotate.annotate module#
- panorama.annotate.annotate.annot_pangenomes(pangenomes: Pangenomes, source: str = None, table: Path = None, hmm: Path = None, threads: int = 1, k_best_hit: int = None, lock: Lock = None, force: bool = False, disable_bar: bool = False, **hmm_kwgs: Any)#
Gene families annotation with HMM or TSV files for multiple pangenomes in multiprocessing.
- Parameters:
pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
source (str, optional) – Name of the annotation source. Defaults to None.
table (Path, optional) – Path to the metadata file for gene families annotation. Defaults to None.
hmm (Path, optional) – Path to hmm list file. Defaults to None.
threads (int, optional) – Number of available threads. Defaults to 1.
k_best_hit (int, optional) – Number of best hits to keep. Defaults to None.
lock (Lock, optional) – Lock for multiprocessing. Defaults to None.
force (bool, optional) – Flag to allow force overwrite in pangenomes. Defaults to False.
disable_bar (bool, optional) – Flag to disable progress bar. Defaults to False.
**hmm_kwgs (Any) – Arbitrary keyword arguments for hmm alignment.
- Raises:
AssertionError – If neither HMM nor TSV are provided.
- panorama.annotate.annotate.annot_pangenomes_with_hmm(pangenomes: Pangenomes, hmm: Path = None, source: str = '', mode: str = 'fast', threads: int = 1, disable_bar: bool = False, **hmm_kwgs) Dict[str, DataFrame]#
Main function to add annotation to pangenome from tsv file.
- Parameters:
pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
hmm (Path, optional) – Path to hmm list file. Defaults to None.
source (str, optional) – Name of the annotation source. Defaults to “”.
mode (str, optional) – Which mode to use to annotate gene families with HMM. Defaults to “fast”.
threads (int, optional) – Number of available threads. Defaults to 1.
disable_bar (bool, optional) – Flag to disable progress bar. Defaults to False.
**hmm_kwgs – Arbitrary keyword arguments for HMM annotation.
- Returns:
Dict[str, pd.DataFrame] – Dictionary with for each pangenome a dataframe
containing gene families metadata given by HMM.
- panorama.annotate.annotate.check_annotate_args(args: Namespace, silence_warning: bool = False) Tuple[Dict[str, Any], Dict[str, Any]]#
Checks the provided arguments to ensure that they are valid.
- Parameters:
args (argparse.Namespace) – The parsed arguments.
silence_warning (bool, optional) – Flag to silence warning messages. Defaults to False. This option is used for pansystems workflow to not have unwanted warnings.
- Returns:
Tuple[Dict[str, Any], Dict[str, Any]]
Two dictionaries containing necessary information and HMM keyword arguments.
- Raises:
argparse.ArgumentError – If any required arguments are missing or invalid.
- panorama.annotate.annotate.check_pangenome_annotation(pangenome: Pangenome, source: str, force: bool = False)#
Check pangenome information before adding annotation.
- Parameters:
pangenome (Pangenome) – Pangenome object that will be checked.
source (str) – Source of annotation to check if already in pangenome.
force (bool, optional) – Flag to allow overwriting/erasing annotation. Defaults to False.
- Raises:
KeyError – If a source with the same name already exists, and force is False.
- panorama.annotate.annotate.get_k_best_hit(group, k_best_hit: int) DataFrame#
Get the K best hits for a given group in a dataframe.
- Parameters:
group – Dataframe group.
k_best_hit (int) – Number of best hits to keep.
- Returns:
pd.DataFrame – K best hits per group.
- panorama.annotate.annotate.keep_best_hit(metadata: DataFrame, k_best_hit: int) DataFrame#
Keep the k best hit for a given metadata.
- Parameters:
metadata (pd.DataFrame) – Metadata dataframe with multiple annotations for gene families.
k_best_hit (int) – Number of best hits to keep.
- Returns:
pd.DataFrame – Filtered metadata dataframe with only the k best hits.
- panorama.annotate.annotate.launch(args: Namespace) None#
Launch functions to annotate pangenomes
- Parameters:
args (argparse.Namespace) – argument given in CLI
- panorama.annotate.annotate.parser_annot(parser)#
Add argument to parser for annot command
- Parameters:
parser – parser for annot argument
- panorama.annotate.annotate.parser_annot_hmm(parser)#
Add argument to parser for HMM annotation
- Parameters:
parser – parser for annot argument
- panorama.annotate.annotate.read_families_metadata(pangenome: Pangenome, metadata: Path) Tuple[DataFrame, str]#
Read gene families metadata for one pangenome.
- Parameters:
pangenome (Pangenome) – Pangenome object for which metadata will be associated.
metadata (Path) – Path to metadata file containing metadata to add to pangenome.
- Returns:
Tuple[pd.DataFrame, str] – The metadata dataframe and the name of the pangenome.
- panorama.annotate.annotate.read_families_metadata_mp(pangenomes: Pangenomes, table: Path, threads: int = 1, lock: Lock = None, disable_bar: bool = False) Dict[str, DataFrame]#
Read gene families metadata for multiple pangenomes in multiprocessing.
- Parameters:
pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
table (Path) – Path to the metadata file for gene families.
threads (int, optional) – Number of available threads. Defaults to 1.
lock (Lock, optional) – Lock for multiprocessing execution. Defaults to None.
disable_bar (bool, optional) – Flag to disable progress bar. Defaults to False.
- Returns:
Dict[str, pd.DataFrame] – Dictionary with the metadata linked to pangenome by its name.
- panorama.annotate.annotate.remove_redundant_annotation(metadata: DataFrame) DataFrame#
Remove redundant annotation based on score, e-value, and bias.
- Parameters:
metadata (pd.DataFrame) – Metadata dataframe containing annotations.
- Returns:
pd.DataFrame – Dataframe with redundant annotations removed.
- panorama.annotate.annotate.subparser(sub_parser) ArgumentParser#
Subparser to launch PANORAMA Command line
- Parameters:
sub_parser – sub_parser for annot command
- Returns:
argparse.ArgumentParser – parser arguments for annot command
- panorama.annotate.annotate.write_annotations_to_pangenome(pangenome: Pangenome, metadata: DataFrame, source: str, k_best_hit: int = None, force: bool = False, disable_bar: bool = False)#
Write gene families annotation for one pangenome.
- Parameters:
pangenome (Pangenome) – Pangenome linked to metadata.
metadata (pd.DataFrame) – Metadata dataframe.
source (str) – Metadata source.
k_best_hit (int, optional) – Number of best hits to keep. Defaults to None.
force (bool, optional) – Boolean to allow force writing in pangenomes. Defaults to False.
disable_bar (bool, optional) – Allow disabling the progress bar. Defaults to False.
- panorama.annotate.annotate.write_annotations_to_pangenomes(pangenomes: Pangenomes, pangenomes2metadata: Dict[str, DataFrame], source: str, k_best_hit: int = None, threads: int = 1, lock: Lock = None, force: bool = False, disable_bar: bool = False)#
Write gene families annotation for pangenomes in multiple processing.
- Parameters:
pangenomes (Pangenomes) – Pangenomes object containing all the pangenome to annotate.
pangenomes2metadata (Dict[str, pd.DataFrame]) – Dictionary with for each pangenome
associated. (the metadata dataframe)
source (str) – Metadata source.
k_best_hit (int, optional) – Number of best hits to keep. Defaults to None.
threads (int, optional) – Number of available threads. Defaults to 1.
lock (Lock, optional) – Lock for multiprocessing execution. Defaults to None.
force (bool, optional) – Boolean to allow force to write in pangenomes. Defaults to False.
disable_bar (bool, optional) – Allow disabling the progress bar. Defaults to False.
panorama.annotate.hmm_search module#
- panorama.annotate.hmm_search.annot_with_hmm(pangenome: Pangenome, hmms: Dict[str, List[HMM]], meta: DataFrame = None, source: str = '', mode: str = 'fast', msa: Path = None, msa_format: str = 'afa', tblout: bool = False, Z: int = 4000, domtblout: bool = False, pfamtblout: bool = False, output: Path = None, threads: int = 1, tmp: Path = None, disable_bar: bool = False) DataFrame#
Annotate a pangenome using a collection of HMM profiles and return best hits per family.
Supports three annotation modes: - “fast”: Uses referent sequences or representative sequences. - “sensitive”: Uses all genes in gene families with HMMScan/HMMSearch. - “profile”: Not yet implemented; intended to use family alignments.
- Parameters:
pangenome (Pangenome) – Pangenome object with gene families.
hmms (Dict[str, List[HMM]]) – Dictionary of annotation source → list of HMMs.
meta (pd.DataFrame, optional) – Optional metadata for the HMMs.
source (str) – Name of the annotation source.
mode (str) – Annotation mode to use (“fast”, “sensitive”, “profile”).
msa (Path, optional) – Path to a file listing MSAs (only used in “profile” mode).
msa_format (str) – Format of MSAs if provided (default: “afa”).
tblout (bool) – If True, write per-sequence hits.
pfamtblout (bool) – If True, write Pfam-format hits.
Z (int) – Effective number of comparisons (default: 4000).
domtblout (bool) – If True, write per-domain hits.
output (Path, optional) – Directory to write annotation results.
threads (int) – Number of threads for parallel processing.
tmp (Path, optional) – Temporary directory for intermediate files.
disable_bar (bool) – If True, disable progress bars.
- Returns:
pd.DataFrame – DataFrame containing best hit per gene or per family.
- Raises:
ValueError – If the number of MSAs doesn’t match the number of gene families.
AssertionError – If output is required but not provided.
NotImplementedError – If “profile” mode is selected (not yet available).
- panorama.annotate.hmm_search.annot_with_hmmscan(hmms: Dict[str, List[HMM]], gf_sequences: SequenceFile | List[DigitalSequence], meta: DataFrame = None, Z: int = 4000, threads: int = 1, tmp: Path = None, disable_bar: bool = False) Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]]#
Annotate sequences by scanning them against HMM profiles using HMMER’s
hmmscan.- Parameters:
hmms (Dict[str, List[HMM]]) – Dictionary of HMMs grouped by cutoff type.
gf_sequences (Union[SequenceFile, List[DigitalSequence]]) – Digital sequences to annotate.
meta (pd.DataFrame, optional) – Optional metadata for evaluating hit criteria.
Z (int, optional) – Effective number of database comparisons. Default is 4000.
threads (int, optional) – Number of threads to use. Default is 1.
tmp (Path, optional) – Temporary directory for intermediate files. Default is system temp dir.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- Returns:
Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]] –
Annotation results that pass filters.
All raw TopHits results from HMMER.
- panorama.annotate.hmm_search.annot_with_hmmsearch(hmms: Dict[str, List[HMM]], gf_sequences: SequenceBlock, meta: DataFrame = None, Z: int = 4000, threads: int = 1, disable_bar: bool = False) Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]]#
Annotate HMM profiles by searching them against a block of target sequences using HMMER’s
hmmsearch.- Parameters:
hmms (Dict[str, List[HMM]]) – Dictionary of HMMs grouped by cutoff type.
gf_sequences (SequenceBlock) – Digital sequence block representing target sequences.
meta (pd.DataFrame, optional) – Optional metadata for evaluating hit criteria.
Z (int, optional) – Effective number of database comparisons. Default is 4000.
threads (int, optional) – Number of threads to use. Default is 1.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- Returns:
Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]] –
Filtered annotation results.
All TopHits results returned by HMMER.
- panorama.annotate.hmm_search.assign_hit(hit: Hit, meta: DataFrame) Tuple[str, str, str, float, float, float, float, str, str] | None#
Evaluate whether a hit from HMMER alignment meets filtering criteria.
- Parameters:
hit (Hit) – HMMER hit object representing a match between a query and a model.
meta (pd.DataFrame) – DataFrame containing annotation thresholds and metadata.
- Returns:
Union[Tuple[str, str, str, float, float, float, float, str, str], None] – A tuple of annotation
information if the hit passes filters, otherwise None.
- panorama.annotate.hmm_search.digit_family_sequences(pangenome: Pangenome, disable_bar: bool = False) Tuple[List[DigitalSequence], bool]#
Convert each gene family’s consensus or sequence into a digital format for HMM profile creation.
- Parameters:
pangenome (Pangenome) – Pangenome object containing gene families.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- Returns:
Tuple[List[DigitalSequence], bool] – A list of digitalized sequences for each gene family,
and a boolean indicating whether the total size is below available system memory.
- panorama.annotate.hmm_search.digit_gene_sequences(pangenome: Pangenome, threads: int = 1, tmp: Path = None, keep_tmp: bool = False, disable_bar: bool = False) Tuple[SequenceFile, bool]#
Convert all gene sequences from the pangenome into a digital format for HMMER alignment.
- Parameters:
pangenome (Pangenome) – Pangenome object containing annotated gene data.
threads (int, optional) – Number of threads to use when exporting sequences. Default is 1.
tmp (Path, optional) – Temporary directory where intermediate files will be written.
None (If)
directory. (uses system temp)
keep_tmp (bool, optional) – Whether to keep temporary files after execution. Default is False.
disable_bar (bool, optional) – If True, disables progress bars. Default is False.
- Returns:
Tuple[SequenceFile, bool] – A tuple containing: - The digitalized SequenceFile object for downstream HMMER processing. - A boolean indicating whether the size of the sequence file is below 10% of available system memory.
- panorama.annotate.hmm_search.get_metadata_df(result: List[Tuple[str, str, str, float, float, float, float, str, str]], mode: str = 'fast', gene2family: Dict[str, str] = None) DataFrame#
Refactor HMM alignment results into a structured metadata DataFrame.
Handles basic cleaning and optionally joins gene-family metadata in “sensitive” mode to allow grouping by family.
- Parameters:
result (List[Tuple]) – List of raw alignment results from HMM search.
mode (str) – Annotation mode used (“fast” or “sensitive”).
gene2family (Dict[str, str], optional) – Required for “sensitive” mode. Maps gene IDs to family names.
- Returns:
pd.DataFrame – Cleaned and optionally merged metadata DataFrame.
- panorama.annotate.hmm_search.get_msa(pangenome: Pangenome, tmpdir: Path, threads: int = 1, disable_bar: bool = False) DataFrame#
Compute multiple sequence alignments (MSA) for all gene families in the pangenome.
- Parameters:
pangenome (Pangenome) – Pangenome object containing gene family information.
tmpdir (Path) – Directory to store temporary MSA output files.
threads (int, optional) – Number of threads to use for parallel execution. Default is 1.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- Returns:
pd.DataFrame – A DataFrame mapping each gene family ID to its corresponding MSA file path.
- panorama.annotate.hmm_search.profile_gf(gf: GeneFamily, msa_path: Path, msa_format: str = 'afa')#
Build an HMM profile for a single gene family using its MSA.
- Parameters:
gf (GeneFamily) – Gene family object to be profiled.
msa_path (Path) – Path to the MSA file.
msa_format (str, optional) – Format of the MSA file (e.g., “afa”). Default is “afa”.
- Raises:
Exception – If the MSA file is unreadable or if building the HMM profile fails.
- panorama.annotate.hmm_search.profile_gfs(pangenome: Pangenome, msa_df: DataFrame, msa_format: str = 'afa', threads: int = 1, disable_bar: bool = False)#
Generate HMM profiles for all gene families in the pangenome.
- Parameters:
pangenome (Pangenome) – Pangenome object containing gene families.
msa_df (pd.DataFrame) – DataFrame mapping gene family IDs to MSA file paths.
msa_format (str, optional) – Format used to read MSA files. Default is “afa”.
threads (int, optional) – Number of threads for parallel processing. Default is 1.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- panorama.annotate.hmm_search.read_hmms(hmm_db: Path, disable_bar: bool = False) Tuple[Dict[str, List[HMM]], DataFrame]#
Read a set of HMM files and categorize them based on available cutoffs.
- Parameters:
hmm_db (Path) – Path to the tab-delimited file listing HMM metadata.
disable_bar (bool, optional) – If True, disables the progress bar. Default is False.
- Returns:
Tuple[Dict[str, List[HMM]], pd.DataFrame] – A dictionary categorizing HMMs by cutoff type
(gathering, trusted, noise, or None), and a DataFrame with metadata.
- Raises:
Exception – If reading an HMM file fails unexpectedly.
- panorama.annotate.hmm_search.write_top_hits(all_top_hits: List[TopHits], output: Path, source: str, tblout: bool = False, domtblout: bool = False, pfamtblout: bool = False, name: str = 'panorama', mode: str = 'fast')#
Write pyhmmer search hits to file in various tabular formats.
Depending on the flags provided, writes per-sequence (
tbl), per-domain (domtbl), and/or Pfam-style (pfamtbl) formatted results.- Parameters:
all_top_hits (List[TopHits]) – List of pyhmmer TopHits objects.
output (Path) – Directory where output files will be written.
source (str) – Name of the annotation source (used in subfolder naming).
tblout (bool) – If True, write per-sequence hits (
*.tbl).domtblout (bool) – If True, write per-domain hits (
*.domtbl).pfamtblout (bool) – If True, write hits in Pfam format (
*.pfamtbl).name (str) – Name of the pangenome (used for folder structure).
mode (str) – Alignment mode used for the annotation (e.g., “fast”, “sensitive”).