Gene Family annotation#

The annotation command adds functional annotations to gene families in pangenomes. You can choose between:

A TSV file with metadata
A HMM database, searched with pyhmmer

Annotation Modes#

TSV-based annotation#

This mode injects gene family metadata from a .tsv file.

Expected format: A TSV file where each row lists a pangenome name and the path to its gene family annotation file.

These annotation files contain functional details (e.g., protein name, accession, score, etc.). The only mandatory column is families, which correspond to the gene families identifier. See metadata format PPanGGOLiN documentation, for more information.

HMM-based annotation#

To annotate with a HMM database, you must provide a HMM metadata file (TSV format), containing:

Column	Description	Type	Mandatory
name	The name of the HMM	string	True
accession	Identifier of the HMM	string	True
path	Path to the HMM file	string	True
length	Length of the profile. Automatically recover by pyhmmer if necessary	int	False
protein_name	Name of the protein/function corresponding to the HMM	string	True
secondary_name	Secondary name of the protein	string	True
score_threshold	Threshold used on the score to filter the profile	float	False
eval_threshold	Threshold used on the E-value to filter the profile	float	False
ieval_threshold	Threshold used on the iE-value to filter the profile	float	False
hmm_cov_threshold	Threshold used on the HMM covering to filter the profile	float	False
target_cov_threshold	Threshold used on the target covering to filter the profile	float	False
description	Description of the HMM, its protein function, or any other information	float	False

Warning

Not all the columns need to be filled with value as indicated by the mandatory column, but they should exist in the metadata file.

Tip

To keep all assignations possible between a profile and a gen family, you can let the threshold columns empty.

Note

You can generate the input files expected by PANORAMA using panorama utils --hmm.

To align gene families against a HMM database, you can use different modes:

Mode	Description
`fast`	Aligns representative sequences of each family to the HMMs
`profile`	Builds HMMs for each family from MSAs
`sensitive`	Aligns all genes from each family to the HMMs

Command Line Usage#

To annotate gene families with precomputed metadata, do as such:

panorama annotation \
  --pangenomes pangenomes.tsv \
  --source KEGG \
  --table annotations.tsv
  --threads 8

To annotate with a HMM database, do as such:

panorama annotation \
  --pangenomes pangenomes.tsv \
  --source defensefinder \
  --hmm hmms.tsv \
  --mode sensitive \
  --k_best_hit 3 \    # <-- or use the alias -b to keep only the best hit
  --save_hits tblout \
  --output results/ \
  --threads 8

Tip

More options are available to annotate with a HMM database. See below.

Note

Source name should not contain a special character. They could interfere with the .h5 writing.

Warning

You must provide either --table or --hmm, but not both. These options are mutually exclusive.

Key options#

Shortcut	Argument	Description
`-p`	`--pangenomes`	TSV file listing `.h5` pangenomes
`-s`	`--source`	Name of the annotation source (e.g. `KO2024`, `Pfam`)
—	`--table`	Mutually exclusive with `--hmm`. TSV linking pangenome names to annotation files
—	`--hmm`	Mutually exclusive with `--table`. HMM metadata TSV (from `panorama utils --hmm`)
—	`--mode`	Required with `--hmm`. Alignment strategy: `fast`, `profile`, or `sensitive`
—	`--msa`	(Used only in `profile` mode) TSV listing MSAs per gene family
`-b`	`--only_best_hit`	Equivalent to `--k_best_hit 1`
—	`--k_best_hit`	Keep up to `k` best hits per gene family
—	`--output`	Output directory for HMM result files (optional, used with `--save_hits`)
—	`--save_hits`	Save HMM alignment results in formats: `tblout`, `domtblout`, `pfamtblout`
—	`--tmp`	Temporary directory (used with HMM mode)
—	`--keep_tmp`	Keep temporary files after HMM alignment
—	`--Z`	Custom Z value for e-value scaling (advanced HMMER option)
—	`--msa-format`	Format of MSA files (default: `afa`) — rarely changed
—	`--threads`	Number of threads to use

Annotation Workflow#

Load pangenomes

Pangenomes are loaded from .h5 files. Only necessary information is retrieved based on the mode.
Retrieve annotations
- With --table: loads metadata from TSV
- With --hmm: aligns families via annot_with_hmm() from hmm_search.py
Filter HMM hits (only for the hmm option)

Each hit is filtered using the thresholds defined in the HMM metadata:
- e-value
- i-evalue
- score
- target coverage
- HMM coverage

Tip

Prefer to use the score instead of the e-value or the i-evalue to ensure reproducibility of the results even if the size of your targets changes.

Write annotations

Filtered annotations are stored in the .h5 files, under the given –source name.

Note

Annotations can be viewed or reused with PANORAMA, PPanGGOLiN, or custom tools (e.g., vitables).

HMM Search Details#

Annotation relies on the pyhmmer Python API.

Depending on sequence size, PANORAMA chooses the best method:

Method	Use case
hmmsearch	In-memory, fast
hmmscan	Streaming, used when memory is limited

Note

If sequences exceed 10% of available RAM, PANORAMA uses hmmscan, as recommended by pyhmmer documentation here

Minimal example#

Annotate gene families based on the reference sequence with COG HMM#

panorama annotation \
  -p pangenomes.tsv \
  -s COG \
  --hmm hmms.tsv \
  --mode fast \
  --only_best_hit   # <-- or use the alias: -b