Gene Family annotation#

The annotation command adds functional annotations to gene families in pangenomes. You can choose between:

  • A TSV file with metadata

  • A HMM database, searched with pyhmmer


Annotation Modes#

TSV-based annotation#

This mode injects gene family metadata from a .tsv file.

Expected format: A TSV file where each row lists a pangenome name and the path to its gene family annotation file.

These annotation files contain functional details (e.g., protein name, accession, score, etc.). The only mandatory column is families, which correspond to the gene families identifier. See metadata format PPanGGOLiN documentation, for more information.

HMM-based annotation#

To annotate with a HMM database, you must provide a HMM metadata file (TSV format), containing:

Column

Description

Type

Mandatory

name

The name of the HMM

string

True

accession

Identifier of the HMM

string

True

path

Path to the HMM file

string

True

length

Length of the profile. Automatically recover by pyhmmer if necessary

int

False

protein_name

Name of the protein/function corresponding to the HMM

string

True

secondary_name

Secondary name of the protein

string

True

score_threshold

Threshold used on the score to filter the profile

float

False

eval_threshold

Threshold used on the E-value to filter the profile

float

False

ieval_threshold

Threshold used on the iE-value to filter the profile

float

False

hmm_cov_threshold

Threshold used on the HMM covering to filter the profile

float

False

target_cov_threshold

Threshold used on the target covering to filter the profile

float

False

description

Description of the HMM, its protein function, or any other information

float

False

Warning

Not all the columns need to be filled with value as indicated by the mandatory column, but they should exist in the metadata file.

Tip

To keep all assignations possible between a profile and a gen family, you can let the threshold columns empty.

Note

You can generate the input files expected by PANORAMA using panorama utils --hmm.

To align gene families against a HMM database, you can use different modes:

Mode

Description

fast

Aligns representative sequences of each family to the HMMs

profile

Builds HMMs for each family from MSAs

sensitive

Aligns all genes from each family to the HMMs


Command Line Usage#

To annotate gene families with precomputed metadata, do as such:

panorama annotation \
  --pangenomes pangenomes.tsv \
  --source KEGG \
  --table annotations.tsv
  --threads 8

To annotate with a HMM database, do as such:

panorama annotation \
  --pangenomes pangenomes.tsv \
  --source defensefinder \
  --hmm hmms.tsv \
  --mode sensitive \
  --k_best_hit 3 \    # <-- or use the alias -b to keep only the best hit
  --save_hits tblout \
  --output results/ \
  --threads 8

Tip

More options are available to annotate with a HMM database. See below.

Note

Source name should not contain a special character. They could interfere with the .h5 writing.

Warning

You must provide either --table or --hmm, but not both. These options are mutually exclusive.

Key options#

Shortcut

Argument

Description

-p

--pangenomes

TSV file listing .h5 pangenomes

-s

--source

Name of the annotation source (e.g. KO2024, Pfam)

โ€”

--table

Mutually exclusive with --hmm. TSV linking pangenome names to annotation files

โ€”

--hmm

Mutually exclusive with --table. HMM metadata TSV (from panorama utils --hmm)

โ€”

--mode

Required with --hmm. Alignment strategy: fast, profile, or sensitive

โ€”

--msa

(Used only in profile mode) TSV listing MSAs per gene family

-b

--only_best_hit

Equivalent to --k_best_hit 1

โ€”

--k_best_hit

Keep up to k best hits per gene family

โ€”

--output

Output directory for HMM result files (optional, used with --save_hits)

โ€”

--save_hits

Save HMM alignment results in formats: tblout, domtblout, pfamtblout

โ€”

--tmp

Temporary directory (used with HMM mode)

โ€”

--keep_tmp

Keep temporary files after HMM alignment

โ€”

--Z

Custom Z value for e-value scaling (advanced HMMER option)

โ€”

--msa-format

Format of MSA files (default: afa) โ€” rarely changed

โ€”

--threads

Number of threads to use

Annotation Workflow#

  1. Load pangenomes

    Pangenomes are loaded from .h5 files. Only necessary information is retrieved based on the mode.

  2. Retrieve annotations

    • With --table: loads metadata from TSV

    • With --hmm: aligns families via annot_with_hmm() from hmm_search.py

  3. Filter HMM hits (only for the hmm option)

    Each hit is filtered using the thresholds defined in the HMM metadata:

    • e-value

    • i-evalue

    • score

    • target coverage

    • HMM coverage

Tip

Prefer to use the score instead of the e-value or the i-evalue to ensure reproducibility of the results even if the size of your targets changes.

  1. Write annotations

    Filtered annotations are stored in the .h5 files, under the given โ€“source name.

Note

Annotations can be viewed or reused with PANORAMA, PPanGGOLiN, or custom tools (e.g., vitables).

HMM Search Details#

Annotation relies on the pyhmmer Python API.

Depending on sequence size, PANORAMA chooses the best method:

Method

Use case

hmmsearch

In-memory, fast

hmmscan

Streaming, used when memory is limited

Note

If sequences exceed 10% of available RAM, PANORAMA uses hmmscan, as recommended by pyhmmer documentation here

Minimal example#

Annotate gene families based on the reference sequence with COG HMM#

panorama annotation \
  -p pangenomes.tsv \
  -s COG \
  --hmm hmms.tsv \
  --mode fast \
  --only_best_hit   # <-- or use the alias: -b