Gene Family annotation#
The annotation command adds functional annotations to gene families in pangenomes.
You can choose between:
A TSV file with metadata
A HMM database, searched with pyhmmer
Annotation Modes#
TSV-based annotation#
This mode injects gene family metadata from a .tsv file.
Expected format: A TSV file where each row lists a pangenome name and the path to its gene family annotation file.
These annotation files contain functional details (e.g., protein name, accession, score, etc.).
The only mandatory column is families, which correspond to the gene families identifier.
See metadata format PPanGGOLiN
documentation, for more information.
HMM-based annotation#
To annotate with a HMM database, you must provide a HMM metadata file (TSV format), containing:
Column |
Description |
Type |
Mandatory |
|---|---|---|---|
name |
The name of the HMM |
string |
True |
accession |
Identifier of the HMM |
string |
True |
path |
Path to the HMM file |
string |
True |
length |
Length of the profile. Automatically recover by pyhmmer if necessary |
int |
False |
protein_name |
Name of the protein/function corresponding to the HMM |
string |
True |
secondary_name |
Secondary name of the protein |
string |
True |
score_threshold |
Threshold used on the score to filter the profile |
float |
False |
eval_threshold |
Threshold used on the E-value to filter the profile |
float |
False |
ieval_threshold |
Threshold used on the iE-value to filter the profile |
float |
False |
hmm_cov_threshold |
Threshold used on the HMM covering to filter the profile |
float |
False |
target_cov_threshold |
Threshold used on the target covering to filter the profile |
float |
False |
description |
Description of the HMM, its protein function, or any other information |
float |
False |
Warning
Not all the columns need to be filled with value as indicated by the mandatory column, but they should exist in the metadata file.
Tip
To keep all assignations possible between a profile and a gen family, you can let the threshold columns empty.
Note
You can generate the input files expected by PANORAMA using panorama utils --hmm.
To align gene families against a HMM database, you can use different modes:
Mode |
Description |
|---|---|
|
Aligns representative sequences of each family to the HMMs |
|
Builds HMMs for each family from MSAs |
|
Aligns all genes from each family to the HMMs |
Command Line Usage#
To annotate gene families with precomputed metadata, do as such:
panorama annotation \
--pangenomes pangenomes.tsv \
--source KEGG \
--table annotations.tsv
--threads 8
To annotate with a HMM database, do as such:
panorama annotation \
--pangenomes pangenomes.tsv \
--source defensefinder \
--hmm hmms.tsv \
--mode sensitive \
--k_best_hit 3 \ # <-- or use the alias -b to keep only the best hit
--save_hits tblout \
--output results/ \
--threads 8
Tip
More options are available to annotate with a HMM database. See below.
Note
Source name should not contain a special character. They could interfere with the .h5 writing.
Warning
You must provide either --table or --hmm, but not both.
These options are mutually exclusive.
Key options#
Shortcut |
Argument |
Description |
|---|---|---|
|
|
TSV file listing |
|
|
Name of the annotation source (e.g. |
โ |
|
Mutually exclusive with |
โ |
|
Mutually exclusive with |
โ |
|
Required with |
โ |
|
(Used only in |
|
|
Equivalent to |
โ |
|
Keep up to |
โ |
|
Output directory for HMM result files (optional, used with |
โ |
|
Save HMM alignment results in formats: |
โ |
|
Temporary directory (used with HMM mode) |
โ |
|
Keep temporary files after HMM alignment |
โ |
|
Custom Z value for e-value scaling (advanced HMMER option) |
โ |
|
Format of MSA files (default: |
โ |
|
Number of threads to use |
Annotation Workflow#
Load pangenomes
Pangenomes are loaded from .h5 files. Only necessary information is retrieved based on the mode.
Retrieve annotations
With
--table: loads metadata from TSVWith
--hmm: aligns families via annot_with_hmm() from hmm_search.py
Filter HMM hits (only for the hmm option)
Each hit is filtered using the thresholds defined in the HMM metadata:
e-value
i-evalue
score
target coverage
HMM coverage
Tip
Prefer to use the score instead of the e-value or the i-evalue to ensure reproducibility of the results even if the size of your targets changes.
Write annotations
Filtered annotations are stored in the .h5 files, under the given โsource name.
Note
Annotations can be viewed or reused with PANORAMA, PPanGGOLiN, or custom tools (e.g., vitables).
HMM Search Details#
Annotation relies on the pyhmmer Python API.
Depending on sequence size, PANORAMA chooses the best method:
Method |
Use case |
|---|---|
hmmsearch |
In-memory, fast |
hmmscan |
Streaming, used when memory is limited |
Note
If sequences exceed 10% of available RAM, PANORAMA uses hmmscan, as recommended by pyhmmer documentation
here
Minimal example#
Annotate gene families based on the reference sequence with COG HMM#
panorama annotation \
-p pangenomes.tsv \
-s COG \
--hmm hmms.tsv \
--mode fast \
--only_best_hit # <-- or use the alias: -b