(gene-family-annotation)=
# Gene Family annotation

The `annotation` command adds **functional annotations** to gene families in pangenomes.
You can choose between:

- A **TSV file** with metadata
- A **HMM database**, searched with [pyhmmer](https://pyhmmer.readthedocs.io/en/stable/index.html)

---

## Annotation Modes

### TSV-based annotation

This mode injects gene family metadata from a `.tsv` file.

**Expected format**:
A TSV file where each row lists a pangenome name and the path to its gene family annotation file.

These annotation files contain functional details (e.g., protein name, accession, score, etc.).
The only mandatory column is `families`, which correspond to the gene families identifier.
See [metadata format](https://ppanggolin.readthedocs.io/en/latest/user/metadata.html#metadata-format) PPanGGOLiN
documentation, for more information.

### HMM-based annotation

To annotate with a HMM database, you must provide a HMM metadata file (TSV format), containing:

| Column               | Description                                                            | Type   | Mandatory |
|----------------------|------------------------------------------------------------------------|--------|-----------|
| name                 | The name of the HMM                                                    | string | True      |     
| accession            | Identifier of the HMM                                                  | string | True      |
| path                 | Path to the HMM file                                                   | string | True      |    
| length               | Length of the profile. Automatically recover by pyhmmer if necessary   | int    | False     |
| protein_name         | Name of the protein/function corresponding to the HMM                  | string | True      |
| secondary_name       | Secondary name of the protein                                          | string | True      |
| score_threshold      | Threshold used on the score to filter the profile                      | float  | False     |
| eval_threshold       | Threshold used on the E-value to filter the profile                    | float  | False     |
| ieval_threshold      | Threshold used on the iE-value to filter the profile                   | float  | False     |
| hmm_cov_threshold    | Threshold used on the HMM covering to filter the profile               | float  | False     |
| target_cov_threshold | Threshold used on the target covering to filter the profile            | float  | False     |
| description          | Description of the HMM, its protein function, or any other information | float  | False     |

```{warning}
Not all the columns need to be filled with value as indicated by the mandatory column, but they should exist in the metadata file.
```

```{tip}
To keep all assignations possible between a profile and a gen family, you can let the threshold columns empty.
```

```{note}
You can generate the input files expected by PANORAMA using `panorama utils --hmm`.

[//]: # (TODO Ajouter le lien vers la documentation quand écrit)
```

To align gene families against a HMM database, you can use different modes:

| Mode        | Description                                                |
|-------------|------------------------------------------------------------|
| `fast`      | Aligns representative sequences of each family to the HMMs |
| `profile`   | Builds HMMs for each family from MSAs                      |
| `sensitive` | Aligns **all genes** from each family to the HMMs          |

---

## Command Line Usage

To annotate gene families with precomputed metadata, do as such:

```bash
panorama annotation \
  --pangenomes pangenomes.tsv \
  --source KEGG \
  --table annotations.tsv
  --threads 8
```

To annotate with a HMM database, do as such:

```bash
panorama annotation \
  --pangenomes pangenomes.tsv \
  --source defensefinder \
  --hmm hmms.tsv \
  --mode sensitive \
  --k_best_hit 3 \    # <-- or use the alias -b to keep only the best hit
  --save_hits tblout \
  --output results/ \
  --threads 8
```

```{tip}
More options are available to annotate with a HMM database. See below.
```

```{note}
Source name should not contain a special character. They could interfere with the `.h5` writing.
```

```{warning}
You **must provide either** `--table` **or** `--hmm`, **but not both**. 
These options are mutually exclusive.
```

### Key options

| Shortcut | Argument          | Description                                                                           |
|----------|-------------------|---------------------------------------------------------------------------------------|
| `-p`     | `--pangenomes`    | TSV file listing `.h5` pangenomes                                                     |
| `-s`     | `--source`        | Name of the annotation source (e.g. `KO2024`, `Pfam`)                                 |
| —        | `--table`         | **Mutually exclusive with `--hmm`**. TSV linking pangenome names to annotation files  |
| —        | `--hmm`           | **Mutually exclusive with `--table`**. HMM metadata TSV (from `panorama utils --hmm`) |
| —        | `--mode`          | Required with `--hmm`. Alignment strategy: `fast`, `profile`, or `sensitive`          |
| —        | `--msa`           | (Used only in `profile` mode) TSV listing MSAs per gene family                        |
| `-b`     | `--only_best_hit` | Equivalent to `--k_best_hit 1`                                                        |
| —        | `--k_best_hit`    | Keep up to `k` best hits per gene family                                              |
| —        | `--output`        | Output directory for HMM result files (optional, used with `--save_hits`)             |
| —        | `--save_hits`     | Save HMM alignment results in formats: `tblout`, `domtblout`, `pfamtblout`            |
| —        | `--tmp`           | Temporary directory (used with HMM mode)                                              |
| —        | `--keep_tmp`      | Keep temporary files after HMM alignment                                              |
| —        | `--Z`             | Custom Z value for e-value scaling (advanced HMMER option)                            |
| —        | `--msa-format`    | Format of MSA files (default: `afa`) — rarely changed                                 |
| —        | `--threads`       | Number of threads to use                                                              |

## Annotation Workflow

1. Load pangenomes

   Pangenomes are loaded from .h5 files. Only necessary information is retrieved based on the mode.

2. Retrieve annotations

    - With `--table`: loads metadata from TSV

    - With `--hmm`: aligns families via annot_with_hmm() from hmm_search.py

3. Filter HMM hits (only for the hmm option)

   Each hit is filtered using the thresholds defined in the HMM metadata:
    - e-value
    - i-evalue
    - score
    - target coverage
    - HMM coverage

```{tip}
Prefer to use the score instead of the e-value or the i-evalue to ensure reproducibility of the results even if the size of your targets changes.
```

4. Write annotations

   Filtered annotations are stored in the .h5 files, under the given --source name.

```{note}
Annotations can be viewed or reused with PANORAMA, PPanGGOLiN, or custom tools (*e.g.*, [vitables](https://vitables.org/index.html)).
```

## HMM Search Details

Annotation relies on the [pyhmmer](https://pyhmmer.readthedocs.io/en/stable/index.html) Python API.

Depending on sequence size, PANORAMA chooses the best method:

| Method    | Use case                               |
|-----------|----------------------------------------|
| hmmsearch | In-memory, fast                        |
| hmmscan   | Streaming, used when memory is limited |

```{note}
If sequences exceed 10% of available RAM, PANORAMA uses `hmmscan`, as recommended by pyhmmer documentation 
[here](https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Performance-tips-and-tricks)
```

## Minimal example

### Annotate gene families based on the reference sequence with COG HMM

```bash
panorama annotation \
  -p pangenomes.tsv \
  -s COG \
  --hmm hmms.tsv \
  --mode fast \
  --only_best_hit   # <-- or use the alias: -b
```