panorama.format package#

Submodules#

panorama.format.read_binaries module#

Pangenome Data Reader Module

This module provides comprehensive functions to read and load pangenome data from HDF5 files. It supports parallel loading of multiple pangenomes with various data components, including annotations, gene families, systems, spots, modules, and metadata.

The module is designed to work with the PPanGGOLiN pangenome analysis toolkit and provides robust error handling and progress tracking for large-scale pangenome analyses.

panorama.format.read_binaries.check_pangenome_info(pangenome, need_families_info=False, need_families_sequences=False, need_systems=False, models=None, systems_sources=None, read_canonical=False, disable_bar=False, **kwargs)#

Determine and load required pangenome information based on analysis needs.

This function analyzes what information is needed and automatically loads the required components from the pangenome file.

Parameters:

pangenome (Pangenome) – Target pangenome object.
need_families_info (bool) – Whether gene family partition info is needed. Defaults to False.
need_families_sequences (bool) – Whether gene family sequences are needed. Defaults to False.
need_systems (bool) – Whether biological systems are needed. Defaults to False.
models (Optional[List[Models]]) – Model definitions for systems. Required if need_systems=True.
systems_sources (Optional[List[str]]) – System source identifiers. Required if need_systems=True.
read_canonical (bool) – Whether to read canonical system representations. Defaults to False.
disable_bar (bool) – Whether to disable progress bars. Defaults to False.
**kwargs – Additional parameters

Raises:

AssertionError – If systems are requested but required parameters are missing.
ValueError – If gene families info/sequences are requested without gene families.

Return type:

None

panorama.format.read_binaries.get_status(pangenome, pangenome_file)#

Check which elements are present in the HDF5 file and update pangenome status.

This function extends the base status checking functionality to include systems-specific status information.

Parameters:

pangenome (Pangenome) – The pangenome object to update with status information.
pangenome_file (Path) – Path to the pangenome HDF5 file to examine.

Return type:

None

Note

This function modifies the pangenome object in-place by updating its status dictionary.

panorama.format.read_binaries.load_pangenome(name, path, taxid, need_info, check_function=None, disable_bar=False, **kwargs)#

Load a single pangenome from the file with specified information requirements.

This function creates a new pangenome object, associates it with the given HDF5 file, performs optional validation, and loads the requested information.

Parameters:

name (str) – Descriptive name for the pangenome.
path (Path) – Path to the pangenome HDF5 file.
taxid (int) – NCBI taxonomic identifier for the pangenome.
need_info (Dict[str, bool]) – Dictionary specifying what information to load. Keys can include: ‘annotation’, ‘gene_families’, ‘graph’, ‘rgp’, ‘spots’, ‘gene_sequences’, ‘modules’, ‘metadata’, ‘systems’, etc.
check_function (Optional[Callable[[Pangenome, ...], None]]) –

Custom validation function to run before loading information.
Should raise exceptions on failure.
disable_bar (bool) – Whether to disable progress bars. Defaults to False.
**kwargs – Additional parameters passed to check_function and check_pangenome_info.

Returns:

Pangenome – Fully loaded pangenome object with requested information.

Raises:

FileNotFoundError – If the specified path does not exist.
Exception – If check_function validation fails or loading encounters errors.

Return type:

Pangenome

panorama.format.read_binaries.load_pangenomes(pangenome_list, need_info, check_function=None, max_workers=1, lock=None, disable_bar=False, **kwargs)#

Load multiple pangenomes in parallel from a configuration file.

This function provides efficient parallel loading of multiple pangenomes using a thread pool. It reads pangenome specifications from a TSV file and loads each pangenome with the same information requirements.

Parameters:

pangenome_list (Path) – Path to TSV file containing pangenome specifications. Expected format: name path taxid (tab-separated values).
need_info (Dict[str, bool]) – Information requirements applied to all pangenomes. See load_pangenome documentation for available keys.
check_function (Optional[Callable]) – Validation function applied to each pangenome before loading. Defaults to None.
max_workers (int) – Maximum number of concurrent loading threads. Defaults to 1 (sequential loading).
lock (Optional[Lock]) – Multiprocessing lock for thread synchronization. If None, a new lock will be created. Defaults to None.
disable_bar (bool) – Whether to disable the main progress bar. Individual pangenome progress bars are always disabled in parallel mode. Defaults to False.
**kwargs – Additional parameters passed to each pangenome’s load_pangenome call.

Returns:

Pangenomes – Collection containing all successfully loaded pangenomes.

Raises:

FileNotFoundError – If the pangenome_list file does not exist.
ValueError – If the pangenome_list file format is invalid.
Exception – If any pangenome fails to load (will stop all loading).

Return type:

Pangenomes

Note

Progress bars for individual pangenomes are disabled during parallel loading to avoid output conflicts.
All pangenomes must load successfully; if any fails, the entire operation fails.
The lock parameter is used to synchronize access to shared resources during parallel execution.

panorama.format.read_binaries.read_gene_families(pangenome, h5f, disable_bar=False)#

Read gene family associations from the HDF5 file.

This function creates gene families and associates genes with them. If genome annotations are already loaded, it will link to existing gene objects; otherwise, it creates minimal gene objects.

Parameters:

pangenome (Pangenome) – Target pangenome object.
h5f (File) – Open HDF5 file handle containing gene family data.
disable_bar (bool) – Whether to disable the progress bar. Defaults to False.

Return type:

None

panorama.format.read_binaries.read_gene_families_info(pangenome, h5f, information=False, sequences=False, disable_bar=False)#

Read additional information about gene families from the HDF5 file.

This function can read partition information and/or protein sequences for gene families already present in the pangenome.

Parameters:

pangenome (Pangenome) – Target pangenome object containing gene families.
h5f (File) – Open HDF5 file handle with gene family information.
information (bool) – Whether to read partition information. Defaults to False.
sequences (bool) – Whether to read protein sequences. Defaults to False.
disable_bar (bool) – Whether to disable the progress bar. Defaults to False.

Return type:

None

Note

At least one of ‘information’ or ‘sequences’ should be True for this function to perform any meaningful work.

panorama.format.read_binaries.read_modules(pangenome, h5f, disable_bar=False)#

Read functional modules from the HDF5 file.

Modules represent sets of gene families that are consistently found together and likely represent functional units.

Parameters:

pangenome (Pangenome) – Target pangenome object.
h5f (File) – Open HDF5 file handle with precomputed modules.
disable_bar (bool) – Whether to disable the progress bar. Defaults to False.

Raises:

Exception – If gene families have not been loaded into the pangenome.

Return type:

None

panorama.format.read_binaries.read_pangenome(pangenome, annotation=False, gene_families=False, graph=False, rgp=False, spots=False, gene_sequences=False, modules=False, metadata=False, systems=False, disable_bar=False, **kwargs)#

Read a pangenome from its HDF5 file with specified components.

This is the main function for loading pangenome data. It reads only the requested components and validates that they are available in the file.

Parameters:

pangenome (Pangenome) – Target the pangenome object with the associated HDF5 file.
annotation (bool) – Read genome annotations. Defaults to False.
gene_families (bool) – Read gene family associations. Defaults to False.
graph (bool) – Read gene neighborhood graph. Defaults to False.
rgp (bool) – Read regions of genomic plasticity. Defaults to False.
spots (bool) – Read genomic hotspots. Defaults to False.
gene_sequences (bool) – Read gene DNA sequences. Defaults to False.
modules (bool) – Read functional modules. Defaults to False.
metadata (bool) – Read associated metadata. Defaults to False.
systems (bool) – Read biological systems. Defaults to False.
disable_bar (bool) – Disable all progress bars. Defaults to False.
**kwargs – Additional parameters

Raises:

FileNotFoundError – If pangenome has no associated HDF5 file.
ValueError – If requested data is not available in the file.
AttributeError – If required graph/spots/modules data is missing.
KeyError – If required metadata is not present.

Return type:

None

panorama.format.read_binaries.read_spots(pangenome, h5f, disable_bar=False)#

Read genomic hotspots (spots) from the HDF5 file.

Spots represent clusters of regions of genomic plasticity (RGPs) that occur in similar genomic contexts across multiple genomes.

Parameters:

pangenome (Pangenome) – Target pangenome object.
h5f (File) – Open HDF5 file handle with precomputed spots.
disable_bar (bool) – Whether to disable the progress bar. Defaults to False.

Return type:

None

panorama.format.read_binaries.read_systems(pangenome, h5f, models, sources, read_canonical=False, disable_bar=False)#

Read system information from all sources in the pangenome HDF5 file.

Parameters:

pangenome (Pangenome) – Target pangenome object.
h5f (File) – Open HDF5 file handle containing pangenome data.
models (List[Models]) – List of model definitions for each source.
sources (List[str]) – List of source identifiers to process.
read_canonical (bool) – Whether to read canonical representations. Defaults to False.
disable_bar (bool) – Whether to disable progress bars. Defaults to False.

Returns:

Set[str] – Combined set of metadata sources from all processed sources.

Raises:

ValueError – If the number of models doesn’t match the number of sources.

Return type:

Set[str]

panorama.format.read_binaries.read_systems_by_source(pangenome, source_group, models, read_canonical=True, disable_bar=False)#

Read systems from a specific source and integrate them into the pangenome.

This function processes system data from a single source, including both regular systems and their canonical representations if requested.

Parameters:

pangenome (Pangenome) – Target the pangenome object to populate with systems.
source_group (Group) – HDF5 group containing system tables for one source.
models (Models) – Model definitions associated with the systems.
read_canonical (bool) – Whether to read canonical system representations. Defaults to True.
disable_bar (bool) – Whether to disable the progress bar display. Defaults to False.

Return type:

None

Note

Systems are sorted by complexity before addition to ensure consistent ordering.

panorama.format.write_binaries module#

This module provides functions to write, update and erase pangenome system data from HDF5 files.

The module extends the base ppanggolin functionality to handle system-specific data structures including canonical systems, system units and their metadata relationships.

panorama.format.write_binaries._calculate_max_lengths_for_system(system, current_max_sys, current_max_unit)#

Calculate maximum string lengths for a single system and its units.

Parameters:

system – System object to analyze
current_max_sys (Tuple[int, int]) – Current maximum (id_len, name_len) for systems
current_max_unit (Tuple[int, int, int]) – Current maximum (name_len, gf_name_len, annot_source_len) for units

Returns:

Tuple[Tuple[int, int], Tuple[int, int, int]] – Updated maximum lengths for systems and units

Return type:

Tuple[Tuple[int, int], Tuple[int, int, int]]

panorama.format.write_binaries.calculate_system_table_sizes(pangenome, source)#

Calculate maximum sizes and expected row counts for all system-related HDF5 tables.

This function analyzes all systems from a given source to determine the optimal table sizes for efficient HDF5 storage.

Parameters:

pangenome (Pangenome) – Pangenome object containing the systems data
source (str) – Name of the system source to analyze

Returns:

SystemTableSizes – Dataclass containing all calculated size information for system tables, unit tables, canonical tables and cross-reference tables.

Return type:

SystemTableSizes

panorama.format.write_binaries.create_system_description(max_id_len=1, max_name_len=1, include_canonical_count=False)#

Create the dictionary that describes the HDF5 table structure for systems.

Parameters:

max_id_len (int) – Maximum size of system identifier. Defaults to 1.
max_name_len (int) – Maximum size of system name. Defaults to 1.
include_canonical_count (bool) – Whether to include the canonical count column. Defaults to False.

Returns:

Dict[str, Union[tables.StringCol, tables.Int64Col]] – HDF5 table description dictionary containing column definitions for the system table.

Return type:

Dict[str, Union[NewCol, NewCol]]

panorama.format.write_binaries.create_system_tables(h5f, source_group, table_sizes)#

Create all HDF5 tables needed for storing system data.

Parameters:

h5f (File) – HDF5 file handles
source_group (Group) – HDF5 group for this source
table_sizes (SystemTableSizes) – Size information for table creation

Returns:

Tuple[tables.Table, …] – Tuple of created tables (system, unit, canonical, canonical_unit, sys2canonical)

Return type:

Tuple[Table, ...]

panorama.format.write_binaries.create_system_unit_description(max_name_len=1, max_gf_name_len=1, max_metadata_source=1)#

Create the dictionary that describes the HDF5 table structure for system units.

Parameters:

max_name_len (int) – Maximum size of system unit name. Defaults to 1.
max_gf_name_len (int) – Maximum size of gene family name. Defaults to 1.
max_metadata_source (int) – Maximum size of annotation source name. Defaults to 1.

Returns:

Dict[str, Union[tables.StringCol, tables.Int64Col]] – HDF5 table description dictionary containing column definitions for the system unit table.

Return type:

Dict[str, Union[NewCol, NewCol]]

panorama.format.write_binaries.erase_pangenome(pangenome, graph=False, gene_families=False, partition=False, rgp=False, spots=False, modules=False, metadata=False, systems=False, source=None)#

Erase specific data tables from a pangenome HDF5 file.

This function provides selective deletion of pangenome data components, extending the base functionality to handle system-specific data.

Parameters:

pangenome (Pangenome) – Pangenome object to modify
graph (bool) – Remove graph information. Defaults to False.
gene_families (bool) – Remove gene families information. Defaults to False.
partition (bool) – Remove partition information. Defaults to False.
rgp (bool) – Remove RGP information. Defaults to False.
spots (bool) – Remove spots information. Defaults to False.
modules (bool) – Remove modules’ information. Defaults to False.
metadata (bool) – Remove metadata. Defaults to False.
systems (bool) – Remove systems data. Defaults to False.
source (str) – Specific source to remove (required for systems/metadata). Defaults to None.

Returns:

None

Raises:

AssertionError – If the source is None when systems=True
FileNotFoundError – If the pangenome file doesn’t exist

Return type:

None

panorama.format.write_binaries.write_pangenome(pangenome, file_path, source=None, force=False, disable_bar=False)#

Write or update a complete pangenome to an HDF5 file.

This function handles the complete workflow of writing pangenome data, including both standard components and system-specific data.

Parameters:

pangenome (Pangenome) – Pangenome object containing all data to write
file_path (str) – Path to the HDF5 file for the output
source (str) – Source identifier for systems or metadata. Defaults to None.
force (bool) – Whether to overwrite existing files. Defaults to False.
disable_bar (bool) – Whether to disable progress bars. Defaults to False.

Returns:

None

Raises:

AssertionError – If the source is None when systems need to be written
IOError – If the file cannot be created or accessed

Return type:

None

panorama.format.write_binaries.write_pangenome_status(pangenome, h5f)#

Write pangenome processing status to the HDF5 file.

This function extends the base status writing functionality to include system-specific status information.

Parameters:

pangenome (Pangenome) – Pangenome object with current status
h5f (File) – Open HDF5 file handle

Returns:

None

Return type:

None

panorama.format.write_binaries.write_system_data_to_tables(system, system_row, unit_row, is_canonical=False)#

Write a single system’s data to the appropriate HDF5 table rows.

Parameters:

system – System object containing the data to write
system_row (<property object at 0x722b03bf1490>) – HDF5 table row object for system data
unit_row (<property object at 0x722b03bf1490>) – HDF5 table row object for unit data
is_canonical (bool) – Whether this system is a canonical version. Defaults to False.

Returns:

None

Return type:

None

Note

This function modifies the table rows in-place and calls append() to commit the data.

panorama.format.write_binaries.write_systems_to_hdf5(pangenome, h5f, source, disable_bar=False)#

Write all systems from a specific source to HDF5 file.

This function creates the necessary HDF5 table structure and writes all system data, including regular systems, canonical systems and their relationships.

Parameters:

pangenome (Pangenome) – Pangenome object containing systems to write
h5f (File) – Open HDF5 file handle for writing
source (str) – Source identifier for the systems being written
disable_bar (bool) – Whether to disable the progress bar. Defaults to False.

Returns:

None

Raises:

AssertionError – If systems data is not properly formatted or accessible

Return type:

None

SystemTableSizes

Container for system table size information.

panorama.format.write_flat module#

panorama.format.write_flat.check_flat_parameters(args)#

Checks if given command argument are legit if so return a dictionary with information needed to load pangenomes.

Parameters:: args (Namespace) – Argument in the command line
Returns:: Dictionary needed to load pangenomes information
Return type:: Tuple[Dict[str, Union[bool, Any]], Dict[str, Union[bool, Any]]]

panorama.format.write_flat.check_pangenome_write_flat(pangenome, *args, **kwargs)#

Wrapper function

Parameters:

pangenome – pangenome to check
*args – all possible necessary args
**kwargs – all possible necessary kwargs

Returns:

the function wrapped

panorama.format.write_flat.check_pangenome_write_flat_annotations(func)#

Decorator to check pangenome to write annotations

Parameters:: func – Function to decorate
Returns:: wrapped function

panorama.format.write_flat.check_pangenome_write_flat_hmm(func)#

Decorator to check pangenome to write annotations

Parameters:: func – Function to decorate
Returns:: wrapped function

panorama.format.write_flat.launch(args)#

Launch functions to write flat files from pangenomes

Parameters:: args – argument given in CLI

panorama.format.write_flat.parser_write(parser)#

Add argument to parser for write command

Parameters:: parser – parser for write argument

panorama.format.write_flat.subparser(sub_parser)#

Subparser to launch PANORAMA in Command line

Parameters:: sub_parser – sub_parser for annot command
Returns:: argparse.ArgumentParser – parser arguments for annot command
Return type:: ArgumentParser

panorama.format.write_flat.write_flat_files(pangenomes, output, annotation=False, hmm=False, threads=1, lock=None, force=False, disable_bar=False, **kwargs)#

Global function to write all flat files from pangenomes.

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing pangenome
output (Path) – Path to the output directory for all flat files
annotation (bool) – Flag to indicate whether annotations should be written or not (default: False)
hmm (bool) – Flag to indicate whether hmms should be written or not (default: False)
threads (int) – Number of available threads (default: 1)
lock (Lock) – Global lock for multiprocessing execution (default: None)
force (bool) – Flag to indicate if a path can be overwritten (default: False)
disable_bar (bool) – Disable progress bar (default: False)
**kwargs – Additional keyword needed according to expected flat file

panorama.format.write_flat.write_hmm(family, output, msa_file_path=None, msa_format='afa')#

Write an HMM profile for a gene family

Parameters:

family (GeneFamily) – A pangenome gene family
output (Path) – Path to the directory to save the HMM
msa_file_path (Path) – Path to the msa file to compute the HMM
msa_format (str) – format of the msa file to read it correctly (default: “afa”)

panorama.format.write_flat.write_hmm_profile(pangenomes, msa_tsv_path, output, msa_format='afa', threads=1, lock=None, force=False, disable_bar=False)#

Write an HMM profile for all gene families in pangenomes

Parameters:

pangenomes (Pangenomes) – Pangenomes object with all pangenome
msa_tsv_path (Path) – Path to the tsv file with msa
output (Path) – Path to the output directory
msa_format (str) – format of the msa file to read it correctly (default: “afa”)
threads (int) – Number of available threads (default: 1)
lock (Lock) – Global lock for multiprocessing execution (default: None)
force (bool) – Flag to indicate if a path can be overwritten (default: False)
disable_bar (bool) – Disable progress bar (default: False)

panorama.format.write_flat.write_pangenome_families_annotations(pangenome, output, sources, disable_bar=False)#

Write a tsv file with all annotations and sources present in pangenome

Parameters:

pangenome (Pangenome) – Pangenome with annotation loaded
output (Path) – Output directory to save the tsv file
sources (List[str]) – sources to write
disable_bar (bool) – Flag to disable the progress bar (default: False)

panorama.format.write_flat.write_pangenomes_families_annotations(pangenomes, output, sources, threads=1, lock=None, force=False, disable_bar=False)#

Function to write annotations from multiple pangenomes

Parameters:

pangenomes (Pangenomes) – Pangenomes object containing all pangenome
output (Path) – Path to the output directory
sources (List[str]) – List of sources to write annotations for families
threads (int) – Number of available threads (default = 1)
lock (Lock) – Global lock for multiprocessing execution (default: None)
force (bool) – Flag to indicate if a path can be overwritten (default: False)
disable_bar (bool) – Disable progress bar (default: False)

panorama.format.write_proksee module#

panorama.format.write_proksee.palette()#

Return type:: List[Tuple[int]]

panorama.format.write_proksee.read_data(template, features, sources=None)#

Return type:: dict

panorama.format.write_proksee.read_settings(settings_data)#

panorama.format.write_proksee.write_contig(organism)#

panorama.format.write_proksee.write_genes(organism, sources)#

panorama.format.write_proksee.write_legend_items(legend_data, features, sources)#

panorama.format.write_proksee.write_modules(pangenome, organism, gf2genes)#

panorama.format.write_proksee.write_partition(organism)#

panorama.format.write_proksee.write_proksee(pangenome, output, features=None, sources=None, template=None, organisms_list=None, threads=1, disable_bar=False)#

panorama.format.write_proksee.write_proksee_organism(pangenome, organism, output, template, features=None, sources=None)#

panorama.format.write_proksee.write_rgp(pangenome, organism)#

panorama.format.write_proksee.write_spots(pangenome, organism, gf2genes)#

panorama.format.write_proksee.write_systems(pangenome, organism, gf2genes, sources)#

panorama.format.write_proksee.write_tracks(features)#