Tips and Tricks for PANORAMA Developers#

Table of Contents#

  1. Code Quality Fundamentals

  2. Testing Strategy

  3. Debugging and Profiling

  4. Git Workflow

  5. Common Pitfalls


Code Quality Fundamentals#

Naming Conventions#

Good names make code self-documenting:

# Variables and functions: snake_case
gene_family_count = len(gene_families)
def calculate_coverage(genes):
    pass

# Classes: CamelCase
class GeneFamily:
    pass

# Constants: UPPER_SNAKE_CASE
MAX_ITERATIONS = 1000
DEFAULT_THRESHOLD = 0.95

# Private attributes: leading underscore
class System:
    def __init__(self):
        self._internal_cache = {}

Be descriptive but concise:

# Good
def merge_systems(system1, system2):
    pass

# Too vague
def merge(s1, s2):
    pass

# Too verbose
def merge_two_system_objects_together(first_system, second_system):
    pass

Error Handling#

Handle Errors Gracefully#

Always anticipate what could go wrong and handle it appropriately:

def load_pangenome(filepath: str) -> Pangenome:
    """Load a pangenome from an HDF5 file."""
    if not Path(filepath).exists():
        raise FileNotFoundError(f"Pangenome file not found: {filepath}")
    
    try:
        pangenome = Pangenome.from_file(filepath)
    except Exception as e:
        raise RuntimeError(f"Failed to load pangenome: {e}") from e
    
    return pangenome

Choose the Right Exception Type#

Use appropriate exception types:

# Invalid input from user
raise ValueError("threshold must be between 0 and 1")

# File doesn't exist
raise FileNotFoundError(f"Model file not found: {path}")

# Wrong type provided
raise TypeError(f"Expected GeneFamily, got {type(obj)}")

# Feature not implemented yet
raise NotImplementedError("Clustering method X not yet supported")

# Key doesn't exist in dict
raise KeyError(f"System '{system_id}' not found")

Validate Input Early#

Check inputs at the start of functions:

def calculate_similarity(family1: GeneFamily, family2: GeneFamily) -> float:
    """Calculate Jaccard similarity between two gene families."""
    # Validate inputs
    if not isinstance(family1, GeneFamily):
        raise TypeError(f"family1 must be GeneFamily, got {type(family1)}")
    
    if not isinstance(family2, GeneFamily):
        raise TypeError(f"family2 must be GeneFamily, got {type(family2)}")
    
    if len(family1) == 0 or len(family2) == 0:
        raise ValueError("Cannot calculate similarity for empty families")
    
    # Now we know inputs are valid
    intersection = len(family1.genes & family2.genes)
    union = len(family1.genes | family2.genes)
    return intersection / union

Logging#

Use Python’s logging module for informational messages:

import logging

logger = logging.getLogger(__name__)

def process_pangenome(pangenome):
    """Process a pangenome with logging."""
    logger.info(f"Processing pangenome with {len(pangenome.gene_families)} families")
    
    try:
        result = complex_operation(pangenome)
        logger.debug(f"Complex operation completed: {result}")
    except Exception as e:
        logger.error(f"Failed to process pangenome: {e}")
        raise
    
    logger.info("Processing completed successfully")
    return result

Logging levels:

  • DEBUG - Detailed diagnostic information

  • INFO - General informational messages

  • WARNING - Something unexpected but not an error

  • ERROR - Something failed

  • CRITICAL - Serious failure

Performance Best Practices#

Use Generators for Large Datasets#

# Good: Memory efficient
def iter_gene_families(pangenome):
    for family in pangenome.families:
        yield family

# Less efficient: Loads everything into memory
def get_all_families(pangenome):
    return [family for family in pangenome.families]

Cache Expensive Computations#

from functools import lru_cache

@lru_cache(maxsize=1000)
def calculate_similarity(family_id1, family_id2):
    """Calculate similarity with caching."""
    # Expensive computation here
    pass

Use Built-in Functions#

They’re optimized in C and much faster:

# Fast
total = sum(len(family) for family in families)

# Slower
total = 0
for family in families:
    total += len(family)

Use Sets for Membership Testing#

# Fast: O(1) lookup
gene_ids = set(family.gene_ids)
if gene_id in gene_ids:
    pass

# Slow: O(n) lookup
gene_ids = list(family.gene_ids)
if gene_id in gene_ids:
    pass

Testing Strategy#

Unit Testing 🔬#

Unit tests are your first line of defense against bugs. They test individual pieces of code in isolation.

What Makes a Good Unit Test?#

  1. Isolation - Each test stands alone and doesn’t depend on external systems or other tests

  2. Speed - Unit tests should be fast (ideally under a second)

  3. Focused - Test one thing at a time

  4. Reliable - The Same input always produces the same output

Test Both Success and Failure#

def test_valid_gene_family(self):
    """Test that valid gene families are accepted."""
    gf = GeneFamily(name="valid_family", family_id=1)
    assert gf.name == "valid_family"

def test_invalid_gene_family_raises_error(self):
    """Test that invalid input raises appropriate error."""
    with pytest.raises(ValueError, match="Invalid family ID"):
        GeneFamily(name="test", family_id=-1)

Test Edge Cases#

Always test:

  • Empty inputs

  • Very large inputs

  • Boundary values

  • None/null values

  • Duplicate entries

Use Descriptive Names#

# Good
def test_merge_fails_when_models_differ(self):
    pass

# Less helpful
def test_merge_2(self):
    pass

Functional Testing 🚀#

Functional tests verify that complete features work as users would actually use them.

What Makes a Good Functional Test?#

  1. Realistic - Test real workflows with realistic data

  2. End-to-end - Test the full pipeline, not just pieces

  3. User-focused - Test what users actually do

  4. Thorough - Verify outputs, not just that commands don’t crash

Use Session-Scoped Fixtures#

@pytest.fixture(scope="session")
def test_pangenome():
    """Create test pangenome once for all tests."""
    # This might take a while, so we only do it once
    return create_test_pangenome()

Mark Tests That Need External Data#

@pytest.mark.requires_test_data
def test_annotation_pipeline(test_data_path):
    """Test the annotation pipeline with real data."""
    command = f"panorama annotate --pangenome {test_data_path}/test.h5"
    run_command(command)

Test Command-Line Interfaces#

def test_systems_command():
    """Test the systems command with typical user parameters."""
    command = (
        f"panorama systems "
        f"--pangenomes {pangenome_list} "
        f"--models {model_file} "
        f"--source defensefinder"
    )
    result = run_command(command)
    assert result.returncode == 0

Creating Reusable Test Components 🔧#

Fixtures Are Your Friends#

Fixtures let you reuse test setup code without repeating yourself:

class TestFixture:
    """Base class for shared fixtures."""

    @pytest.fixture
    def model(self):
        """Create a test model."""
        return Model(
            name="test_model",
            min_mandatory=1,
            min_total=1,
            canonical=["canonical_1", "canonical_2"],
        )

    @pytest.fixture
    def functional_unit(self, model):
        """Create a test functional unit (depends on model fixture)."""
        fu = FuncUnit(name="test_unit", presence="mandatory", min_total=2)
        fu.model = model
        return fu

Helper Methods#

For complex setup that’s specific to a test class:

class TestGeneFamily:
    def create_gene_family(self, name, num_organisms=5):
        """Helper to create a gene family with organisms."""
        gf = GeneFamily(name=name, family_id=next_id())
        for i in range(num_organisms):
            org = Organism(name=f"org_{i}")
            gf.add_organism(org)
        return gf

    def test_family_with_many_organisms(self):
        """Test gene family with many organisms."""
        gf = self.create_gene_family("test", num_organisms=100)
        assert len(gf.organisms) == 100

Testing Errors and Edge Cases ⚠️#

Always Test Error Conditions#

def test_division_by_zero_raises_error(self):
    """Test that division by zero is handled properly."""
    with pytest.raises(ZeroDivisionError, match="cannot divide by zero"):
        calculator.divide(10, 0)

Parametrize Error Tests#

Test multiple invalid inputs efficiently:

@pytest.mark.parametrize("invalid_input", [
    "not_a_number",
    None,
    [],
    {},
    -1,
    float('inf'),
])
def test_rejects_invalid_input(self, invalid_input):
    """Test that various invalid inputs are rejected."""
    with pytest.raises(TypeError):
        process_data(invalid_input)

This is much cleaner than writing six separate test methods!


Debugging and Profiling#

Using the Python Debugger 🔍#

Don’t just add print statements - use Python’s debugger:

# Add this line where you want to break
import pdb; pdb.set_trace()

# Or in Python 3.7+
breakpoint()

Common debugger commands:

  • n - Next line

  • s - Step into function

  • c - Continue execution

  • p variable - Print variable value

  • l - List surrounding code

  • q - Quit debugger

Hint

Your editor might integrate a debugger, that way you don’t have to type commands manually.

Better Print Debugging#

If you must use print statements:

# Basic print
print(f"DEBUG: family_count = {family_count}")

# Pretty print complex objects
from pprint import pprint
pprint(complex_dict)

# Print with context
import sys
print(f"DEBUG [{sys._getframe().f_code.co_name}]: value = {value}")

Caution

Remember to remove debug prints before committing!

Performance Profiling with VizTracer#

VizTracer creates visual timelines showing exactly where your code spends time:

# Profile a script
viztracer my_script.py --output profile.json

# Profile with specific arguments
viztracer panorama systems --pangenomes data.txt --models models.yml

# Open the visualization
vizviewer profile.json
# This opens a browser showing an interactive timeline

What to look for in the timeline:

  • Functions that take a long time

  • Functions are called very frequently

  • Unexpected I/O operations

  • Nested loops that could be optimized


Git Workflow#

Writing Good Commits 📝#

Good commit messages are like good lab notes - they help everyone (including future you) understand what happened and why.

The Basic Format#

One-line commits for simple changes:

git commit -m "Fix off-by-one error in gene counting"
git commit -m "Add validation for empty gene families"
git commit -m "Update installation instructions for conda"

Multi-line commits when you need to explain more:

git commit -m "Optimize system clustering for large datasets

The previous implementation used nested loops that didn't scale well.
This commit introduces vectorized operations and caching that reduce
runtime from 2 hours to 25 minutes on 10k+ genome datasets.

Tested on: E. coli dataset (15k genomes), Staphylococcus (8k genomes)"

Commit Message Tips#

  • Use imperative mood: “Add feature” not “Added feature”

  • Be specific: “Fix memory leak in clustering” beats “Fix bug”

  • Keep the first line under 50 characters when possible

  • Explain the ‘why’ not the ‘how’ - code shows how, commits explain why

  • Make atomic commits - one logical change per commit

Tip

If you find yourself using “and” in a commit message, you might want to split it into multiple commits!

Small, Focused Commits#

Break your work into digestible pieces:

# Good: Three clear, focused commits
git commit -m "Add merge method to System class"
git commit -m "Add unit tests for System.merge()"  
git commit -m "Document System.merge() in API reference"

# Less ideal: One big blob
git commit -m "Add merge feature with tests and docs"

This makes it easier to review, debug, and potentially revert changes if needed.

Before You Push: The Checklist ✅#

We’ve all pushed code and then immediately realized something was wrong. This checklist helps catch issues before they become embarrassing! 😅

1. Run the Tests#

# Quick check
pytest

# Full check with coverage (recommended)
pytest --cov=panorama

# Just test what you changed
pytest tests/test_my_feature.py

All tests should pass. If something fails, fix it before pushing. Your future self will thank you!

2. Format with Black#

We use Black to keep the code style consistent. No more debates about spaces and brackets!

# Format everything
black panorama/ tests/

# Check what would change (without modifying files)
black --check panorama/ tests/

Black makes code reviews smoother since we’re focusing on logic, not style.

3. Linting with Flake8#

flake8 catches potential bugs and style issues:

# Check the entire project
flake8 panorama/ tests/

# Check specific files
flake8 panorama/systems/system.py

Fix the issues flake8 reports before pushing. Most are quick fixes!

4. Update Documentation#

Documentation is code too! If you:

  • Added a new feature → Update user documentation

  • Changed public APIs → Update API reference

  • Added/modified functions → Write/update docstrings

More information on how to write good documentation can be found in the “how to build the documentation”.

5. Review Your Own Changes#

Before asking others to review, review yourself:

# What changed compared to dev?
git diff origin/dev

# Check your commit history
git log origin/dev..HEAD --oneline

# Make sure you didn't leave any debug code
grep -r "print(" panorama/  # Just an example!

6. Update the VERSION File#

Don’t forget to bump the patch version! See the Versioning section above.

Handling Merge Conflicts#

Conflicts happen - they’re not a failure, just Git asking for your help to combine changes.

# Start the rebase
git rebase dev
# Git pauses on conflicts

# Open conflicting files and look for markers:
<<<<<<< HEAD
Your code
=======
Their code
>>>>>>> dev

# Edit to keep what you want, then:
git add resolved_file.py
git rebase --continue

# If things get messy, you can always abort and try again
git rebase --abort

Stuck on conflicts? Don’t hesitate to ask for help! Ping a maintainer or open a draft PR and explain where you’re stuck.

Useful Git Commands 🛠️#

Some handy commands to make your life easier:

# Beautiful commit history
git log --oneline --graph --all

# What changed in a specific commit?
git show abc123

# Temporarily save changes without committing
git stash
git stash pop  # Get them back

# Oops, need to change the last commit message?
git commit --amend -m "Better message"

# Interactive rebase to clean up commits before pushing
git rebase -i HEAD~3  # Last 3 commits

# Find which commit introduced a bug
git bisect start

Common Pitfalls 🚧#

Mutable Default Arguments#

# Bad: Default list is shared between calls!
def add_family(families=[]):
    families.append(new_family)
    return families

# Good: Use None and create new list
def add_family(families=None):
    if families is None:
        families = []
    families.append(new_family)
    return families

Catching All Exceptions#

# Bad: Hides all errors, even bugs!
try:
    result = risky_operation()
except:
    pass

# Good: Catch specific exceptions
try:
    result = risky_operation()
except ValueError as e:
    logger.error(f"Invalid value: {e}")
    raise

Hardcoded Paths#

# Bad: Won't work on other systems
file_path = "/home/user/data/pangenome.h5"

# Good: Use Path and relative paths
from pathlib import Path
file_path = Path(__file__).parent / "data" / "pangenome.h5"

String Formatting#

# Old style (avoid)
message = "Found %d genes in %s" % (count, family_name)

# Good: f-strings (Python 3.6+)
message = f"Found {count} genes in {family_name}"

# Also good: .format() for complex cases
message = "Found {count} genes in {name}".format(count=count, name=family_name)