Tips and Tricks for PANORAMA Developers#
Table of Contents#
Code Quality Fundamentals#
Naming Conventions#
Good names make code self-documenting:
# Variables and functions: snake_case
gene_family_count = len(gene_families)
def calculate_coverage(genes):
pass
# Classes: CamelCase
class GeneFamily:
pass
# Constants: UPPER_SNAKE_CASE
MAX_ITERATIONS = 1000
DEFAULT_THRESHOLD = 0.95
# Private attributes: leading underscore
class System:
def __init__(self):
self._internal_cache = {}
Be descriptive but concise:
# Good
def merge_systems(system1, system2):
pass
# Too vague
def merge(s1, s2):
pass
# Too verbose
def merge_two_system_objects_together(first_system, second_system):
pass
Error Handling#
Handle Errors Gracefully#
Always anticipate what could go wrong and handle it appropriately:
def load_pangenome(filepath: str) -> Pangenome:
"""Load a pangenome from an HDF5 file."""
if not Path(filepath).exists():
raise FileNotFoundError(f"Pangenome file not found: {filepath}")
try:
pangenome = Pangenome.from_file(filepath)
except Exception as e:
raise RuntimeError(f"Failed to load pangenome: {e}") from e
return pangenome
Choose the Right Exception Type#
Use appropriate exception types:
# Invalid input from user
raise ValueError("threshold must be between 0 and 1")
# File doesn't exist
raise FileNotFoundError(f"Model file not found: {path}")
# Wrong type provided
raise TypeError(f"Expected GeneFamily, got {type(obj)}")
# Feature not implemented yet
raise NotImplementedError("Clustering method X not yet supported")
# Key doesn't exist in dict
raise KeyError(f"System '{system_id}' not found")
Validate Input Early#
Check inputs at the start of functions:
def calculate_similarity(family1: GeneFamily, family2: GeneFamily) -> float:
"""Calculate Jaccard similarity between two gene families."""
# Validate inputs
if not isinstance(family1, GeneFamily):
raise TypeError(f"family1 must be GeneFamily, got {type(family1)}")
if not isinstance(family2, GeneFamily):
raise TypeError(f"family2 must be GeneFamily, got {type(family2)}")
if len(family1) == 0 or len(family2) == 0:
raise ValueError("Cannot calculate similarity for empty families")
# Now we know inputs are valid
intersection = len(family1.genes & family2.genes)
union = len(family1.genes | family2.genes)
return intersection / union
Logging#
Use Python’s logging module for informational messages:
import logging
logger = logging.getLogger(__name__)
def process_pangenome(pangenome):
"""Process a pangenome with logging."""
logger.info(f"Processing pangenome with {len(pangenome.gene_families)} families")
try:
result = complex_operation(pangenome)
logger.debug(f"Complex operation completed: {result}")
except Exception as e:
logger.error(f"Failed to process pangenome: {e}")
raise
logger.info("Processing completed successfully")
return result
Logging levels:
DEBUG- Detailed diagnostic informationINFO- General informational messagesWARNING- Something unexpected but not an errorERROR- Something failedCRITICAL- Serious failure
Performance Best Practices#
Use Generators for Large Datasets#
# Good: Memory efficient
def iter_gene_families(pangenome):
for family in pangenome.families:
yield family
# Less efficient: Loads everything into memory
def get_all_families(pangenome):
return [family for family in pangenome.families]
Cache Expensive Computations#
from functools import lru_cache
@lru_cache(maxsize=1000)
def calculate_similarity(family_id1, family_id2):
"""Calculate similarity with caching."""
# Expensive computation here
pass
Use Built-in Functions#
They’re optimized in C and much faster:
# Fast
total = sum(len(family) for family in families)
# Slower
total = 0
for family in families:
total += len(family)
Use Sets for Membership Testing#
# Fast: O(1) lookup
gene_ids = set(family.gene_ids)
if gene_id in gene_ids:
pass
# Slow: O(n) lookup
gene_ids = list(family.gene_ids)
if gene_id in gene_ids:
pass
Testing Strategy#
Unit Testing 🔬#
Unit tests are your first line of defense against bugs. They test individual pieces of code in isolation.
What Makes a Good Unit Test?#
Isolation - Each test stands alone and doesn’t depend on external systems or other tests
Speed - Unit tests should be fast (ideally under a second)
Focused - Test one thing at a time
Reliable - The Same input always produces the same output
Test Both Success and Failure#
def test_valid_gene_family(self):
"""Test that valid gene families are accepted."""
gf = GeneFamily(name="valid_family", family_id=1)
assert gf.name == "valid_family"
def test_invalid_gene_family_raises_error(self):
"""Test that invalid input raises appropriate error."""
with pytest.raises(ValueError, match="Invalid family ID"):
GeneFamily(name="test", family_id=-1)
Test Edge Cases#
Always test:
Empty inputs
Very large inputs
Boundary values
None/null values
Duplicate entries
Use Descriptive Names#
# Good
def test_merge_fails_when_models_differ(self):
pass
# Less helpful
def test_merge_2(self):
pass
Functional Testing 🚀#
Functional tests verify that complete features work as users would actually use them.
What Makes a Good Functional Test?#
Realistic - Test real workflows with realistic data
End-to-end - Test the full pipeline, not just pieces
User-focused - Test what users actually do
Thorough - Verify outputs, not just that commands don’t crash
Use Session-Scoped Fixtures#
@pytest.fixture(scope="session")
def test_pangenome():
"""Create test pangenome once for all tests."""
# This might take a while, so we only do it once
return create_test_pangenome()
Mark Tests That Need External Data#
@pytest.mark.requires_test_data
def test_annotation_pipeline(test_data_path):
"""Test the annotation pipeline with real data."""
command = f"panorama annotate --pangenome {test_data_path}/test.h5"
run_command(command)
Test Command-Line Interfaces#
def test_systems_command():
"""Test the systems command with typical user parameters."""
command = (
f"panorama systems "
f"--pangenomes {pangenome_list} "
f"--models {model_file} "
f"--source defensefinder"
)
result = run_command(command)
assert result.returncode == 0
Creating Reusable Test Components 🔧#
Fixtures Are Your Friends#
Fixtures let you reuse test setup code without repeating yourself:
class TestFixture:
"""Base class for shared fixtures."""
@pytest.fixture
def model(self):
"""Create a test model."""
return Model(
name="test_model",
min_mandatory=1,
min_total=1,
canonical=["canonical_1", "canonical_2"],
)
@pytest.fixture
def functional_unit(self, model):
"""Create a test functional unit (depends on model fixture)."""
fu = FuncUnit(name="test_unit", presence="mandatory", min_total=2)
fu.model = model
return fu
Helper Methods#
For complex setup that’s specific to a test class:
class TestGeneFamily:
def create_gene_family(self, name, num_organisms=5):
"""Helper to create a gene family with organisms."""
gf = GeneFamily(name=name, family_id=next_id())
for i in range(num_organisms):
org = Organism(name=f"org_{i}")
gf.add_organism(org)
return gf
def test_family_with_many_organisms(self):
"""Test gene family with many organisms."""
gf = self.create_gene_family("test", num_organisms=100)
assert len(gf.organisms) == 100
Testing Errors and Edge Cases ⚠️#
Always Test Error Conditions#
def test_division_by_zero_raises_error(self):
"""Test that division by zero is handled properly."""
with pytest.raises(ZeroDivisionError, match="cannot divide by zero"):
calculator.divide(10, 0)
Parametrize Error Tests#
Test multiple invalid inputs efficiently:
@pytest.mark.parametrize("invalid_input", [
"not_a_number",
None,
[],
{},
-1,
float('inf'),
])
def test_rejects_invalid_input(self, invalid_input):
"""Test that various invalid inputs are rejected."""
with pytest.raises(TypeError):
process_data(invalid_input)
This is much cleaner than writing six separate test methods!
Debugging and Profiling#
Using the Python Debugger 🔍#
Don’t just add print statements - use Python’s debugger:
# Add this line where you want to break
import pdb; pdb.set_trace()
# Or in Python 3.7+
breakpoint()
Common debugger commands:
n- Next lines- Step into functionc- Continue executionp variable- Print variable valuel- List surrounding codeq- Quit debugger
Hint
Your editor might integrate a debugger, that way you don’t have to type commands manually.
Better Print Debugging#
If you must use print statements:
# Basic print
print(f"DEBUG: family_count = {family_count}")
# Pretty print complex objects
from pprint import pprint
pprint(complex_dict)
# Print with context
import sys
print(f"DEBUG [{sys._getframe().f_code.co_name}]: value = {value}")
Caution
Remember to remove debug prints before committing!
Performance Profiling with VizTracer#
VizTracer creates visual timelines showing exactly where your code spends time:
# Profile a script
viztracer my_script.py --output profile.json
# Profile with specific arguments
viztracer panorama systems --pangenomes data.txt --models models.yml
# Open the visualization
vizviewer profile.json
# This opens a browser showing an interactive timeline
What to look for in the timeline:
Functions that take a long time
Functions are called very frequently
Unexpected I/O operations
Nested loops that could be optimized
Git Workflow#
Writing Good Commits 📝#
Good commit messages are like good lab notes - they help everyone (including future you) understand what happened and why.
The Basic Format#
One-line commits for simple changes:
git commit -m "Fix off-by-one error in gene counting"
git commit -m "Add validation for empty gene families"
git commit -m "Update installation instructions for conda"
Multi-line commits when you need to explain more:
git commit -m "Optimize system clustering for large datasets
The previous implementation used nested loops that didn't scale well.
This commit introduces vectorized operations and caching that reduce
runtime from 2 hours to 25 minutes on 10k+ genome datasets.
Tested on: E. coli dataset (15k genomes), Staphylococcus (8k genomes)"
Commit Message Tips#
Use imperative mood: “Add feature” not “Added feature”
Be specific: “Fix memory leak in clustering” beats “Fix bug”
Keep the first line under 50 characters when possible
Explain the ‘why’ not the ‘how’ - code shows how, commits explain why
Make atomic commits - one logical change per commit
Tip
If you find yourself using “and” in a commit message, you might want to split it into multiple commits!
Small, Focused Commits#
Break your work into digestible pieces:
# Good: Three clear, focused commits
git commit -m "Add merge method to System class"
git commit -m "Add unit tests for System.merge()"
git commit -m "Document System.merge() in API reference"
# Less ideal: One big blob
git commit -m "Add merge feature with tests and docs"
This makes it easier to review, debug, and potentially revert changes if needed.
Before You Push: The Checklist ✅#
We’ve all pushed code and then immediately realized something was wrong. This checklist helps catch issues before they become embarrassing! 😅
1. Run the Tests#
# Quick check
pytest
# Full check with coverage (recommended)
pytest --cov=panorama
# Just test what you changed
pytest tests/test_my_feature.py
All tests should pass. If something fails, fix it before pushing. Your future self will thank you!
2. Format with Black#
We use Black to keep the code style consistent. No more debates about spaces and brackets!
# Format everything
black panorama/ tests/
# Check what would change (without modifying files)
black --check panorama/ tests/
Black makes code reviews smoother since we’re focusing on logic, not style.
3. Linting with Flake8#
flake8 catches potential bugs and style issues:
# Check the entire project
flake8 panorama/ tests/
# Check specific files
flake8 panorama/systems/system.py
Fix the issues flake8 reports before pushing. Most are quick fixes!
4. Update Documentation#
Documentation is code too! If you:
Added a new feature → Update user documentation
Changed public APIs → Update API reference
Added/modified functions → Write/update docstrings
More information on how to write good documentation can be found in the “how to build the documentation”.
5. Review Your Own Changes#
Before asking others to review, review yourself:
# What changed compared to dev?
git diff origin/dev
# Check your commit history
git log origin/dev..HEAD --oneline
# Make sure you didn't leave any debug code
grep -r "print(" panorama/ # Just an example!
6. Update the VERSION File#
Don’t forget to bump the patch version! See the Versioning section above.
Handling Merge Conflicts#
Conflicts happen - they’re not a failure, just Git asking for your help to combine changes.
# Start the rebase
git rebase dev
# Git pauses on conflicts
# Open conflicting files and look for markers:
<<<<<<< HEAD
Your code
=======
Their code
>>>>>>> dev
# Edit to keep what you want, then:
git add resolved_file.py
git rebase --continue
# If things get messy, you can always abort and try again
git rebase --abort
Stuck on conflicts? Don’t hesitate to ask for help! Ping a maintainer or open a draft PR and explain where you’re stuck.
Useful Git Commands 🛠️#
Some handy commands to make your life easier:
# Beautiful commit history
git log --oneline --graph --all
# What changed in a specific commit?
git show abc123
# Temporarily save changes without committing
git stash
git stash pop # Get them back
# Oops, need to change the last commit message?
git commit --amend -m "Better message"
# Interactive rebase to clean up commits before pushing
git rebase -i HEAD~3 # Last 3 commits
# Find which commit introduced a bug
git bisect start
Common Pitfalls 🚧#
Mutable Default Arguments#
# Bad: Default list is shared between calls!
def add_family(families=[]):
families.append(new_family)
return families
# Good: Use None and create new list
def add_family(families=None):
if families is None:
families = []
families.append(new_family)
return families
Catching All Exceptions#
# Bad: Hides all errors, even bugs!
try:
result = risky_operation()
except:
pass
# Good: Catch specific exceptions
try:
result = risky_operation()
except ValueError as e:
logger.error(f"Invalid value: {e}")
raise
Hardcoded Paths#
# Bad: Won't work on other systems
file_path = "/home/user/data/pangenome.h5"
# Good: Use Path and relative paths
from pathlib import Path
file_path = Path(__file__).parent / "data" / "pangenome.h5"
String Formatting#
# Old style (avoid)
message = "Found %d genes in %s" % (count, family_name)
# Good: f-strings (Python 3.6+)
message = f"Found {count} genes in {family_name}"
# Also good: .format() for complex cases
message = "Found {count} genes in {name}".format(count=count, name=family_name)