Skip to content

Add DGEB evaluation integration for UME models#137

Merged
ncfrey merged 8 commits intomainfrom
n/add-dgeb
Jul 7, 2025
Merged

Add DGEB evaluation integration for UME models#137
ncfrey merged 8 commits intomainfrom
n/add-dgeb

Conversation

@ncfrey
Copy link
Contributor

@ncfrey ncfrey commented Jul 7, 2025

Summary

This PR adds comprehensive DGEB (DNA/Protein Language Model Benchmark) evaluation integration for UME models, enabling standardized benchmarking of biological sequence models.

Key Features

  • DGEB Integration: Full integration with the DGEB framework for protein and DNA sequence benchmarking
  • UME Model Support: Direct support for UME models via Ume.from_pretrained() method
  • Local Evaluation: Run evaluations locally without requiring Hugging Face Hub integration
  • Comprehensive Reporting: Generate both JSON and Markdown reports for easy sharing
  • CLI Interface: New lobster_dgeb_eval command for easy evaluation runs
  • Complete Test Suite: Full test coverage with fixtures for both pretrained and checkpoint models

Implementation Details

  • UMEAdapter Class: Bridges UME models with DGEB's BioSeqTransformer interface
  • Evaluation Runner: Orchestrates evaluation across multiple tasks with proper logging
  • Report Generation: Creates detailed reports with task summaries and model metadata
  • Error Handling: Robust error handling for model loading and evaluation failures
  • Device Management: Automatic CPU/GPU device detection and management

Files Added

  • src/lobster/evaluation/dgeb_adapter.py - Core adapter implementation
  • src/lobster/evaluation/dgeb_runner.py - Evaluation orchestration
  • src/lobster/evaluation/README.md - Comprehensive documentation
  • src/lobster/cmdline/dgeb_eval.py - CLI entry point
  • tests/lobster/evaluation/test_dgeb_integration.py - Complete test suite

Usage

# Evaluate a pretrained model on protein tasks
lobster_dgeb_eval ume-mini-base-12M --modality protein

# Evaluate on specific tasks
lobster_dgeb_eval ume-mini-base-12M --modality protein --tasks ProteinKNN ProteinGym

# Evaluate from checkpoint
lobster_dgeb_eval /path/to/checkpoint.ckpt --modality dna --output-dir my_results

Testing

All tests pass including:

  • DGEB task loading and compatibility
  • UME model adapter functionality
  • Embedding generation and shape validation
  • Error handling for edge cases
  • Both pretrained and checkpoint model support

Test Plan

  • Unit tests for adapter functionality
  • Integration tests with DGEB framework
  • CLI interface testing
  • Report generation validation
  • Error handling verification
  • Documentation completeness check

@ncfrey ncfrey requested a review from taylormjs July 7, 2025 16:37
)

# Convert to numpy
batch_embeddings = batch_embeddings.detach().cpu().numpy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: run on gpu as well

* rope tests

* skip test

---------

Co-authored-by: freyn6 <freyn6@gene.com>
@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 17:27 — with GitHub Actions Inactive
Copy link
Collaborator

@taylormjs taylormjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall! Left some minor comments

--output-dir OUTPUT_DIR # Optional: results directory (default: dgeb_results)
--batch-size BATCH_SIZE # Optional: encoding batch size (default: 32)
--max-seq-length MAX_LENGTH # Optional: max sequence length (default: 1024)
--use-flash-attn # Optional: enable flash attention
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may only want one of these two flags --use-flash-attn or --no-flash-attn

### Performance Tips

- **Batch Size**: Increase `--batch-size` for faster evaluation on GPU (try 64-128)
- **Sequence Length**: Reduce `--max-seq-length` if memory is limited (try 512)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider indicating if any tasks require longer sequence lengths, especially for DNA tasks. Having a flashback to the AAV task where all sequence variation was after index 512

Embeddings of shape [batch_size, num_layers, embedding_dim].
"""
# For now, use the high-level embed_sequences method which gives us aggregated embeddings
# TODO: In the future, we could implement proper layer-wise extraction by calling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! This is good for now

logger = logging.getLogger(__name__)


class UMEAdapter(BioSeqTransformer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to UMEAdapterDGEB or something more specific? In case we add more of these, and to avoid confusion in case we add LoRAdapters

output_path : Path
Path to save the report.
"""
report_path = output_path / "evaluation_report.md"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include a timestamp to avoid overwriting reports? I suppose that could also be in the output_path


# Get available tasks
all_tasks = dgeb.get_all_task_names()
assert len(all_tasks) > 0, "Should find some tasks"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert len(all_tasks) = num_tasks instead of > 0? Just to be sure we're capturing all of them. Same for protein & dna tasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept this as > 0 to be robust to dgeb dataset updates

assert max_diff > 1e-6, f"Rotary embedding did not modify the tensor. Max diff: {max_diff}"


def test_rotary_embedding_positional_invariance():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test case!

@taylormjs
Copy link
Collaborator

@ncfrey Just saw the merge conflicts in uv.lock

@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 18:44 — with GitHub Actions Inactive
@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 19:36 — with GitHub Actions Inactive
@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 20:54 — with GitHub Actions Inactive
@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 21:30 — with GitHub Actions Inactive
@ncfrey ncfrey temporarily deployed to test.pypi.org July 7, 2025 21:37 — with GitHub Actions Inactive
@ncfrey ncfrey merged commit 84ba732 into main Jul 7, 2025
5 checks passed
@ncfrey ncfrey deleted the n/add-dgeb branch July 7, 2025 21:40
from pathlib import Path

# Add the evaluation module to the path
sys.path.insert(0, str(Path(__file__).parent.parent))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to get rid of this? @ncfrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants