Add DGEB evaluation integration for UME models by ncfrey · Pull Request #137 · prescient-design/lobster

ncfrey · 2025-07-07T16:36:22Z

Summary

This PR adds comprehensive DGEB (DNA/Protein Language Model Benchmark) evaluation integration for UME models, enabling standardized benchmarking of biological sequence models.

Key Features

DGEB Integration: Full integration with the DGEB framework for protein and DNA sequence benchmarking
UME Model Support: Direct support for UME models via Ume.from_pretrained() method
Local Evaluation: Run evaluations locally without requiring Hugging Face Hub integration
Comprehensive Reporting: Generate both JSON and Markdown reports for easy sharing
CLI Interface: New lobster_dgeb_eval command for easy evaluation runs
Complete Test Suite: Full test coverage with fixtures for both pretrained and checkpoint models

Implementation Details

UMEAdapter Class: Bridges UME models with DGEB's BioSeqTransformer interface
Evaluation Runner: Orchestrates evaluation across multiple tasks with proper logging
Report Generation: Creates detailed reports with task summaries and model metadata
Error Handling: Robust error handling for model loading and evaluation failures
Device Management: Automatic CPU/GPU device detection and management

Files Added

src/lobster/evaluation/dgeb_adapter.py - Core adapter implementation
src/lobster/evaluation/dgeb_runner.py - Evaluation orchestration
src/lobster/evaluation/README.md - Comprehensive documentation
src/lobster/cmdline/dgeb_eval.py - CLI entry point
tests/lobster/evaluation/test_dgeb_integration.py - Complete test suite

Usage

# Evaluate a pretrained model on protein tasks
lobster_dgeb_eval ume-mini-base-12M --modality protein

# Evaluate on specific tasks
lobster_dgeb_eval ume-mini-base-12M --modality protein --tasks ProteinKNN ProteinGym

# Evaluate from checkpoint
lobster_dgeb_eval /path/to/checkpoint.ckpt --modality dna --output-dir my_results

Testing

All tests pass including:

DGEB task loading and compatibility
UME model adapter functionality
Embedding generation and shape validation
Error handling for edge cases
Both pretrained and checkpoint model support

Test Plan

Unit tests for adapter functionality
Integration tests with DGEB framework
CLI interface testing
Report generation validation
Error handling verification
Documentation completeness check

ncfrey · 2025-07-07T16:53:42Z

src/lobster/evaluation/dgeb_adapter.py

+                )
+
+                # Convert to numpy
+                batch_embeddings = batch_embeddings.detach().cpu().numpy()


TODO: run on gpu as well

* rope tests * skip test --------- Co-authored-by: freyn6 <freyn6@gene.com>

taylormjs

Looks great overall! Left some minor comments

taylormjs · 2025-07-07T17:21:20Z

src/lobster/evaluation/README.md

+    --output-dir OUTPUT_DIR            # Optional: results directory (default: dgeb_results)
+    --batch-size BATCH_SIZE            # Optional: encoding batch size (default: 32)
+    --max-seq-length MAX_LENGTH        # Optional: max sequence length (default: 1024)
+    --use-flash-attn                   # Optional: enable flash attention


We may only want one of these two flags --use-flash-attn or --no-flash-attn

taylormjs · 2025-07-07T17:24:21Z

src/lobster/evaluation/README.md

+### Performance Tips
+
+- **Batch Size**: Increase `--batch-size` for faster evaluation on GPU (try 64-128)
+- **Sequence Length**: Reduce `--max-seq-length` if memory is limited (try 512)


We should consider indicating if any tasks require longer sequence lengths, especially for DNA tasks. Having a flashback to the AAV task where all sequence variation was after index 512

taylormjs · 2025-07-07T17:30:27Z

src/lobster/evaluation/dgeb_adapter.py

+            Embeddings of shape [batch_size, num_layers, embedding_dim].
+        """
+        # For now, use the high-level embed_sequences method which gives us aggregated embeddings
+        # TODO: In the future, we could implement proper layer-wise extraction by calling


Agreed! This is good for now

taylormjs · 2025-07-07T17:34:33Z

src/lobster/evaluation/dgeb_adapter.py

+logger = logging.getLogger(__name__)
+
+
+class UMEAdapter(BioSeqTransformer):


Maybe change to UMEAdapterDGEB or something more specific? In case we add more of these, and to avoid confusion in case we add LoRAdapters

taylormjs · 2025-07-07T17:36:27Z

src/lobster/evaluation/dgeb_runner.py

+    output_path : Path
+        Path to save the report.
+    """
+    report_path = output_path / "evaluation_report.md"


Should this include a timestamp to avoid overwriting reports? I suppose that could also be in the output_path

taylormjs · 2025-07-07T17:41:49Z

tests/lobster/evaluation/test_dgeb_integration.py

+
+    # Get available tasks
+    all_tasks = dgeb.get_all_task_names()
+    assert len(all_tasks) > 0, "Should find some tasks"


Assert len(all_tasks) = num_tasks instead of > 0? Just to be sure we're capturing all of them. Same for protein & dna tasks

kept this as > 0 to be robust to dgeb dataset updates

taylormjs · 2025-07-07T17:45:27Z

tests/lobster/model/modern_bert/test_rotary.py

+    assert max_diff > 1e-6, f"Rotary embedding did not modify the tensor. Max diff: {max_diff}"
+
+
+def test_rotary_embedding_positional_invariance():


Nice test case!

taylormjs · 2025-07-07T17:47:38Z

@ncfrey Just saw the merge conflicts in uv.lock

karinazad · 2025-07-09T13:10:07Z

src/lobster/cmdline/dgeb_eval.py

+from pathlib import Path
+
+# Add the evaluation module to the path
+sys.path.insert(0, str(Path(__file__).parent.parent))


we probably want to get rid of this? @ncfrey

ncfrey requested a review from taylormjs July 7, 2025 16:37

ncfrey commented Jul 7, 2025

View reviewed changes

rope tests (#134)

3307e8f

* rope tests * skip test --------- Co-authored-by: freyn6 <freyn6@gene.com>

ncfrey force-pushed the n/add-dgeb branch from f45dc63 to 3307e8f Compare July 7, 2025 17:14

ncfrey temporarily deployed to test.pypi.org July 7, 2025 17:14 — with GitHub Actions Inactive

update tokenizer

a7336dd

ncfrey temporarily deployed to test.pypi.org July 7, 2025 17:27 — with GitHub Actions Inactive

taylormjs approved these changes Jul 7, 2025

View reviewed changes

updates

8c98548

ncfrey temporarily deployed to test.pypi.org July 7, 2025 18:44 — with GitHub Actions Inactive

save results

bba837e

ncfrey temporarily deployed to test.pypi.org July 7, 2025 19:36 — with GitHub Actions Inactive

real layers

b3d8062

ncfrey temporarily deployed to test.pypi.org July 7, 2025 20:54 — with GitHub Actions Inactive

fix layers

6ba4797

ncfrey temporarily deployed to test.pypi.org July 7, 2025 21:30 — with GitHub Actions Inactive

ncfrey added 2 commits July 7, 2025 17:35

Merge main branch and resolve conflicts

a4c23ec

Update uv.lock after merge

f4d3fed

ncfrey temporarily deployed to test.pypi.org July 7, 2025 21:37 — with GitHub Actions Inactive

ncfrey merged commit 84ba732 into main Jul 7, 2025
5 checks passed

ncfrey deleted the n/add-dgeb branch July 7, 2025 21:40

karinazad reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DGEB evaluation integration for UME models#137

Add DGEB evaluation integration for UME models#137
ncfrey merged 8 commits intomainfrom
n/add-dgeb

ncfrey commented Jul 7, 2025

Uh oh!

ncfrey Jul 7, 2025

Uh oh!

taylormjs left a comment

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

ncfrey Jul 7, 2025

Uh oh!

taylormjs Jul 7, 2025

Uh oh!

taylormjs commented Jul 7, 2025

Uh oh!

Uh oh!

karinazad Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		class UMEAdapter(BioSeqTransformer):

		assert max_diff > 1e-6, f"Rotary embedding did not modify the tensor. Max diff: {max_diff}"


		def test_rotary_embedding_positional_invariance():

Conversation

ncfrey commented Jul 7, 2025

Summary

Key Features

Implementation Details

Files Added

Usage

Testing

Test Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylormjs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylormjs commented Jul 7, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants