Skip to content

Add 'from_pretrained' to Ume#113

Merged
karinazad merged 10 commits intomainfrom
k/ume-from-pretrained
Jun 18, 2025
Merged

Add 'from_pretrained' to Ume#113
karinazad merged 10 commits intomainfrom
k/ume-from-pretrained

Conversation

@karinazad
Copy link
Collaborator

@karinazad karinazad commented Jun 18, 2025

MR #113: Add 'from_pretrained' to Ume

Overview

This merge request adds a convenient from_pretrained method to the Universal Molecular Encoder (Ume) model, making it easier to load pre-trained models without manually specifying checkpoint paths. This follows the familiar pattern used by other popular model libraries like Hugging Face Transformers.

Key Changes

1. New Constants File

  • File: src/lobster/constants/_ume_models.py
  • Purpose: Defines available pre-trained model checkpoints
  • Models Available:
    • ume-mini-base-12M (12M parameters)
    • ume-medium-base-480M (480M parameters)
    • ume-large-base-740M (740M parameters)

2. Enhanced Ume Model

  • File: src/lobster/model/_ume.py
  • New Method: from_pretrained() class method
  • Features:
    • Automatic model name resolution to checkpoint paths
    • Device placement control (cpu/cuda)
    • Flash attention configuration
    • Custom cache directory support
    • Automatic retry on corrupted downloads

3. Checkpoint Utilities

  • File: src/lobster/model/_utils_checkpoint.py
  • Purpose: Handles S3 downloads and checkpoint loading with error recovery
  • Features:
    • Automatic download from S3
    • Corruption detection and recovery
    • Proper error handling for credential issues

4. Comprehensive Testing

  • Files:
    • tests/lobster/model/test__ume.py - Tests for from_pretrained method
    • tests/lobster/model/test__utils_checkpoint.py - Tests for checkpoint utilities
  • Coverage: Unit tests for all new functionality including error cases

Usage Examples

Basic Usage

from lobster.model import Ume

# Load a pre-trained model
ume = Ume.from_pretrained("ume-mini-base-12M")

# Check model properties
print(f"Supported modalities: {ume.modalities}")
print(f"Vocab size: {len(ume.get_vocab())}")
print(f"Embedding dimension: {ume.embedding_dim}")

# Protein sequences
protein_sequences = ["MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"]
protein_embeddings = ume.embed_sequences(protein_sequences, modality="amino_acid")

Security Notes

  • Models are currently only available to Prescient Design members
  • S3 credentials required for download
  • Clear error messages for unauthorized access attempts

Future Enhancements

  • Support for external users (planned)
  • Additional model variants
  • Integration with Hugging Face Hub
  • More sophisticated caching strategies

@karinazad karinazad requested a review from ncfrey June 18, 2025 01:47
@karinazad karinazad merged commit d695dfa into main Jun 18, 2025
5 checks passed
@karinazad karinazad deleted the k/ume-from-pretrained branch June 18, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants