Skip to content

Add PTM dataset support (addresses #81)#156

Merged
ncfrey merged 19 commits intoprescient-design:mainfrom
etherealsunshine:add-ptm-dataset
Jul 30, 2025
Merged

Add PTM dataset support (addresses #81)#156
ncfrey merged 19 commits intoprescient-design:mainfrom
etherealsunshine:add-ptm-dataset

Conversation

@etherealsunshine
Copy link
Contributor

Description

Adds support for Post-Translational Modification (PTM) datasets from the PTM-mamba paper, addressing issue #81.

What's added

  • PTMDataset class that auto-downloads PTM data from Zenodo
  • Support for 311,350 PTM samples with protein sequences and modification annotations
  • Integration with lobster's existing dataset patterns and transforms
  • Basic unit tests

Features

  • Auto-download: Automatically downloads PTM-mamba dataset from Zenodo on first use
  • Flexible columns: Support for selecting specific data columns (protein_id, position, ptm_type, sequence, tokens)
  • Lobster integration: Follows existing dataset patterns, works with transforms and DataLoaders
  • PTM-mamba compatibility: Uses the official PTM-mamba dataset format

Usage

from lobster.datasets import PTMDataset

# Load full PTM dataset (auto-downloads on first use)
dataset = PTMDataset()
print(f"Loaded {len(dataset)} PTM samples")

# Access individual samples
sample = dataset[0]  # Returns (protein_id, position, ptm_type, sequence, tokens)

# Use specific columns only
seq_dataset = PTMDataset(columns=["ori_seq"])
sequence = seq_dataset[0]  # Returns just the sequence string

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring

Testing

  • Tests pass locally
  • Added new tests for new functionality
  • Updated existing tests if needed

Checklist

  • Code follows style guidelines
  • Self-review completed
  • Documentation updated if needed
  • No breaking changes (or clearly documented)

@ncfrey
Copy link
Contributor

ncfrey commented Jul 25, 2025

@etherealsunshine thanks for implementing this! could you push something to the branch again? we just updated the GitHub actions to trigger the CI on pull requests coming from external forks

@etherealsunshine
Copy link
Contributor Author

etherealsunshine commented Jul 25, 2025

hey @ncfrey thanks for the update! Just a heads up - one of the test failures (the AttributeError: 'LobsterCRMPLM' object has no attribute 'concept_names') is due to a dependency on another PR that still needs to be merged. I believe i have fixed the other issues :)

@ncfrey
Copy link
Contributor

ncfrey commented Jul 25, 2025

hey @ncfrey thanks for the update! Just a heads up - one of the test failures (the AttributeError: 'LobsterCRMPLM' object has no attribute 'concept_names') is due to a dependency on another PR that still needs to be merged. I believe i have fixed the other issues :)

great! i merged #155 so you can rebase and run the checks

@etherealsunshine
Copy link
Contributor Author

Hi @ncfrey, I think the CI is failing due to a GitHub Actions permissions issue with OIDC tokens for forks. I think its from the changes to trigger CI on external forks. Could you help resolve this workflow permission issue? Thanks!

@etherealsunshine
Copy link
Contributor Author

should resolve the failing test on main as well as add support for the dataset now.

@ncfrey
Copy link
Contributor

ncfrey commented Jul 28, 2025

should resolve the failing test on main as well as add support for the dataset now.

#162 should resolve this!

@ncfrey
Copy link
Contributor

ncfrey commented Jul 29, 2025

@etherealsunshine this looks good to merge!

@ncfrey ncfrey merged commit 81690fe into prescient-design:main Jul 30, 2025
4 checks passed
taylormjs pushed a commit that referenced this pull request Jul 31, 2025
* added wrapper for concept names

* Add unit test for concept_names property

* added support for PTM Dataset

* Add unit test for concept_names property

* added tests and integration for ptm datasets

* Fix pooch.retrieve() call - add known_hash=None

* updated column names

* fixed column names to include token

* added code to download the PTM Dataset

* Fix code formatting issues (ruff)

* Move PTM test to correct datasets directory and fix column names

* Clean up commit history

* added fixes for code formatting

* Remove duplicate concept_names code

* Fix typo in concept_names property (quick fix for failing tests on main)

* Update test to match concepts_name property

* Format code with ruff
taylormjs pushed a commit that referenced this pull request Aug 1, 2025
* added wrapper for concept names

* Add unit test for concept_names property

* added support for PTM Dataset

* Add unit test for concept_names property

* added tests and integration for ptm datasets

* Fix pooch.retrieve() call - add known_hash=None

* updated column names

* fixed column names to include token

* added code to download the PTM Dataset

* Fix code formatting issues (ruff)

* Move PTM test to correct datasets directory and fix column names

* Clean up commit history

* added fixes for code formatting

* Remove duplicate concept_names code

* Fix typo in concept_names property (quick fix for failing tests on main)

* Update test to match concepts_name property

* Format code with ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants