Skip to content

Conversation

@gpauloski
Copy link
Collaborator

@gpauloski gpauloski commented Jul 5, 2025

Description

The file/Globus connectors disabled file buffering which would cause some silently truncated file writes with very large files (>=2.14GB). Both connectors now use the "unset" buffering policies that uses the system default, with an option to override the policy on construction. This issue was identified in #704.

That issue also benefitted from using a custom serializer/deserializer to more efficiently serialize large torch models. I change the serialization protocols to be more flexible, operating on BytesLike objects rather that just bytes, so it is easier for people to implement new serializers. I also updated the example in the docs.

Fixes

Type of Change

  • Breaking Change (fix or enhancement which changes existing semantics of the public interface)
  • Enhancement (new features or improvements to existing functionality)
  • Bug (fixes for a bug or issue)
  • Internal (refactoring, style changes, testing, optimizations)
  • Documentation update (changes to documentation or examples)
  • Package (dependencies, versions, package metadata)
  • Development (CI workflows, pre-commit, linters, templates)
  • Security (security related changes)

Testing

To replicate the issue I used this script:

from __future__ import annotations

import logging

import torch
from transformers import OPTForCausalLM

from proxystore.connectors.file import FileConnector
from proxystore.store import Store


def main() -> None:
    logging.basicConfig(level=logging.DEBUG)

    model = OPTForCausalLM.from_pretrained(
        'facebook/opt-1.3b',
        cache_dir='/tmp/proxystore-issue-704/cache',
        torch_dtype=torch.float32,
    )

    connector = FileConnector('/tmp/proxystore-issue-704/store')
    with Store(name='issue-704', connector=connector) as store:
        key = store.put(model)
        store.get(key)


if __name__ == '__main__':
    main()

And I tested the custom serializers with this:

from __future__ import annotations

import logging
import io

import torch
from transformers import AutoConfig, OPTForCausalLM

from proxystore.connectors.file import FileConnector
from proxystore.serialize import serialize, deserialize, BytesLike, SerializationError
from proxystore.store import Store


def serialize_torch_model(obj: Any) -> bytes:
    if isinstance(obj, torch.nn.Module):
        buffer = io.BytesIO()
        buffer.write(b'PT\n')
        torch.save(obj, buffer)
        return buffer.getvalue()
    else:
        return serialize(obj)


def deserialize_torch_model(raw: BytesLike) -> Any:
    try:
        return deserialize(raw)
    except SerializationError:
        buffer = io.BytesIO(raw)
        assert buffer.readline() == b'PT\n'
        return torch.load(buffer, weights_only=False)


def main() -> None:
    logging.basicConfig(level=logging.DEBUG)

    model = OPTForCausalLM.from_pretrained(
        'facebook/opt-1.3b',
        cache_dir='/tmp/proxystore-issue-704/cache',
        torch_dtype=torch.float32,
    )

    connector = FileConnector('/tmp/proxystore-issue-704/store')
    with Store(name='issue-704', connector=connector) as store:
        key = store.put(model, serializer=serialize_torch_model)
        store.get(key, deserializer=deserialize_torch_model)


if __name__ == '__main__':
    main()

Pull Request Checklist

Please confirm the PR meets the following requirements.

  • Tags added to PR (e.g., breaking, bug, enhancement, internal, documentation, package, development, security).
  • Code changes pass pre-commit (e.g., mypy, ruff, etc.).
  • Tests have been added to show the fix is effective or that the new feature works.
  • New and existing unit tests pass locally with the changes.
  • Docs have been updated and reviewed if relevant.

@gpauloski gpauloski added bug Error, flaw, or fault that causes unexpected behavior enhancement New features or improvements to existing functionality labels Jul 5, 2025
The default buffering was set to 0 to disable buffering when opening
files so that file writes would appear immediately on the file system.
However, writes larger tha 2GB were getting silently truncated by the OS
as noted in Issue #704. To preserve the ability to disable buffering for
unit tests, a buffering parameter is added to the connectors.
@gpauloski gpauloski merged commit 200a3f1 into main Jul 5, 2025
12 checks passed
@gpauloski gpauloski deleted the issue-704 branch July 5, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Error, flaw, or fault that causes unexpected behavior enhancement New features or improvements to existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proxied model is truncated at 2.14 GB

2 participants