Fix silent file truncation when very large objects are written #705

gpauloski · 2025-07-05T05:19:14Z

Description

The file/Globus connectors disabled file buffering which would cause some silently truncated file writes with very large files (>=2.14GB). Both connectors now use the "unset" buffering policies that uses the system default, with an option to override the policy on construction. This issue was identified in #704.

That issue also benefitted from using a custom serializer/deserializer to more efficiently serialize large torch models. I change the serialization protocols to be more flexible, operating on BytesLike objects rather that just bytes, so it is easier for people to implement new serializers. I also updated the example in the docs.

Fixes

Fixes Proxied model is truncated at 2.14 GB #704

Type of Change

Breaking Change (fix or enhancement which changes existing semantics of the public interface)
Enhancement (new features or improvements to existing functionality)
Bug (fixes for a bug or issue)
Internal (refactoring, style changes, testing, optimizations)
Documentation update (changes to documentation or examples)
Package (dependencies, versions, package metadata)
Development (CI workflows, pre-commit, linters, templates)
Security (security related changes)

Testing

To replicate the issue I used this script:

from __future__ import annotations

import logging

import torch
from transformers import OPTForCausalLM

from proxystore.connectors.file import FileConnector
from proxystore.store import Store


def main() -> None:
    logging.basicConfig(level=logging.DEBUG)

    model = OPTForCausalLM.from_pretrained(
        'facebook/opt-1.3b',
        cache_dir='/tmp/proxystore-issue-704/cache',
        torch_dtype=torch.float32,
    )

    connector = FileConnector('/tmp/proxystore-issue-704/store')
    with Store(name='issue-704', connector=connector) as store:
        key = store.put(model)
        store.get(key)


if __name__ == '__main__':
    main()

And I tested the custom serializers with this:

from __future__ import annotations

import logging
import io

import torch
from transformers import AutoConfig, OPTForCausalLM

from proxystore.connectors.file import FileConnector
from proxystore.serialize import serialize, deserialize, BytesLike, SerializationError
from proxystore.store import Store


def serialize_torch_model(obj: Any) -> bytes:
    if isinstance(obj, torch.nn.Module):
        buffer = io.BytesIO()
        buffer.write(b'PT\n')
        torch.save(obj, buffer)
        return buffer.getvalue()
    else:
        return serialize(obj)


def deserialize_torch_model(raw: BytesLike) -> Any:
    try:
        return deserialize(raw)
    except SerializationError:
        buffer = io.BytesIO(raw)
        assert buffer.readline() == b'PT\n'
        return torch.load(buffer, weights_only=False)


def main() -> None:
    logging.basicConfig(level=logging.DEBUG)

    model = OPTForCausalLM.from_pretrained(
        'facebook/opt-1.3b',
        cache_dir='/tmp/proxystore-issue-704/cache',
        torch_dtype=torch.float32,
    )

    connector = FileConnector('/tmp/proxystore-issue-704/store')
    with Store(name='issue-704', connector=connector) as store:
        key = store.put(model, serializer=serialize_torch_model)
        store.get(key, deserializer=deserialize_torch_model)


if __name__ == '__main__':
    main()

Pull Request Checklist

I have read the Contributing and PR submission guides.

Please confirm the PR meets the following requirements.

Tags added to PR (e.g., breaking, bug, enhancement, internal, documentation, package, development, security).
Code changes pass pre-commit (e.g., mypy, ruff, etc.).
Tests have been added to show the fix is effective or that the new feature works.
New and existing unit tests pass locally with the changes.
Docs have been updated and reviewed if relevant.

The default buffering was set to 0 to disable buffering when opening files so that file writes would appear immediately on the file system. However, writes larger tha 2GB were getting silently truncated by the OS as noted in Issue #704. To preserve the ability to disable buffering for unit tests, a buffering parameter is added to the connectors.

gpauloski added bug Error, flaw, or fault that causes unexpected behavior enhancement New features or improvements to existing functionality labels Jul 5, 2025

gpauloski force-pushed the issue-704 branch from fecbfe7 to d3b8cea Compare July 5, 2025 05:35

gpauloski added 2 commits July 5, 2025 00:51

Allow BytesLike types rather than only bytes

52e88f2

Update serialization example

318341a

gpauloski force-pushed the issue-704 branch from d3b8cea to 318341a Compare July 5, 2025 05:52

gpauloski merged commit 200a3f1 into main Jul 5, 2025
12 checks passed

gpauloski deleted the issue-704 branch July 5, 2025 05:53

gpauloski mentioned this pull request Jul 5, 2025

Proxied model is truncated at 2.14 GB #704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix silent file truncation when very large objects are written #705

Fix silent file truncation when very large objects are written #705

Uh oh!

gpauloski commented Jul 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix silent file truncation when very large objects are written #705

Fix silent file truncation when very large objects are written #705

Uh oh!

Conversation

gpauloski commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fixes

Type of Change

Testing

Pull Request Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gpauloski commented Jul 5, 2025 •

edited

Loading