Skip to content

refactor(lobster): Rewrite LOBSTER parsers with memory-efficient streaming architecture#4592

Merged
shyuep merged 14 commits intomaterialsproject:masterfrom
tomdemeyere:future
Feb 24, 2026
Merged

refactor(lobster): Rewrite LOBSTER parsers with memory-efficient streaming architecture#4592
shyuep merged 14 commits intomaterialsproject:masterfrom
tomdemeyere:future

Conversation

@tomdemeyere
Copy link
Contributor

Description

This PR introduces a complete rewrite of the LOBSTER output file parsers in pymatgen.io.lobster. The new implementation lives in pymatgen.io.lobster.future and provides a cleaner, more maintainable, and a memory-efficient architecture.

Motivation

The existing LOBSTER parsers have several limitations:

  • Memory inefficiency: Files are fully loaded into memory via _get_lines(), then parsed, resulting in duplicate data (raw text + parsed structures)
  • Inconsistent APIs: Each parser has its own conventions for attributes, methods, and initialization
  • No version handling: Version-specific parsing logic is scattered and hard to maintain
  • Limited extensibility: Adding support for new file formats or LOBSTER versions requires significant code changes
  • Poor separation of concerns: Parsing, data storage, filtering, and serialization logic are intertwined, making the code difficult to test, maintain, and extend

Key Changes

New Base Class Architecture

LobsterFile (MSONable)
  ├── Version processor registry via @version_processor decorator
  ├── Streaming via iterate_lines() generator
  ├── Unified serialization (as_dict/from_dict)
  └── LobsterInteractionsHolder
        ├── Shared filtering API (get_interactions_by_properties, get_data_by_properties)
        ├── COXXCAR (COHPCAR, COOPCAR, COBICAR)
        ├── ICOXXLIST (ICOHPLIST, ICOOPLIST, ICOBILIST)
        └── NcICOBILIST

New Features

  • NcICOBILIST orbital-wise parsing: The new NcICOBILIST parser now fully supports orbital-resolved multi-center COBI data, which was previously ignored with a warning
  • Explicit version specification: Users working with older LOBSTER files can now specify the version directly at initialization:
    obj = ICOHPLIST("old_file.lobster", lobster_version="3.1.0")
  • Streaming file parsing: Large files (Wavefunction, BandOverlaps, COXXCAR, BWDF) are now parsed via streaming or two-pass approaches with pre-allocated numpy arrays, eliminating the 2-3x memory overhead from keeping both raw text and parsed data in memory

Version Processor System

class COXXCAR(LobsterInteractionsHolder):
    @version_processor(min_version="5.1")
    def parse_file(self) -> None:
        """Modern format parser."""
        ...

    @version_processor(max_version="5.0")
    def parse_file_legacy(self) -> None:
        """Legacy format parser."""
        ...

Version processors are automatically registered via __init_subclass__ and selected at runtime based on file version detection or user-specified version.

Interaction Filtering API

The get_interactions_by_properties and get_data_by_properties methods provide filtering across all interaction-based parsers (COXXCAR, ICOXXLIST, NcICOBILIST):

cohpcar = COHPCAR("COHPCAR.lobster")

# Get all Fe-O interactions with bond lengths between 1.8 and 2.2 Å
fe_o_bonds = cohpcar.get_interactions_by_properties(
    centers=["Fe", "O"],
    length=(1.8, 2.2)
)

# One specific center and all oxygens between 1.8 and 2.2 Å
fe_o_bonds = cohpcar.get_interactions_by_properties(
    centers=["Fe3", "O"],
    length=(1.8, 2.2)
)

# Get all d-d interactions (d orbital on BOTH sides)
# Passing duplicate values requires ALL to match
d_d_interactions = cohpcar.get_interactions_by_properties(
    orbitals=["d", "d"]
)

# Get only s-p interactions
s_p_interactions = cohpcar.get_interactions_by_properties(
    orbitals=["s", "p"]
)

# Combine multiple filters: Fe-Fe d-d interactions, 2.4-2.6 Å, spin up only
fe_fe_d_d_data = cohpcar.get_data_by_properties(
    centers=["Fe", "Fe"],
    orbitals=["d", "d"],
    length=(2.4, 2.6),
    spins=[Spin.up],
    data_type="cohp"
)

# Get integrated COHP (ICOHP) for specific bond indices
icohp_data = cohpcar.get_data_by_properties(
    indices=[1, 2, 3],
    data_type="icoxx"
)

# Works identically for ICOXXLIST files
icohplist = ICOHPLIST("ICOHPLIST.lobster")
strong_bonds = icohplist.get_interactions_by_properties(
    centers=["Fe"],  # Any interaction involving Fe
    length=(0.0, 2.5)
)

# Generate lobster compatible human-readable labels for interactions
for interaction in strong_bonds:
    label = ICOHPLIST.get_label_from_interaction(
        interaction,
        include_centers=True,
        include_orbitals=True,
        include_cells=True,
        include_length=True
    )
    # Output: "Fe1[3d_xy]->O2[0 0 1][2p_x](1.982)"

# Find all interactions involving at least 3 Fe atoms
fe_fe_fe = ncicobi.get_interactions_by_properties(
    centers=["Fe", "Fe", "Fe"]
)

# Same works for orbitals on multi-center interactions
d_d_p = ncicobi.get_interactions_by_properties(
    orbitals=["d", "d", "p"]
)

New Module Structure

pymatgen/io/lobster/future/
├── __init__.py
├── constants.py          # LOBSTER_VERSION, LOBSTER_ORBITALS
├── core.py               # LobsterFile, LobsterInteractionsHolder
├── types.py              # TypedDicts for type safety
├── utils.py              # Shared utilities
├── versioning.py         # @version_processor decorator
└── outputs/
    ├── __init__.py
    ├── bands.py          # BandOverlaps, Fatband, Fatbands
    ├── charges.py        # Charge, Grosspop
    ├── coxxcar.py        # COXXCAR, COHPCAR, COOPCAR, COBICAR
    ├── dos.py            # DOSCAR
    ├── icoxxlist.py      # ICOXXLIST, ICOHPLIST, etc., NcICOBILIST
    ├── lobsterout.py     # LobsterOut
    └── misc.py           # Wavefunction, MadelungEnergies, SitePotentials, etc.

Migration Path

The existing pymatgen.io.lobster.outputs module remains unchanged for backward compatibility. Users can migrate gradually:

# Old (still works)
from pymatgen.io.lobster.outputs import Cohpcar

# New
from pymatgen.io.lobster.future import COHPCAR

Testing

  • 63 tests covering all file types
  • Spin-polarized and non-spin-polarized files
  • Orbital-resolved and non-orbital-resolved parsing
  • LCFO format variants
  • Version-specific parsing (legacy vs modern)
  • Complete and thorough JSON serialization roundtrips (MSONable interface)

Breaking Changes

None for existing code. The new module is in pymatgen.io.lobster.future and does not modify existing APIs.

Future Work

  • If possible gather feedback from users and modify accordingly
  • Deprecate old parsers in favor of new ones if the community is interested in them.
  • Performance benchmarks on large files...?

@JaGeo @naik-aakash

@shyuep shyuep merged commit cc57da9 into materialsproject:master Feb 24, 2026
31 of 44 checks passed
@shyuep
Copy link
Member

shyuep commented Feb 24, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants