Skip to content

A (light) refactor inspired by modern compiler theory #1124

@schneiderfelipe

Description

@schneiderfelipe

TL;DR: Let's make cclib closer to a compiler and improve separation of concerns in the process!
This is not an Abril's fool joke.

In my view, some features are hard to implement with the current architecture. This is mainly due to a sort of information duplication shared by all parsers: they all have to know about the final data structure, which is part of the public API. Why is this information duplication? Because if we ever want to change ccData, we have to change every single parser.

For instance, #89 or #988 would require updating all parsers just to change units in ccData. And it seems reasonable to me that parsers should have no responsibility for units in ccData (other than the ones outputted by each computational chemistry package). To me, this is a case of double duty.

Even if we don't change its structure, manipulating ccData before returning it has to be done at the parser level. This makes it harder to implement things like #50 or #789 (see also this comment by @awvwgk), since it would require parsing a second file, and this would return another ccData. It would have to be merged within the requesting parser itself. So parsers would have to both consume and produce ccData, which seems to be a double duty as well.

An example of that can be found in the excellent work of @shivupa in #1076. Ideally, we would like to reuse parsers (which would help with #808 and #1012, for instance), but as they accumulate more and more responsibilities (and complexity, and statefulness), this becomes harder and harder.

To solve those issues, we could use some ideas from modern compiler theory.

What I'm proposing

This is related to #335.

I propose creating the ccData object outside parsers. This is similar to how most compilers work. Our parsers become tokenizers, i.e., iterators emitting tokens (a.k.a. events) as the log file is read. A package-agnostic object then consumes this token stream and produces a ccData object. The main point is that iterators are easy to compose, so multi-job/multi-file can be done in a general way.

What would change? Code like

self.set_attribute('natom', len(atomnos))
self.set_attribute('atomnos', atomnos)
self.append_attribute("atomcoords", atomcoords)
would change to

    yield {"set_attribute": "natom", "value": len(atomnos)}
    yield {"set_attribute": "atomnos", "value": atomnos}
    yield {"append_attribute": "atomcoords", "value": atomcoords}

(Or something equivalent with a dedicated class for tokens.) As long as the token stream consumer produces a valid ccData, nothing would break. But the composability greatly increases. For instance, we could transform

cclib/cclib/io/ccio.py

Lines 79 to 100 in 693bd40

triggers = [
(ADF, ["Amsterdam Density Functional"], True),
(DALTON, ["Dalton - An Electronic Structure Program"], True),
(FChk, ["Number of atoms", "I"], True),
(GAMESS, ["GAMESS"], False),
(GAMESS, ["GAMESS VERSION"], True),
(GAMESSUK, ["G A M E S S - U K"], True),
(Gaussian, ["Gaussian, Inc."], True),
(Jaguar, ["Jaguar"], True),
(Molcas, ["MOLCAS"], True),
(Molpro, ["PROGRAM SYSTEM MOLPRO"], True),
(Molpro, ["1PROGRAM"], False),
(MOPAC, ["MOPAC20"], True),
(NWChem, ["Northwest Computational Chemistry Package"], True),
(ORCA, ["O R C A"], True),
(Psi3, ["PSI3: An Open-Source Ab Initio Electronic Structure Package"], True),
(Psi4, ["Psi4: An Open-Source Ab Initio Electronic Structure Package"], True),
(QChem, ["A Quantum Leap Into The Future Of Chemistry"], True),
(Turbomole, ["TURBOMOLE"], True),
]
into a special tokenizer that detects which log file we have, which could trigger a tokenizer switch.

How does this solve anything?

Examples:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions