A (light) refactor inspired by modern compiler theory

> TL;DR: Let's make cclib closer to a compiler and improve [separation of concerns](https://en.wikipedia.org/wiki/Separation_of_concerns) in the process!
> This is *not* an Abril's fool joke.

In my view, some features are hard to implement with the current architecture. This is mainly due to a sort of information duplication shared by all parsers: they all have to know about the final data structure, which is [part of the public API](https://cclib.github.io/data.html). Why is this information duplication? Because if we ever want to change `ccData`, we have to change every single parser.

For instance, #89 or #988 would require updating all parsers just to change units in `ccData`. And it seems reasonable to me that parsers should have no responsibility for units in `ccData` (other than the ones outputted by each computational chemistry package). To me, this is a case of double duty.

Even if we don't change its structure, manipulating `ccData` before returning it has to be done at the parser level. This makes it harder to implement things like #50 or #789 ([see also this comment](https://github.com/cclib/cclib/issues/789#issuecomment-1002230073) by @awvwgk), since it would require parsing a second file, and this would return another `ccData`. It would have to be merged within the requesting parser itself. So parsers would have to both consume and produce `ccData`, which seems to be a double duty as well.

An example of that can be found in the excellent work of @shivupa in #1076. Ideally, we would like to reuse parsers (which would help with #808 and #1012, for instance), but as they accumulate more and more responsibilities (and complexity, and *statefulness*), this becomes harder and harder.

To solve those issues, we could use some ideas from modern compiler theory.

## What I'm proposing

> This is related to #335.

I propose creating the `ccData` object *outside* parsers. This is similar to [how most compilers work](https://en.wikipedia.org/wiki/Compiler#Front_end). Our parsers become [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis), i.e., [iterators emitting *tokens*](https://stackoverflow.com/a/15895283/4039050) (a.k.a. *events*) as the log file is read. A [package-agnostic object](https://en.wikipedia.org/wiki/Parsing) then consumes this token stream and produces a `ccData` object. The main point is that [iterators](https://docs.python.org/3/glossary.html#term-iterator) are easy to compose, so multi-job/multi-file can be done in a general way.

What would change? Code like https://github.com/cclib/cclib/blob/5798fdcdd91813cbdfb99cb36a2e48c16be9a155/cclib/parser/orcaparser.py#L736-L738 would change to

```python
    yield {"set_attribute": "natom", "value": len(atomnos)}
    yield {"set_attribute": "atomnos", "value": atomnos}
    yield {"append_attribute": "atomcoords", "value": atomcoords}
```

(Or something equivalent with a dedicated class for tokens.) As long as the token stream consumer produces a valid `ccData`, nothing would break. But the composability greatly increases. For instance, we could transform https://github.com/cclib/cclib/blob/693bd40f5b9956608884531d7e70bb60ca31027f/cclib/io/ccio.py#L79-L100 into a special tokenizer that detects which log file we have, which could trigger a tokenizer switch.

## How does this solve anything?

Examples:
- Dedicated token stream consumers (i.e., *the actual* parser) could filter tokens and only produce user-requested data or format (#335).
- Token streams can be manipulated before returning. For instance, units could be requested by the caller (#998).
- Nested and multiple-job log files become easier by chaining/interspersing token streams from different sources (#1076).

	triggers = [

	(ADF, ["Amsterdam Density Functional"], True),
	(DALTON, ["Dalton - An Electronic Structure Program"], True),
	(FChk, ["Number of atoms", "I"], True),
	(GAMESS, ["GAMESS"], False),
	(GAMESS, ["GAMESS VERSION"], True),
	(GAMESSUK, ["G A M E S S - U K"], True),
	(Gaussian, ["Gaussian, Inc."], True),
	(Jaguar, ["Jaguar"], True),
	(Molcas, ["MOLCAS"], True),
	(Molpro, ["PROGRAM SYSTEM MOLPRO"], True),
	(Molpro, ["1PROGRAM"], False),
	(MOPAC, ["MOPAC20"], True),
	(NWChem, ["Northwest Computational Chemistry Package"], True),
	(ORCA, ["O R C A"], True),
	(Psi3, ["PSI3: An Open-Source Ab Initio Electronic Structure Package"], True),
	(Psi4, ["Psi4: An Open-Source Ab Initio Electronic Structure Package"], True),
	(QChem, ["A Quantum Leap Into The Future Of Chemistry"], True),
	(Turbomole, ["TURBOMOLE"], True),

	]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A (light) refactor inspired by modern compiler theory #1124

What I'm proposing

How does this solve anything?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	self.set_attribute('natom', len(atomnos))
	self.set_attribute('atomnos', atomnos)
	self.append_attribute("atomcoords", atomcoords)

A (light) refactor inspired by modern compiler theory #1124

Description

What I'm proposing

How does this solve anything?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions