-
Notifications
You must be signed in to change notification settings - Fork 174
Description
TL;DR: Let's make cclib closer to a compiler and improve separation of concerns in the process!
This is not an Abril's fool joke.
In my view, some features are hard to implement with the current architecture. This is mainly due to a sort of information duplication shared by all parsers: they all have to know about the final data structure, which is part of the public API. Why is this information duplication? Because if we ever want to change ccData, we have to change every single parser.
For instance, #89 or #988 would require updating all parsers just to change units in ccData. And it seems reasonable to me that parsers should have no responsibility for units in ccData (other than the ones outputted by each computational chemistry package). To me, this is a case of double duty.
Even if we don't change its structure, manipulating ccData before returning it has to be done at the parser level. This makes it harder to implement things like #50 or #789 (see also this comment by @awvwgk), since it would require parsing a second file, and this would return another ccData. It would have to be merged within the requesting parser itself. So parsers would have to both consume and produce ccData, which seems to be a double duty as well.
An example of that can be found in the excellent work of @shivupa in #1076. Ideally, we would like to reuse parsers (which would help with #808 and #1012, for instance), but as they accumulate more and more responsibilities (and complexity, and statefulness), this becomes harder and harder.
To solve those issues, we could use some ideas from modern compiler theory.
What I'm proposing
This is related to #335.
I propose creating the ccData object outside parsers. This is similar to how most compilers work. Our parsers become tokenizers, i.e., iterators emitting tokens (a.k.a. events) as the log file is read. A package-agnostic object then consumes this token stream and produces a ccData object. The main point is that iterators are easy to compose, so multi-job/multi-file can be done in a general way.
What would change? Code like
cclib/cclib/parser/orcaparser.py
Lines 736 to 738 in 5798fdc
| self.set_attribute('natom', len(atomnos)) | |
| self.set_attribute('atomnos', atomnos) | |
| self.append_attribute("atomcoords", atomcoords) |
yield {"set_attribute": "natom", "value": len(atomnos)}
yield {"set_attribute": "atomnos", "value": atomnos}
yield {"append_attribute": "atomcoords", "value": atomcoords}(Or something equivalent with a dedicated class for tokens.) As long as the token stream consumer produces a valid ccData, nothing would break. But the composability greatly increases. For instance, we could transform
Lines 79 to 100 in 693bd40
| triggers = [ | |
| (ADF, ["Amsterdam Density Functional"], True), | |
| (DALTON, ["Dalton - An Electronic Structure Program"], True), | |
| (FChk, ["Number of atoms", "I"], True), | |
| (GAMESS, ["GAMESS"], False), | |
| (GAMESS, ["GAMESS VERSION"], True), | |
| (GAMESSUK, ["G A M E S S - U K"], True), | |
| (Gaussian, ["Gaussian, Inc."], True), | |
| (Jaguar, ["Jaguar"], True), | |
| (Molcas, ["MOLCAS"], True), | |
| (Molpro, ["PROGRAM SYSTEM MOLPRO"], True), | |
| (Molpro, ["1PROGRAM"], False), | |
| (MOPAC, ["MOPAC20"], True), | |
| (NWChem, ["Northwest Computational Chemistry Package"], True), | |
| (ORCA, ["O R C A"], True), | |
| (Psi3, ["PSI3: An Open-Source Ab Initio Electronic Structure Package"], True), | |
| (Psi4, ["Psi4: An Open-Source Ab Initio Electronic Structure Package"], True), | |
| (QChem, ["A Quantum Leap Into The Future Of Chemistry"], True), | |
| (Turbomole, ["TURBOMOLE"], True), | |
| ] |
How does this solve anything?
Examples:
- Dedicated token stream consumers (i.e., the actual parser) could filter tokens and only produce user-requested data or format (Refactor parsers #335).
- Token streams can be manipulated before returning. For instance, units could be requested by the caller (Units for parsed energies #998).
- Nested and multiple-job log files become easier by chaining/interspersing token streams from different sources (Multiple jobs in one output file #1076).