Major reorganization of pymatgen repo#4595
Conversation
|
@DanielYang59 If you'd like, pls review this. If there are no further comments from anyone by Thursday, I will merge into master. |
|
@shyuep We will discuss this shortly internally and will try to get back to you by tomorrow. |
|
@shyuep To make the move less painful, we ideally get #4148 merged first. We assume that the failing tests are related to the recent changes in the repo. Additionally, from our side, preserving the contribution history would be very important, also for these modules. We really love to contribute to open-source software on a voluntary basis, also to software maintained by other groups . The visibility of the contributions is, at least in my opinion, even more important in the latter case. This is also important with regard to funding justifications. I hope you can understand this. We also plan further improvements of these modules in the future and we are happy to contribute to pymatgen/pymatgen-core in general. |
|
@JaGeo I think one of the difficulties I had was that the lobster and cohp classes depend on chemenv, which I am reluctant to move into the core since this is the only package that relies on it. Is there a refactoring you can do to move the chemenv dependent portions out of lobster? |
|
@shyuep Ah, I see! We could, for example, move https://github.com/materialsproject/pymatgen/blob/master/src/pymatgen/io/lobster/lobsterenv.py to one of the analysis modules. It likely fits better there anyhow. This would be a breaking change. Likely, however, one could find a sensible way to deprecate this as well. |
Sure can you please give me a bit more time (until this weekend maybe) as I have some work meetings this week and I would try to go through the checklist? What should go into
|
| Module | Count |
|---|---|
| core | 487 |
| analysis | 153 |
| io | 127 |
| entries | 71 |
| symmetry | 54 |
| util | 23 |
| electronic_structure | 16 |
| ext | 15 |
| optimization | 6 |
I guess this should gives us a rough estimate what is "important" to users?
The current core package
- I could see
alchemyis included, but as far as I see there's no update here in the past two years -
command_lineis also included, a bit unsure if it should be "core" to pymatgen because they're most wrappers for some command line programs (maybe their position is a bit similar toioso they're also moved?) - I could see there're some visualization/plotting things in
util, guess they should be moved back?
The non-core side
- surely you're discussing this, just add it to my checklist:
electronic_structure/cohpand lobster is in non-core package - guess
entriesare more closer toanalysisso they're not moved?
Should test files be separated?
I'm really unsure about this, because this would make it hard for people to contribute as they would need to break one atomic change in two places (i.e. if I want to add a test that needs a test file, I have to get the file merged first). Also I don't think this brings too much benefit to the core side because most of the big test files are from there (mostly electronic_structure and IO):
# Core
12K alchemy/
388K cif/
29M command_line/
212K core/
143M electronic_structure/
313M io/
12K performance/
4.3M phonon/
16M surfaces/
36K symmetry/
12K transformations/
# Non-core
380K apps/
60M analysis/
1.0M entries/
Some other important things
- git history on core package needs to be rewritten to preserve history (@esoteric-ephemera cc): Align history with pymatgen pymatgen-core#1
- also git history on test files
- I would check if all current pymatgen APIs are still available in the new layout
- core APIs
- non-core APIs
- relocated APIs are available with a deprecation warning
About versioning...
I could see the first pymatgen-core was published using calendar versioning, and guess we shouldn't jump back and force between semantic and calendar to the pymatgen side (otherwise pymatgen would need to use version epochs like 1!5.0.0 which is quite awkward), but still want to ask if we should consider starting pymatgen-core with semantic version? As far as I know calendar versioning is a bit easier for us to decide on the version number, but to the user/tooling side it doesn't communicate compatibility? i.e. if we are currently in ver 1.0, user could pin >=1.0, <2 to avoid breaking change, but is not possible with calendar versioning?
Side note: I checked the top 10 most downloaded packages in PyPI in the past year and only one certifi uses calendar versioning.
Some other non-urgent things could be easily fixed with PR
- The dependencies in both places need to be cleaned up, especially to the core side
- The non-core repo should probably add
coreas workspace member for easy development - after the
corebeing extracted, where should the API docs live? - probably change to the
corerepo should trigger non-core pipeline (similar to atomate2) under some condition (e.g. when commit go intomain)?
Sure. The reason why I want this done fast is to avoid divergence from PRs.
Yes. But there are cross-dependencies. For example,
Command line is used extensively in transformations (enumlib). So that will stay. Alchemy is definitely old code and we can consider moving it back into non-core (I seem to remember some alchemy dependency somewhere else though).
Entries have been moved to core. The only Entry class not moved is things related to compatibility.
This needs to be balanced against the cost of cloning. Quite a number of people actually clone pymatgen directly from Github on HPCs. Those do not need the test files. My clone of pymatgen with test files from Singapore took 30 mins yesterday. Devs actually clone with recursive to get the test files without a separate command. When they add a new test file, they can test locally without problems. The only extra step is a separate PR for the test files.
To resolve in separate PR.
Honestly, this is really not important. People want credit for the code and tests. But putting in a VASP output file for testing? Really? But sure if someone does a PR on the git history, I will merge it.
That is tested. As you can see, I retained all analysis and entries test for now. Those were the only things I moved to pymatgen/core. The location of everything else removed the same.
Too late to move back to semantic versioning, regardless of arguments (which I don't believe in any case). And it will be hell of confusing to ask users to go from install pymatgen 2025.1.1 to pymatgen-core 1.0. We will close this discussion.
Sure.
We will wait on this. I am not sure many people do dev on both core and non-core codes. (https://docs.astral.sh/uv/concepts/projects/workspaces/) for easy development
pymatgen.org, which will still be the pymatgen (non-core) repo.
Not sure about this. Maybe. But pymatgen is going to use core as a dependency which is pinned. Any errors will be tested when that pin gets updated. |
|
Re git history on test files, this should already be partially taken care of. Can be useful for understanding test regression I would exclude Will also try to take a closer look at the split and give further comments |
|
Just as a follow-up: @naik-aakash addressed the move to the analysis module (thanks!) |
Thanks, fully understood.
I see, looks like the warning might be missing from the non-core stub? uv run --with git+https://github.com/materialyzeai/pymatgen.git@master python -c "from pymatgen.analysis.elasticity.strain import Deformation"
<string>:1: DeprecationWarning: This module has been moved to the pymatgen.core.elasticity module. This stub will be removed v2027.1.
uv run --with git+https://github.com/materialyzeai/pymatgen.git@master python -c "from pymatgen.entries import Entry"
>> no warning
That's a solid reason, and I think we have other options other than separating the files in another repo, for example I usually do a partial clone (I cannot remember the full command, so I just use my clipboard manager to fuzzy search): # I used Andrew's fork because test files are still there
# Clone the complete repo
time git clone --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen
>> Receiving objects: 100% (263529/263529), 937.70 MiB | 18.33 MiB/s, done.
>> git clone https://github.com/Andrew-S-Rosen/pymatgen 40.44s user 11.79s system 91% cpu 57.107 total
# Shallow clone
time git clone https://github.com/Andrew-S-Rosen/pymatgen --depth 1
>> Receiving objects: 100% (2818/2818), 249.23 MiB | 12.93 MiB/s, done.
>> git clone https://github.com/Andrew-S-Rosen/pymatgen --depth 1 10.70s user 4.00s system 60% cpu 24.249 total
# Partial shallow clone
git clone --filter=blob:none --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen && cd pymatgen
git sparse-checkout set --no-cone '/*' '!/tests/files/'
time git checkout
>> Receiving objects: 100% (1094/1094), 12.84 MiB | 7.53 MiB/s, done.
>> git checkout 1.00s user 0.44s system 22% cpu 6.330 total
I think the extra step would actually break the connection between test and test files, i.e. one might later modify only test or test file, and then it would be hard to figure out the relationship between them.
For me I think the reason is more to track "metadata"... when I submit a PR I usually try to include the reason/data source in that commit message (or PR description), if git history is wiped, it would be quite hard to find when/why certain file is added/changed
I'm not aware of any way to rewrite git history without force push, but I believe Aaron is working on this.
Sure it's not urgent and after this refactor I could play around with this (and other small improvement) |
|
Does anyone still use Haven't had chance to review all of this yet, would also like opportunity to do so. |
Agreed, but I am setting a hard deadline of this weekend. Speak your mind by then or forever hold your peace. |
@shyuep I guess we should have a warning for the |
|
@shyuep what was the nature of the deadline, are follow-up PRs ok or is there a reason this structure needs to be frozen? |
|
I'm actually a little confused. Where is |
|
Ok, I found it https://github.com/materialyzeai/pymatgen-core. I'm confused why this is not in the It would make sense to keep Edited to add: I am also concerned about the confusion this will cause to developers by moving to a separate repo. A lot of git history, issues, documentation, etc. will no longer resolve and be more difficult to find. |
Of course the structure is never frozen. Follow-up PRs are fine. I just wanted to make sure the PR is merged in a timely manner.
We discussed this. The expectation is that the current pymatgen repo will mostly function as a meta package of sorts. The git history issue has already been resolved in pymatgen-core. |
|
I understand I am late to the discussion and only now understanding the full context, but I also understand this is a major change that was done under a short timeframe. There is no mention on the pymatgen.org homepage or repo README.md for example. The documentation has not been updated. We have discussed and explored a The bigger question I think we need to resolve is the location of |
Correct. But at the present moment, there are 0 implications for anyone using pymatgen. There are implications for developers only. Any other user doing
I want to separate out the two questions. 1) What is technically the best structure - monorepo or multi-repo, especially given the huge technical debt (a full clone of pymatgen with history took 30 mins for me), 2) provenance. In my view, a mono-repo renders most of the argument for splitting out pymatgen-core moot, plus it will create significant complexity in management. Note that my preferences of management take precedence since I spend a lot of time maintaining this. If we want a mono repo, I'd rather reverse the entire change and leave it as before and treat my many hours doing the separation as wasted effort. |
|
Yes, my concern is definitely with the developer experience, and also with propriety. The I definitely advocate for the monorepo. The reason here is historical as much as technical: pymatgen has many hundreds of contributors, many of whom use their contributions to pymatgen as evidence of their impact for career statements and similar. I fear we would be doing these contributors a grave disservice by forking pymatgen to a separate repo in this manner. If the separate repo route was the route forward, this should still be within the
I'm fully on board with expanding the pool of maintainers. We have several people who have been consistently doing work on maintaining pymatgen, including @JaGeo, @DanielYang59 and others, and I would love to give them an equal seat at the table to help with the maintenance burden. I agree it can be a lot.
Unfortunately I think this is the best course forwards. This is such a large change, I think it is better for us to revert and consider our options together. I'm sorry for the wasted effort and I would be glad to help resolve. On a more hopeful note, I have been finding agentic workflows with the recent frontier models very powerful, and I think they can definitely help us with maintenance tasks: I expect I could start to become more frequently involved myself again thanks to these tools. |
|
I am just seeing this conversation now. In my conversations with MP development team:
|
|
Dependency management can be separated while still living in the same repository. It is technically easier to keep e.g. But I think the bigger aspect, which I welcome comment on from others, is how this split into multiple repositories might harm the community. I believe the community that has been established around pymatgen has really been a critical part of its success and this is a very big change to make without community buy-in and support. |
From a downstream developer/user and (lesser) contributor, yes I would agree that having Both choices of multi-repo/mono-repo under the same organisation have valid arguments – I guess the question is whether the pain of |
|
@kavanase Thanks for the input.
|
|
Update: pymatgen-core repo has been moved to materialsproject org - https://github.com/materialsproject/pymatgen-core |
|
If the only concern is It's true that a standard: takes ~5 minutes and is very large. However, this is readily addressable without switching to a new repo: We can update documentation, The repo size itself is also fixable. The biggest culprits are ~4.9 GB of old sphinx-built HTML docs (not even the documentation source .rst) and ~1.4 GB of old test files. These can be scrubbed. This is a big change itself because it changes SHAs and would created orphaned PRs - so I do not recommend it as our first option and would prefer the We can also move the sphinx-built HTML pages to an orphaned |
|
yes i totally agree we have more options to handle git clone --filter=blob:none --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen && cd pymatgen
git sparse-checkout set --no-cone '/*' '!/tests/files/'
time git checkout
>> Receiving objects: 100% (1094/1094), 12.84 MiB | 7.53 MiB/s, done.
>> git checkout 1.00s user 0.44s system 22% cpu 6.330 totalregarding the docs issue it has been tracked, perhaps we also build and serve in CI without saving static pages at all? - name: Build Sphinx
run: sphinx-build -b html docs/source docs/build/html
- name: Upload artifact
uses: actions/upload-pages-artifact@v4
with:
path: docs/build/html
- name: Deploy to GitHub Pages
uses: actions/deploy-pages@v4 |
|
About the docs issue, unless there's some strong reason I'm missing to keep the pre-built HTML files (which I guess is essentially a legacy artifact?), I agree with @DanielYang59 that they should be removed and built in CI, to significantly drop the repo size. |
|
Docs are not a big contributor to the repo size but I agree with moving it to a different branch. I think the whole mono repo argument is colored by bias of the status quo without considering the future evolution of pymatgen. As a reference software suite that uses multi-repos, pls look at the jupyter (I am sure all here are familiar with this) organization of jupyter, jupyter lab, notebook, jupyter-server, etc. It is multi repo. We already started on the path towards multi-repos years ago when we decided to make pymatgen into a namespace package. pymatgen-analysis-diffusion and pymatgen-analysis-alloys are in separate repos maintained by different teams. So in that context, what makes pymatgen-analysis-optics or pymatgen-apps-borg so special that they should be in the same mono repo as pymatgen-core? If a mono repo is such a wonderful thing, why not dump the entire MP codebase including custodian, emmet, mp-api, atomate, etc. into one giant repo? Note that I originally wanted the new MP API to be implemented in the pymatgen repo because MPRester was already in pymatgen and a key component used to retrieve crystal structures, but @mkhorton et al. steadfastly refused which led to the current schism in pymatgen's MPRester and mp-api's. Quite clearly the commitment to the benefits of a mono-repo seems to be situationally dependent. This is a reorg that has multiple aims.
This is the first (albeit the most significant) of many steps. So I would (again) ask that everyone take a step back and look at the whole tapestry, rather than being consumed by administrative stuff. P.S. For administrative stuff, I will listen to opinions and suggestions. But unless there are others who are willing to commit the same amount and regularity of effort into helping maintain pymatgen, my preferences as lead maintainer needs to take precedence. |
Sphinx-built HTML docs are contributing ~4.8 GB to the repo size (~1 GB compressed). They're the main reason the This can be verified with:
The current |
|
Ok, for a route forwards, I want to propose:
|
|
@shyuep would you be OK with us putting together a draft PR for recombining There are plenty of examples where multiple namespaces for one parent are combined into a single GH repo. I think this makes the most sense and can be organized to cleanly differentiate the major namespaces |
This PR introduces a major reorganization of pymatgen. Based on comprehensive testing, the changes should be fully backward compatible.
Key Changes
• analysis
• apps
• cli
• entries / compatibility
• ext
• vis
• All other core modules have been relocated to pymatgen-core.