Skip to content

Major reorganization of pymatgen repo#4595

Merged
shyuep merged 13 commits intomaterialsproject:masterfrom
materialyzeai:master
Mar 2, 2026
Merged

Major reorganization of pymatgen repo#4595
shyuep merged 13 commits intomaterialsproject:masterfrom
materialyzeai:master

Conversation

@shyuep
Copy link
Member

@shyuep shyuep commented Feb 24, 2026

This PR introduces a major reorganization of pymatgen. Based on comprehensive testing, the changes should be fully backward compatible.

Key Changes

  1. Core modularization
  • Core functionality has been moved to a new repository: pymatgen-core, which has now been released.
  • This separation is intended to clearly distinguish foundational data structures and utilities from higher-level functionality.
  1. Scope of this repository
  • The main pymatgen repo will now primarily contain:
    • analysis
    • apps
    • cli
    • entries / compatibility
    • ext
    • vis
    • All other core modules have been relocated to pymatgen-core.
  1. Test files
  • Test files have been extracted into a separate submodule: pymatgen-test-files.
  1. Modules moved to core
  • Several foundational modules (e.g., structure_matcher, among others) have been moved into pymatgen-core.
  1. Open question
  • io.lobster has not been moved to pymatgen-core. @JaGeo — your input would be appreciated on whether this module should also be relocated.

@shyuep shyuep requested a review from mkhorton as a code owner February 24, 2026 07:56
@shyuep shyuep marked this pull request as draft February 24, 2026 07:56
@shyuep shyuep changed the title Pymatgen-Core Dependency Pymatgen Reorg Feb 24, 2026
@shyuep shyuep requested review from JaGeo and computron February 24, 2026 11:18
@shyuep shyuep marked this pull request as ready for review February 24, 2026 11:18
@shyuep shyuep changed the title Pymatgen Reorg Major reorganization of pymatgen repo Feb 24, 2026
@shyuep
Copy link
Member Author

shyuep commented Feb 25, 2026

@DanielYang59 If you'd like, pls review this.

If there are no further comments from anyone by Thursday, I will merge into master.

@JaGeo
Copy link
Member

JaGeo commented Feb 25, 2026

@shyuep We will discuss this shortly internally and will try to get back to you by tomorrow.

@JaGeo
Copy link
Member

JaGeo commented Feb 25, 2026

@shyuep
We believe it would make a lot of sense to also move the io.lobster modules to pymatgen-core, as "cohp.py" is already a part of pymatgen-core, and we also think that the modules would be easier to discover in this way.

To make the move less painful, we ideally get #4148 merged first. We assume that the failing tests are related to the recent changes in the repo.

Additionally, from our side, preserving the contribution history would be very important, also for these modules. We really love to contribute to open-source software on a voluntary basis, also to software maintained by other groups . The visibility of the contributions is, at least in my opinion, even more important in the latter case. This is also important with regard to funding justifications. I hope you can understand this. We also plan further improvements of these modules in the future and we are happy to contribute to pymatgen/pymatgen-core in general.

@shyuep
Copy link
Member Author

shyuep commented Feb 25, 2026

@JaGeo I think one of the difficulties I had was that the lobster and cohp classes depend on chemenv, which I am reluctant to move into the core since this is the only package that relies on it. Is there a refactoring you can do to move the chemenv dependent portions out of lobster?

@JaGeo
Copy link
Member

JaGeo commented Feb 25, 2026

@shyuep Ah, I see!

We could, for example, move https://github.com/materialsproject/pymatgen/blob/master/src/pymatgen/io/lobster/lobsterenv.py to one of the analysis modules. It likely fits better there anyhow. This would be a breaking change. Likely, however, one could find a sensible way to deprecate this as well.

@DanielYang59
Copy link
Contributor

DanielYang59 commented Feb 25, 2026

If you'd like, pls review this.

Sure can you please give me a bit more time (until this weekend maybe) as I have some work meetings this week and I would try to go through the checklist?


What should go into core?

Some question about the layout (i.e. which module should go into the core package), and here I enclose the data collected for the "dependent usage of pmg API" figure in the 2nd pmg paper (I wouldn't include the figure here as it's not published), the most frequently used API in pymatgen are (only from the top 10 dependents ranked by github stars so it's far from being a complete picture):

Module Count
core 487
analysis 153
io 127
entries 71
symmetry 54
util 23
electronic_structure 16
ext 15
optimization 6

I guess this should gives us a rough estimate what is "important" to users?

The current core package

  • I could see alchemy is included, but as far as I see there's no update here in the past two years
  • command_line is also included, a bit unsure if it should be "core" to pymatgen because they're most wrappers for some command line programs (maybe their position is a bit similar to io so they're also moved?)
  • I could see there're some visualization/plotting things in util, guess they should be moved back?

The non-core side

  • surely you're discussing this, just add it to my checklist: electronic_structure/cohp and lobster is in non-core package
  • guessentries are more closer to analysis so they're not moved?

Should test files be separated?

I'm really unsure about this, because this would make it hard for people to contribute as they would need to break one atomic change in two places (i.e. if I want to add a test that needs a test file, I have to get the file merged first). Also I don't think this brings too much benefit to the core side because most of the big test files are from there (mostly electronic_structure and IO):

# Core
 12K    alchemy/
388K    cif/
 29M    command_line/
212K    core/
143M    electronic_structure/
313M    io/
 12K    performance/
4.3M    phonon/
 16M    surfaces/
 36K    symmetry/
 12K    transformations/

# Non-core
380K    apps/
 60M    analysis/
1.0M    entries/

Some other important things

  • git history on core package needs to be rewritten to preserve history (@esoteric-ephemera cc): Align history with pymatgen pymatgen-core#1
  • also git history on test files
  • I would check if all current pymatgen APIs are still available in the new layout
    • core APIs
    • non-core APIs
    • relocated APIs are available with a deprecation warning

About versioning...

I could see the first pymatgen-core was published using calendar versioning, and guess we shouldn't jump back and force between semantic and calendar to the pymatgen side (otherwise pymatgen would need to use version epochs like 1!5.0.0 which is quite awkward), but still want to ask if we should consider starting pymatgen-core with semantic version? As far as I know calendar versioning is a bit easier for us to decide on the version number, but to the user/tooling side it doesn't communicate compatibility? i.e. if we are currently in ver 1.0, user could pin >=1.0, <2 to avoid breaking change, but is not possible with calendar versioning?

Side note: I checked the top 10 most downloaded packages in PyPI in the past year and only one certifi uses calendar versioning.

Some other non-urgent things could be easily fixed with PR

  • The dependencies in both places need to be cleaned up, especially to the core side
  • The non-core repo should probably add core as workspace member for easy development
  • after the core being extracted, where should the API docs live?
  • probably change to the core repo should trigger non-core pipeline (similar to atomate2) under some condition (e.g. when commit go into main)?

@shyuep
Copy link
Member Author

shyuep commented Feb 25, 2026

Sure can you please give me a bit more time (until this weekend maybe) as I have some work meetings this week and I would try to go through the checklist?

Sure. The reason why I want this done fast is to avoid divergence from PRs.

What should go into core?

Yes. But there are cross-dependencies. For example, Structure.from_file("Li2O.cif") would use the io class, even if there is no specific import of io.

The current core package

  • I could see alchemy is included, but as far as I see there's no update here in the past two years
  • command_line is also included, a bit unsure if it should be "core" to pymatgen because they're most wrappers for some command line programs (maybe their position is a bit similar to io so they're also moved?)
  • I could see there're some visualization/plotting things in util, guess they should be moved back?

Command line is used extensively in transformations (enumlib). So that will stay. Alchemy is definitely old code and we can consider moving it back into non-core (I seem to remember some alchemy dependency somewhere else though).

The non-core side

  • surely you're discussing this, just add it to my checklist: electronic_structure/cohp and lobster is in non-core package
  • guessentries are more closer to analysis so they're not moved?

Entries have been moved to core. The only Entry class not moved is things related to compatibility.

Should test files be separated?

This needs to be balanced against the cost of cloning. Quite a number of people actually clone pymatgen directly from Github on HPCs. Those do not need the test files.

My clone of pymatgen with test files from Singapore took 30 mins yesterday.

Devs actually clone with recursive to get the test files without a separate command. When they add a new test file, they can test locally without problems. The only extra step is a separate PR for the test files.

Some other important things

To resolve in separate PR.

  • also git history on test files

Honestly, this is really not important. People want credit for the code and tests. But putting in a VASP output file for testing? Really? But sure if someone does a PR on the git history, I will merge it.

  • I would check if all current pymatgen APIs are still available in the new layout

That is tested. As you can see, I retained all analysis and entries test for now. Those were the only things I moved to pymatgen/core. The location of everything else removed the same.

About versioning...

Too late to move back to semantic versioning, regardless of arguments (which I don't believe in any case). And it will be hell of confusing to ask users to go from install pymatgen 2025.1.1 to pymatgen-core 1.0. We will close this discussion.

Some other non-urgent things could be easily fixed with PR

  • The dependencies in both places need to be cleaned up, especially to the core side

Sure.

  • The non-core repo should probably add core as [workspace member]

We will wait on this. I am not sure many people do dev on both core and non-core codes.

(https://docs.astral.sh/uv/concepts/projects/workspaces/) for easy development

  • after the core being extracted, where should the API docs live?

pymatgen.org, which will still be the pymatgen (non-core) repo.

  • probably change to the core repo should trigger non-core pipeline

Not sure about this. Maybe. But pymatgen is going to use core as a dependency which is pinned. Any errors will be tested when that pin gets updated.

@esoteric-ephemera
Copy link
Contributor

Re git history on test files, this should already be partially taken care of. Can be useful for understanding test regression

I would exclude ext and much of analysis from "core". ext is mostly database interactions, which shouldn't be in "core". analysis I think is generally too broad to include, and @shyuep's choice to move, ex., StructureMatcher into "core" is a good idea. I would also make sure that a lot of the key data structures (PhaseDiagram, PourbaixDiagram, etc.) get moved into a core analysis package

Will also try to take a closer look at the split and give further comments

@JaGeo
Copy link
Member

JaGeo commented Feb 26, 2026

Just as a follow-up: @naik-aakash addressed the move to the analysis module (thanks!)

#4148 (comment)

@DanielYang59
Copy link
Contributor

DanielYang59 commented Feb 26, 2026

Sure. The reason why I want this done fast is to avoid divergence from PRs.

Thanks, fully understood.

Entries have been moved to core. The only Entry class not moved is things related to compatibility.

I see, looks like the warning might be missing from the non-core stub?

uv run --with git+https://github.com/materialyzeai/pymatgen.git@master python -c "from pymatgen.analysis.elasticity.strain import Deformation"
<string>:1: DeprecationWarning: This module has been moved to the pymatgen.core.elasticity module. This stub will be removed v2027.1.

uv run --with git+https://github.com/materialyzeai/pymatgen.git@master python -c "from pymatgen.entries import Entry"
>> no warning

This needs to be balanced against the cost of cloning. Quite a number of people actually clone pymatgen directly from Github on HPCs. Those do not need the test files.

My clone of pymatgen with test files from Singapore took 30 mins yesterday.

That's a solid reason, and I think we have other options other than separating the files in another repo, for example I usually do a partial clone (I cannot remember the full command, so I just use my clipboard manager to fuzzy search):

# I used Andrew's fork because test files are still there

# Clone the complete repo 
time git clone --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen
>> Receiving objects: 100% (263529/263529), 937.70 MiB | 18.33 MiB/s, done.
>> git clone https://github.com/Andrew-S-Rosen/pymatgen  40.44s user 11.79s system 91% cpu 57.107 total

# Shallow clone 
time git clone  https://github.com/Andrew-S-Rosen/pymatgen --depth 1
>> Receiving objects: 100% (2818/2818), 249.23 MiB | 12.93 MiB/s, done.
>> git clone https://github.com/Andrew-S-Rosen/pymatgen --depth 1  10.70s user 4.00s system 60% cpu 24.249 total

# Partial shallow clone
git clone --filter=blob:none --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen && cd pymatgen
git sparse-checkout set --no-cone '/*' '!/tests/files/'
time git checkout
>> Receiving objects: 100% (1094/1094), 12.84 MiB | 7.53 MiB/s, done.
>> git checkout  1.00s user 0.44s system 22% cpu 6.330 total

Devs actually clone with recursive to get the test files without a separate command. When they add a new test file, they can test locally without problems. The only extra step is a separate PR for the test files.

I think the extra step would actually break the connection between test and test files, i.e. one might later modify only test or test file, and then it would be hard to figure out the relationship between them.

People want credit for the code and tests. But putting in a VASP output file for testing? Really?

For me I think the reason is more to track "metadata"... when I submit a PR I usually try to include the reason/data source in that commit message (or PR description), if git history is wiped, it would be quite hard to find when/why certain file is added/changed

To resolve in separate PR.

I'm not aware of any way to rewrite git history without force push, but I believe Aaron is working on this.

The non-core repo should probably add core as [workspace member]

We will wait on this. I am not sure many people do dev on both core and non-core codes.

Sure it's not urgent and after this refactor I could play around with this (and other small improvement)

@mkhorton
Copy link
Member

Does anyone still use .vis @shyuep? I'm not sure if we should formally deprecate. There are many other options now.

Haven't had chance to review all of this yet, would also like opportunity to do so.

@shyuep
Copy link
Member Author

shyuep commented Feb 27, 2026

Does anyone still use .vis @shyuep? I'm not sure if we should formally deprecate. There are many other options now.

Haven't had chance to review all of this yet, would also like opportunity to do so.

Agreed, but I am setting a hard deadline of this weekend. Speak your mind by then or forever hold your peace.

@shyuep shyuep merged commit ed1867a into materialsproject:master Mar 2, 2026
7 checks passed
@DanielYang59
Copy link
Contributor

Entries have been moved to core. The only Entry class not moved is things related to compatibility.

I see, looks like the warning might be missing from the non-core stub?

@shyuep I guess we should have a warning for the entries stub?

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

@shyuep what was the nature of the deadline, are follow-up PRs ok or is there a reason this structure needs to be frozen?

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

I'm actually a little confused. Where is pymatgen-core living right now? I see the deleted files but not added ones. I don't see the new repo in the materialsproject org.

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

Ok, I found it https://github.com/materialyzeai/pymatgen-core.

I'm confused why this is not in the materialsproject org.

It would make sense to keep pymatgen-core inside the same repo as pymatgen to reduce development friction.

Edited to add: I am also concerned about the confusion this will cause to developers by moving to a separate repo. A lot of git history, issues, documentation, etc. will no longer resolve and be more difficult to find.

@shyuep
Copy link
Member Author

shyuep commented Mar 3, 2026

@shyuep what was the nature of the deadline, are follow-up PRs ok or is there a reason this structure needs to be frozen?

Of course the structure is never frozen. Follow-up PRs are fine. I just wanted to make sure the PR is merged in a timely manner.

It would make sense to keep pymatgen-core inside the same repo as pymatgen to reduce development friction.
Edited to add: I am also concerned about the confusion this will cause to developers by moving to a separate repo. A > lot of git history, issues, documentation, etc. will no longer resolve and be more difficult to find.

We discussed this. The expectation is that the current pymatgen repo will mostly function as a meta package of sorts. The git history issue has already been resolved in pymatgen-core.

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

I understand I am late to the discussion and only now understanding the full context, but I also understand this is a major change that was done under a short timeframe. There is no mention on the pymatgen.org homepage or repo README.md for example. The documentation has not been updated.

We have discussed and explored a pymatgen-core refactor several times. This is very reasonable.

The bigger question I think we need to resolve is the location of pymatgen-core. I think will cause an extreme amount of confusion to move the pymatgen-core source code out and into a different repository. I think it also causes confusion given the provenance of pymatgen as part of Materials Project and as an MP work product.

@shyuep
Copy link
Member Author

shyuep commented Mar 3, 2026

I understand I am late to the discussion and only now understanding the full context, but I also understand this is a major change that was done under a short timeframe. There is no mention on the pymatgen.org homepage or repo README.md for example. The documentation has not been updated.

Correct. But at the present moment, there are 0 implications for anyone using pymatgen. There are implications for developers only. Any other user doing pip install pymatgen will get exactly the same functionality. There is no release of the new pymatgen yet and even after it is released, we expect the same functionality.

The bigger question I think we need to resolve is the location of pymatgen-core. I think will cause an extreme amount of confusion to move the pymatgen-core source code out and into a different repository. I think it also causes confusion given the provenance of pymatgen as part of Materials Project and as an MP work product.

I want to separate out the two questions. 1) What is technically the best structure - monorepo or multi-repo, especially given the huge technical debt (a full clone of pymatgen with history took 30 mins for me), 2) provenance. In my view, a mono-repo renders most of the argument for splitting out pymatgen-core moot, plus it will create significant complexity in management. Note that my preferences of management take precedence since I spend a lot of time maintaining this. If we want a mono repo, I'd rather reverse the entire change and leave it as before and treat my many hours doing the separation as wasted effort.

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

Yes, my concern is definitely with the developer experience, and also with propriety.

The git clone time for pymatgen is definitely an issue. We can encourage shallow clones etc. in README.md to help with this and see what other cleanup we can do.

I definitely advocate for the monorepo. The reason here is historical as much as technical: pymatgen has many hundreds of contributors, many of whom use their contributions to pymatgen as evidence of their impact for career statements and similar. I fear we would be doing these contributors a grave disservice by forking pymatgen to a separate repo in this manner.

If the separate repo route was the route forward, this should still be within the materialsproject org.

Note that my preferences of management take precedence since I spend a lot of time maintaining this.

I'm fully on board with expanding the pool of maintainers. We have several people who have been consistently doing work on maintaining pymatgen, including @JaGeo, @DanielYang59 and others, and I would love to give them an equal seat at the table to help with the maintenance burden. I agree it can be a lot.

If we want a mono repo, I'd rather reverse the entire change and leave it as before and treat my many hours doing the separation as wasted effort.

Unfortunately I think this is the best course forwards. This is such a large change, I think it is better for us to revert and consider our options together. I'm sorry for the wasted effort and I would be glad to help resolve.

On a more hopeful note, I have been finding agentic workflows with the recent frontier models very powerful, and I think they can definitely help us with maintenance tasks: I expect I could start to become more frequently involved myself again thanks to these tools.

@computron
Copy link
Member

I am just seeing this conversation now. In my conversations with MP development team:

  1. the separation of pymatgen and pymatgen-core with history retained is actually the preferred solution. The separation is nice in terms of dependency management and other aspects. Retention of git history was not there initially but is now the case, I am not sure if this is clear to all commenting on the git history aspect.
  2. Both repos still residing within the materialsproject Github org. This was at one point the plan but seems to have gotten sidetracked to be materialyzeai and materialsproject split orgs, so we would like to revisit it.

@mkhorton
Copy link
Member

mkhorton commented Mar 3, 2026

Dependency management can be separated while still living in the same repository.

It is technically easier to keep e.g. /src/pymatgen and /src/pymatgen-core in the same repository. If a PR needs to make a change that requires changes to both packages (which is a common scenario), this makes it a lot easier for contributors. The alternative is two separate PRs in two separate repositories along with a potential drift in CI. Keeping them in the same repository also makes releasing new versions of pymatgen and pymatgen-core simultaneously easier.

But I think the bigger aspect, which I welcome comment on from others, is how this split into multiple repositories might harm the community. I believe the community that has been established around pymatgen has really been a critical part of its success and this is a very big change to make without community buy-in and support.

@kavanase
Copy link
Contributor

kavanase commented Mar 3, 2026

But I think the bigger aspect, which I welcome comment on from others, is how this split into multiple repositories (presumably organisations) might harm the community. I believe the community that has been established around pymatgen has really been a critical part of its success and this is a very big change to make without community buy-in and support.

From a downstream developer/user and (lesser) contributor, yes I would agree that having pymatgen and pymatgen-core under the same organisation (whether or not same repo) would be best for continued community buy-in; making it clear that it is a community code (despite uneven development and maintenance burdens of course) with a stable and persistent home.

Both choices of multi-repo/mono-repo under the same organisation have valid arguments – I guess the question is whether the pain of git cloneing pymatgen outweighs the potential CI/development issues @mkhorton raises. I would vote for the mono-repo case as I think most users can just pip install, with git clone only necessary for the small fraction who modify the codebase, while the headaches (from experience!) of co-dependent developments to separate repos I think outweigh this – just my opinion!

@shyuep
Copy link
Member Author

shyuep commented Mar 3, 2026

@kavanase Thanks for the input.

  1. The pymatgen-core repo will be moved under materialsproject org. The only reason why it was initially done in another org was because I did not have the permissions to create a new repo in the materialsproject org. This will be sorted in a few hours.
  2. The intent has always been that the main (non-core) pymatgen repo will still be the front, with the entire git history. This git history has already been ported (and merged) to pymatgen-core with the efforts of @esoteric-ephemera and @DanielYang59.
  3. The mono vs multi-repo both have pros and cons that is too long to get into here - a simple search will bring up a summary of this but it boils down to complexity of maintenance vs modularity. But I want to be clear on this part that regardless of how the repos are organized, the use of pymatgen remains unchanged. pip install pymatgen still gives the exact same functionality as it would before the change. The only thing that has changed is that people now have the option of pip install pymatgen-core if they want to use the main functionalities and don't need the many analyses (with corresponding dependency bloat) that comes with pymatgen (non-core). Similarly, downstream packages that have pymatgen as a dependency can opt to use pymatgen-core if they are using only the core functionality such as structure representations, manipulations and IO (there are very few upstream packages that depend on pymatgen for the analysis package).

@shyuep
Copy link
Member Author

shyuep commented Mar 3, 2026

Update: pymatgen-core repo has been moved to materialsproject org - https://github.com/materialsproject/pymatgen-core

@mkhorton
Copy link
Member

mkhorton commented Mar 4, 2026

If the only concern is git clone time, I think we should still keep everything in the same repo, unless there is broad community support for the change to a separate repository.

It's true that a standard:

git clone git@github.com:materialsproject/pymatgen.git

takes ~5 minutes and is very large.

However, this is readily addressable without switching to a new repo:

# 3.8 seconds for a shallow clone
git clone --depth=1

# 5.7 seconds with a lazy blob fetch
git clone --filter=blob:none

We can update documentation, README.md and CI to use these patterns where appropriate.

The repo size itself is also fixable. The biggest culprits are ~4.9 GB of old sphinx-built HTML docs (not even the documentation source .rst) and ~1.4 GB of old test files. These can be scrubbed. This is a big change itself because it changes SHAs and would created orphaned PRs - so I do not recommend it as our first option and would prefer the git --filter approach to start - but we do have viable options that do not require a separate repository.

We can also move the sphinx-built HTML pages to an orphaned gh-pages branch and this will help prevent future bloat.

@DanielYang59
Copy link
Contributor

DanielYang59 commented Mar 4, 2026

yes i totally agree we have more options to handle git clone (note current pymatgen repo doesn't have test files so shallow/blobless clone might not reflect the real situation if we get test files back), if test files are really the concern we still could try the very complex partial clone I tested here (and we surely could document them):

git clone --filter=blob:none --no-checkout --depth 1 https://github.com/Andrew-S-Rosen/pymatgen && cd pymatgen
git sparse-checkout set --no-cone '/*' '!/tests/files/'
time git checkout
>> Receiving objects: 100% (1094/1094), 12.84 MiB | 7.53 MiB/s, done.
>> git checkout  1.00s user 0.44s system 22% cpu 6.330 total

regarding the docs issue it has been tracked, perhaps we also build and serve in CI without saving static pages at all?

      - name: Build Sphinx
        run: sphinx-build -b html docs/source docs/build/html

      - name: Upload artifact
        uses: actions/upload-pages-artifact@v4
        with:
          path: docs/build/html

      - name: Deploy to GitHub Pages
        uses: actions/deploy-pages@v4

@kavanase
Copy link
Contributor

kavanase commented Mar 4, 2026

About the docs issue, unless there's some strong reason I'm missing to keep the pre-built HTML files (which I guess is essentially a legacy artifact?), I agree with @DanielYang59 that they should be removed and built in CI, to significantly drop the repo size.
Building docs in CI is the standard practice for most codes these days.

@shyuep
Copy link
Member Author

shyuep commented Mar 4, 2026

Docs are not a big contributor to the repo size but I agree with moving it to a different branch.

I think the whole mono repo argument is colored by bias of the status quo without considering the future evolution of pymatgen. As a reference software suite that uses multi-repos, pls look at the jupyter (I am sure all here are familiar with this) organization of jupyter, jupyter lab, notebook, jupyter-server, etc. It is multi repo.

We already started on the path towards multi-repos years ago when we decided to make pymatgen into a namespace package. pymatgen-analysis-diffusion and pymatgen-analysis-alloys are in separate repos maintained by different teams. So in that context, what makes pymatgen-analysis-optics or pymatgen-apps-borg so special that they should be in the same mono repo as pymatgen-core? If a mono repo is such a wonderful thing, why not dump the entire MP codebase including custodian, emmet, mp-api, atomate, etc. into one giant repo? Note that I originally wanted the new MP API to be implemented in the pymatgen repo because MPRester was already in pymatgen and a key component used to retrieve crystal structures, but @mkhorton et al. steadfastly refused which led to the current schism in pymatgen's MPRester and mp-api's. Quite clearly the commitment to the benefits of a mono-repo seems to be situationally dependent.

This is a reorg that has multiple aims.

  1. Keep core code in a separate repository for maintenance.
  2. Begin decentralizing non-core parts to separate repositories and maintainers who have an active research interest in maintaining those codes.
  3. Allow for extension of pymatgen's capabilities, e.g., pymatgen-analysis-mechanics, pymatgen-ext-matpes, pymatgen-ai, etc.

This is the first (albeit the most significant) of many steps. So I would (again) ask that everyone take a step back and look at the whole tapestry, rather than being consumed by administrative stuff.

P.S. For administrative stuff, I will listen to opinions and suggestions. But unless there are others who are willing to commit the same amount and regularity of effort into helping maintain pymatgen, my preferences as lead maintainer needs to take precedence.

@mkhorton
Copy link
Member

mkhorton commented Mar 6, 2026

Docs are not a big contributor to the repo size but I agree with moving it to a different branch.

Sphinx-built HTML docs are contributing ~4.8 GB to the repo size (~1 GB compressed). They're the main reason the git clone is so slow.

This can be verified with:

  git rev-list --all --objects -- docs/ | git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' | awk '/^blob/{sum+=$2} END{printf "%.2f GB\n", sum/1073741824}'

Note that I originally wanted the new MP API to be implemented in the pymatgen repo because MPRester was already in pymatgen and a key component used to retrieve crystal structures

The current mp-api depends on emmet-core which in turn depends on pymatgen. To put the new MPRester inside pymatgen would require pymatgen to depend on emmet-core: a circular dependency. This was the only reason it was spun-off, as I recall. My preference would have been to keep it in pymatgen otherwise.

@mkhorton
Copy link
Member

mkhorton commented Mar 10, 2026

Ok, for a route forwards, I want to propose:

  1. Let's get pymatgen-core back into pymatgen. I do not see enough support to fragment to two repos and there is a lot of downside.

    I agree there is a bias of status quo in preferring the single repo, but I think the downside is also real: we under-estimate how many people use pymatgen because of stability and trust. This includes the developer experience and being able to navigate and find code easily. I am also concerned that our delineation between "core" and "non-core" doesn't quite make sense right now: it feels rushed and lacking good justification.

  2. Let us help you resolve this @shyuep:

    But unless there are others who are willing to commit the same amount and regularity of effort into helping maintain pymatgen, my preferences as lead maintainer needs to take precedence.

    I agree the maintenance burden is a lot - let's formally expand the pymatgen maintainer team to include those with a long-term track record of high-quality contributions. I hope this helps with your maintainer burden and also my own.

    To start, I would nominate @DanielYang59 and @JaGeo if they would be willing, and if others are interested or willing I would encourage them to nominate themselves too. I've been consistently impressed by @kavanase's development of doped for example and would also love to see Seán be able to maintain pymatgen too, if he was interested. @esoteric-ephemera too has been doing great work on the Materials Project code stack. I am sure there are other equally good candidates too, I do not present this as a complete list. Absolutely no pressure to any of those named, but if you would be willing to join as co-maintainers, let us know.

    If we are still struggling with maintenance burden after expanding the maintainer pool, I think that would be a good time to consider more radical options.

@esoteric-ephemera
Copy link
Contributor

@shyuep would you be OK with us putting together a draft PR for recombining pymatgen-core and pymatgen into a single repo (+ GH workflows for testing, release)?

There are plenty of examples where multiple namespaces for one parent are combined into a single GH repo. I think this makes the most sense and can be organized to cleanly differentiate the major namespaces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants