fix: check only 1 pssm for variant queries#430
Conversation
|
I tried with running preprocessing with this branch. It seems it still tries to check for all chains in PDB? |
|
The changes work. This pull request can be closed. |
It works because it now fills the pssm feature with 0s. So this PR actually deals with both #428 and #429, as they are closely related. In retrospect, maybe that wasn't ideal, because now things are getting a bit confused. |
|
I am making a small dataset with variants that have neighbours in multiple chains. |
I think this will only tell you something about your particular use case, but we want a solution that is generally applicable to anyone using the package. |
|
My test was for the general applicability. By omitting pssm files and adding 0s we would treat such variants as though they are solvent-exposed variants- which is not quite right, although I am not sure how many such cases are there. I have never tested such a thing with deeprank-mut, but it is expected and I do note that solvent-exposed variants are treated differently in network. I would prefer to include all chains within the radius, so may be user can definitely include all relevant chains. Rest can be 0 |
|
so do I understand correctly that all residues from chains within the radius should be included in the network and that for those chains a pssm file should be required (with an error message telling you which files are missing if they are not provided)? |
|
@rgayatri , can you please test on the current branch if it still works as intended for you? Note that |
| def _load_ppi_pssms(pssm_paths: Union[Dict[str, str], None], | ||
| chain_id1: str, chain_id2: str, | ||
| def _load_ppi_pssms(pssm_paths: Optional[Dict[str, str]], | ||
| chains: List[str], |
There was a problem hiding this comment.
Right now we don't support more than 2 chains for PPI, so I think it's better to have chain_id1 and chain_id2 to make it more explicit
There was a problem hiding this comment.
I made the change mainly because the linter was complaining, but then realized it's nice because a few lines below we anyway loop through the two chains and turn them into a list there.
I think because it's an internal function, it doesn't matter too much how explicit it is now. And your comment makes me realize that if in the future we would want to allow more than 2 chains in a PPI, then this is one less issue to think about :)
There was a problem hiding this comment.
Very nice edits, and great job with the tests!
I have only a minor comment about an internal function in query.py, and I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?
For _check_pssm, I think that the docstring so far is enough.
About the verbosity, I do agree that it's okay if it's accessible internally only.
lines 390 and 413 of test_query.py ;-) Good question what the hdf5 file will then contain. I will look into it. |
Oh right! What if we explicitly set |
I did it this way to make sure that the test uses the same instance of the object that failed if the error suppression was not turned on (a few lines above). This way, if someone modifies the Query object in the test, we are sure that both tests run the same instance. |
OK, I looked into it:
The latter point made me realize that it still checks all residues from the chain, irrespective of whether they are a part of the graph. I will rewrite so that it only checks those residues that are part of the graph. @rgayatri , what behavior do you expect if a residue is missing from a pssm file? All 0s? At the end of the day, all these tests make me feel that we just don't want people to use bad pssm files. I don't like the concept of being able to suppress the error, because now we need to hack ad hoc solutions to deal with bad data. |
I agree, ideally, a user shouldn't be able to suppress the error indeed and should take care of providing the correct files. But if @rgayatri thinks that it makes sense to fill bad data with 0s, then we can do that. Good that you noticed that all residues were checked instead of the relevant ones only @DaniBodor, and indeed better to make it check only those residues that are part of the graph. |
|
This PR is stale because it has been open for 14 days with no activity. |
|
I believe that at this point the current PR is functional. If fed with accurate data, everything works as expected. Otherwise errors can be suppressed if absolutely necessary, but it can still fail if this touches data points that are part of the graph. This is okay behavior, as we do not want users to train with unexpected/unexplainable results. |
_check_pssmand give a warning instead.radiusradius_check_pssmto throw more verbose errorscloses #428, #429