Skip to content

fix: check only 1 pssm for variant queries#430

Merged
DaniBodor merged 27 commits intomainfrom
428_checkpssm_singlechain_dbodor
Jul 4, 2023
Merged

fix: check only 1 pssm for variant queries#430
DaniBodor merged 27 commits intomainfrom
428_checkpssm_singlechain_dbodor

Conversation

@DaniBodor
Copy link
Copy Markdown
Collaborator

@DaniBodor DaniBodor commented May 22, 2023

  • problem with check_pssm  #428 fixed by only checking for residues on chains for which a pssm file was provided.
  • problem with get_pssm when preprocessing variants #429 will not be fixed, because we want users to provide pssm files for all residues that are included in the graph.
  • Created an option to suppress the error thrown by _check_pssm and give a warning instead.
    • does this require additional documentation, beyond the docstrings?
    • created tests for error suppression
  • Create tests for variant query where there is a second chain for which no pssm file is provided:
    • throw error if second chain is inside radius
    • no error if second chain is NOT inside radius
  • Refactor existing tests relating to _check_pssm
  • Allow for _check_pssm to throw more verbose errors
    • this is only accessible from source code, which I think is ok because it's more of a troubleshooting/debugging mode thing than something that we expect the user to want to touch, right?

closes #428, #429

@DaniBodor DaniBodor linked an issue May 24, 2023 that may be closed by this pull request
@rgayatri
Copy link
Copy Markdown
Collaborator

rgayatri commented May 25, 2023

I tried with running preprocessing with this branch. It seems it still tries to check for all chains in PDB?
Unfortunately some PDBs have errors in chain names, so there are inconsistencies between no. of PDB chains and no. of PSSMs.
I'll do the checks again

@rgayatri
Copy link
Copy Markdown
Collaborator

The changes work. This pull request can be closed.
However, there should be a provision to include other pssm chains in case the radius includes multiple chains. can be dealt in #429

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

The changes work. This pull request can be closed. However, there should be a provision to include other pssm chains in case the radius includes multiple chains. can be dealt in #429

It works because it now fills the pssm feature with 0s. So this PR actually deals with both #428 and #429, as they are closely related. In retrospect, maybe that wasn't ideal, because now things are getting a bit confused.
I will await decision on #429 to know whether we actually want to fill it with 0s, or want to force pssm files to be present for all chains and then either create separate PRs or resolve both issues here.

@rgayatri
Copy link
Copy Markdown
Collaborator

I am making a small dataset with variants that have neighbours in multiple chains.
I suppose that may be the only way to test consequence of 0s or non-0s in predictions (non-0s will be ideal for these cases)

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

DaniBodor commented May 31, 2023

I am making a small dataset with variants that have neighbours in multiple chains. I suppose that may be the only way to test consequence of 0s or non-0s in predictions (non-0s will be ideal for these cases)

I think this will only tell you something about your particular use case, but we want a solution that is generally applicable to anyone using the package.
So in a wider context, we need to make a decision whether we want users to provide pssm files for all chains included in the network (i think preferable) or whether we allow them to omit pssm files for some chains and then the feature gets filled with 0s for those chains.
Note that for the second case, it would put 0s for whatever chain/chains is/are omitted, without differentiating what the main chain and the secondary chain is, etc.
For the first option (required pssm files), users can always just generate pssm files with all 0s if needed.

@rgayatri
Copy link
Copy Markdown
Collaborator

My test was for the general applicability. By omitting pssm files and adding 0s we would treat such variants as though they are solvent-exposed variants- which is not quite right, although I am not sure how many such cases are there. I have never tested such a thing with deeprank-mut, but it is expected and I do note that solvent-exposed variants are treated differently in network.

I would prefer to include all chains within the radius, so may be user can definitely include all relevant chains. Rest can be 0

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

DaniBodor commented May 31, 2023

so do I understand correctly that all residues from chains within the radius should be included in the network and that for those chains a pssm file should be required (with an error message telling you which files are missing if they are not provided)?

@DaniBodor DaniBodor requested review from gcroci2 and rgayatri June 2, 2023 17:10
@DaniBodor
Copy link
Copy Markdown
Collaborator Author

@rgayatri , can you please test on the current branch if it still works as intended for you?

Note that Query objects now have an optional argument suppress_pssm_errors

def _load_ppi_pssms(pssm_paths: Union[Dict[str, str], None],
chain_id1: str, chain_id2: str,
def _load_ppi_pssms(pssm_paths: Optional[Dict[str, str]],
chains: List[str],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we don't support more than 2 chains for PPI, so I think it's better to have chain_id1 and chain_id2 to make it more explicit

Copy link
Copy Markdown
Collaborator Author

@DaniBodor DaniBodor Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change mainly because the linter was complaining, but then realized it's nice because a few lines below we anyway loop through the two chains and turn them into a list there.
I think because it's an internal function, it doesn't matter too much how explicit it is now. And your comment makes me realize that if in the future we would want to allow more than 2 chains in a PPI, then this is one less issue to think about :)

Copy link
Copy Markdown
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice edits, and great job with the tests!
I have only a minor comment about an internal function in query.py, and I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

For _check_pssm, I think that the docstring so far is enough.

About the verbosity, I do agree that it's okay if it's accessible internally only.

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

DaniBodor commented Jun 9, 2023

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

@gcroci2
Copy link
Copy Markdown
Collaborator

gcroci2 commented Jun 9, 2023

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

Oh right! What if we explicitly set suppress_pssm_errors in the query initialization (maybe redefining the query in the tests)? This way is clearer how such a parameter can be used, instead of setting the internal _suppress. Since we don't have explicit examples about suppress_pssm_errors, can be good to have it in the tests

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)
Good question what the hdf5 file will then contain. I will look into it.

Oh right! What if we explicitly set suppress_pssm_errors in the query initialization (maybe redefining the query in the tests)? This way is clearer how such a parameter can be used, instead of setting the internal _suppress. Since we don't have explicit examples about suppress_pssm_errors, can be good to have it in the tests

I did it this way to make sure that the test uses the same instance of the object that failed if the error suppression was not turned on (a few lines above). This way, if someone modifies the Query object in the test, we are sure that both tests run the same instance.
Do you think I should add an additional test where it is set upon initialization?

@DaniBodor
Copy link
Copy Markdown
Collaborator Author

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

OK, I looked into it:

  • for incorrect entries in the pssm: it just used the "wrong" data. So the pssm data that is associated to the position number of that residue, even though it is the incorrect residue (@rgayatri is this the behavior that you want?)
  • for missing entries in the pssm:
    • if the entry is not part of the graph, it doesn't complain
    • if the entry is a part of the graph, then it fails at a later step (so check_pssm passes, but then an error is thrown I think when it tries to write data into that key).

The latter point made me realize that it still checks all residues from the chain, irrespective of whether they are a part of the graph. I will rewrite so that it only checks those residues that are part of the graph.

@rgayatri , what behavior do you expect if a residue is missing from a pssm file? All 0s?

At the end of the day, all these tests make me feel that we just don't want people to use bad pssm files. I don't like the concept of being able to suppress the error, because now we need to hack ad hoc solutions to deal with bad data.

@gcroci2
Copy link
Copy Markdown
Collaborator

gcroci2 commented Jun 12, 2023

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)
Good question what the hdf5 file will then contain. I will look into it.

OK, I looked into it:

  • for incorrect entries in the pssm: it just used the "wrong" data. So the pssm data that is associated to the position number of that residue, even though it is the incorrect residue (@rgayatri is this the behavior that you want?)

  • for missing entries in the pssm:

    • if the entry is not part of the graph, it doesn't complain
    • if the entry is a part of the graph, then it fails at a later step (so check_pssm passes, but then an error is thrown I think when it tries to write data into that key).

The latter point made me realize that it still checks all residues from the chain, irrespective of whether they are a part of the graph. I will rewrite so that it only checks those residues that are part of the graph.

@rgayatri , what behavior do you expect if a residue is missing from a pssm file? All 0s?

At the end of the day, all these tests make me feel that we just don't want people to use bad pssm files. I don't like the concept of being able to suppress the error, because now we need to hack ad hoc solutions to deal with bad data.

I agree, ideally, a user shouldn't be able to suppress the error indeed and should take care of providing the correct files. But if @rgayatri thinks that it makes sense to fill bad data with 0s, then we can do that.

Good that you noticed that all residues were checked instead of the relevant ones only @DaniBodor, and indeed better to make it check only those residues that are part of the graph.

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label Jun 28, 2023
@DaniBodor DaniBodor merged commit 921e8ff into main Jul 4, 2023
@DaniBodor DaniBodor deleted the 428_checkpssm_singlechain_dbodor branch July 4, 2023 10:12
@DaniBodor
Copy link
Copy Markdown
Collaborator Author

I believe that at this point the current PR is functional. If fed with accurate data, everything works as expected. Otherwise errors can be suppressed if absolutely necessary, but it can still fail if this touches data points that are part of the graph. This is okay behavior, as we do not want users to train with unexpected/unexplainable results.
As such, I will be merging and closing this PR. If there are any remaining problems to be solved about this, we can open new issues about them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale issue not touched from too much time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

problem with get_pssm when preprocessing variants problem with check_pssm

3 participants