fix: check only 1 pssm for variant queries by DaniBodor · Pull Request #430 · DeepRank/deeprank2

DaniBodor · 2023-05-22T17:35:14Z

problem with check_pssm #428 fixed by only checking for residues on chains for which a pssm file was provided.
problem with get_pssm when preprocessing variants #429 will not be fixed, because we want users to provide pssm files for all residues that are included in the graph.
Created an option to suppress the error thrown by _check_pssm and give a warning instead.
- does this require additional documentation, beyond the docstrings?
- created tests for error suppression
Create tests for variant query where there is a second chain for which no pssm file is provided:
- throw error if second chain is inside radius
- no error if second chain is NOT inside radius
Refactor existing tests relating to _check_pssm
Allow for _check_pssm to throw more verbose errors
- this is only accessible from source code, which I think is ok because it's more of a troubleshooting/debugging mode thing than something that we expect the user to want to touch, right?

rgayatri · 2023-05-25T00:48:40Z

I tried with running preprocessing with this branch. It seems it still tries to check for all chains in PDB?
Unfortunately some PDBs have errors in chain names, so there are inconsistencies between no. of PDB chains and no. of PSSMs.
I'll do the checks again

rgayatri · 2023-05-31T10:04:51Z

The changes work. This pull request can be closed.
However, there should be a provision to include other pssm chains in case the radius includes multiple chains. can be dealt in #429

DaniBodor · 2023-05-31T10:45:17Z

The changes work. This pull request can be closed. However, there should be a provision to include other pssm chains in case the radius includes multiple chains. can be dealt in #429

It works because it now fills the pssm feature with 0s. So this PR actually deals with both #428 and #429, as they are closely related. In retrospect, maybe that wasn't ideal, because now things are getting a bit confused.
I will await decision on #429 to know whether we actually want to fill it with 0s, or want to force pssm files to be present for all chains and then either create separate PRs or resolve both issues here.

This reverts commit f83eeb1.

This reverts commit 6e778dc.

This reverts commit 1f17572.

This reverts commit 188df2e.

rgayatri · 2023-05-31T11:30:26Z

I am making a small dataset with variants that have neighbours in multiple chains.
I suppose that may be the only way to test consequence of 0s or non-0s in predictions (non-0s will be ideal for these cases)

DaniBodor · 2023-05-31T12:27:56Z

I am making a small dataset with variants that have neighbours in multiple chains. I suppose that may be the only way to test consequence of 0s or non-0s in predictions (non-0s will be ideal for these cases)

I think this will only tell you something about your particular use case, but we want a solution that is generally applicable to anyone using the package.
So in a wider context, we need to make a decision whether we want users to provide pssm files for all chains included in the network (i think preferable) or whether we allow them to omit pssm files for some chains and then the feature gets filled with 0s for those chains.
Note that for the second case, it would put 0s for whatever chain/chains is/are omitted, without differentiating what the main chain and the secondary chain is, etc.
For the first option (required pssm files), users can always just generate pssm files with all 0s if needed.

rgayatri · 2023-05-31T13:00:26Z

My test was for the general applicability. By omitting pssm files and adding 0s we would treat such variants as though they are solvent-exposed variants- which is not quite right, although I am not sure how many such cases are there. I have never tested such a thing with deeprank-mut, but it is expected and I do note that solvent-exposed variants are treated differently in network.

I would prefer to include all chains within the radius, so may be user can definitely include all relevant chains. Rest can be 0

DaniBodor · 2023-05-31T14:28:14Z

so do I understand correctly that all residues from chains within the radius should be included in the network and that for those chains a pssm file should be required (with an error message telling you which files are missing if they are not provided)?

DaniBodor · 2023-06-02T17:13:46Z

@rgayatri , can you please test on the current branch if it still works as intended for you?

Note that Query objects now have an optional argument suppress_pssm_errors

gcroci2 · 2023-06-09T12:56:16Z

deeprankcore/query.py

-def _load_ppi_pssms(pssm_paths: Union[Dict[str, str], None],
-                    chain_id1: str, chain_id2: str,
+def _load_ppi_pssms(pssm_paths: Optional[Dict[str, str]],
+                    chains: List[str],


Right now we don't support more than 2 chains for PPI, so I think it's better to have chain_id1 and chain_id2 to make it more explicit

I made the change mainly because the linter was complaining, but then realized it's nice because a few lines below we anyway loop through the two chains and turn them into a list there.
I think because it's an internal function, it doesn't matter too much how explicit it is now. And your comment makes me realize that if in the future we would want to allow more than 2 chains in a PPI, then this is one less issue to think about :)

gcroci2

Very nice edits, and great job with the tests!
I have only a minor comment about an internal function in query.py, and I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

For _check_pssm, I think that the docstring so far is enough.

About the verbosity, I do agree that it's okay if it's accessible internally only.

DaniBodor · 2023-06-09T13:15:44Z

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

gcroci2 · 2023-06-09T13:24:17Z

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

Oh right! What if we explicitly set suppress_pssm_errors in the query initialization (maybe redefining the query in the tests)? This way is clearer how such a parameter can be used, instead of setting the internal _suppress. Since we don't have explicit examples about suppress_pssm_errors, can be good to have it in the tests

DaniBodor · 2023-06-09T16:03:42Z

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)
Good question what the hdf5 file will then contain. I will look into it.

Oh right! What if we explicitly set suppress_pssm_errors in the query initialization (maybe redefining the query in the tests)? This way is clearer how such a parameter can be used, instead of setting the internal _suppress. Since we don't have explicit examples about suppress_pssm_errors, can be good to have it in the tests

I did it this way to make sure that the test uses the same instance of the object that failed if the error suppression was not turned on (a few lines above). This way, if someone modifies the Query object in the test, we are sure that both tests run the same instance.
Do you think I should add an additional test where it is set upon initialization?

DaniBodor · 2023-06-09T16:10:32Z

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)

Good question what the hdf5 file will then contain. I will look into it.

OK, I looked into it:

for incorrect entries in the pssm: it just used the "wrong" data. So the pssm data that is associated to the position number of that residue, even though it is the incorrect residue (@rgayatri is this the behavior that you want?)
for missing entries in the pssm:
- if the entry is not part of the graph, it doesn't complain
- if the entry is a part of the graph, then it fails at a later step (so check_pssm passes, but then an error is thrown I think when it tries to write data into that key).

The latter point made me realize that it still checks all residues from the chain, irrespective of whether they are a part of the graph. I will rewrite so that it only checks those residues that are part of the graph.

@rgayatri , what behavior do you expect if a residue is missing from a pssm file? All 0s?

At the end of the day, all these tests make me feel that we just don't want people to use bad pssm files. I don't like the concept of being able to suppress the error, because now we need to hack ad hoc solutions to deal with bad data.

gcroci2 · 2023-06-12T08:40:23Z

I think we're missing a test that uses suppress_pssm_errors=True. Indeed I was wondering, what happens if the errors are suppressed and pssm files are missing or incorrect? What will conservation-related features contain?

lines 390 and 413 of test_query.py ;-)
Good question what the hdf5 file will then contain. I will look into it.

OK, I looked into it:

for incorrect entries in the pssm: it just used the "wrong" data. So the pssm data that is associated to the position number of that residue, even though it is the incorrect residue (@rgayatri is this the behavior that you want?)

for missing entries in the pssm:

if the entry is not part of the graph, it doesn't complain

if the entry is a part of the graph, then it fails at a later step (so check_pssm passes, but then an error is thrown I think when it tries to write data into that key).

The latter point made me realize that it still checks all residues from the chain, irrespective of whether they are a part of the graph. I will rewrite so that it only checks those residues that are part of the graph.

@rgayatri , what behavior do you expect if a residue is missing from a pssm file? All 0s?

At the end of the day, all these tests make me feel that we just don't want people to use bad pssm files. I don't like the concept of being able to suppress the error, because now we need to hack ad hoc solutions to deal with bad data.

I agree, ideally, a user shouldn't be able to suppress the error indeed and should take care of providing the correct files. But if @rgayatri thinks that it makes sense to fill bad data with 0s, then we can do that.

Good that you noticed that all residues were checked instead of the relevant ones only @DaniBodor, and indeed better to make it check only those residues that are part of the graph.

github-actions · 2023-06-28T03:33:38Z

This PR is stale because it has been open for 14 days with no activity.

DaniBodor · 2023-07-04T10:13:24Z

I believe that at this point the current PR is functional. If fed with accurate data, everything works as expected. Otherwise errors can be suppressed if absolutely necessary, but it can still fail if this touches data points that are part of the graph. This is okay behavior, as we do not want users to train with unexpected/unexplainable results.
As such, I will be merging and closing this PR. If there are any remaining problems to be solved about this, we can open new issues about them.

DaniBodor added 4 commits May 22, 2023 13:59

improve type hinting for Optional

503fc8f

allow single ModuleType as feature_modules in query.build()

2b55a78

only run _check_pssm on relevant chain for variant query

baa8749

avoid double .hdf5 at filename

5e8aa52

DaniBodor mentioned this pull request May 22, 2023

problem with get_pssm when preprocessing variants #429

Closed

DaniBodor added 4 commits May 24, 2023 12:04

allow for missing pssm file

188df2e

pssm and infocontent -> 0s if no pssm file for chain

1f17572

zeros for no pssm

6e778dc

remove obsolete test

f83eeb1

DaniBodor linked an issue May 24, 2023 that may be closed by this pull request

problem with get_pssm when preprocessing variants #429

Closed

DaniBodor added 4 commits May 31, 2023 13:15

Revert "remove obsolete test"

9d944c5

This reverts commit f83eeb1.

Revert "zeros for no pssm"

f0cf5a2

This reverts commit 6e778dc.

Revert "pssm and infocontent -> 0s if no pssm file for chain"

498c1cc

This reverts commit 1f17572.

Revert "allow for missing pssm file"

e05ec40

This reverts commit 188df2e.

DaniBodor added 9 commits June 2, 2023 15:32

make error message more verbose again

e4e0af9

allow for different levels of verbosity

d8e9895

allow for suppression of error

de34e4a

update docstring for _check_pssm

3310596

allow for suppression upon Query initiation

311507d

linting

1c6c78c

error inside if block

2ae273e

refactor pssm tests

8fded69

add tests for error suppression

ce65999

DaniBodor added 5 commits June 2, 2023 17:24

remove trailing line

d2af507

add test data for multichain variant test

ccdcc33

tests for variant query with multiple chains

6c0e980

make Query.suppress internal attribute

f388af6

improve suppress_pssm_errors docstring

88dc16b

DaniBodor requested review from gcroci2 and rgayatri June 2, 2023 17:10

gcroci2 reviewed Jun 9, 2023

View reviewed changes

gcroci2 approved these changes Jun 9, 2023

View reviewed changes

github-actions bot added the stale issue not touched from too much time label Jun 28, 2023

Merge branch 'main' into 428_checkpssm_singlechain_dbodor

82e793c

DaniBodor merged commit 921e8ff into main Jul 4, 2023

DaniBodor deleted the 428_checkpssm_singlechain_dbodor branch July 4, 2023 10:12

Conversation

DaniBodor commented May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgayatri commented May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgayatri commented May 31, 2023

Uh oh!

DaniBodor commented May 31, 2023

Uh oh!

rgayatri commented May 31, 2023

Uh oh!

DaniBodor commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgayatri commented May 31, 2023

Uh oh!

DaniBodor commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaniBodor commented Jun 2, 2023

Uh oh!

gcroci2 Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

DaniBodor Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcroci2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaniBodor commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcroci2 commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaniBodor commented Jun 9, 2023

Uh oh!

DaniBodor commented Jun 9, 2023

Uh oh!

gcroci2 commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 28, 2023

Uh oh!

DaniBodor commented Jul 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DaniBodor commented May 22, 2023 •

edited

Loading

rgayatri commented May 25, 2023 •

edited

Loading

DaniBodor commented May 31, 2023 •

edited

Loading

DaniBodor commented May 31, 2023 •

edited

Loading

DaniBodor Jun 9, 2023 •

edited

Loading

gcroci2 left a comment •

edited

Loading

DaniBodor commented Jun 9, 2023 •

edited

Loading

gcroci2 commented Jun 9, 2023 •

edited

Loading

gcroci2 commented Jun 12, 2023 •

edited

Loading