fix: pdb files with underscore in the filename gives unexpected query ids#447
fix: pdb files with underscore in the filename gives unexpected query ids#447
Conversation
|
@DaniBodor I am not sure if I'm understanding the issue correctly, can you give it a look? |
deeprankcore/query.py
Outdated
|
|
||
| query_id_base = query_id.split("_")[0] | ||
|
|
||
| warn_duplicate = False |
There was a problem hiding this comment.
warn_duplicate is a parameter for the user to determine if printing warnings in case of duplicates or not. As you modified it, warnings will be printed independently from what the user has set up.
There was a problem hiding this comment.
Thanks for picking this up :)
I think we can solve the issue in a much simpler way, like the following:
def add(self, query: Query, verbose: bool = False, warn_duplicate: bool = True):
"""
Adds a new query to the collection.
Args:
query(:class:`Query`): Must be a :class:`Query` object, either :class:`ProteinProteinInterfaceResidueQuery` or
:class:`SingleResidueVariantAtomicQuery`.
verbose(bool, optional): For logging query IDs added, defaults to False.
warn_duplicate (bool): Log a warning before renaming if a duplicate query is identified.
"""
query_id = query.get_query_id()
if verbose:
_log.info(f'Adding query with ID {query_id}.')
if query_id not in self.ids_count:
self.ids_count[query_id] = 1
else:
self.ids_count[query_id] += 1
new_id = query.model_id + "_" + str(self.ids_count[query_id])
query.model_id = new_id
if warn_duplicate:
_log.warning(f'Query with ID {query_id} has already been added to the collection. Renaming it as {query.get_query_id()}')
self._queries.append(query)Here I am just directly using query_id to keep track of the queries ids in the ids_count dictionary, without splitting anything on _ character.
I am not sure why we didn't implement this in the first place, and why it was necessary to split on _, but I think it should work. Tests pass on my local machine, but this is not a guarantee since we don't have relevant tests yet, but at least it means that it doesn't break anything.
Then as you mentioned, we need at least a unit test for testing relevant cases.
|
Thanks, Giulia! |
gcroci2
left a comment
There was a problem hiding this comment.
Almost there :) I gave some suggestions about the test
DaniBodor
left a comment
There was a problem hiding this comment.
Looks really good! Elegant solution.
Could you just add an empty line to the end of the test file. No super important reason, but it's recommended to always end on an empty line.
|
I think merging is still blocked because @gcroci2 once requested changes and hasn't formally accepted yet. Can you please do so now :) |
Before the query ids will be seen as duplicates and be assigned a new id when the pdb files have the same base name (file name before the underscore character).
For example:
will give warnings as followings and rename the pdb files:
because of the same base name
1ATN.Now, the duplicate pdb file name will be checked using the standard below:
self._querieslist.For example for
1ATN_1w.pdbcontains an underscore, it will be checked if there exists a complete same query id in the query list named1ATN_1winstead of only checking whether the base name1ATNexists.self.ids_countlist.For example for
1A6B.pdb, which contains no underscore, if a same pdb file named1A6B.pdbappears, it will be renamed to1A6B_2