fix: pdb files with underscore in the filename gives unexpected query ids by joyceljy · Pull Request #447 · DeepRank/deeprank2

joyceljy · 2023-06-14T13:41:21Z

Before the query ids will be seen as duplicates and be assigned a new id when the pdb files have the same base name (file name before the underscore character).
For example:

pdb_paths = [
    str(PATH_TEST / "data/pdb/1ATN/1ATN_1w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_2w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_3w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_4w.pdb")]

will give warnings as followings and rename the pdb files:

Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_2
Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_3
Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_4

because of the same base name 1ATN.

Now, the duplicate pdb file name will be checked using the standard below:

If the pdb file contains an underscore character, then it will further check whether the complete pdb file name(base name + name after underscore) already exists in the self._queries list.
For example for 1ATN_1w.pdb contains an underscore, it will be checked if there exists a complete same query id in the query list named 1ATN_1w instead of only checking whether the base name 1ATNexists.
If the pdb file does not contain an underscore character, it will check whether the complete query name(base name) already exists in self.ids_count list.
For example for 1A6B.pdb, which contains no underscore, if a same pdb file named 1A6B.pdb appears, it will be renamed to 1A6B_2

joyceljy · 2023-06-14T14:00:50Z

@DaniBodor I am not sure if I'm understanding the issue correctly, can you give it a look?
If everything is fine then I will add a unit test for this later on. Thanks!

gcroci2 · 2023-06-16T07:36:08Z

deeprankcore/query.py


        query_id_base = query_id.split("_")[0]
+
+        warn_duplicate = False    


warn_duplicate is a parameter for the user to determine if printing warnings in case of duplicates or not. As you modified it, warnings will be printed independently from what the user has set up.

gcroci2

Thanks for picking this up :)
I think we can solve the issue in a much simpler way, like the following:

    def add(self, query: Query, verbose: bool = False, warn_duplicate: bool = True):
        """
        Adds a new query to the collection.

        Args:
            query(:class:`Query`): Must be a :class:`Query` object, either :class:`ProteinProteinInterfaceResidueQuery` or
                :class:`SingleResidueVariantAtomicQuery`.    
            verbose(bool, optional): For logging query IDs added, defaults to False.
            warn_duplicate (bool): Log a warning before renaming if a duplicate query is identified.

        """
        query_id = query.get_query_id()

        if verbose:
            _log.info(f'Adding query with ID {query_id}.')

        if query_id not in self.ids_count:
            self.ids_count[query_id] = 1
        else:
            self.ids_count[query_id] += 1
            new_id = query.model_id + "_" + str(self.ids_count[query_id])
            query.model_id = new_id
            
            if warn_duplicate:
                _log.warning(f'Query with ID {query_id} has already been added to the collection. Renaming it as {query.get_query_id()}')

        self._queries.append(query)

Here I am just directly using query_id to keep track of the queries ids in the ids_count dictionary, without splitting anything on _ character.
I am not sure why we didn't implement this in the first place, and why it was necessary to split on _, but I think it should work. Tests pass on my local machine, but this is not a guarantee since we don't have relevant tests yet, but at least it means that it doesn't break anything.

Then as you mentioned, we need at least a unit test for testing relevant cases.

joyceljy · 2023-06-16T11:25:05Z

Thanks, Giulia!
I am also aware that the ids_count is counting on the base_name of the pdb files and I thought that it may be used somewhere else. That's the reason why I chose not to modify that part. But like you said if there is no limitation on the name storing to ids_count, your way will be the most efficient way:)

tests/test_querycollection.py

gcroci2

Almost there :) I gave some suggestions about the test

DaniBodor

Looks really good! Elegant solution.

Could you just add an empty line to the end of the test file. No super important reason, but it's recommended to always end on an empty line.

DaniBodor · 2023-07-03T17:09:18Z

I think merging is still blocked because @gcroci2 once requested changes and hasn't formally accepted yet. Can you please do so now :)

Fix standard for handling duplicate query ids

3811d10

joyceljy self-assigned this Jun 14, 2023

joyceljy linked an issue Jun 14, 2023 that may be closed by this pull request

Creating queries from pdb files with underscore in the filename gives unexpected query ids #411

Closed

joyceljy requested a review from DaniBodor June 14, 2023 14:00

Fix wrong warning message shown

2894bba

joyceljy requested a review from gcroci2 June 15, 2023 09:14

gcroci2 reviewed Jun 16, 2023

View reviewed changes

gcroci2 requested changes Jun 16, 2023

View reviewed changes

Chia Yu Lin added 4 commits June 16, 2023 16:17

Modify ids_count column

39b67b0

Add unittest for add function

fc06133

Change unit test name

4311e07

Fix linting

8e19903

joyceljy requested a review from gcroci2 June 16, 2023 14:41