Skip to content

fix: pdb files with underscore in the filename gives unexpected query ids#447

Merged
joyceljy merged 8 commits intomainfrom
411_QueryID_UnderScore_joyceljy
Jul 4, 2023
Merged

fix: pdb files with underscore in the filename gives unexpected query ids#447
joyceljy merged 8 commits intomainfrom
411_QueryID_UnderScore_joyceljy

Conversation

@joyceljy
Copy link
Copy Markdown
Collaborator

@joyceljy joyceljy commented Jun 14, 2023

Before the query ids will be seen as duplicates and be assigned a new id when the pdb files have the same base name (file name before the underscore character).
For example:

pdb_paths = [
    str(PATH_TEST / "data/pdb/1ATN/1ATN_1w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_2w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_3w.pdb"),
    str(PATH_TEST / "data/pdb/1ATN/1ATN_4w.pdb")]

will give warnings as followings and rename the pdb files:

Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_2
Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_3
Query with ID residue-ppi:A-B:1ATN has already been added to the collection. Renaming it as residue-ppi:A-B:1ATN_4

because of the same base name 1ATN.

Now, the duplicate pdb file name will be checked using the standard below:

  1. If the pdb file contains an underscore character, then it will further check whether the complete pdb file name(base name + name after underscore) already exists in the self._queries list.
    For example for 1ATN_1w.pdb contains an underscore, it will be checked if there exists a complete same query id in the query list named 1ATN_1w instead of only checking whether the base name 1ATNexists.
  2. If the pdb file does not contain an underscore character, it will check whether the complete query name(base name) already exists in self.ids_count list.
    For example for 1A6B.pdb, which contains no underscore, if a same pdb file named 1A6B.pdb appears, it will be renamed to 1A6B_2

@joyceljy
Copy link
Copy Markdown
Collaborator Author

joyceljy commented Jun 14, 2023

@DaniBodor I am not sure if I'm understanding the issue correctly, can you give it a look?
If everything is fine then I will add a unit test for this later on. Thanks!

@joyceljy joyceljy requested a review from DaniBodor June 14, 2023 14:00
@joyceljy joyceljy requested a review from gcroci2 June 15, 2023 09:14

query_id_base = query_id.split("_")[0]

warn_duplicate = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn_duplicate is a parameter for the user to determine if printing warnings in case of duplicates or not. As you modified it, warnings will be printed independently from what the user has set up.

Copy link
Copy Markdown
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up :)
I think we can solve the issue in a much simpler way, like the following:

    def add(self, query: Query, verbose: bool = False, warn_duplicate: bool = True):
        """
        Adds a new query to the collection.

        Args:
            query(:class:`Query`): Must be a :class:`Query` object, either :class:`ProteinProteinInterfaceResidueQuery` or
                :class:`SingleResidueVariantAtomicQuery`.    
            verbose(bool, optional): For logging query IDs added, defaults to False.
            warn_duplicate (bool): Log a warning before renaming if a duplicate query is identified.

        """
        query_id = query.get_query_id()

        if verbose:
            _log.info(f'Adding query with ID {query_id}.')

        if query_id not in self.ids_count:
            self.ids_count[query_id] = 1
        else:
            self.ids_count[query_id] += 1
            new_id = query.model_id + "_" + str(self.ids_count[query_id])
            query.model_id = new_id
            
            if warn_duplicate:
                _log.warning(f'Query with ID {query_id} has already been added to the collection. Renaming it as {query.get_query_id()}')

        self._queries.append(query)

Here I am just directly using query_id to keep track of the queries ids in the ids_count dictionary, without splitting anything on _ character.
I am not sure why we didn't implement this in the first place, and why it was necessary to split on _, but I think it should work. Tests pass on my local machine, but this is not a guarantee since we don't have relevant tests yet, but at least it means that it doesn't break anything.

Then as you mentioned, we need at least a unit test for testing relevant cases.

@joyceljy
Copy link
Copy Markdown
Collaborator Author

joyceljy commented Jun 16, 2023

Thanks, Giulia!
I am also aware that the ids_count is counting on the base_name of the pdb files and I thought that it may be used somewhere else. That's the reason why I chose not to modify that part. But like you said if there is no limitation on the name storing to ids_count, your way will be the most efficient way:)

@joyceljy joyceljy requested a review from gcroci2 June 16, 2023 14:41
Copy link
Copy Markdown
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :) I gave some suggestions about the test

@joyceljy joyceljy requested a review from gcroci2 June 21, 2023 11:34
@DaniBodor DaniBodor removed their request for review June 26, 2023 13:52
Copy link
Copy Markdown
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Elegant solution.

Could you just add an empty line to the end of the test file. No super important reason, but it's recommended to always end on an empty line.

@DaniBodor
Copy link
Copy Markdown
Collaborator

I think merging is still blocked because @gcroci2 once requested changes and hasn't formally accepted yet. Can you please do so now :)

@joyceljy joyceljy merged commit b97ecbf into main Jul 4, 2023
@joyceljy joyceljy deleted the 411_QueryID_UnderScore_joyceljy branch July 4, 2023 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating queries from pdb files with underscore in the filename gives unexpected query ids

3 participants