-
Notifications
You must be signed in to change notification settings - Fork 107
Description
I have been in contact with NCBI regarding because some (a lot) of proteins were missing in the prot.accession2taxid file. Hence, they did not have a corresponding taxID and Krona Tools could not find them back in the all.accession2taxid.sorted file (as the prot.accession2taxid was incomplete).
The problem with missing accessions in the prot.accession2taxid file was due to NCBI's switching to gi-less records. The missing proteins are those that have accession numbers only but are without the gi identifiers. However the internal processing for prot.accession2taxid file actually depends on the gi identifiers, hence the missing entries.
(I am guessing this might be causing issue #143 too... )
The developers have been working on changes in processing that include gi-less accessions and made a version available last week that they would like me to test (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.20201013_0121.gz)
Unfortunately I cannot seem to get the accessions database in the correct format. I downloaded all necessary files and replaced the old prot.accession2taxid file with the new one (prot.accession2taxid.20201013_0121) and then ran updateAccessions.sh --only-build. Grepping the protein accessions in the all.accession2taxid.sorted file works, but running ktClassifyBLAST, ktImportBLAST, ... returns root for all entries and says I should update the database as accessions were not found. I was wondering if you were able to help me (especially since this new format will replace the old one soon)?
Thanks in advance!