Skip to content

Conversation

@xzy3
Copy link
Contributor

@xzy3 xzy3 commented Jul 7, 2020

…hat tempfile.NamedTemporaryFile can be used easily.

3.7.0 (default, Aug 15 2018, 09:03:42)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
CPython
Linux-3.10.0-1127.e17.x86_64-x86_64-with-centos-7.8.2003-Core
Bio.version 1.77

This line:

line = next(handle)

has broken reading from tempfile.NamedTemporaryFile unless you do thisSeqIO.parse(tmp_fd.file, 'fastq'). NamedTemporaryFile returns https://github.com/python/cpython/blob/1bf9cc509326bc42cd8cb1650eb9bf64550d817e/Lib/tempfile.py#L457
which is for some reason seems to not be its own iterator. I don't see any recent changes to the tempfile module to cause this.

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@xzy3 xzy3 requested a review from peterjc as a code owner July 7, 2020 03:44
@codecov
Copy link

codecov bot commented Jul 7, 2020

Codecov Report

Attention: Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 86.30%. Comparing base (e838eb9) to head (1d18ede).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
Bio/SeqIO/PirIO.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3031      +/-   ##
==========================================
+ Coverage   85.37%   86.30%   +0.93%     
==========================================
  Files         282      282              
  Lines       59385    59467      +82     
==========================================
+ Hits        50697    51323     +626     
+ Misses       8688     8144     -544     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@peterjc
Copy link
Member

peterjc commented Jul 7, 2020

Have you looked at the Python issue tracker to see if this is a known issue with tempfile.NamedTemporaryFile itself?

Can you add a test using a tempfile.NamedTemporaryFile - this could affect other parsers in SeqIO too?

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 7, 2020

There is some discussion on the Python mailing list from 2016, but nothing seemed to come of it: https://mail.python.org/pipermail/python-list/2016-July/862590.html

The git blame on tempfile.NamedTemporaryFile says it hasn't been touched in 5+ years. So it's at least a very old bug. There are no mentions of this specific quirk on the Python bugs list that I can find.

And sure i'll work on a unittest.

@mdehoon
Copy link
Contributor

mdehoon commented Jul 8, 2020

@xzy3 Can you provide a self-contained example showing the problem?

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 8, 2020

@mdehoon here you go

from tempfile import NamedTemporaryFile
from Bio import SeqIO

test_seq = """\
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
"""

with NamedTemporaryFile(mode='w+t') as fd:
  fd.write(test_seq)
  fd.seek(0)
  next(SeqIO.parse(fd, 'fastq'))
(stantz5) xzy3@biolinux> python ~/test-seqio.py
Traceback (most recent call last):
  File "/scicomp/home/xzy3/test-seqio.py", line 14, in <module>
    next(SeqIO.parse(fd, 'fastq'))
  File "/scicomp/home/xzy3/.local/virtualenvs/stantz5/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 68, in __next__
    return next(self.records)
  File "/scicomp/home/xzy3/.local/virtualenvs/stantz5/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 1081, in iterate
    for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
  File "/scicomp/home/xzy3/.local/virtualenvs/stantz5/lib/python3.7/site-packages/Bio/SeqIO/QualityIO.py", line 923, in FastqGeneralIterator
    line = next(handle)
TypeError: '_TemporaryFileWrapper' object is not an iterator

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 8, 2020

MafIO turned out to have the exact same problem. I did the same fix to that parser also.

@mdehoon
Copy link
Contributor

mdehoon commented Jul 8, 2020

@xzy3 Thank you.

It seems that this is a bug in Python. For example, we get the same problem here:

>>> fd = open("mytestfile", "w+t")
>>> fd.write('something')
9
>>> fd.seek(0)
0
>>> next(fd)
'something'
>>> fd.close()
>>> from tempfile import NamedTemporaryFile            
>>> fd = NamedTemporaryFile(mode='w+t')
>>> fd.write('something')
9
>>> fd.seek(0)
0
>>> next(fd)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_TemporaryFileWrapper' object is not an iterator
>>>  

if this is indeed a bug in Python, I don't think we should fix this in Biopython, in particular since there is a trivial workaround.

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 8, 2020

@mdehoon This exact thing was brought up on the Python mailing list in 2016 and it was informally dropped as "not a bug". The argument against was that NamedTemporaryFile does not claim to be an iterator in the documentation, and iter should be called on it rather than changing the wrapper object. I can agree, it definitely violates my assumptions of how it should work. I couldn't find a bug report in the Python bug tracker open or closed. However I can try formally pushing this upstream and see if we get a different result.
The result of NamedTemporaryFile is a wrapper object that uses dunder method magic to forward method calls to the underlying file. The most recent change to the object was to change __iter__ to be a generator in response to a bug. Apparently the iter builtin does not use __getattr__ to find methods and that may be true of __next__ also.

@peterjc
Copy link
Member

peterjc commented Jul 9, 2020

I agree, it does seem to be a bug in Python itself, and would support reporting a bug to Python directly.

However, even if they do fix it, it won't be widely available for a while (and may never be backported as far at Python 3.6), so being pragmatic I am OK with adding the workaround (with tests) to Biopython.

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 9, 2020

@peterjc Ok I'll report it to the Python bugtracker and reference this discussion.

@mdehoon
Copy link
Contributor

mdehoon commented Jul 9, 2020

@peterjc Even then I don't think this needs to be fixed in Biopython. As @xzy3 wrote, one workaround is to use the file property of the named temporary file as in SeqIO.parse(tmp_fd.file, 'fastq'). Another solution is to call iter on tmp_fd, as in SeqIO.parse(iter(tmp_fd), 'fastq'). Both are simple workarounds that a user can use. If we do fix this in Biopython, we would have to fix it everywhere (not just Bio.SeqIO).

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 9, 2020

@mdehoon as best I can tell there are only two places in SeqIO that use next directly on a file. I've fixed them both in this patch.

@xzy3
Copy link
Contributor Author

xzy3 commented Jul 10, 2020

@peterjc ok tests added, fixed a spot in MafIO that had the same problem. I made an issue over at the Python bug tracker and submitted a pull request python/cpython#21434

@peterjc
Copy link
Member

peterjc commented Jul 14, 2020

Looks fairly positive so far on your Python pull request 👍

@peterjc
Copy link
Member

peterjc commented Jul 22, 2020

Looks like the Python pull request is awaiting review by the core team.

Even if it is merged, I am inclined to merge this fix (although we have some git conflicts to resolve first). [fixed typo]

@mdehoon would you prefer to wait and see what the Python team decides?

@mdehoon
Copy link
Contributor

mdehoon commented Jul 22, 2020

@peterjc Yes we should wait and see the Python team decides.

@xzy3
Copy link
Contributor Author

xzy3 commented Oct 19, 2020

Well I committed a brain-o and got the first pull request in a bad state. I had to restart it so it's now here python/cpython#22766

@peterjc
Copy link
Member

peterjc commented Oct 19, 2020

Let's hope the CPython team accept that - thank you.

super().__init__(source, fmt="Fasta")
try:
line = next(self.stream)
line = next(iter(self.stream))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use line = self.stream.readline()?

raise StopIteration
stream = self.stream
# NamedTemporaryFiles don't support __next__ so create an iter as a workaround
stream = iter(self.stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need this here? We are not calling next on stream, and for line in stream implicitly calls iter on stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 155 is next(stream).step() which breaks in the unit tests when run with NamedTempFiles.

# handle as a work around.
handle_iter = iter(handle)
try:
line = next(handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use readline instead?

if line is None:
try:
line = next(self.stream)
line = next(stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The current fix required calling iter each time we enter __next__.

@xzy3 xzy3 force-pushed the named-tempfile-fix branch from 50f2e60 to 735c2e2 Compare February 5, 2025 13:33
@xzy3
Copy link
Contributor Author

xzy3 commented Feb 5, 2025

switched to fd.readline() everywhere. Doing so required some minor changes to parser logic.

line = next(self.stream)
except StopIteration:
line = self.stream.readline()
if line == "":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if not line: is considered better stylistically.

@xzy3 xzy3 force-pushed the named-tempfile-fix branch from 735c2e2 to 32d5038 Compare February 5, 2025 15:15

identifier = line[4:].strip()
description = next(stream).strip()
description = self.stream.readline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call .strip() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called 3 lines down after checking if the empty string was returned to indicate EOF.

self.line = line
if line is None:
line = self.stream.readline()
if line is None or line == "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line cannot be None here.

# allows parsing of the last bundle without duplicating code
try:
line = next(handle)
line = next(handle_iter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use .readline() here instead of creating handle_iter?

tempfile.NamedTemporaryFiles do not support __next__.
@xzy3 xzy3 force-pushed the named-tempfile-fix branch from 32d5038 to 1d18ede Compare February 6, 2025 14:00
@xzy3
Copy link
Contributor Author

xzy3 commented Feb 21, 2025

Do I need to do anything else?

@mdehoon mdehoon requested review from mdehoon and removed request for mdehoon February 22, 2025 01:29
@peterjc peterjc merged commit d6fcdfa into biopython:master Feb 24, 2025
31 checks passed
@peterjc
Copy link
Member

peterjc commented Feb 24, 2025

Thank you both.

@xzy3 xzy3 deleted the named-tempfile-fix branch February 24, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants