Skip to content

[WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP#40

Closed
NivekT wants to merge 6 commits intogh/NivekT/7/basefrom
gh/NivekT/7/head
Closed

[WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP#40
NivekT wants to merge 6 commits intogh/NivekT/7/basefrom
gh/NivekT/7/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Oct 4, 2021

Stack from ghstack:

The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the HTTPReader because its stream does not support the operationseek:

Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek

NivekT added a commit that referenced this pull request Oct 4, 2021
…om it via HTTP

ghstack-source-id: b300814
Pull Request resolved: #40
@NivekT NivekT marked this pull request as draft October 4, 2021 22:30
@NivekT NivekT changed the title Adding a test that creates an arbitrarily large .tar file and read from it via HTTP [WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP Oct 4, 2021
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 4, 2021
NivekT added a commit that referenced this pull request Oct 8, 2021
…ing, minor change to test case"


Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 11, 2021
…ing, minor change to test case"


Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers.

Differential Revision: [D31515974](https://our.internmc.facebook.com/intern/diff/D31515974)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 11, 2021
…ing, minor change to test case"


Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers.

Differential Revision: [D31515974](https://our.internmc.facebook.com/intern/diff/D31515974)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Oct 11, 2021
…change to test case (#51)

Summary:
Pull Request resolved: #51

Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31515974

Pulled By: NivekT

fbshipit-source-id: 065261aeac5863971d94c4949ed0e6b5df201fa7
… file and read from it via HTTP"


The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 18, 2021
…om it via HTTP

ghstack-source-id: 2917762
Pull Request resolved: #40
… file and read from it via HTTP"


The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 18, 2021
…om it via HTTP

ghstack-source-id: 0af378e
Pull Request resolved: #40
… file and read from it via HTTP"


The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 18, 2021
…om it via HTTP

ghstack-source-id: df27cf2
Pull Request resolved: #40
… file and read from it via HTTP"


The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 19, 2021
…om it via HTTP

ghstack-source-id: 7383f29
Pull Request resolved: #40
httpd.serve_forever()
while True:
if self.stop_server: # TODO: This is not closing
httpd.server_close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not leaving while loop after server_close

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think self.stop_server is being set to True either. Perhaps because it is on a different thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use thread event to terminate the loop.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments below. And, please add underscore to all helper methods in the TestCase .

httpd.serve_forever()
while True:
if self.stop_server: # TODO: This is not closing
httpd.server_close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use thread event to terminate the loop.

Comment on lines +32 to +37
self.temp_dir_path = self.temp_dir.name
self.port = 8006
self.stop_server = False
self.server_thread = threading.Thread(
target=self.running_server
) # TestStream.start_test_server(self.temp_dir_path, self.port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work. You need to pass these variables to thread during construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work for now since running_server has access to self variable. But I will keep an eye out to see if there is any bug related to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it would be better to refactor it and take those as arguments as you suggested.


def tearDown(self) -> None:
print("Tear down is running...")
self.stop_server = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trigger threading event here.



class TestStream(expecttest.TestCase):
def setUp(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I want to mention that if setUp and tearDown are shared for all test methods in the future. They should be converted to setUpClass and tearDownClass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if you don't want these setup methods invoked for every single test run.

… file and read from it via HTTP"


The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 20, 2021
…om it via HTTP

ghstack-source-id: f7466e5
Pull Request resolved: #40
@NivekT
Copy link
Contributor Author

NivekT commented Feb 8, 2023

No longer need it since we have other benchmark and remote testing

@NivekT NivekT closed this Feb 8, 2023
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/7/head branch March 11, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants