[WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP by NivekT · Pull Request #40 · meta-pytorch/data

NivekT · 2021-10-04T22:30:16Z

Stack from ghstack:

-> [WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP #40

The goal here is to create an arbitrarily large .tar file and read from it via HTTP.

Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the HTTPReader because its stream does not support the operationseek:

Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek

…om it via HTTP [ghstack-poisoned]

…om it via HTTP ghstack-source-id: b300814 Pull Request resolved: #40

…ing, minor change to test case" Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers. [ghstack-poisoned]

…ing, minor change to test case" Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers. Differential Revision: [D31515974](https://our.internmc.facebook.com/intern/diff/D31515974) [ghstack-poisoned]

…change to test case (#51) Summary: Pull Request resolved: #51 Fixes #42. I plan to add more test in #40 that will test online readers in connection to the various archive readers. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31515974 Pulled By: NivekT fbshipit-source-id: 065261aeac5863971d94c4949ed0e6b5df201fa7

… file and read from it via HTTP" The goal here is to create an arbitrarily large .tar file and read from it via HTTP. Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`: ``` Traceback (most recent call last): File "/Users/ktse/data/test/test_stream.py", line 66, in <module> for fname, stream in tar_dp: File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__ raise e File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__ tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode) File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open saved_pos = fileobj.tell() io.UnsupportedOperation: seek ``` [ghstack-poisoned]

…om it via HTTP ghstack-source-id: 2917762 Pull Request resolved: #40

… file and read from it via HTTP" The goal here is to create an arbitrarily large .tar file and read from it via HTTP. Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`: ``` Traceback (most recent call last): File "/Users/ktse/data/test/test_stream.py", line 66, in <module> for fname, stream in tar_dp: File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__ raise e File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__ tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode) File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open saved_pos = fileobj.tell() io.UnsupportedOperation: seek ``` [ghstack-poisoned]

…om it via HTTP ghstack-source-id: 0af378e Pull Request resolved: #40

… file and read from it via HTTP" The goal here is to create an arbitrarily large .tar file and read from it via HTTP. Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`: ``` Traceback (most recent call last): File "/Users/ktse/data/test/test_stream.py", line 66, in <module> for fname, stream in tar_dp: File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__ raise e File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__ tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode) File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open saved_pos = fileobj.tell() io.UnsupportedOperation: seek ``` [ghstack-poisoned]

…om it via HTTP ghstack-source-id: df27cf2 Pull Request resolved: #40

… file and read from it via HTTP" The goal here is to create an arbitrarily large .tar file and read from it via HTTP. Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`: ``` Traceback (most recent call last): File "/Users/ktse/data/test/test_stream.py", line 66, in <module> for fname, stream in tar_dp: File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__ raise e File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__ tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode) File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open saved_pos = fileobj.tell() io.UnsupportedOperation: seek ``` [ghstack-poisoned]

…om it via HTTP ghstack-source-id: 7383f29 Pull Request resolved: #40

VitalyFedyunin · 2021-10-19T16:27:50Z

test/test_stream.py

+        httpd.serve_forever()
+        while True:
+            if self.stop_server:  # TODO: This is not closing
+                httpd.server_close()


You are not leaving while loop after server_close

I don't think self.stop_server is being set to True either. Perhaps because it is on a different thread?

You should use thread event to terminate the loop.

ejguan

A few comments below. And, please add underscore to all helper methods in the TestCase .

ejguan · 2021-10-19T21:10:32Z

test/test_stream.py

+        httpd.serve_forever()
+        while True:
+            if self.stop_server:  # TODO: This is not closing
+                httpd.server_close()


You should use thread event to terminate the loop.

ejguan · 2021-10-19T21:12:23Z

test/test_stream.py

+        self.temp_dir_path = self.temp_dir.name
+        self.port = 8006
+        self.stop_server = False
+        self.server_thread = threading.Thread(
+            target=self.running_server
+        )  # TestStream.start_test_server(self.temp_dir_path, self.port)


This is not going to work. You need to pass these variables to thread during construction.

It seems to work for now since running_server has access to self variable. But I will keep an eye out to see if there is any bug related to this.

But I think it would be better to refactor it and take those as arguments as you suggested.

ejguan · 2021-10-19T21:12:35Z

test/test_stream.py

+
+    def tearDown(self) -> None:
+        print("Tear down is running...")
+        self.stop_server = True


Trigger threading event here.

test/test_stream.py

ejguan · 2021-10-19T21:21:56Z

test/test_stream.py

+
+
+class TestStream(expecttest.TestCase):
+    def setUp(self) -> None:


Another thing I want to mention that if setUp and tearDown are shared for all test methods in the future. They should be converted to setUpClass and tearDownClass

I mean if you don't want these setup methods invoked for every single test run.

… file and read from it via HTTP" The goal here is to create an arbitrarily large .tar file and read from it via HTTP. Currently, it seems to work if we cache the file first and then read from it. However, an error is raised when we attempt to read directly from the `HTTPReader` because its stream does not support the operation`seek`: ``` Traceback (most recent call last): File "/Users/ktse/data/test/test_stream.py", line 66, in <module> for fname, stream in tar_dp: File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__ raise e File "/Users/ktse/data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__ tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode) File "/Users/ktse/miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open saved_pos = fileobj.tell() io.UnsupportedOperation: seek ``` [ghstack-poisoned]

…om it via HTTP ghstack-source-id: f7466e5 Pull Request resolved: #40

NivekT · 2023-02-08T18:06:30Z

No longer need it since we have other benchmark and remote testing

Adding a test that creates an arbitrarily large .tar file and read fr…

46d4745

…om it via HTTP [ghstack-poisoned]

NivekT added a commit that referenced this pull request Oct 4, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

31f53d7

…om it via HTTP ghstack-source-id: b300814 Pull Request resolved: #40

NivekT marked this pull request as draft October 4, 2021 22:30

NivekT requested review from VitalyFedyunin and ejguan October 4, 2021 22:30

NivekT changed the title ~~Adding a test that creates an arbitrarily large .tar file and read from it via HTTP~~ [WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP Oct 4, 2021

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 4, 2021

This was referenced Oct 5, 2021

TarArchiveReader is not functioning with HTTPReader or GDriveReader #42

Closed

Change HTTPReader to use requests module and enable streaming, minor change to test case #51

Closed

NivekT added a commit that referenced this pull request Oct 18, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

43f4e15

…om it via HTTP ghstack-source-id: 2917762 Pull Request resolved: #40

NivekT added a commit that referenced this pull request Oct 18, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

c1763b0

…om it via HTTP ghstack-source-id: 0af378e Pull Request resolved: #40

NivekT added a commit that referenced this pull request Oct 18, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

252fbf6

…om it via HTTP ghstack-source-id: df27cf2 Pull Request resolved: #40

NivekT added a commit that referenced this pull request Oct 19, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

4d93a85

…om it via HTTP ghstack-source-id: 7383f29 Pull Request resolved: #40

VitalyFedyunin reviewed Oct 19, 2021

View reviewed changes

ejguan reviewed Oct 19, 2021

View reviewed changes

NivekT added a commit that referenced this pull request Oct 20, 2021

Adding a test that creates an arbitrarily large .tar file and read fr…

d5964f1

…om it via HTTP ghstack-source-id: f7466e5 Pull Request resolved: #40

NivekT closed this Feb 8, 2023

facebook-github-bot deleted the gh/NivekT/7/head branch March 11, 2023 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP#40

[WIP] Adding a test that creates an arbitrarily large .tar file and read from it via HTTP#40
NivekT wants to merge 6 commits intogh/NivekT/7/basefrom
gh/NivekT/7/head

NivekT commented Oct 4, 2021 •

edited

Loading

Uh oh!

VitalyFedyunin Oct 19, 2021

Uh oh!

NivekT Oct 19, 2021

Uh oh!

ejguan Oct 19, 2021

Uh oh!

ejguan left a comment

Uh oh!

ejguan Oct 19, 2021

Uh oh!

ejguan Oct 19, 2021

Uh oh!

NivekT Oct 20, 2021

Uh oh!

NivekT Oct 20, 2021

Uh oh!

ejguan Oct 19, 2021

Uh oh!

Uh oh!

ejguan Oct 19, 2021

Uh oh!

ejguan Oct 19, 2021

Uh oh!

NivekT commented Feb 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		class TestStream(expecttest.TestCase):
		def setUp(self) -> None:

Conversation

NivekT commented Oct 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NivekT commented Feb 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NivekT commented Oct 4, 2021 •

edited

Loading