Add HTTPFileSystem by martindurant · Pull Request #3160 · dask/dask

martindurant · 2018-02-12T16:51:06Z

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

Posted for comments: is this a reasonable way to go about things?

I would need to do add to the set of imports in dask.bytes.core.get_fs; this only requires requests.

Testing will need to set up an actual simple HTTP server.

Plus auto-import the back-end

martindurant · 2018-02-12T21:23:51Z

Note that the simple http server used for testing here does not actually support Range, so the code is not very complete; should I write a little tornado thing, actually hosting some sorts of files that we might want to work with (csv...) ?

mrocklin · 2018-02-12T21:30:31Z

Testing against tornado would be fine.

Alternatively I think that some of the dask.bag test suite still tests against the open internet. They mark with @pytest.mark.network for people who want to avoid network-based tests.

martindurant · 2018-02-12T21:32:41Z

Whoa, dask.bag.from_url

martindurant · 2018-02-13T14:40:56Z

A couple of things to consider here:

some servers do not provide the content-length with a HEAD call, so we cannot know the size of those objects before downloading them
some servers do not respect the Range header, and send the whole object every time
some servers still do not provide the length even when getting the data. In other cases they might, and streamed download mode could be used to bail in the case that the download looks too big

If the length is not known, or Range is ignored, clearly we cannot use intra-file partitioning. Is it then an error to do anything other than read() from position 0? For a smallish file (less than the block-size), we would just download that block and be able to seek within it. So perhaps the fail condition is: we are doing something other than tell()==0, read(), we tried to download a block (say 5MB), and while streaming, the header says the arriving data is bigger than we asked for OR we are streaming and have already seen more data than we asked for.

Thoughts on this?

mrocklin · 2018-02-13T14:45:22Z

My hope would be that the servers that would serve large datasets would support these features. Is that likely to be true?

Is it possible to check if they support these and, if not, err?

martindurant · 2018-02-13T14:53:28Z

We can know up front whether the content-length is missing in the HEAD call, we could error early on that, but it should not prevent the simple file-access pattern of one full file per call.

We cannot know whether Range is supported without trying with a Range header. This SO answer says that you can combine HEAD and Range to see if the server supports it without getting data, or

That could mean three calls to get any data: one HEAD to get the size, one HEAD to check Range, and then a GET to actually fetch data.

EDIT -
We could instead check for 'Accept-Ranges': 'bytes' in the first HEAD call.

mrocklin · 2018-02-13T15:08:37Z

That sounds unpleasant. What do you think is best?

martindurant · 2018-02-13T18:41:54Z

Here is my stab at putting checks and expressive error messages. I don't know how to go about testing these bits, though.

martindurant · 2018-02-21T16:32:51Z

I think this is a useful addition at this point. There may be more polish needed down the road, but I would defer them to future PRs. Any thoughts?

mrocklin · 2018-02-21T16:34:39Z

dask/bytes/http.py

+    Simple File-System for fetching data via HTTP(S)
+
+    Unlike other file-systems, HTTP is limited in that it does not provide glob
+    or write capability. Furthermore, no read-ahead is presently


This sentence appears to be unfinished.

Also it appears that read-ahead does exist?

mrocklin · 2018-02-21T16:35:24Z

dask/bytes/http.py

+    def ukey(self, url):
+        """Unique identifier, so we can tell if a file changed"""
+        # Could do HEAD here?
+        return tokenize(url)


This seems dangerous if the underlying file changes.

I think it can be an assumption of the FS that we're dealing with static files. The server is not guaranteed to provide ETAGs or any other checksums, timestamp or even size, so I don't think there's anything else we can do (except make this a uuid and reload every time).

Some people use Dask over an extended time. I think that they would find this behavior surprising. I'm inclined to use uuids.

mrocklin · 2018-02-21T16:36:03Z

dask/bytes/http.py

+
+class HTTPFile(object):
+    """
+    A file-like object pointing to a remove HTTP(S) resource.


Small style nit: reserve periods for sentences

mrocklin · 2018-02-21T16:40:08Z

dask/bytes/tests/test_http.py

+    assert data == open(fn, 'rb').read()
+
+
+@pytest.mark.parametrize('block_size', [None, 99999])


I thought that block size didn't work with the simple server? What happens in this case? We just download everything into a local cache?

It doesn't work, but the FS can cope with that.
You fetch the whole file if it is smaller than the given block size, or error if, while streaming, the size surpasses the given block.

mrocklin · 2018-02-21T16:41:44Z

dask/bytes/tests/test_http.py

+
+def test_files(server):
+    root = 'http://localhost:8999/'
+    files = [f for f in os.listdir('.') if os.path.isfile(f)]


Might want to limit based on file size here? Or else maybe add a couple of explicit temporary files here?

Use UUID for ukey; file may have changed at any time Use temp directory for test server

martindurant · 2018-02-21T18:30:22Z

(appveyor failure appears unrelated, possibly the numpy version problem that has been noticed elsewhere)

mrocklin · 2018-02-21T23:55:45Z

Appveyor issues resolved

mrocklin · 2018-02-22T14:02:08Z

+1 from me on the code. Should this capability be mentioned in documentation somewhere?

martindurant · 2018-02-22T14:03:52Z

Yes, it should be in the changelog, and the limitations mentioned on http://dask.pydata.org/en/latest/remote-data-services.html

mrocklin · 2018-02-22T16:29:16Z

+1

Martin Durant added 5 commits February 12, 2018 11:45

Add HTTPFileSystem

3f01ca8

First test

609b11b

Plus auto-import the back-end

Simplify logic; parse query

d3d83ee

Add tests

4518f04

flake

10ac499

martindurant changed the title ~~(WIP) Add HTTPFileSystem~~ Add HTTPFileSystem Feb 12, 2018

Martin Durant added 3 commits February 12, 2018 19:29

More tests, fixes, and working for Range-free servers

d7c5810

Rever change in bytes.core

a31995c

fix test

ce259ab

flake on tests

d81dc11

Add checks for non-behaving HTTP servers

2930cbb

one more flake

e8fca45

mrocklin reviewed Feb 21, 2018

View reviewed changes

Fix for comments

f14dc8f

Use UUID for ukey; file may have changed at any time Use temp directory for test server

martindurant mentioned this pull request Feb 21, 2018

Add MemoryFileSystem #2741

Closed

3 tasks

flaking

dafab62

Merge branch 'master' into httpfs

0999d77

Add httpfs docs to remote services, update changelog

b86ef09

martindurant merged commit 683d1c9 into dask:master Feb 22, 2018

martindurant deleted the httpfs branch February 22, 2018 16:30

ian-r-rose mentioned this pull request Mar 26, 2019

dd.read_csv fails when the server doesn't support HTTP HEAD requests #4633

Closed

TomAugspurger mentioned this pull request Jul 1, 2019

CI: Failing test pytest dask/bytes/tests/test_http.py::test_parquet #5042

Closed

		assert data == open(fn, 'rb').read()


		@pytest.mark.parametrize('block_size', [None, 99999])

Uh oh!

Conversation

martindurant commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Feb 12, 2018

Uh oh!

mrocklin commented Feb 12, 2018

Uh oh!

martindurant commented Feb 12, 2018

Uh oh!

martindurant commented Feb 13, 2018

Uh oh!

mrocklin commented Feb 13, 2018

Uh oh!

martindurant commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Feb 13, 2018

Uh oh!

martindurant commented Feb 13, 2018

Uh oh!

martindurant commented Feb 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant commented Feb 21, 2018

Uh oh!

mrocklin commented Feb 21, 2018

Uh oh!

mrocklin commented Feb 22, 2018

Uh oh!

martindurant commented Feb 22, 2018

Uh oh!

mrocklin commented Feb 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martindurant commented Feb 12, 2018 •

edited

Loading

martindurant commented Feb 13, 2018 •

edited

Loading