Skip to content

Add HTTPFileSystem#3160

Merged
martindurant merged 15 commits intodask:masterfrom
martindurant:httpfs
Feb 22, 2018
Merged

Add HTTPFileSystem#3160
martindurant merged 15 commits intodask:masterfrom
martindurant:httpfs

Conversation

@martindurant
Copy link
Member

@martindurant martindurant commented Feb 12, 2018

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

Fixes #3158

Posted for comments: is this a reasonable way to go about things?

I would need to do add to the set of imports in dask.bytes.core.get_fs; this only requires requests.

Testing will need to set up an actual simple HTTP server.

@martindurant martindurant changed the title (WIP) Add HTTPFileSystem Add HTTPFileSystem Feb 12, 2018
@martindurant
Copy link
Member Author

Note that the simple http server used for testing here does not actually support Range, so the code is not very complete; should I write a little tornado thing, actually hosting some sorts of files that we might want to work with (csv...) ?

@mrocklin
Copy link
Member

Testing against tornado would be fine.

Alternatively I think that some of the dask.bag test suite still tests against the open internet. They mark with @pytest.mark.network for people who want to avoid network-based tests.

@martindurant
Copy link
Member Author

Whoa, dask.bag.from_url

@martindurant
Copy link
Member Author

A couple of things to consider here:

  • some servers do not provide the content-length with a HEAD call, so we cannot know the size of those objects before downloading them
  • some servers do not respect the Range header, and send the whole object every time
  • some servers still do not provide the length even when getting the data. In other cases they might, and streamed download mode could be used to bail in the case that the download looks too big

If the length is not known, or Range is ignored, clearly we cannot use intra-file partitioning. Is it then an error to do anything other than read() from position 0? For a smallish file (less than the block-size), we would just download that block and be able to seek within it. So perhaps the fail condition is: we are doing something other than tell()==0, read(), we tried to download a block (say 5MB), and while streaming, the header says the arriving data is bigger than we asked for OR we are streaming and have already seen more data than we asked for.

Thoughts on this?

@mrocklin
Copy link
Member

My hope would be that the servers that would serve large datasets would support these features. Is that likely to be true?

Is it possible to check if they support these and, if not, err?

@martindurant
Copy link
Member Author

martindurant commented Feb 13, 2018

We can know up front whether the content-length is missing in the HEAD call, we could error early on that, but it should not prevent the simple file-access pattern of one full file per call.

We cannot know whether Range is supported without trying with a Range header. This SO answer says that you can combine HEAD and Range to see if the server supports it without getting data, or

That could mean three calls to get any data: one HEAD to get the size, one HEAD to check Range, and then a GET to actually fetch data.

  • EDIT -
    We could instead check for 'Accept-Ranges': 'bytes' in the first HEAD call.

@mrocklin
Copy link
Member

That sounds unpleasant. What do you think is best?

@martindurant
Copy link
Member Author

Here is my stab at putting checks and expressive error messages. I don't know how to go about testing these bits, though.

@martindurant
Copy link
Member Author

I think this is a useful addition at this point. There may be more polish needed down the road, but I would defer them to future PRs. Any thoughts?

Simple File-System for fetching data via HTTP(S)

Unlike other file-systems, HTTP is limited in that it does not provide glob
or write capability. Furthermore, no read-ahead is presently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence appears to be unfinished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it appears that read-ahead does exist?

def ukey(self, url):
"""Unique identifier, so we can tell if a file changed"""
# Could do HEAD here?
return tokenize(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous if the underlying file changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be an assumption of the FS that we're dealing with static files. The server is not guaranteed to provide ETAGs or any other checksums, timestamp or even size, so I don't think there's anything else we can do (except make this a uuid and reload every time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people use Dask over an extended time. I think that they would find this behavior surprising. I'm inclined to use uuids.


class HTTPFile(object):
"""
A file-like object pointing to a remove HTTP(S) resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style nit: reserve periods for sentences

assert data == open(fn, 'rb').read()


@pytest.mark.parametrize('block_size', [None, 99999])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that block size didn't work with the simple server? What happens in this case? We just download everything into a local cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work, but the FS can cope with that.
You fetch the whole file if it is smaller than the given block size, or error if, while streaming, the size surpasses the given block.


def test_files(server):
root = 'http://localhost:8999/'
files = [f for f in os.listdir('.') if os.path.isfile(f)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to limit based on file size here? Or else maybe add a couple of explicit temporary files here?

Use UUID for ukey; file may have changed at any time

Use temp directory for test server
@martindurant martindurant mentioned this pull request Feb 21, 2018
3 tasks
@martindurant
Copy link
Member Author

(appveyor failure appears unrelated, possibly the numpy version problem that has been noticed elsewhere)

@mrocklin
Copy link
Member

Appveyor issues resolved

@mrocklin
Copy link
Member

+1 from me on the code. Should this capability be mentioned in documentation somewhere?

@martindurant
Copy link
Member Author

Yes, it should be in the changelog, and the limitations mentioned on http://dask.pydata.org/en/latest/remote-data-services.html

@mrocklin
Copy link
Member

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants