Conversation
Plus auto-import the back-end
|
Note that the simple http server used for testing here does not actually support Range, so the code is not very complete; should I write a little tornado thing, actually hosting some sorts of files that we might want to work with (csv...) ? |
|
Testing against tornado would be fine. Alternatively I think that some of the dask.bag test suite still tests against the open internet. They mark with |
|
Whoa, |
|
A couple of things to consider here:
If the length is not known, or Range is ignored, clearly we cannot use intra-file partitioning. Is it then an error to do anything other than Thoughts on this? |
|
My hope would be that the servers that would serve large datasets would support these features. Is that likely to be true? Is it possible to check if they support these and, if not, err? |
|
We can know up front whether the content-length is missing in the HEAD call, we could error early on that, but it should not prevent the simple file-access pattern of one full file per call. We cannot know whether Range is supported without trying with a Range header. This SO answer says that you can combine HEAD and Range to see if the server supports it without getting data, or That could mean three calls to get any data: one HEAD to get the size, one HEAD to check Range, and then a GET to actually fetch data.
|
|
That sounds unpleasant. What do you think is best? |
|
Here is my stab at putting checks and expressive error messages. I don't know how to go about testing these bits, though. |
|
I think this is a useful addition at this point. There may be more polish needed down the road, but I would defer them to future PRs. Any thoughts? |
dask/bytes/http.py
Outdated
| Simple File-System for fetching data via HTTP(S) | ||
|
|
||
| Unlike other file-systems, HTTP is limited in that it does not provide glob | ||
| or write capability. Furthermore, no read-ahead is presently |
There was a problem hiding this comment.
This sentence appears to be unfinished.
There was a problem hiding this comment.
Also it appears that read-ahead does exist?
dask/bytes/http.py
Outdated
| def ukey(self, url): | ||
| """Unique identifier, so we can tell if a file changed""" | ||
| # Could do HEAD here? | ||
| return tokenize(url) |
There was a problem hiding this comment.
This seems dangerous if the underlying file changes.
There was a problem hiding this comment.
I think it can be an assumption of the FS that we're dealing with static files. The server is not guaranteed to provide ETAGs or any other checksums, timestamp or even size, so I don't think there's anything else we can do (except make this a uuid and reload every time).
There was a problem hiding this comment.
Some people use Dask over an extended time. I think that they would find this behavior surprising. I'm inclined to use uuids.
dask/bytes/http.py
Outdated
|
|
||
| class HTTPFile(object): | ||
| """ | ||
| A file-like object pointing to a remove HTTP(S) resource. |
There was a problem hiding this comment.
Small style nit: reserve periods for sentences
| assert data == open(fn, 'rb').read() | ||
|
|
||
|
|
||
| @pytest.mark.parametrize('block_size', [None, 99999]) |
There was a problem hiding this comment.
I thought that block size didn't work with the simple server? What happens in this case? We just download everything into a local cache?
There was a problem hiding this comment.
It doesn't work, but the FS can cope with that.
You fetch the whole file if it is smaller than the given block size, or error if, while streaming, the size surpasses the given block.
dask/bytes/tests/test_http.py
Outdated
|
|
||
| def test_files(server): | ||
| root = 'http://localhost:8999/' | ||
| files = [f for f in os.listdir('.') if os.path.isfile(f)] |
There was a problem hiding this comment.
Might want to limit based on file size here? Or else maybe add a couple of explicit temporary files here?
Use UUID for ukey; file may have changed at any time Use temp directory for test server
|
(appveyor failure appears unrelated, possibly the numpy version problem that has been noticed elsewhere) |
|
Appveyor issues resolved |
|
+1 from me on the code. Should this capability be mentioned in documentation somewhere? |
|
Yes, it should be in the changelog, and the limitations mentioned on http://dask.pydata.org/en/latest/remote-data-services.html |
|
+1 |
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new APIFixes #3158
Posted for comments: is this a reasonable way to go about things?
I would need to do add to the set of imports in dask.bytes.core.get_fs; this only requires requests.
Testing will need to set up an actual simple HTTP server.