Skip to content

Fix file adapter to handle large files#11

Merged
rlgomes merged 2 commits intomasterfrom
fix-file-adapter-to-handle-large-files
Jan 4, 2016
Merged

Fix file adapter to handle large files#11
rlgomes merged 2 commits intomasterfrom
fix-file-adapter-to-handle-large-files

Conversation

@rlgomes
Copy link
Contributor

@rlgomes rlgomes commented Dec 21, 2015

this PR reworks the file adapter so its easier to add more features in
the future by splitting the read and write up into individual modules.

the fix for issue #6 is currently to set a write limit so that we
simply limit writing out no more than 100K JSON points in the currently
supported writing format of JSON.

future work could be to include streamable writing formats such as CSV,
or JSONL (jsonlines: http://jsonlines.org/)

@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 21, 2015

@dmehra @demmer @dmajda @mstemm fixing the situation identified by issue #6 and I'll update the documentation here in this PR once the initial review for the change goes through.

I'm thinking that if we set the limit for the only existing format for write to be 100K right now then what we can do in a follow up PR is to add the ability to write file to specify an output format such as csv and jsonlines which would allow for incremental writing and never run out of memory.

@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch 3 times, most recently from 095a42e to 6bd1558 Compare December 22, 2015 04:11
@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 22, 2015

@demmer initial approach of using the scheduler, doing this on a 1s interval seems a little "harsh" so I'm open to better suggestion but if we do things with a larger interval than the program would have to wait up to N additional seconds after finishing to actually complete and have the data flushed.

@dmajda
Copy link
Contributor

dmajda commented Dec 22, 2015

@rlgomes General note — in adapter-related PRs please tag @bkutil instead of (or in addition to) me. He is working more in the area so he should be able to give better feedback. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also possible to use this.limit = options.limit || 100000, which is somewhat shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing a check that limit should be non-negative.

@bkutil
Copy link
Contributor

bkutil commented Dec 22, 2015

the fix for issue #6 is currently to set a write limit so that we
simply limit writing out no more than 100K

Nice! So no more OOMs? IIUC, we still read & parse the whole file...

future work could be to include streamable writing formats such as CSV,
or JSONL (jsonlines: http://jsonlines.org/)

+10k - I agree with you that this is the way to go.

@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 22, 2015

@bkutil taking care of the various suggestions all good catches. The issue with OOM'ing because we'd have an existing 1GB JSON file I'm not sure how we'd fix that at this point. I think another simple limitation would be that we'd fail to open files biger than 200MB ? impose some hard limit on the size of the file ? @demmer @dmajda what do you think of this ? I guess we'd also have to expose some way of overriding the limitation but at least this way we could avoid blowing things up in a nasty way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to me to have eof trigger an immediate flush instead of just setting isdone and waiting.

This should be trivial to handle now since all the fs operations are async, but as @bkutil pointed out, they should be changed to be async, so you'll need to make sure you don't race between the scheduler callback and the immediate eof from flush. I think it's easy to handle by just setting isdone immediately after calling flush so that you know a subsequent callback won't race.

@demmer
Copy link
Contributor

demmer commented Dec 22, 2015

As far as the limit on the size of the file, I guess you'll need to have two options: -limit which is a number of points and -maxSize in bytes. Seems icky but I don't see a better option.

@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 22, 2015

But I'll document that maxSize is the file size while -limit is the number of points in said file.

@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 6bd1558 to 5372fef Compare December 22, 2015 21:00
@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 22, 2015

@demmer @bkutil have a looksie I think this is much better at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this won't ever call done() if there aren't any points buffered in the queue when the EOF arrives.

You should fix this by always entering the promise chain even if the queue length is empty and have a test case for it.

@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 5372fef to 6c4f0ed Compare December 23, 2015 05:05
@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 23, 2015

@demmer @bkutil another look at the current solution. I opted for not having a -flushInterval and instead doing a -flushFrequency (open to suggestions on a better name) where we only flush when we've hit a target amount of points and not just periodically checking if we have more things to flush or not. The -flushInterval option was leading to some absurdly complicated code due to the background callbacks having to synchronize their work with the final flush from the eof call.

@rlgomes
Copy link
Contributor Author

rlgomes commented Dec 23, 2015

Things to finish up:

  • add tests for option validations that aren't covered and showing up as uncovered lines in the coverage report.

@rlgomes rlgomes added the ready label Dec 23, 2015
@rlgomes rlgomes self-assigned this Dec 23, 2015
@rlgomes rlgomes added the bug label Dec 23, 2015
@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from a9643e7 to b99cbbd Compare December 28, 2015 18:19
@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from b99cbbd to 9f4ca85 Compare December 29, 2015 06:41
@demmer
Copy link
Contributor

demmer commented Jan 4, 2016

Overall this looks good to me now since the above comments are just a few minor suggestions.

easier to further developer features on the file adapter.
@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 9f4ca85 to 58dce8e Compare January 4, 2016 18:34
@rlgomes
Copy link
Contributor Author

rlgomes commented Jan 4, 2016

Ok merging on green as I've taken care of all @demmer 's suggestions and I think this is ready for prime time.

@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 58dce8e to 3815d6b Compare January 4, 2016 18:53
@rlgomes
Copy link
Contributor Author

rlgomes commented Jan 4, 2016

@dmehra
Copy link
Contributor

dmehra commented Jan 4, 2016

Doc review comment: how about a rephrase so defaults go into the "Required?" column, and don't mention -file because per #44 it will be renamed to -path anyway. Maybe like this:

`-bufferLimit`    | Maximum number of points to buffer in memory before each write to the specified file | No; defaults to 100000
`-maxFilesize`    | Maximum size of the file being written to, limited since the entire JSON needs to fit in memory | No; defaults to 200MB
`-flushFrequency` | How often (in number of points) to flush the points in memory out to the specified file | No; defaults to every 100 points

the fix for issue #6 is currently to set a write limit so that we
simply limit writing out no more than 100K JSON points in the currently
supported writing format of JSON.

while fixing this issue I've also moved the writes to happen
asynchronously using the scheduler to schedule the syncing of data to
and from disk.

future work could be to include streamable writing formats such as CSV,
or JSONL (jsonlines: http://jsonlines.org/)

made all filesystem interactions async and cleaned up any outstanding
issues with limit

with this added a new parameter -limit to dictate the maximum number of
points that can be handled when writing to a JSON file and also added
the -maxFilesize which dictates the maximum size in bytes of a JSON
file that we can handle when writing.

In order to avoid out of memory decided on using a -flushFrequency
option to specify how many points to wait before flushing and opting
for not doing the flushing based on time since 1) the code was getting
way more complicated to manage the constant async callbacks flushing
along with the final flush from eof() 2) its odd to keep doing async
flushes every N seconds when nothing is there to flush and having all
the gaurds in place to avoid reading/writing when there is no data to
actually flush.
@rlgomes rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 3815d6b to 956e7be Compare January 4, 2016 19:05
@rlgomes
Copy link
Contributor Author

rlgomes commented Jan 4, 2016

@dmehra fixed!

rlgomes added a commit that referenced this pull request Jan 4, 2016
…iles

Fix file adapter to handle large files
@rlgomes rlgomes merged commit 4d2c324 into master Jan 4, 2016
@rlgomes rlgomes deleted the fix-file-adapter-to-handle-large-files branch January 4, 2016 19:19
@rlgomes rlgomes removed the ready label Jan 4, 2016
rlgomes pushed a commit that referenced this pull request Jan 4, 2016
fixes #100

In my hastiness to get pr #11 in I changed the wrong option to be
bufferLimit resulting in a messy set of options which should be cleared
up now with this PR.
rlgomes pushed a commit that referenced this pull request Jan 4, 2016
fixes #100

In my hastiness to get pr #11 in I changed the wrong option to be
bufferLimit resulting in a messy set of options which should be cleared
up now with this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants