Fix file adapter to handle large files by rlgomes · Pull Request #11 · juttle/juttle

rlgomes · 2015-12-21T19:58:48Z

this PR reworks the file adapter so its easier to add more features in
the future by splitting the read and write up into individual modules.

the fix for issue #6 is currently to set a write limit so that we
simply limit writing out no more than 100K JSON points in the currently
supported writing format of JSON.

future work could be to include streamable writing formats such as CSV,
or JSONL (jsonlines: http://jsonlines.org/)

rlgomes · 2015-12-21T20:00:07Z

@dmehra @demmer @dmajda @mstemm fixing the situation identified by issue #6 and I'll update the documentation here in this PR once the initial review for the change goes through.

I'm thinking that if we set the limit for the only existing format for write to be 100K right now then what we can do in a follow up PR is to add the ability to write file to specify an output format such as csv and jsonlines which would allow for incremental writing and never run out of memory.

rlgomes · 2015-12-22T04:12:37Z

@demmer initial approach of using the scheduler, doing this on a 1s interval seems a little "harsh" so I'm open to better suggestion but if we do things with a larger interval than the program would have to wait up to N additional seconds after finishing to actually complete and have the data flushed.

dmajda · 2015-12-22T13:29:56Z

@rlgomes General note — in adapter-related PRs please tag @bkutil instead of (or in addition to) me. He is working more in the area so he should be able to give better feedback. Thanks.

bkutil · 2015-12-22T13:32:08Z

lib/adapters/file/write.js

It is also possible to use this.limit = options.limit || 100000, which is somewhat shorter.

We're missing a check that limit should be non-negative.

bkutil · 2015-12-22T13:57:36Z

the fix for issue #6 is currently to set a write limit so that we
simply limit writing out no more than 100K

Nice! So no more OOMs? IIUC, we still read & parse the whole file...

future work could be to include streamable writing formats such as CSV,
or JSONL (jsonlines: http://jsonlines.org/)

+10k - I agree with you that this is the way to go.

rlgomes · 2015-12-22T19:12:57Z

@bkutil taking care of the various suggestions all good catches. The issue with OOM'ing because we'd have an existing 1GB JSON file I'm not sure how we'd fix that at this point. I think another simple limitation would be that we'd fail to open files biger than 200MB ? impose some hard limit on the size of the file ? @demmer @dmajda what do you think of this ? I guess we'd also have to expose some way of overriding the limitation but at least this way we could avoid blowing things up in a nasty way.

demmer · 2015-12-22T19:15:52Z

lib/adapters/file/write.js

It seems better to me to have eof trigger an immediate flush instead of just setting isdone and waiting.

This should be trivial to handle now since all the fs operations are async, but as @bkutil pointed out, they should be changed to be async, so you'll need to make sure you don't race between the scheduler callback and the immediate eof from flush. I think it's easy to handle by just setting isdone immediately after calling flush so that you know a subsequent callback won't race.

demmer · 2015-12-22T19:22:30Z

As far as the limit on the size of the file, I guess you'll need to have two options: -limit which is a number of points and -maxSize in bytes. Seems icky but I don't see a better option.

rlgomes · 2015-12-22T19:23:33Z

But I'll document that maxSize is the file size while -limit is the number of points in said file.

rlgomes · 2015-12-22T21:04:09Z

@demmer @bkutil have a looksie I think this is much better at this point.

demmer · 2015-12-22T21:08:22Z

lib/adapters/file/write.js

Looks like this won't ever call done() if there aren't any points buffered in the queue when the EOF arrives.

You should fix this by always entering the promise chain even if the queue length is empty and have a test case for it.

rlgomes · 2015-12-23T05:07:38Z

@demmer @bkutil another look at the current solution. I opted for not having a -flushInterval and instead doing a -flushFrequency (open to suggestions on a better name) where we only flush when we've hit a target amount of points and not just periodically checking if we have more things to flush or not. The -flushInterval option was leading to some absurdly complicated code due to the background callbacks having to synchronize their work with the final flush from the eof call.

rlgomes · 2015-12-23T05:12:16Z

Things to finish up:

add tests for option validations that aren't covered and showing up as uncovered lines in the coverage report.

demmer · 2016-01-04T14:34:04Z

Overall this looks good to me now since the above comments are just a few minor suggestions.

easier to further developer features on the file adapter.

rlgomes · 2016-01-04T18:34:24Z

Ok merging on green as I've taken care of all @demmer 's suggestions and I think this is ready for prime time.

rlgomes · 2016-01-04T18:54:12Z

@dmehra quick review on documentation additions: https://github.com/juttle/juttle/pull/11/files#diff-8cca8547a0a58a3918b9eb089cdda8a6

dmehra · 2016-01-04T19:04:34Z

Doc review comment: how about a rephrase so defaults go into the "Required?" column, and don't mention -file because per #44 it will be renamed to -path anyway. Maybe like this:

`-bufferLimit`    | Maximum number of points to buffer in memory before each write to the specified file | No; defaults to 100000
`-maxFilesize`    | Maximum size of the file being written to, limited since the entire JSON needs to fit in memory | No; defaults to 200MB
`-flushFrequency` | How often (in number of points) to flush the points in memory out to the specified file | No; defaults to every 100 points

the fix for issue #6 is currently to set a write limit so that we simply limit writing out no more than 100K JSON points in the currently supported writing format of JSON. while fixing this issue I've also moved the writes to happen asynchronously using the scheduler to schedule the syncing of data to and from disk. future work could be to include streamable writing formats such as CSV, or JSONL (jsonlines: http://jsonlines.org/) made all filesystem interactions async and cleaned up any outstanding issues with limit with this added a new parameter -limit to dictate the maximum number of points that can be handled when writing to a JSON file and also added the -maxFilesize which dictates the maximum size in bytes of a JSON file that we can handle when writing. In order to avoid out of memory decided on using a -flushFrequency option to specify how many points to wait before flushing and opting for not doing the flushing based on time since 1) the code was getting way more complicated to manage the constant async callbacks flushing along with the final flush from eof() 2) its odd to keep doing async flushes every N seconds when nothing is there to flush and having all the gaurds in place to avoid reading/writing when there is no data to actually flush.

rlgomes · 2016-01-04T19:06:11Z

@dmehra fixed!

…iles Fix file adapter to handle large files

fixes #100 In my hastiness to get pr #11 in I changed the wrong option to be bufferLimit resulting in a messy set of options which should be cleared up now with this PR.

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch 3 times, most recently from 095a42e to 6bd1558 Compare December 22, 2015 04:11

bkutil reviewed Dec 22, 2015
View reviewed changes

demmer reviewed Dec 22, 2015
View reviewed changes

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 6bd1558 to 5372fef Compare December 22, 2015 21:00

demmer reviewed Dec 22, 2015
View reviewed changes

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 5372fef to 6c4f0ed Compare December 23, 2015 05:05

rlgomes added the ready label Dec 23, 2015

rlgomes self-assigned this Dec 23, 2015

rlgomes added the bug label Dec 23, 2015

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from a9643e7 to b99cbbd Compare December 28, 2015 18:19

This was referenced Dec 28, 2015

First draft of the juttle tutorial with github data #15

Closed

add -from/-to to read file #73

Merged

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from b99cbbd to 9f4ca85 Compare December 29, 2015 06:41

move file adapter around a bit and split up the read/write to make it

bdc1ddb

easier to further developer features on the file adapter.

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 9f4ca85 to 58dce8e Compare January 4, 2016 18:34

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 58dce8e to 3815d6b Compare January 4, 2016 18:53

rlgomes force-pushed the fix-file-adapter-to-handle-large-files branch from 3815d6b to 956e7be Compare January 4, 2016 19:05

rlgomes added a commit that referenced this pull request Jan 4, 2016

Merge pull request #11 from juttle/fix-file-adapter-to-handle-large-f…

4d2c324

…iles Fix file adapter to handle large files

rlgomes merged commit 4d2c324 into master Jan 4, 2016

rlgomes deleted the fix-file-adapter-to-handle-large-files branch January 4, 2016 19:19

rlgomes removed the ready label Jan 4, 2016

rlgomes mentioned this pull request Jan 4, 2016

fix bufferLimit naming to limit instead #100

Closed

rlgomes mentioned this pull request Jan 4, 2016

rename options for write to match desired naming scheme #101

Merged

Conversation

rlgomes commented Dec 21, 2015

Uh oh!

rlgomes commented Dec 21, 2015

Uh oh!

rlgomes commented Dec 22, 2015

Uh oh!

dmajda commented Dec 22, 2015

Uh oh!

bkutil Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

bkutil Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

bkutil commented Dec 22, 2015

Uh oh!

rlgomes commented Dec 22, 2015

Uh oh!

demmer Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

demmer commented Dec 22, 2015

Uh oh!

rlgomes commented Dec 22, 2015

Uh oh!

rlgomes commented Dec 22, 2015

Uh oh!

demmer Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

rlgomes commented Dec 23, 2015

Uh oh!

rlgomes commented Dec 23, 2015

Uh oh!

demmer commented Jan 4, 2016

Uh oh!

rlgomes commented Jan 4, 2016

Uh oh!

rlgomes commented Jan 4, 2016

Uh oh!

dmehra commented Jan 4, 2016

Uh oh!

rlgomes commented Jan 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants