Skip to content

Conversation

@jeremybmerrill
Copy link
Member

@jeremybmerrill jeremybmerrill commented Jul 18, 2017

this set of commits adds UI and serverside ability to

  • save a template from the extraction view
  • load an already-saved template from the extraction view
  • upload/download/rename a template from disk in the Library (front page) view

this incorporates @cheapsteak's #669 PR. this will close the feature requested in #608, #483, #93, #505.

The "template" is the same as the JSON output you can already download from the export view. Template metadata is recorded in workspace.json, which now has a new format (with stuff to automatically convert existing workspace.json files).

@jazzido @mtigas, anyone else, wanna take a look? (I already showed Mike IRL)

this is kind of just a first-stab at the problem, there's a bunch of enhancements that could be done now:

  • refactor tabula_web.rb's template persistence stuff, which is a little repetitive and gross right now
  • reimplement Autodetected Selections as a template, since it's really the same thing
  • how should we handle repeated selections? (i.e. if the template is saved with one selection repeated on pages 2 thru 100 of 100 page PDF, and that selection is loaded on a 102 page PDF, should we apply the selection to the following pages?)
  • you should be able to name templates from PDFView
  • you should be able to "overwrite" old templates from PDFView (i.e. loading a template, modifying it, and saving it as teh same one, not a new one)
  • maybe you should be able to upload templates in PDFView (rather than just on the front page)
  • "batch" mode from front page, applying one template to multiple PDFs all at once.
  • how to handle selections that don't "match" the PDF they're applied to -- maybe a selection's coordinates are outside the page on the new PDF, or there are selections on pages that don't existg in the new PDF.

@jazzido
Copy link
Contributor

jazzido commented Jul 18, 2017

That's awesome, @jeremybmerrill! Welcome back to Tabuland :)

I love the feature. My only comment is that I'm not sure about the save-template → download-template → upload-template workflow. The only use case that I can think of for uploading a template from a file is sharing them among users. Besides, the template list and upload form it kind of clutter the home page.

refactor tabula_web.rb's template persistence stuff, which is a little repetitive and gross right now

Agree. Take a look at what I implemented in the Java rewrite of Tabula that I PoC'd a few months ago: https://github.com/tabulapdf/tabula-web-java/blob/master/src/main/java/technology/tabula/tabula_web/workspace/WorkspaceDAO.java

@jeremybmerrill
Copy link
Member Author

Yeah, we could totally move the template library stuff to a new page.

I do think sharing templates among users is a feature we'd like to be able to support; that's exactly what I had in mind. It's optional though: you don't have to download/upload if you want to use the templates within your install.

@jeremybmerrill
Copy link
Member Author

I would love to allow users to provide a template (eg. a set of rectangular areas constrained by a rectangular container) as the extraction template. I prototyped an UI for that a few years ago [3], which I based off ABBYY’s FineReader table editor (also, see the attached screencast).

^^ via email.

What did you mean by this? Is the "set of rectangular areas" the cells? (Completely "manual" extraction, obviating the need for any extraction algorithm at all?

Also: were I to move the template library stuff to a new page, what do you think about merging this for the next release? I totally agree that there's a lot more that could be done, but I think this first pass could benefit a lot of users.

@jazzido
Copy link
Contributor

jazzido commented Aug 4, 2017

What did you mean by this? Is the "set of rectangular areas" the cells? (Completely "manual" extraction, obviating the need for any extraction algorithm at all?

Yes, pretty much. I guess the algorithm would propose a segmentation that the user would be able to edit with the UI.

Also: were I to move the template library stuff to a new page, what do you think about merging this for the next release? I totally agree that there's a lot more that could be done, but I think this first pass could benefit a lot of users.

Would love to. Can you merge tabula-java-1.0.0 into your branch?

@jeremybmerrill
Copy link
Member Author

with f36d3ba I've moved the My Templates page to its own page (off of the front page). I think this is a minimally-viable feature improvement. There's obviously a lot to be done around making the templates even more powerful, but I think this is a start.

return [job_batch, file_id]
end

def list_templates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jeremybmerrill

Can we move this to the new Tabula::Workspace class (in my PR)? I'd love to include this feature in our next release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just merge tabula-java-1.0.0 into this branch and take it from there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to, but I don't see Tabula::Workspace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #707

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the integration would be easier if you make a new PR against tabula-java-1.0.0 instead of master.

@jazzido
Copy link
Contributor

jazzido commented Aug 9, 2017

Deleted my workspace folder, ran Tabula from source, got an error when uploading a document:

java.util.concurrent.ExecutionException: org.jruby.exceptions.RaiseException: (Errno::ENOENT) /Users/manuel/Library/Application Support/Tabula/pdfs/workspace.json
	at java.util.concurrent.FutureTask.report(java/util/concurrent/FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(java/util/concurrent/FutureTask.java:192)
	at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:497)
	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:438)
	at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:302)
	at RUBY.afterExecute(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/executor.rb:42)
	at org.jruby.javasupport.proxy.JavaProxyConstructor$MethodInvocationHandler.invokeRuby(org/jruby/javasupport/proxy/JavaProxyConstructor.java:255)
	at org.jruby.javasupport.proxy.JavaProxyConstructor$MethodInvocationHandler.invoke(org/jruby/javasupport/proxy/JavaProxyConstructor.java:238)
	at org.jruby.proxy.java.util.concurrent.ThreadPoolExecutor$Proxy1.afterExecute(org/jruby/proxy/java/util/concurrent/ThreadPoolExecutor$Proxy1)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1150)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)
0:0:0:0:0:0:0:1 - - [09/Aug/2017:12:43:50 -0400] "POST /upload.json HTTP/1.1" 200 180 0.9489
	at java.lang.Thread.run(java/lang/Thread.java:745)
Aug 09, 2017 12:43:50 PM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
Caused by: org.jruby.exceptions.RaiseException: (Errno::ENOENT) /Users/manuel/Library/Application Support/Tabula/pdfs/workspace.json
	at org.jruby.RubyFile.initialize(org/jruby/RubyFile.java:366)
	at org.jruby.RubyIO.open(org/jruby/RubyIO.java:1154)
	at RUBY.read_workspace!(/Users/manuel/Work/code/tabula/lib/tabula_workspace.rb:154)
	at RUBY.add_document(/Users/manuel/Work/code/tabula/lib/tabula_workspace.rb:29)
	at RUBY.perform(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/jobs/generate_document_data.rb:36)
	at RUBY.call(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/executor.rb:104)
java.util.concurrent.ExecutionException: org.jruby.exceptions.RaiseException: (Errno::ENOENT) /Users/manuel/Library/Application Support/Tabula/pdfs/workspace.json
	at java.util.concurrent.FutureTask.report(java/util/concurrent/FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(java/util/concurrent/FutureTask.java:192)
	at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:497)
	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:438)
	at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:302)
	at RUBY.afterExecute(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/executor.rb:42)
	at org.jruby.javasupport.proxy.JavaProxyConstructor$MethodInvocationHandler.invokeRuby(org/jruby/javasupport/proxy/JavaProxyConstructor.java:255)
	at org.jruby.javasupport.proxy.JavaProxyConstructor$MethodInvocationHandler.invoke(org/jruby/javasupport/proxy/JavaProxyConstructor.java:238)
	at org.jruby.proxy.java.util.concurrent.ThreadPoolExecutor$Proxy1.afterExecute(org/jruby/proxy/java/util/concurrent/ThreadPoolExecutor$Proxy1)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1150)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(java/lang/Thread.java:745)
Caused by: org.jruby.exceptions.RaiseException: (Errno::ENOENT) /Users/manuel/Library/Application Support/Tabula/pdfs/workspace.json
	at org.jruby.RubyFile.initialize(org/jruby/RubyFile.java:366)
	at org.jruby.RubyIO.open(org/jruby/RubyIO.java:1154)
	at RUBY.read_workspace!(/Users/manuel/Work/code/tabula/lib/tabula_workspace.rb:154)
	at RUBY.add_document(/Users/manuel/Work/code/tabula/lib/tabula_workspace.rb:29)
	at RUBY.perform(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/jobs/generate_document_data.rb:36)
	at RUBY.call(/Users/manuel/Work/code/tabula/lib/tabula_job_executor/executor.rb:104)

@jeremybmerrill
Copy link
Member Author

I'm not quite done poking at this yet, but will address.

@jeremybmerrill
Copy link
Member Author

closing to propose merge into tabula-java-1.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants