I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.
Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (./gradlew run)
An example error log:
ERROR [2024-05-09 13:13:55,538] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
! at java.base/java.util.Objects.checkIndex(Objects.java:359)
! at java.base/java.util.ArrayList.get(ArrayList.java:427)
! at org.grobid.core.data.Note.getPageNumber(Note.java:77)
! at org.grobid.core.document.TEIFormatter.lambda$toTEITextPiece$0(TEIFormatter.java:1460)
! at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
! at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
! at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
! at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
! at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
! at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
! at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1461)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 78 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2708)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:320)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)
The 50 PDFs that failed:
https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990
I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.
Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (
./gradlew run)An example error log:
The 50 PDFs that failed:
https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990