Skip to content

processFulltextDocument fails on 0.23% arXiv PDFs #1113

@MarksonChen

Description

@MarksonChen

I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.

Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (./gradlew run)

An example error log:

ERROR [2024-05-09 13:13:55,538] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
! at java.base/java.util.Objects.checkIndex(Objects.java:359)
! at java.base/java.util.ArrayList.get(ArrayList.java:427)
! at org.grobid.core.data.Note.getPageNumber(Note.java:77)
! at org.grobid.core.document.TEIFormatter.lambda$toTEITextPiece$0(TEIFormatter.java:1460)
! at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
! at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
! at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
! at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
! at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
! at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
! at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1461)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 78 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2708)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:320)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)

The 50 PDFs that failed:

https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugFrom Hemiptera and especially its suborder HeteropteraimplementedThe issue has been implemented

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions