Skip to content

Incorrectly extracted chemistry information. #1249

@haydn-jones

Description

@haydn-jones

I have a large number of PDFs which have a series of sections/subsections that look like this:

Image

After processing all of them, I've realized that these sections are often either being identified as figures or even as bib references, which is causing a lot of issues with my pipeline. I've also seen other various issues with documents like these (missing paragraph breaks and such), but they are largely not as problematic as these sections getting lost.

Heres an example of part of one of these sections getting turned into a bib ref

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Orange solid, yield: 73%. 1 H NMR (400 MHz</title>
		<idno>1.7.1. 2-(3-Bromopropyl)-6-(methylamino)-1H-benzo[de]iso- quinoline-1</idno>
		<imprint>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
	<note>H)-dione (7a) CDCl 3 ): d 8.62 (d, J = 6.8 Hz, 1H), 8.53 (d, J = 8.4 Hz, 1H), 8.13 (d, J = 9.2 Hz, 1H), 7.67 (t, J = 8.4 Hz, 1H</note>
</biblStruct>

And from the same PDF, several of them were turned into a figure:

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>5.
					1 .</head><label>1</label><figDesc>Synthesis 5.1.1. General
					procedure for the preparation of 1a-1d6-Bromobenzo[de]-isochromene-1,3-dione
					(1.94 g, 7.00 mmol) was dissolved in 20 ml ethanol. Then corresponding primary
					amine (7.70 mmol) was added, and the mixture was stirred at 60 °C for 5-6 h. The
					mixture was cooled to room temperature and evaporated in vacuum to obtain the
					residue. Then the residue was purified on silica gel chromatography (PE:EA =
					10:1, V/V) to provide 1a-1d.5.1.1.1.
					6-Bromo-2-methyl-1H-benzo[de]isoquinoline-1,3(2H)dione (1a).White solid, yield:
					90%.<ref type="bibr" target="#b11">1</ref> H NMR (400 MHz, CDCl 3 ): d 8.68 (d,
					J = 7.2 Hz, 1H), 8.59 (d, J = 8.4 Hz, 1H), 8.44 (d, J = 8.0 Hz, 1H), 8.06 (d, J
					= 7.6 Hz, 1H), 7.86 (t, J = 8.4 Hz, 1H), 3.57 (s, 3H); MS(ESI) calcd for C 13 H
					9 BrNO 2 [M+H] + 289.0, found: 289.0. 5.1.1.2.
					6-Bromo-2-butyl-1H-benzo[de]isoquinoline-1,3(2H)dione (1b). White solid, yield:
					80%. 1 H NMR (400 MHz, CDCl 3 ) d 8.66 (d, J = 7.2 Hz, 1H), 8.56 (d, J = 8.8 Hz,
					1H), 8.41 (d, J = 8.0 Hz, 1H), 8.04 (d, J = 8.0 Hz, 1H), 7.85 (t, J = 8.0 Hz,
					1H), 4.19 (t, J = 7.6 Hz, 1H), 1.77-1.69 (m, 2H), 1.51-1.42 (m, 2H), 1.00 (t, J
					= 7.2 Hz, 3H); MS(ESI) calcd for C 16 H 15 BrNO 2 [M+H] + 331.0, found: 331.0.
					5.1.1.3. 6-Bromo-2-octyl-1H-benzo[de]isoquinoline-1,3(2H)dione (1c). White
					solid, yield: 85%. 1 H NMR (400 MHz, CDCl 3 ): d 8.68 (d, J = 7.2 Hz, 1H), 8.59
					(d, J = 7.6 Hz, 1H), 8.44 (d, J = 8.0 Hz, 1H), 8.06 (d, J = 8.0 Hz, 1H), 7.87
					(t, J = 7.6 Hz, 1H), 4.18 (t, J = 8.0 Hz, 2H), 1.78-1.71 (m, 2H), 1.47-1.29 (m,
					10H), 0.89 (t, J = 7.2 Hz, 3H); MS(ESI) calcd for C 20 H 23 BrNO 2 [M+H] +
					387.1, found: 387.1. 5.1.1.4.
					6-Bromo-2-dodecyl-1H-benzo[de]isoquinoline-1,3(2H)dione (1d). White solid,
					yield: 85%. 1 H NMR (400 MHz, CDCl 3 ): d 8.68 (d, J = 7.2 Hz, 1H), 8.59 (d, J =
					8.4 Hz, 1H), 8.44 (d, J = 7.6 Hz, 1H), 8.06 (d, J = 8.0 Hz, 1H), 7.87 (t, J =
					7.6 Hz, 1H), 4.18 (t, J = 7.6 Hz, 2H), 1.78-1.71 (m, 2H), 1.47-1.27 (m, 18H),
					0.90 (t, J = 7.2 Hz, 3H); MS(ESI) calcd for C 24 H 31 BrNO 2 [M+H] + 443.1,
					found: 443.2.</figDesc></figure>

Not really sure whats causing this, thought it would be useful to report it. I've attached the PDF that produced the above issues.

1-s2.0-S0968089615005787-main.pdf

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions