[ENH] - Update words collection & processing #49

TomDonoghue · 2021-01-31T23:20:44Z

This PR does some updates on words collection & processing:

fixes how IDs are extracted, to not accidentally collect IDs listed as references
updates collecting years data, to cover more cases of how it may be encoded
drops the nltk dependency, by adding stopwords and tokenizing functionality to the module
update to cleanly handle the situation in which no authors are found
refactors and cleans of the code, including optimization of words processing

TomDonoghue added 13 commits January 31, 2021 16:00

refactor part of words extraction

f6f365f

rename extract -> get_info

496a3cf

add extract_tag

c59f675

make test tag object

f0ca904

update to extract ref list

60dd72f

update to also grab medline date info

fac2908

update to handle no authors

ce27056

speedup word processing

8ab7a59

add test for drop_none

fdc5404

add packaged stopwords list

2d6fb36

clean up update impl

b8456b5

update approach for dropping punctuation

34c1800

drop nltk FreqDist -> Counter

2012ebc

TomDonoghue changed the title ~~[MNT] - Fix up some words collection & processing~~ [ENH - Fix up some words collection & processing Feb 1, 2021

TomDonoghue mentioned this pull request Feb 1, 2021

NLP dependency #2

Closed

TomDonoghue added 3 commits February 1, 2021 01:32

add and use own tokenizer

db4bc51

drop downloads of nltk files - no longer needed

40e51d5

drop nltk dependency

6a4f0cb

TomDonoghue changed the title ~~[ENH - Fix up some words collection & processing~~ [ENH] - Update words collection & processing Feb 1, 2021

lisc-tools deleted a comment from codecov-io Feb 1, 2021

clean ups

5bc96ee

TomDonoghue merged commit a7bbe07 into main Feb 1, 2021

TomDonoghue deleted the words branch February 1, 2021 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] - Update words collection & processing #49

[ENH] - Update words collection & processing #49

Uh oh!

TomDonoghue commented Jan 31, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ENH] - Update words collection & processing #49

[ENH] - Update words collection & processing #49

Uh oh!

Conversation

TomDonoghue commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomDonoghue commented Jan 31, 2021 •

edited

Loading