newspaper Is this project still being maintained?

Mar 23 '21 02:03 lodenrogue

ref: https://github.com/codelucas/newspaper/issues/813

The owner of this project hasn't responded to any inquires about the status of this project, since June 2020. The project likely needs to be forked and updated, because the last published update by @codelucas was on Jun 13, 2017.

Mar 23 '21 12:03 johnbumgarner

The owner did an interview on a podcast in September where he expressed his interest in continuing to maintain the library but that he was having trouble keeping up with it (if my memory serves me).

The interview: https://www.pythonpodcast.com/newspaper-data-extraction-episode-280/

Mar 31 '21 01:03 ghost

If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner

Apr 09 '21 14:04 planktonrobo

Has anyone tried to reach out to the developer yet? I may reach out offering support. The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.

Apr 09 '21 17:04 ghost

Yes. Reference: https://github.com/codelucas/newspaper/issues/813

Has anyone tried to reach out to the developer yet? I may reach out offering support. The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.

Apr 09 '21 22:04 johnbumgarner

If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner

Based on some the past issues the extraction piece of this module would require the most changes. After that likely the NLP piece of this code.

Apr 10 '21 17:04 johnbumgarner

Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.

Apr 23 '21 10:04 AlviseSembenico

Yes please

Apr 24 '21 14:04 lodenrogue

Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.

@AlviseSembenico mostly likely, because the module's creator won't respond to emails about the status of the code base. The question is how much to keep and how much to redesigned from scratch. The rule-base extraction is still useful, but it might be better to rebuild that to use some type of machine learning technique that can "guess at a page's structure and tags." I have started doing research into that, but I'm not an expert on ML or modeling.

I have also been exploring all the issues with the current version by reading all the pull requests and open/closed issues.

Apr 24 '21 15:04 johnbumgarner

@johnbumgarner Would love to contribute on that

May 05 '21 05:05 RaedShabbir

@johnbumgarner Your is a good point. Let's bear in mind that a "fast" version should be available since some of the use cases require speed and might run on not-so-performing computers. I have an ML background so can do research. Did you already look if there is already a project going in that direction?

May 05 '21 10:05 AlviseSembenico

@AlviseSembenico
The best I've found is https://github.com/fhamborg/news-please

They recently released a paper with a non transformer based model https://github.com/fhamborg/NewsMTSC

It would be great to see a version of that library empowered by huggingfaces!

May 05 '21 15:05 RaedShabbir

@RaedShabbir I worker with News-please, it is a great project, however, it uses Newspaper and other heuristics under the hood so it is not a radical change in the paradigm.

May 07 '21 12:05 AlviseSembenico

Hello! I recently stumbled upon this repo. Despite not being maintained anymore, how reliable would you say this project is? And is news-please any more reliable? If not, has anyone made an updated fork?

May 22 '21 20:05 edvilme

And is news-please any more reliable?

news-please depends on newspaper3k so it cannot be considered more reliable. news-please however is an active project. We are better off getting in touch with news-please maintainer. newspaper3k could potentially be made an optional dependency and replaced by another extractor.

Jun 06 '23 19:06 mxdev88