Is this project still being maintained?
ref: https://github.com/codelucas/newspaper/issues/813
The owner of this project hasn't responded to any inquires about the status of this project, since June 2020. The project likely needs to be forked and updated, because the last published update by @codelucas was on Jun 13, 2017.
The owner did an interview on a podcast in September where he expressed his interest in continuing to maintain the library but that he was having trouble keeping up with it (if my memory serves me).
The interview: https://www.pythonpodcast.com/newspaper-data-extraction-episode-280/
If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner
Has anyone tried to reach out to the developer yet? I may reach out offering support. The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.
Yes. Reference: https://github.com/codelucas/newspaper/issues/813
Has anyone tried to reach out to the developer yet? I may reach out offering support. The biggest thing this project needs in my opinion is more transparent and direct access to the cached articles. If there are methods to access the cache, I have not found them yet.
If this project were to be forked & updated, what suggestions do you have for updates needed? @johnbumgarner
Based on some the past issues the extraction piece of this module would require the most changes. After that likely the NLP piece of this code.
Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.
Yes please
Shall we once for all fork it and work on it? It seems a lot of time passed since the last conversation about this.
@AlviseSembenico mostly likely, because the module's creator won't respond to emails about the status of the code base. The question is how much to keep and how much to redesigned from scratch. The rule-base extraction is still useful, but it might be better to rebuild that to use some type of machine learning technique that can "guess at a page's structure and tags." I have started doing research into that, but I'm not an expert on ML or modeling.
I have also been exploring all the issues with the current version by reading all the pull requests and open/closed issues.
@johnbumgarner Would love to contribute on that
@johnbumgarner Your is a good point. Let's bear in mind that a "fast" version should be available since some of the use cases require speed and might run on not-so-performing computers. I have an ML background so can do research. Did you already look if there is already a project going in that direction?
@AlviseSembenico
The best I've found is https://github.com/fhamborg/news-please
They recently released a paper with a non transformer based model https://github.com/fhamborg/NewsMTSC
It would be great to see a version of that library empowered by huggingfaces!
@RaedShabbir I worker with News-please, it is a great project, however, it uses Newspaper and other heuristics under the hood so it is not a radical change in the paradigm.
Hello! I recently stumbled upon this repo. Despite not being maintained anymore, how reliable would you say this project is? And is news-please any more reliable? If not, has anyone made an updated fork?
And is news-please any more reliable?
news-please depends on newspaper3k so it cannot be considered more reliable. news-please however is an active project. We are better off getting in touch with news-please maintainer. newspaper3k could potentially be made an optional dependency and replaced by another extractor.