GNE is a general-purpose news content extractor built in Python. It is based on the algorithm described in the paper "Web Content Extraction Based on Text and Symbol Density".
The algorithm is clean, logical, and effective. Since the paper only describes the algorithm without a concrete implementation, this project implements it in Python. It has been tested on major Chinese news sites (Toutiao, NetEase News, Sina News, iFeng, Tencent News, ReadHub, etc.) with nearly 100% accuracy.
Beyond the body text extraction described in the paper, GNE also supports automatic detection and extraction of title, publish time, and author.
This project is named "Extractor" rather than "Crawler" by design — the input is HTML, and the output is a dictionary. You are responsible for obtaining the HTML of target pages using your own methods.
This project does not and will not provide any functionality to actively request HTML from websites.
You can try GNE online at http://gne.kingname.info/. Simply paste the rendered HTML into the text area and click the extract button. For more precise extraction, additional parameters can be provided. See the API documentation for details.
# Install via pip
pip install --upgrade gne
# Or install via pipenv
pipenv install gnefrom gne import GeneralNewsExtractor
html = '''Your rendered HTML code'''
extractor = GeneralNewsExtractor()
result = extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])
print(result)
# Output:
# {"title": "xxxx", "publish_time": "2019-09-10 11:12:13", "author": "yyy", "content": "zzzz", "images": ["/xxx.jpg", "/yyy.png"]}For more details, see the GNE documentation.
from gne import ListPageExtractor
html = '''Your rendered HTML code'''
list_extractor = ListPageExtractor()
result = list_extractor.extract(html, feature='XPath of any element in the list')
print(result)If automatic title extraction fails, you can specify a custom XPath:
from gne import GeneralNewsExtractor
extractor = GeneralNewsExtractor()
html = 'Your target page HTML'
result = extractor.extract(html, title_xpath='//h5/text()')
print(result)Some news pages contain comments that may look more like body text than the actual article. Use the noise_node_list parameter to remove interfering elements before extraction:
result = extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])noise_node_list accepts a list of XPath expressions, each targeting an element to be removed during preprocessing.
git clone https://github.com/kingname/GeneralNewsExtractor.git
cd GeneralNewsExtractor
pipenv install
pipenv shell
python3 example.py- The
example.pyfile provides basic usage examples. - Test code is located in the
testsdirectory. - The input HTML must be JavaScript-rendered HTML, not raw page source. This means GNE works with both server-side rendered and Ajax-loaded content.
- To manually test a new page, open it in Chrome, go to Developer Tools, locate the
<html>tag in the Elements tab, right-click and selectCopy>Copy OuterHTML.
- You can also use Puppeteer/Pyppeteer, Selenium, or any other method to obtain the JavaScript-rendered source code.
- List page extraction is an experimental feature and should not be used in production. You can use Chrome DevTools'
Copy XPathto copy the XPath of any item in the list. GNE will automatically find other items in the same list.
- GNE is designed for news article pages. It may not work well on non-news pages or photo gallery articles.
- The author field may be empty if the article does not specify an author or if the author pattern is not covered by the existing regular expressions.
Use a configuration file for constants instead of hard-coding them.Allow custom patterns for time and author extraction.News article list page extraction.- Support multi-page articles by accepting a list of HTMLs and concatenating the extracted content.
Optimize extraction speed.Test on more news websites.- ...
- WeChat: Add the author
mekingnameand mention "GNE" to join the group. - Telegram: https://t.me/joinchat/Bc5swww_XnVR7pEtDUl1vw
@bigbrother666sh








