This project scrapes the latest tech news articles from TechCrunch, categorizes them, and provides visualizations to analyze the data. It includes scraping headlines, categorizing them based on predefined categories (such as AI, Blockchain, 5G, etc.), and displaying various graphs like bar charts, pie charts, and word clouds.
Inspiration
The inspiration behind this project came from the need to understand the distribution of tech news articles across different categories such as AI, Blockchain, Cybersecurity, etc. We wanted to explore how data visualization can provide insights into tech trends and news, making it easier for tech enthusiasts and professionals to stay updated.
What it does
- Web Scraping: Scrapes the latest articles from TechCrunch using
requestsandBeautifulSoup. - Article Categorization: Categorizes articles into predefined categories (e.g., AI, Blockchain, Cybersecurity).
- Data Visualization:
- Generates a bar chart showing the number of articles in each category.
- Creates a pie chart to visualize the proportion of articles in each category.
- Generates a word cloud to display the most frequent terms in article titles.
- Data Storage: Saves the scraped articles in a CSV file (
tech_news_articles.csv) for future reference.
How we built it
- Web Scraping: We used the
requestslibrary to send HTTP requests to TechCrunch andBeautifulSoupto parse the HTML content and extract relevant article details. - Data Handling: The scraped articles are stored in a
pandasDataFrame, which makes it easy to manipulate and categorize the data based on predefined keywords. - Data Visualization:
- We used
matplotlibto create bar and pie charts, which are saved as images. - A word cloud was generated using the
wordcloudlibrary to visualize the most common terms in the article titles. 4.Saving Data: The data is saved to a CSV file using Python's built-incsvmodule.
- We used
Challenges we ran into
- Dealing with Dynamic Web Pages: TechCrunch uses some dynamic content loading, which was initially a challenge for scraping. We had to ensure the content we needed was being loaded before scraping.
- Categorizing Articles: Categorizing articles based on keywords required refining our search logic to ensure that articles were categorized accurately.
- Visualization Accuracy: Generating meaningful and readable visualizations was tricky at first, especially when it came to handling data with lots of variations in article counts.
Accomplishments that we're proud of
- Successfully scraped and categorized the latest tech news articles.
- Generated accurate and visually appealing data visualizations (bar charts, pie charts, and word clouds).
- Built a functional project that can be reused and extended to analyze articles from other websites or sources.
- Learned to work with multiple libraries (requests, BeautifulSoup, pandas, matplotlib, wordcloud) to create a cohesive project.
What we learned
- Web Scraping: Gained hands-on experience with
requestsandBeautifulSoupfor scraping dynamic websites. - Data Categorization: Learned how to categorize data based on keywords and refine the process to improve accuracy.
- Data Visualization: Gained insights into creating various types of charts and visualizations, which helped in presenting data in a clear and meaningful way.
- Data Handling: Learned how to handle and manipulate data efficiently using
pandas.
What's next for Libraries
- Expand to Other News Sources: We plan to extend this project by scraping news from other tech websites like Wired, The Verge, and Ars Technica.
- More Visualization: We plan to create more visualizations, such as trends over time, word frequency analysis, and more advanced charts.
- Improve Categorization: Use Natural Language Processing (NLP) techniques to improve the categorization accuracy of the articles.
- Real-Time Scraping: Implement a real-time scraping system that collects and analyzes articles regularly.
Built With
- beautiful-soup
- csv
- matplotlib
- pandas
- request
- techcrunch
- wordcloud
Log in or sign up for Devpost to join the conversation.