This project scrapes the latest tech news articles from TechCrunch, categorizes them, and provides visualizations to analyze the data. It includes scraping headlines, categorizing them based on predefined categories (such as AI, Blockchain, 5G, etc.), and displaying various graphs like bar charts, pie charts, and word clouds.

Inspiration

The inspiration behind this project came from the need to understand the distribution of tech news articles across different categories such as AI, Blockchain, Cybersecurity, etc. We wanted to explore how data visualization can provide insights into tech trends and news, making it easier for tech enthusiasts and professionals to stay updated.

What it does

  • Web Scraping: Scrapes the latest articles from TechCrunch using requests and BeautifulSoup.
  • Article Categorization: Categorizes articles into predefined categories (e.g., AI, Blockchain, Cybersecurity).
  • Data Visualization:
    • Generates a bar chart showing the number of articles in each category.
    • Creates a pie chart to visualize the proportion of articles in each category.
    • Generates a word cloud to display the most frequent terms in article titles.
  • Data Storage: Saves the scraped articles in a CSV file (tech_news_articles.csv) for future reference.

How we built it

  1. Web Scraping: We used the requests library to send HTTP requests to TechCrunch and BeautifulSoup to parse the HTML content and extract relevant article details.
  2. Data Handling: The scraped articles are stored in a pandas DataFrame, which makes it easy to manipulate and categorize the data based on predefined keywords.
  3. Data Visualization:
    • We used matplotlib to create bar and pie charts, which are saved as images.
    • A word cloud was generated using the wordcloud library to visualize the most common terms in the article titles. 4.Saving Data: The data is saved to a CSV file using Python's built-in csv module.

Challenges we ran into

  • Dealing with Dynamic Web Pages: TechCrunch uses some dynamic content loading, which was initially a challenge for scraping. We had to ensure the content we needed was being loaded before scraping.
  • Categorizing Articles: Categorizing articles based on keywords required refining our search logic to ensure that articles were categorized accurately.
  • Visualization Accuracy: Generating meaningful and readable visualizations was tricky at first, especially when it came to handling data with lots of variations in article counts.

Accomplishments that we're proud of

  • Successfully scraped and categorized the latest tech news articles.
  • Generated accurate and visually appealing data visualizations (bar charts, pie charts, and word clouds).
  • Built a functional project that can be reused and extended to analyze articles from other websites or sources.
  • Learned to work with multiple libraries (requests, BeautifulSoup, pandas, matplotlib, wordcloud) to create a cohesive project.

What we learned

  • Web Scraping: Gained hands-on experience with requests and BeautifulSoup for scraping dynamic websites.
  • Data Categorization: Learned how to categorize data based on keywords and refine the process to improve accuracy.
  • Data Visualization: Gained insights into creating various types of charts and visualizations, which helped in presenting data in a clear and meaningful way.
  • Data Handling: Learned how to handle and manipulate data efficiently using pandas.

What's next for Libraries

  • Expand to Other News Sources: We plan to extend this project by scraping news from other tech websites like Wired, The Verge, and Ars Technica.
  • More Visualization: We plan to create more visualizations, such as trends over time, word frequency analysis, and more advanced charts.
  • Improve Categorization: Use Natural Language Processing (NLP) techniques to improve the categorization accuracy of the articles.
  • Real-Time Scraping: Implement a real-time scraping system that collects and analyzes articles regularly.

Built With

Share this project:

Updates