A multi-stage pipeline built to support the Catholic Leadership Institute (CLI) in extracting and analyzing parish-related data from diocesan and parish websites across the United States. The project enables data-driven insights into parish operations, sacramental schedules, and engagement activity.
This project automates the scraping of parish names, website URLs, sacramental schedules, and contact information using web scraping and AI tools. Cleaned and structured outputs are integrated into a Power BI dashboard for further analysis. The workflow consists of three core stages:
- Web scraping diocesan directories and parish websites
- AI-based structured data extraction and cleaning
- Visualization in a Power BI dashboard
This scraper collects parish names and website URLs from diocesan directories using Playwright and BeautifulSoup. It supports multiple dioceses and handles dynamic content and pagination.
Key Features:
- Scrapes multiple diocesan directories
- Handles JavaScript-heavy pages
- Outputs:
parish_results.csv
Setup:
pip install playwright beautifulsoup4
playwright installUsage:
python parish_scraper.pyOutput:
parish_results.csvwith columns: Parish Name, Website URL
The AI-enhanced pipeline processes scraped parish websites using ScrapeGraphAI and multiple scripts to produce structured data tables used for Power BI visualization.
Steps:
parish_results.csvβparish_results_cleaner.pyβcleaned_urls.csvcleaned_urls.csvβParishScraperGivenCSV.pyβRawScrapedParishData.csvRawScrapedParishData.csvβCombined_Cleaner.pyβJSON-like_Final_File.csvandCouldNotScrape.csvJSON-like_Final_File.csvβConvertCsvTo6CsvTables.pyβ Final structured tables
The dashboard visualizes activity from the final cleaned tables, combining both VT-scraped data and CLI-provided scores.
Input Tables:
- ParishTable
- MassTimesTable
- ConfessionTable
- AdorationTable
- DaysAdoration, DaysConfession, DaysMass
- SacramentsByDay
- ActivityRankings
- CLI_Data
- Diocese Locations
- OverlappedParishes
Dashboard Pages:
- VT Overview
- VT Sacrament
- VT Face Cards
- CLI Overview
- CLI/VT Scoring State
- CLI/VT Face Cards
- CLI/VT Scoring Comparison
Notable Metrics:
- Sacramental Importance (e.g. Mass: 3, Confession: 2, Adoration: 1)
- Combined and Normalized Activity Scores
- ParishUnique field to differentiate similarly named parishes
- Sacramental weights may need refinement to reduce clustering in dashboard scores.
- Duplicate parishes may exist due to shared names and ZIP assumptions.
- Python scripts run within Power BI require enabling Python scripting and specific libraries (
pyzipcode,ZipCodeDatabase). - Power BI privacy settings may restrict sharing; consider using a dedicated workspace or publish-to-web option.
Devanshu Khadka
Internal use only. Please contact before reuse or distribution.