Skip to content

πŸ—„οΈ Create SQL Script for Database Schema Validation and StatisticsΒ #7865

@pethers

Description

@pethers

🎯 Objective

Create a comprehensive SQL script that validates the database schema by counting all tables and views, extracting sample data from each, and generating statistics reports. This script will provide a baseline for documentation validation and schema health checks.

Recommended Agent: @stack-specialist

πŸ“‹ Background

Currently, we lack automated tools to:

  • Count and list all tables and views in the database
  • Extract representative sample data for validation
  • Generate statistics about schema composition
  • Validate data consistency and quality
  • Provide baseline metrics for documentation

This script will be the foundation for validating DATABASE_VIEW_INTELLIGENCE_CATALOG.md and other documentation.

πŸ“Š Current State

Schema Size:

  • full_schema.sql: 12,934 lines
  • refresh-all-views.sql: ~28 materialized views
  • PostgreSQL 16 database with extensive view hierarchy

Missing Capabilities:

  • No automated schema inventory
  • No sample data extraction tool
  • No statistics generation
  • No baseline for validation

βœ… Acceptance Criteria

  • SQL script created at service.data.impl/src/main/resources/schema-validation.sql
  • Script counts all tables (base tables, not views)
  • Script counts all views (regular and materialized separately)
  • For each table/view: name, type, row count, column count
  • Sample data extraction (top 5-10 rows) for each table/view
  • Statistics: total size, largest tables, empty tables/views
  • Output formatted as JSON or CSV for easy processing
  • Documentation added to README-SCHEMA-MAINTENANCE.md
  • Script tested against actual database

πŸ› οΈ Implementation Guidance

Script Structure

-- schema-validation.sql
-- Database Schema Validation and Statistics Report
-- Generated: [timestamp]

-- Part 1: Count all objects
\echo '=== DATABASE OBJECT COUNTS ==='

-- Count base tables
SELECT 'Base Tables' AS object_type, COUNT(*) AS count
FROM information_schema.tables
WHERE table_schema = 'public' AND table_type = 'BASE TABLE';

-- Count regular views
SELECT 'Regular Views' AS object_type, COUNT(*) AS count
FROM information_schema.views
WHERE table_schema = 'public';

-- Count materialized views
SELECT 'Materialized Views' AS object_type, COUNT(*) AS count
FROM pg_matviews
WHERE schemaname = 'public';

-- Part 2: List all tables with row counts
\echo '\n=== TABLE INVENTORY WITH ROW COUNTS ==='

DO $$
DECLARE
    r RECORD;
    row_count INTEGER;
BEGIN
    FOR r IN 
        SELECT schemaname, tablename 
        FROM pg_tables 
        WHERE schemaname = 'public'
        ORDER BY tablename
    LOOP
        EXECUTE format('SELECT COUNT(*) FROM %I.%I', r.schemaname, r.tablename) INTO row_count;
        RAISE NOTICE 'Table: %.% | Rows: %', r.schemaname, r.tablename, row_count;
    END LOOP;
END $$;

-- Part 3: List all views with row counts
\echo '\n=== VIEW INVENTORY WITH ROW COUNTS ==='

DO $$
DECLARE
    r RECORD;
    row_count INTEGER;
BEGIN
    -- Regular views
    FOR r IN 
        SELECT schemaname, viewname AS name
        FROM pg_views 
        WHERE schemaname = 'public'
        ORDER BY viewname
    LOOP
        BEGIN
            EXECUTE format('SELECT COUNT(*) FROM %I.%I', r.schemaname, r.name) INTO row_count;
            RAISE NOTICE 'View: %.% | Rows: %', r.schemaname, r.name, row_count;
        EXCEPTION WHEN OTHERS THEN
            RAISE NOTICE 'View: %.% | ERROR: %', r.schemaname, r.name, SQLERRM;
        END;
    END LOOP;
    
    -- Materialized views
    FOR r IN 
        SELECT schemaname, matviewname AS name
        FROM pg_matviews 
        WHERE schemaname = 'public'
        ORDER BY matviewname
    LOOP
        BEGIN
            EXECUTE format('SELECT COUNT(*) FROM %I.%I', r.schemaname, r.name) INTO row_count;
            RAISE NOTICE 'Materialized View: %.% | Rows: %', r.schemaname, r.name, row_count;
        EXCEPTION WHEN OTHERS THEN
            RAISE NOTICE 'Materialized View: %.% | ERROR: %', r.schemaname, r.name, SQLERRM;
        END;
    END LOOP;
END $$;

-- Part 4: Sample data extraction
\echo '\n=== SAMPLE DATA FROM KEY VIEWS ==='

-- Sample from each important view (top 5 rows)
-- Add specific queries for key views
SELECT * FROM view_riksdagen_politician LIMIT 5;
SELECT * FROM view_riksdagen_party LIMIT 5;
SELECT * FROM view_riksdagen_vote_data_ballot_politician_summary_daily 
WHERE vote_date >= CURRENT_DATE - INTERVAL '7 days' LIMIT 5;

-- Part 5: Schema statistics
\echo '\n=== SCHEMA STATISTICS ==='

-- Largest tables by row count
SELECT 
    schemaname,
    tablename,
    n_live_tup AS row_count
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_live_tup DESC
LIMIT 20;

-- Tables with no data
SELECT tablename
FROM pg_tables t
LEFT JOIN pg_stat_user_tables s ON t.tablename = s.relname
WHERE t.schemaname = 'public' 
  AND (s.n_live_tup = 0 OR s.n_live_tup IS NULL)
ORDER BY tablename;

-- View dependency count
SELECT 
    dependent_view.relname AS view_name,
    COUNT(DISTINCT source_table.relname) AS dependency_count
FROM pg_depend
JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid
JOIN pg_class AS dependent_view ON pg_rewrite.ev_class = dependent_view.oid
JOIN pg_class AS source_table ON pg_depend.refobjid = source_table.oid
WHERE dependent_view.relkind IN ('v', 'm')
GROUP BY dependent_view.relname
ORDER BY dependency_count DESC
LIMIT 20;

Enhanced Features

JSON Output Option:

-- Generate JSON report
COPY (
  SELECT json_build_object(
    'tables', (SELECT json_agg(t) FROM (
      SELECT tablename, n_live_tup AS rows 
      FROM pg_stat_user_tables 
      WHERE schemaname = 'public'
    ) t),
    'views', (SELECT json_agg(v) FROM (
      SELECT viewname 
      FROM pg_views 
      WHERE schemaname = 'public'
    ) v),
    'materialized_views', (SELECT json_agg(mv) FROM (
      SELECT matviewname 
      FROM pg_matviews 
      WHERE schemaname = 'public'
    ) mv)
  )
) TO '/tmp/schema_report.json';

CSV Output Option:

-- Export to CSV
COPY (
  SELECT 'TABLE' AS type, tablename AS name, n_live_tup AS rows
  FROM pg_stat_user_tables
  WHERE schemaname = 'public'
  UNION ALL
  SELECT 'VIEW', viewname, NULL
  FROM pg_views
  WHERE schemaname = 'public'
  ORDER BY name
) TO '/tmp/schema_inventory.csv' WITH (FORMAT CSV, HEADER);

Usage Instructions

# Run the validation script
psql -U postgres -d cia_dev -f service.data.impl/src/main/resources/schema-validation.sql > schema_report.txt 2>&1

# Generate JSON report
psql -U postgres -d cia_dev -c "SELECT json_build_object(...)" > schema_report.json

# Generate CSV inventory
psql -U postgres -d cia_dev -c "COPY (...) TO STDOUT WITH CSV HEADER" > schema_inventory.csv

πŸ“Š Expected Output

  1. Text Report (schema_report.txt):

    • Object counts summary
    • Full table/view inventory with row counts
    • Sample data from key views
    • Statistics (largest tables, empty objects, dependencies)
  2. JSON Report (schema_report.json):

    • Structured data for programmatic processing
    • Complete schema metadata
    • Suitable for automated validation
  3. CSV Inventory (schema_inventory.csv):

    • Spreadsheet-compatible format
    • Easy to compare with documentation
    • Suitable for gap analysis

πŸ”— Related Files

πŸ“Š Success Metrics

  • Script successfully runs against database
  • Accurate counts of all objects
  • Sample data extracted without errors
  • Reports generated in multiple formats
  • Documentation updated with usage instructions
  • Baseline established for validation work

Estimated Effort: 4-6 hours
Priority: High
Domain: database, validation, automation

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions