IVF Index Health Check Demo

Overview

This tutorial demonstrates IVF (Inverted File) index health monitoring and optimization in MatrixOne Python SDK. Regular health checks ensure optimal vector search performance and help identify when index rebuilding is needed.

Critical Issue: Index Creation Timing

IMPORTANT: A common problem in production is creating IVF indexes too early:

If you create an IVF index when the table is empty or has very little data, the index will have poor centroid distribution. This leads to degraded performance that persists even after adding more data.

Solution: Always insert a representative amount of data (1000+ vectors) before creating the IVF index. If you already created an index too early, use this monitoring tool to detect the issue and rebuild the index.

⚠️ Why Index Creation Timing Matters

❌ Problems with Early Index Creation

If you create an IVF index when:

📉 There is no data in the table
📉 There is very little data (< 100 vectors)
📉 Data volume is much smaller than expected production size

You may experience:

🔴 Insufficient centroids: Not enough data to properly initialize all centroids
🔴 Poor distribution: Vectors clustered unevenly across centroids
🔴 Empty centroids: Some centroids never get assigned any vectors
🔴 Degraded performance: Queries become slower as data grows
🔴 Imbalanced load: Some centroids overloaded while others are empty

✅ Recommended Approach

Best Practice:

✅ Insert a representative amount of data first (at least 1000+ vectors)
✅ Then create the IVF index with appropriate lists parameter
✅ Use this health monitoring tool to evaluate index quality
✅ Rebuild the index if health metrics indicate issues

Example of Bad vs Good:

# ❌ BAD: Create index on empty table
client.create_table(VectorDocument)
client.vector_ops.create_ivf(table_name, index_name, column_name, lists=100)
# Then insert data later → Poor distribution!

# ✅ GOOD: Insert data first, then create index
client.create_table(VectorDocument)
client.batch_insert(VectorDocument, initial_data)  # Insert 10K+ vectors
client.vector_ops.create_ivf(table_name, index_name, column_name, lists=100)
# Index learns from actual data distribution → Good performance!

🔍 How This Tool Helps

This demo shows you how to:

Detect poor index health using get_ivf_stats()
Evaluate centroid distribution and balance ratio
Decide when index rebuild is necessary
Rebuild indexes with optimized parameters

Why Monitor IVF Index Health?

Even with proper initial creation, index health can degrade over time:

⚠️ As you insert, update, or delete vector data
⚠️ When data distribution changes
⚠️ After bulk data imports

This Demo Covers:

✅ Create IVF index with configurable parameters
✅ Monitor centroid distribution and load balance
✅ Analyze health metrics and detect issues
✅ Rebuild indexes when needed
✅ Best practices for production monitoring

MatrixOne Python SDK Documentation

For complete API reference, see MatrixOne Python SDK Documentation

Before You Start

Prerequisites

MatrixOne database installed and running
Python 3.7 or higher
MatrixOne Python SDK installed

Install SDK

pip3 install matrixone-python-sdk

Complete Working Example

Save this as ivf_health_demo.py and run with python3 ivf_health_demo.py:

from matrixone import Client
from matrixone.config import get_connection_params
from matrixone.sqlalchemy_ext import create_vector_column
from matrixone.orm import declarative_base
from sqlalchemy import BigInteger, Column, String
import numpy as np
import time

np.random.seed(42)

print("="* 70)
print("MatrixOne IVF Index Health Check Demo")
print("="* 70)

# Connect to database
host, port, user, password, database = get_connection_params(database='demo')
client = Client()
client.connect(host=host, port=port, user=user, password=password, database=database)
print(f"Connected to database")

# Define table with 128-dimensional vectors
Base = declarative_base()

class VectorDocument(Base):
    __tablename__ = "ivf_health_demo_docs"
    id = Column(BigInteger, primary_key=True)
    title = Column(String(200))
    category = Column(String(100))
    embedding = create_vector_column(128, "f32")

# Create table and insert data
client.drop_table(VectorDocument)
client.create_table(VectorDocument)

initial_docs = [
    {
        "id": i + 1,
        "title": f"Document {i + 1}",
        "category": f"Category_{i % 5}",
        "embedding": np.random.rand(128).astype(np.float32).tolist()
    }
    for i in range(100)
]

client.batch_insert(VectorDocument, initial_docs)
print(f"Inserted {len(initial_docs)} documents")

# Create IVF index
client.vector_ops.create_ivf(
    "ivf_health_demo_docs",
    "idx_embedding_ivf",
    "embedding",
    lists=10,
    op_type="vector_l2_ops"
)
print("IVF index created with 10 lists")

time.sleep(1)

# Get IVF statistics
stats = client.vector_ops.get_ivf_stats("ivf_health_demo_docs", "embedding")

# Analyze distribution
distribution = stats['distribution']
centroid_counts = distribution['centroid_count']
total_centroids = len(centroid_counts)
total_vectors = sum(centroid_counts)
min_count = min(centroid_counts) if centroid_counts else 0
max_count = max(centroid_counts) if centroid_counts else 0
balance_ratio = max_count / min_count if min_count > 0 else float('inf')

print(f"\n Index Health:")
print(f"- Total centroids: {total_centroids}")
print(f"- Total vectors: {total_vectors}")
print(f"- Balance ratio: {balance_ratio:.2f}")

if balance_ratio > 2.0:
    print(f"WARNING: Balance ratio high ({balance_ratio:.2f})")
else:
    print(f"PASS: Good balance")

# Cleanup
client.disconnect()
print("\n Demo completed!")

Key Concepts

1. IVF Index Structure

IVF index organizes vectors into clusters (centroids):

IVF Index with lists=10
├── Centroid 0: [Vector 1, Vector 5, Vector 12, ...]  → 15 vectors
├── Centroid 1: [Vector 2, Vector 8, Vector 19, ...]  → 12 vectors
├── Centroid 2: [Vector 3, Vector 9, ...]             → 8 vectors
├── ...
└── Centroid 9: [Vector 4, Vector 7, ...]             → 11 vectors

Ideal: All centroids have similar number of vectors (balanced load)

2. Get IVF Statistics

stats = client.vector_ops.get_ivf_stats(table_name, vector_column)

Returns:

{
    'database': 'demo',
    'table_name': 'ivf_health_demo_docs',
    'column_name': 'embedding',
    'index_tables': {
        'centroids': '__mo_ivf_centroids_xxx',
        'index': '__mo_ivf_index_xxx'
    },
    'distribution': {
        'centroid_id': [0, 1, 2, ...],
        'centroid_count': [15, 12, 8, ...],
        'centroid_version': [1, 1, 1, ...]
    }
}

3. Health Metrics

Load Balance Ratio

balance_ratio = max_vectors_per_centroid / min_vectors_per_centroid

Health Levels:

🟢 Excellent (< 1.5): Optimal performance
🟡 Good (1.5 - 2.0): Acceptable performance
🟠 Fair (2.0 - 3.0): Monitor closely
🔴 Poor (> 3.0): Rebuild recommended

Utilization

utilization = total_vectors / total_centroids

Guidelines:

🔴 Too Low (< 5): Too many centroids, reduce lists
🟢 Optimal (10 - 100): Good efficiency
🟡 High (> 100): Consider more lists

Health Check Examples

Basic Health Check

# Get statistics
stats = client.vector_ops.get_ivf_stats(table_name, vector_column)

# Extract metrics
distribution = stats['distribution']
centroid_counts = distribution['centroid_count']

# Calculate health metrics
total_vectors = sum(centroid_counts)
min_count = min(centroid_counts)
max_count = max(centroid_counts)
balance_ratio = max_count / min_count

print(f"Balance Ratio: {balance_ratio:.2f}")
if balance_ratio > 2.0:
    print("WARNING: Index needs optimization")
else:
    print("Index is healthy")

Comprehensive Health Check

def check_ivf_health(client, table_name, vector_column, expected_lists):
    """Comprehensive IVF index health check"""

    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    distribution = stats['distribution']

    centroid_ids = distribution['centroid_id']
    centroid_counts = distribution['centroid_count']
    centroid_versions = distribution['centroid_version']

    # Metrics
    total_centroids = len(centroid_counts)
    total_vectors = sum(centroid_counts)
    min_count = min(centroid_counts) if centroid_counts else 0
    max_count = max(centroid_counts) if centroid_counts else 0
    avg_count = total_vectors / total_centroids if total_centroids > 0 else 0
    balance_ratio = max_count / min_count if min_count > 0 else float('inf')
    empty_centroids = sum(1 for c in centroid_counts if c == 0)

    # Health checks
    issues = []
    warnings = []

    # Check centroid count
    if total_centroids != expected_lists:
        issues.append(f"Centroid count mismatch: expected {expected_lists}, found {total_centroids}")

    # Check balance
    if balance_ratio > 3.0:
        issues.append(f"Critical imbalance: ratio {balance_ratio:.2f}")
    elif balance_ratio > 2.0:
        warnings.append(f"Moderate imbalance: ratio {balance_ratio:.2f}")

    # Check empty centroids
    if empty_centroids > 0:
        warnings.append(f"{empty_centroids} empty centroids found")

    # Check version consistency
    if len(set(centroid_versions)) > 1:
        warnings.append(f"Inconsistent versions: {set(centroid_versions)}")

    # Health status
    if issues:
        health = "CRITICAL"
    elif warnings:
        health = "NEEDS_ATTENTION"
    else:
        health = "HEALTHY"

    return {
        'health': health,
        'metrics': {
            'total_centroids': total_centroids,
            'total_vectors': total_vectors,
            'avg_count': avg_count,
            'balance_ratio': balance_ratio,
            'empty_centroids': empty_centroids
        },
        'issues': issues,
        'warnings': warnings
    }

# Usage
health_report = check_ivf_health(client, "my_table", "embedding", expected_lists=10)
print(f"Health Status: {health_report['health']}")
print(f"Balance Ratio: {health_report['metrics']['balance_ratio']:.2f}")

Monitor After Data Changes

# Before bulk insert
stats_before = client.vector_ops.get_ivf_stats(table_name, vector_column)
balance_before = calculate_balance_ratio(stats_before)

# Insert new data
client.batch_insert(Model, new_data)

# After bulk insert
time.sleep(1)  # Allow index to update
stats_after = client.vector_ops.get_ivf_stats(table_name, vector_column)
balance_after = calculate_balance_ratio(stats_after)

# Compare
if balance_after > balance_before * 1.2:
    print("Balance degraded by >20%, consider rebuild")

Index Rebuild Strategies

Strategy 1: Calculate Optimal Lists

# Rule of thumb: lists ≈ √(number of vectors)
total_vectors = 10000
optimal_lists = int(np.sqrt(total_vectors))  # = 100

client.vector_ops.create_ivf(
    table_name, index_name, column_name,
    lists=optimal_lists,
    op_type="vector_l2_ops"
)

Strategy 2: Zero-Downtime Rebuild

# Step 1: Create new index with different name
client.vector_ops.create_ivf(
    table_name,
    "idx_embedding_new",  # New name
    column_name,
    lists=new_optimal_lists,
    op_type="vector_l2_ops"
)

# Step 2: Verify new index works
pinecone_index_new = client.get_pinecone_index(table_name, column_name)
results = pinecone_index_new.query(test_vector, top_k=10)

# Step 3: Drop old index only after verification
if len(results.matches) > 0:
    client.vector_ops.drop(table_name, "idx_embedding_old")
    print("Old index dropped, new index active")

Strategy 3: Scheduled Maintenance

import schedule

def rebuild_if_needed():
    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    balance_ratio = calculate_balance_ratio(stats)

    if balance_ratio > 2.5:
        print(f"Balance ratio {balance_ratio:.2f}, rebuilding...")
        rebuild_index(table_name, vector_column)
    else:
        print(f"Health OK (ratio: {balance_ratio:.2f})")

# Run daily at 2 AM
schedule.every().day.at("02:00").do(rebuild_if_needed)

Production Monitoring

Set Up Alerts

def check_and_alert(client, table_name, vector_column, threshold=2.5):
    """Check index health and send alert if needed"""

    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    distribution = stats['distribution']
    counts = distribution['centroid_count']

    balance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf')

    if balance_ratio > threshold:
        # Send alert (email, Slack, etc.)
        send_alert(
            subject=f"IVF Index Health Alert: {table_name}",
            message=f"Balance ratio {balance_ratio:.2f} exceeds threshold {threshold}"
        )
        return False

    return True

Track Metrics Over Time

import datetime

def log_health_metrics(client, table_name, vector_column):
    """Log health metrics to monitoring system"""

    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    distribution = stats['distribution']
    counts = distribution['centroid_count']

    metrics = {
        'timestamp': datetime.datetime.now(),
        'table': table_name,
        'total_vectors': sum(counts),
        'total_centroids': len(counts),
        'balance_ratio': max(counts) / min(counts) if min(counts) > 0 else 0,
        'empty_centroids': sum(1 for c in counts if c == 0)
    }

    # Send to monitoring system (Prometheus, CloudWatch, etc.)
    send_to_monitoring(metrics)

    return metrics

Dashboard Metrics

Key metrics to display in your monitoring dashboard:

dashboard_metrics = {
    'balance_ratio': balance_ratio,           # Target: < 2.0
    'total_vectors': total_vectors,           # Trend over time
    'avg_vectors_per_centroid': avg_count,    # Should be stable
    'empty_centroids': empty_count,           # Target: 0
    'last_rebuild_date': last_rebuild,        # Track rebuild frequency
    'query_latency_p99': latency_p99         # Should stay low
}

Optimal Lists Parameter Guide

Formula-Based Selection

def calculate_optimal_lists(vector_count):
    """Calculate optimal lists parameter based on vector count"""

    if vector_count < 1000:
        return 10  # Small dataset
    elif vector_count < 10000:
        return 50  # Medium dataset
    elif vector_count < 100000:
        return 100  # Large dataset
    else:
        return int(np.sqrt(vector_count))  # Very large dataset

Adjustment Guidelines

Vector Count	Recommended Lists	Reasoning
< 1,000	10 - 20	Minimize overhead
1K - 10K	20 - 50	Balance speed and accuracy
10K - 100K	50 - 200	Optimize query performance
100K - 1M	200 - 1,000	Use √N formula
> 1M	1,000+	Large-scale optimization

Testing Different Lists Values

# Test multiple configurations
for lists_value in [10, 20, 50, 100]:
    # Create index
    index_name = f"idx_test_lists_{lists_value}"
    client.vector_ops.create_ivf(
        table_name, index_name, column_name,
        lists=lists_value,
        op_type="vector_l2_ops"
    )

    # Measure query performance
    start_time = time.time()
    results = perform_test_queries(100)  # Run 100 test queries
    elapsed = time.time() - start_time

    # Check balance
    stats = client.vector_ops.get_ivf_stats(table_name, column_name)
    balance = calculate_balance_ratio(stats)

    print(f"Lists={lists_value}: Time={elapsed:.2f}s, Balance={balance:.2f}")

    # Cleanup
    client.vector_ops.drop(table_name, index_name)

# Choose configuration with best balance of speed and balance

Health Check Frequency

Recommended Schedule

# After significant data changes
def after_bulk_operation():
    client.batch_insert(Model, large_dataset)
    check_ivf_health()  # Immediate check

# Daily monitoring
def daily_health_check():
    schedule.every().day.at("02:00").do(check_ivf_health)

# Weekly detailed analysis
def weekly_analysis():
    schedule.every().monday.at("03:00").do(detailed_health_analysis)

# After every N inserts
insert_count = 0
def track_inserts():
    global insert_count
    insert_count += 1
    if insert_count % 1000 == 0:  # Every 1000 inserts
        check_ivf_health()

Troubleshooting

Issue: "High balance ratio (> 3.0)"

Cause: Uneven vector distribution across centroids

Solution: Rebuild index with optimized lists parameter

# Drop and rebuild
client.vector_ops.drop(table_name, old_index_name)

# Calculate optimal lists
total_vectors = get_vector_count()
optimal_lists = int(np.sqrt(total_vectors))

client.vector_ops.create_ivf(
    table_name, new_index_name, column_name,
    lists=optimal_lists,
    op_type="vector_l2_ops"
)

Issue: "Empty centroids detected"

Cause: Too many lists for the dataset size

Solution: Reduce lists parameter

# Current: 100 vectors with 20 lists → avg 5 per centroid (too low)
# Better: 100 vectors with 10 lists → avg 10 per centroid

client.vector_ops.create_ivf(
    table_name, index_name, column_name,
    lists=10,  # Reduced from 20
    op_type="vector_l2_ops"
)

Issue: "Version inconsistency"

Cause: Index updates in progress or partial rebuild

Solution: Wait for index to stabilize or rebuild

# Check versions
stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
versions = set(stats['distribution']['centroid_version'])

if len(versions) > 1:
    print(f"Multiple versions found: {versions}")
    print("Waiting for index to stabilize...")
    time.sleep(5)

    # Check again
    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    versions = set(stats['distribution']['centroid_version'])

    if len(versions) > 1:
        print("Rebuilding index...")
        rebuild_index(table_name, vector_column)

Issue: "Performance degradation after bulk insert"

Cause: Index not rebalanced after data growth

Solution: Monitor data growth and rebuild proactively

# Track vector count
initial_count = 1000
current_count = get_vector_count()
growth_rate = (current_count - initial_count) / initial_count

if growth_rate > 0.2:  # 20% growth
    print(f"Vector count grew by {growth_rate*100:.1f}%")
    print("Rebuilding index for optimal performance...")
    rebuild_index(table_name, vector_column)

Best Practices

1. Index Creation Timing (Most Important!)

⚠️ CRITICAL: Always create IVF indexes AFTER inserting data

# ❌ WRONG: Create index first, then insert data
client.create_table(VectorDocument)
client.vector_ops.create_ivf(table_name, index_name, column_name, lists=100)
client.batch_insert(VectorDocument, data)  # Poor distribution!

# ✅ CORRECT: Insert data first, then create index
client.create_table(VectorDocument)
client.batch_insert(VectorDocument, data)  # Insert at least 1000+ vectors
client.vector_ops.create_ivf(table_name, index_name, column_name, lists=100)

# ✅ EVEN BETTER: Calculate optimal lists based on actual data
client.create_table(VectorDocument)
client.batch_insert(VectorDocument, data)
vector_count = len(data)
optimal_lists = int(np.sqrt(vector_count))  # e.g., √10000 = 100
client.vector_ops.create_ivf(table_name, index_name, column_name, lists=optimal_lists)

Why This Matters:

IVF index initialization uses the existing data to determine centroid positions. If you create the index on an empty or nearly-empty table:

Centroids may be positioned poorly
Future data won't be distributed evenly
Performance will suffer even with millions of vectors later

Recovery Steps if Index Created Too Early:

# Step 1: Check current health
stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
balance_ratio = max(counts) / min(counts)

# Step 2: If ratio > 2.5, rebuild is needed
if balance_ratio > 2.5:
    print(f"Index created too early! Balance ratio: {balance_ratio:.2f}")

    # Step 3: Drop old index
    client.vector_ops.drop(table_name, old_index_name)

    # Step 4: Calculate optimal lists from current data
    current_vector_count = sum(stats['distribution']['centroid_count'])
    optimal_lists = int(np.sqrt(current_vector_count))

    # Step 5: Rebuild with proper parameters
    client.vector_ops.create_ivf(
        table_name, new_index_name, column_name,
        lists=optimal_lists,
        op_type="vector_l2_ops"
    )
    print(f"Index rebuilt with {optimal_lists} lists")

2. Regular Monitoring

# Production monitoring script
def production_health_monitor():
    # Check all vector tables
    vector_tables = [
        ("products", "embedding"),
        ("users", "profile_vector"),
        ("documents", "content_embedding")
    ]

    for table, column in vector_tables:
        health = check_ivf_health(client, table, column, expected_lists=100)

        if health['health'] == 'CRITICAL':
            # Immediate rebuild
            rebuild_index(table, column)
        elif health['health'] == 'NEEDS_ATTENTION':
            # Schedule rebuild during maintenance window
            schedule_rebuild(table, column)

# Run every hour
schedule.every().hour.do(production_health_monitor)

2. Rebuild Strategy

def should_rebuild_index(stats, last_rebuild_time):
    """Determine if index rebuild is needed"""

    distribution = stats['distribution']
    counts = distribution['centroid_count']
    balance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf')

    # Rebuild conditions
    if balance_ratio > 2.5:
        return True, "High balance ratio"

    # Rebuild if it's been more than 7 days since last rebuild
    days_since_rebuild = (time.time() - last_rebuild_time) / 86400
    if days_since_rebuild > 7:
        return True, "Scheduled maintenance"

    return False, "Index healthy"

3. Maintenance Windows

def rebuild_during_maintenance(client, table_name, vector_column):
    """Rebuild index during scheduled maintenance window"""

    # Check if in maintenance window (e.g., 2-4 AM)
    current_hour = datetime.datetime.now().hour
    if not (2 <= current_hour < 4):
        print("Outside maintenance window, skipping rebuild")
        return

    print("🔧 Maintenance window: Rebuilding index...")

    # Get current stats
    stats = client.vector_ops.get_ivf_stats(table_name, vector_column)
    total_vectors = sum(stats['distribution']['centroid_count'])

    # Calculate optimal lists
    optimal_lists = int(np.sqrt(total_vectors))

    # Rebuild
    client.vector_ops.drop(table_name, "idx_old")
    client.vector_ops.create_ivf(
        table_name, "idx_new", vector_column,
        lists=optimal_lists,
        op_type="vector_l2_ops"
    )

    print(f"Index rebuilt with {optimal_lists} lists")

4. Performance Testing

def test_query_performance(client, table_name, vector_column):
    """Measure query performance"""

    query_vector = np.random.rand(128).tolist()

    # Run multiple queries and measure time
    times = []
    for _ in range(10):
        start = time.time()
        pinecone_index = client.get_pinecone_index(table_name, vector_column)
        results = pinecone_index.query(vector=query_vector, top_k=10)
        elapsed = time.time() - start
        times.append(elapsed)

    avg_time = np.mean(times)
    p95_time = np.percentile(times, 95)

    print(f"Query Performance:")
    print(f"Avg: {avg_time*1000:.2f}ms")
    print(f"P95: {p95_time*1000:.2f}ms")

    # Alert if too slow
    if p95_time > 0.1:  # 100ms threshold
        print("Queries too slow, check index health")

Reference

Summary

IVF index health monitoring is essential for production vector search systems:

✅ Monitor Regularly: Check health after bulk operations and on schedule ✅ Track Balance Ratio: Keep it < 2.0 for optimal performance ✅ Rebuild Proactively: Don't wait for performance to degrade ✅ Use Optimal Lists: Calculate based on √(vector_count) ✅ Test Performance: Measure query latency to detect issues early ✅ Automate: Set up scheduled health checks and alerts

Golden Rule: Monitor balance ratio and rebuild when it exceeds 2.5! 🚀