As a critical relational database, PostgreSQL provides extensive functionality for updating stored data across single or multiple columns. However, not all updates are created equal – improperly structured bulk updates can cripple database performance through inefficient table scans, wasted I/O, and redundant indexing.

When executed correctly following PostgreSQL best practices, bulk updates enable efficient data transformations essential for data-driven applications. In this advanced guide, we will explore optimized techniques for updating multiple columns across thousands or millions of rows.

Why Updating Multiple Columns Matters

First, let‘s examine why the ability to update multiple columns in bulk is so important for real-world PostgreSQL deployments:

Dynamic Business Needs

From adjusting financial records to modifying user profiles, businesses constantly need to update existing records. For example, an e-commerce site may need to update costs and apply sales across thousands of product listings.

New Application Requirements

As developers enhance applications over time, new columns and data points are frequently added to existing tables through alter table statements. After deployment, those columns eventually need population.

Changing Schemas

Evolving datasets often require changes like splitting a single column into multiple new columns (e.g. names into first/last). This requires propagating data across tables.

Fixing Errors

Despite best efforts, data errors can creep into large production datasets. Bulk updates provide the capability to systematically fix issues across many rows when manual adjustments are not practical.

Simply put – updating multiple columns enables databases to adapt to changing business requirements and keept data current. Ignoring bulk updates leads to stale, rigid datasets.

Updating All Rows in a Table

Earlier we looked at updating rows based on specific WHERE conditions. However, occasionally you may need to update every row in a table across multiple columns.

For example, let‘s consider a users table tracking customer names and privacy preferences:

users

id first_name last_name public_profile
1 John Smith TRUE
2 Sarah Davis TRUE
3 Mark Jenkins FALSE

Now let‘s say updated legal requirements mean all user profiles should default to private. We can update this by setting public_profile to FALSE for all rows, without any WHERE filter:

UPDATE users
SET public_profile = FALSE; 

Running a simple SELECT shows that worked as expected:

SELECT * FROM users;
id first_name last_name public_profile
1 John Smith FALSE
2 Sarah Davis FALSE
3 Mark Jenkins FALSE

This approach updates every row with minimal SQL. However, for large tables scanning every row can become expensive. More advanced options like processing in batches may be preferable depending on data volumes.

Ultimately though, the ability to update all rows in bulk saves tremendous time compared to individual updates when business logic changes.

Updating in Batches for Large Tables

Updating millions or billions of rows across multiple columns presents write-scaling challenges. Attempting a single huge update transaction may work for smaller datasets, but risks contention issues or timeouts at scale.

PostgreSQL provides strategies to break giant updates into smaller, more manageable batches through cursors or the WRITE_BATCHED extension. For example:

DECLARE update_cursor CURSOR FOR 
    SELECT id, column1, column2 FROM huge_table
    FOR UPDATE;

FETCH 1000 FROM update_cursor;

UPDATE huge_table SET  
  column1 = ... , 
  column2 = ...
WHERE CURRENT OF update_cursor;

CLOSE update_cursor;

Here a cursor fetches rows in batches of 1000, which are then updated and committed. This partitions total work into smaller transactions that can individually succeed or rollback. The FOR UPDATE clause locks rows against changes during processing.

Based on expert benchmarks, batch sizes between 100-1000 rows typically strike the best balance. Larger batches risk overflowing shared memory buffers, while smaller are less efficient.

Partitioning for Segmented Updates

In addition to batching, database partitioning provides another scale-out method for large updates by logically splitting tables across multiple physical locations. This enables updates to run in parallel rather than bottlenecking on a single node.

Table partitioning in PostgreSQL remains an advanced technique – butunlocks previously impossible update speeds. When coupled with batching, millions of updates per second are achievable on commodity hardware.

Updates Requiring Index Modifications

We‘ve primarily focused on updates aimed at column data. But special consideration is needed when updates cause row movement that might affect indexes.

For example, if a column update would alter the sort order of rows under an index, performance may suffer from bloated indexes until rebuilt. PostgreSQL does not currently update indexes incrementally on column changes.

Let‘s illustrate with an accounts table using a balance index to optimize queries filtering by account value:

accounts

id name balance
1 John Doe 500
2 Jane Smith 200

balance_amount_idx index

balance id
200 2
500 1

Now let‘s run an update doubling all balances:

UPDATE accounts SET balance = balance * 2;

While our goal was updating the amounts themselves, the side effect is now the index order no longer matches the table:

accounts

id name balance
1 John Doe 1000
2 Jane Smith 400

balance_amount_idx index

balance id
200 2
500 1

Until we rebuild the index using REINDEX or VACUUM commands, queries will perform poorly as the optimizer expects sorted rather than randomly ordered data.

Planning Index Maintenance

Thankfully indexes like B-tree typically only need rebuilding once after major updates, without continuous overhead. But DBAs should plan index rebuild windows following large batch changes to ensure optimal performance across queries.

Tools like pg_repack can rebuild indexes with minimal locks for near-zero downtime. So updates need not undermine query response even on active workloads.

Using RETURNING to Capture Updated Results

Earlier we introduced the RETURNING clause to return updated rows following mutations. Capturing these results provides several benefits:

  • Review updates for quality assurance against expectations
  • Further propagate changes into other systems like search indexes
  • Build audit trails showing historical changes

For example, consider an inventory table tracking product stock data:

CREATE TABLE products (
  id BIGSERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  quantity_on_hand INTEGER  
);

Now let‘s insert a few rows:

INSERT INTO products (name, quantity_on_hand) VALUES
    (‘Widget‘, 100),
    (‘Sprocket‘, 250);  

We can update while returning the mutated rows:

UPDATE products
   SET quantity_on_hand = quantity_on_hand - 10
WHERE name LIKE ‘%Widget‘
RETURNING *;
id name quantity_on_hand
1 Widget 90

Rather than assuming the update succeeded, actually retrieving the rows provides verification. Ifexpected quantity was not 90, we would catch the issue immediately rather than data drifting out of sync over time.

Enabling Audit Trails

Going a step further, the returned rows could also populate an auditing table to build history of all changes over time:

CREATE TABLE audits (
  updated_at TIMESTAMPTZ NOT NULL,
  table_name TEXT NOT NULL, 
  changed_data JSONB NOT NULL  
);

INSERT INTO audits 
  (updated_at, table_name, changed_data)
VALUES
  (CURRENT_TIMESTAMP, ‘products‘, row_to_json(updated_rows)) 
FROM (
  UPDATE products
     SET quantity_on_hand = quantity_on_hand - 10
   WHERE name LIKE ‘%Widget‘
   RETURNING *  
) updated_rows;

Now all product inventory changes become visible, enabling crucial insights like identifying trends or detecting anomalies. RETURNING enables tracking data provenance across analysis, troubleshooting, and regulatory use cases.

Performance Tradeoffs

However, some caveats around performance exist. Returning thousands of updated rows could introduce overhead for extremely high throughput applications. There are also alternatives like database triggers that can react to updates.

In most cases RETURNING strikes the right balance, but should be tested against throughput requirements. Application monitoring helps spot contention early, especially when initially enabling new auditing workflows.

Updating Joins with Care

As we explored previously, PostgreSQL permits updating rows across multiple joined tables to propagate changes. For example:

UPDATE table1
  SET column1 = new_value
FROM table2 
WHERE 
  table1.id = table2.id
  AND table2.name LIKE ‘%target%‘

This pattern proves powerful for atomic changes across normalized data. However, several pitfalls exist around performance and transactional semantics when updating joins:

Understanding Behavior

Notably, the order JOIN tables are updated differs from the written SQL order, following an internal optimization pass. So application code should avoid depending on update order across tables.

Furthermore, updates and foreign keys can interact unexpectedly within multi-statement transactions. Updates may fail if constraints see inconsistent intermediate states between statements.

In other words – updates across joins trade simplicity for surprise. Knowledge of potential performance and transactional issues is vital.

Improving Performance

On the performance front, updates hitting multiple huge tables risk expensive cross joins, massively amplifying rows to process. Explicitly optimizing join order and adding WHERE filters prevents overloaded updates.

Furthermore, joins prevent lock escalation from row to table level, multiplying locking overhead. Bringing data into application memory first via COPY may improve throughput by batching updates after efficient bulk data transfer.

In summary, UPDATE JOIN enables simple logic but masks scalability limits. Real-world referential data often reaches proportions demanding specialized sharding or queuing architectures.

Conclusion

From altering individual columns to bulk actions across billion row tables, PostgreSQL provides extensive update functionality fundamental to managing dynamic data at scale.

However, proper indexing, batch sizes, join strategies, and transaction handling differentiate high performance updates from those bottlenecking production workloads. Planning around limitations like single node write caps or stale indexes after large modifications prevents subtle data drift or instability.

Ultimately, PostgreSQL updating distills down to a balancing act – providing expressive power for developers to adapt applications without compromising scalability or uptime by operations teams. But used judiciously alongside monitoring and capacity planning, UPDATE commands unlock the full potential of PostgreSQL as a data platform able to grow and mutate business datasets according to real-world requirements.

Similar Posts