As organizations unlock value from ever-growing datasets, sharing and managing access to data at scale is crucial. Amazon Redshift‘s "Datashares" capability allows producers to easily share live data with consumers without manually copying or syncing.

In this comprehensive 2600+ word guide, we dive deep into alter datashare, the command to modify datashares. We‘ll analyze use cases, technical configuration details, usage metrics, access control best practices, and more from the lens of a full-stack developer.

Datashare Overview

Launched in 2020, Redshift datashares establish live connections between producer databases and consumer clusters. Consumers can directly query the producer’s data without copying or transforming the data.

Redshift datashare architecture

Redshift datashare architecture (Source: AWS)

Producer databases share data at the schema- and object-level granularity. Consumers use the shared datasets through remote Read-Only connections.

Compared to traditional ETL pipelines, datashares provide low-latency access, reduced operational complexity, and flexibility as data needs evolve. Datashares also log all usage metrics for observability and audit purposes.

Use Cases Driving Datashare Adoption

Datashares unlock analytics use cases like:

Centralized Data Hub

Companies can consolidate data from applications, pipelines and databases into a production-grade analytics-optimized datashare. Business units get performant access to clean, timely data.

Self-Service Analytics

Data teams build and manage datashares that are discoverable enterprise-wide. Stakeholders can directly analyze shared data without IT queue bottlenecks.

Value Chain Analytics

Manufacturers can share inventory, supply chain data with resellers and suppliers to coordinate and optimize planning.

Analytics-as-a-Service

Data providers like financial data firms can monetize data products for clients to consume on-demand.

Compliance Reporting

Banks use datashares to securely share financial data with regulators to demonstrate compliance with reporting mandates.

AI/ML Data Access

Models needing reliable access to the latest, production datasets can leverage datashares.

The above use cases demand flexible control over shared data as business needs evolve – driving adoption of alter datashare.

Alter Datashare Syntax

The ALTER DATASHARE command allows modifying datashares by:

  • Adding/Removing objects like tables, schemas
  • Configuring settings like access control, automatic refresh

Here is the syntax:

ALTER DATASHARE datashare_name
  {
    ADD | REMOVE 
  } 
  {
    TABLE schema.table
    | SCHEMA schema
    | FUNCTION schema.function_name()
    | ALL TABLES IN SCHEMA schema
    | ALL FUNCTIONS IN SCHEMA schema
  }

ALTER DATASHARE datashare_name 
  {
    SET PUBLICACCESSIBLE [=] TRUE | FALSE 
    | SET INCLUDENEW [=] TRUE | FALSE FOR SCHEMA schema 
  }

Key notes on alter datashare:

  • Modifications take effect instantly without needing to rebuild datashare
  • Control access by granting USAGE/ALTER privileges to users
  • Manage costs by sharing at schema or object-level granularity

Now let‘s analyze some example use cases.

Compliance Reporting Use Case

Banks using Redshift to generate regulatory reports need to share sensitive financial data with external agencies frequently.

For instance, the Securities Exchange Commission (SEC) in the USA requires periodic sharing of trading activity to detect fraud or insider trading.

Banks have to balance making timely data available to the SEC while closely controlling access to sensitive information.

Redshift datashares can achieve this securely by:

1. Create base datashare

CREATE DATASHARE financial_datashare;

ALTER DATASHARE financial_datashare 
  ADD SCHEMA financial;

This adds the entire financial schema containing all trading data.

2. Grant access to SEC

GRANT USAGE 
  ON DATASHARE financial_datashare
  TO SEC_Users; 

This allows the SEC read-only access to all trade data.

3. Alter datashare for compliance

ALTER DATASHARE financial_datashare
  REMOVE TABLE financial.trades;

ALTER DATASHARE financial_datashare
  ADD TABLE financial.regulatory_trades; 

Here the bank modifies the datashare to remove the core trades table containing sensitive transaction details.

It adds a new regulatory_trades table with aggregated data sufficient for the SEC to run compliance reports.

This way the bank can securely share timely data with regulators without exposing unnecessary sensitive information.

Next let‘s look at a machine learning use case.

Machine Learning Use Case

AI teams need reliable access to the freshest datasets for model training and inferences. Data drift from using outdated datasets leads to inaccurate predictions.

Redshift datashares allow seamlessly connecting models to production data. The steps are:

1. Create ML datashare

CREATE DATASHARE ml_datashare;

ALTER DATASHARE ml_datashare
  ADD SCHEMA production; 

This shares the production schema containing application data like user activity events.

2. Data science teams connect their Redshift instances

Data scientists configure IAM access to query the datashare from their analytic Redshift clusters.

Most AI tools like SageMaker, Databricks natively connect to Redshift, making adoption frictionless.

3. Retrain models incrementally

Data teams rebuild models on schedules using the latest datashare data instead of stale copies.

daily_data = unload_redshift_datashare_to_dataframe() 

retrain_model(daily_data) 

4. Alter datashare as data evolves

Over time as new features are logged or schema changes, alter datashare helps effortlessly handle evolution.

For example, to share a new user feedback column:

ALTER DATASHARE ml_datashare
  ADD TABLE production.user_engagement;

Now let‘s analyze datashare access control and security considerations.

Fine-grained Access Control

Datashares provide fine-grained control over permissions through IAM policies and SQL GRANT privileges.

For example, a media company sharing customer engagement data with external analytics vendors can configure:

IAM Policy

The policy grants the DatashareConsumer role permissions to access datashares in their account:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": ["datashare:DescribeConsumer", 
                      "datashare:AcceptShareInvitation",
                       ...
                     ],
            "Resource": "*"
        }
    ]
}

Table-level Privileges

Further lock down visibility using SQL:

REVOKE SELECT ON shared_table FROM PUBLIC; -- Revoke access

GRANT SELECT 
   ON shared_table
   TO analytics_vendor_group; -- Allow access    

IAM Condition Keys

Granular sharing can also be enforced using IAM condition keys.

For example, restrict access to business hours:

"Condition": {
  "DateGreaterThan": {"aws:CurrentTime": "2022-07-04T09:00:00Z"},
  "DateLessThan": {"aws:CurrentTime": "2022-07-04T17:00:00Z"}  
}

These authorization mechanisms allow implementing least-privilege and need-to-know access controls when sharing sensitive data.

Now let‘s look at usage metrics and statistics.

Analyzing Datashare Usage

Redshift provides detailed metrics on datashare usage – essential for monitoring costs and performance.

Admins can view hourly/daily metrics like:

  • Bytes scanned
  • No. of rows returned
  • Query run times
  • Table/Schema access patterns

For example, aggregate statistics metrics can be retrieved using this SQL:

SELECT
  date_trunc(‘hour‘, usage_timestamp) AS hour,
  SUM(rows_accessed) AS rows_returned, 
  SUM(bytes_scanned) AS bytes_scanned,
  COUNT(DISTINCT query_id) AS query_count 
FROM
  svv_datashare_usage
GROUP BY 1
ORDER BY 1;
hour rows_returned bytes_scanned query_count
2022-12-01 13:00:00+00 562,123 97 GB 342
2022-12-01 14:00:00+00 781,247 112 GB 512

Sample datashare usage metrics

These metrics combined with consumer-side visibility from STL tables offer visibility into usage as shown below:

Analyzing usage across accounts (Source: AWS)

Trends like surging scan volumes or cross-account activity can inform decisions to alter datashares:

  • Add/Remove objects to optimize costs
  • Apply filters to limit result size
  • Adjust replication and refresh settings

Next, let‘s compare datashares to other sharing options.

Comparison to Other Data Sharing Methods

Beyond datashares, Redshift also enables sharing using:

  • Amazon Redshift Spectrum to directly run SQL queries against exabytes of unstructured data in S3. No data loading needed.

  • Federated Query to analyze data across databases like Redshift, Aurora, and MySQL.

How does alter datashare compare?

Dimension Datashare Spectrum Federated Query
Architecture Producer-consumer clusters External S3 tables Distributed databases
Performance Optimized for analytics Lower, varies by file format Depends on endpoints
Access control SQL GRANT commands S3 + IAM policies Per database users/roles
Use cases Analytics, reporting Ad-hoc exploration Consolidated dashboards
Cost Pay per query Pay per scan Per standard rates

Datashares uniquely enable optimized joint analytics by producer and consumer Redshift clusters. This high performance motivates use for production reporting and machine learning.

Advanced Topics and Best Practices

Now that we‘ve covered basic datashare usage, let‘s discuss some advanced considerations when sharing analytics datasets:

Schema and Table Optimization

When sharing large datasets, optimize organisation for analytics:

Well-partitioned schemas: Partition big tables by time or product to accelerate queries through partition elimination. Alter partitions over time.

Sort keys: Define sort keys based on common join or filter conditions. Improve scan performance.

Distribution keys: Choose the right data distribution style – AUTO, KEY or ALL to minimize data movement during joins and aggregations.

Avoid over-normalization: Excessively granular row-store tables slow down queries. Avoid complexity beyond what analytics need.

Query Isolation and Prioritization

Use Workload Management (WLM) tools like queues, concurrency scaling, and monitoring rules to achieve:

  • Query isolation between production and analytics workloads
  • Preventing shared data queries from interfering with database performance
  • Query prioritization within datashares when there are multiple consumers

Caching and Refresh Strategies

Balance the trade-off between data freshness and query costs:

  • Configure refresh intervals through automation based on usage patterns
  • Cache common drill-down reports and refresh periodically instead of per-query
  • For applications needing 100% real-time data, use other integration mechanisms

Key Takeaways

We covered a lot of ground discussing Redshift alter datashare including:

  • Datashare architecture and typical use cases
  • Syntax for modifying datashares using ALTER commands
  • Usage examples spanning analytics, machine learning and compliance
  • Fine-grained access control best practices
  • Query performance optimization considerations
  • Tools for monitoring datashare usage

The ability to effortlessly share live, analytics-optimized data at scale unlocks tremendous innovation across organizations. Mastering alter datashare opens up this collaborative potential while maintaining world-class performance, security and governance.

Similar Posts