Skip to content

sqlinstance: Update instancereader with cache backed by rangefeed#69976

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
rimadeodhar:instance_id_cache
Oct 27, 2021
Merged

sqlinstance: Update instancereader with cache backed by rangefeed#69976
craig[bot] merged 1 commit intocockroachdb:masterfrom
rimadeodhar:instance_id_cache

Conversation

@rimadeodhar
Copy link
Copy Markdown
Collaborator

@rimadeodhar rimadeodhar commented Sep 9, 2021

This PR adds rangefeed backed cache for the sqlinstance
table. This will prevent the SQL instance lookup code from
making multiple roundtrips. This should have good performance
gains for when we support multi-region for serverless.

Release note (performance improvement): The sqlinstance subsystem
no longer reads from the backing SQL table for every request for
SQL instance details. This will result in improved performance
for when we support multi-region setup for the multi-tenant
architecture.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@rimadeodhar rimadeodhar added the do-not-merge bors won't merge a PR with this label. label Sep 9, 2021
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rimadeodhar)


pkg/sql/catalog/lease/lease_internal_test.go, line 1 at r1 (raw file):

// Copyright 2015 The Cockroach Authors.

?


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 77 at r1 (raw file):

Quoted 10 lines of code…
	cacheConfig := cache.Config{
		Policy: cache.CacheLRU,
		ShouldEvict: func(size int, key, value interface{}) bool {
			// SQL instance data is expected to stay small enough
			// (order of a few KB), so we don't need an eviction strategy
			// and can retain cached data indefinitely.
			return false
		},
	}
	r.mu.instances = cache.NewUnorderedCache(cacheConfig)

why use this data structure if you're never going to evict? Why not just a *btree.BTree?


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 163 at r1 (raw file):

	}

	instancesTablePrefix := r.codec.TablePrefix(uint32(r.tableID))

Consider tightening this to just the primary index.


pkg/sql/sqlinstance/instancestorage/instancereader_test.go, line 52 at r1 (raw file):

	tDB := sqlutils.MakeSQLRunner(tc.ServerConn(0))
	dbName := t.Name()
	s := tc.Server(0)

The test is failing because rangefeeds aren't enabled on the table used for the test. They are enabled by default on all system table ranges and on all tenant ranges.

See this error:

W210914 16:49:20.321486 1529 kv/kvclient/rangefeed/rangefeed.go:228  [rangefeed=sql_instances] 62  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:151: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v21.2/change-data-capture.html#enable-rangefeeds-to-reduce-latency

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rimadeodhar)


pkg/sql/sqlinstance/sqlinstance.go, line 33 at r1 (raw file):

Quoted 9 lines of code…
// AddressResolver exposes API for retrieving the instance address and all live instances for a tenant.
type AddressResolver interface {
	// GetInstance returns the InstanceInfo for a pod given the instance ID.
	// Returns a NonExistentInstanceError if an instance with the given ID
	// does not exist.
	GetInstance(context.Context, base.SQLInstanceID) (InstanceInfo, error)
	// GetAllInstances returns a list of all SQL instances for the tenant.
	GetAllInstances(context.Context) ([]InstanceInfo, error)
}

I've been thinking about these interface a bit. How would you feel about simplifying the interface like what I have below. I also took a hammer to the type names. Consider reading them as sqlinstance.Info, sqlinstance.InfoIterator, sqlinstance.Resolver etc.

// Info exposes information on a SQL instance such as ID, network address and
// the associated sqlliveness.SessionID.
type Info struct {
	InstanceID   base.SQLInstanceID
	InstanceAddr string
	SessionID    sqlliveness.SessionID
}

// InfoIterator is used to iterate instance info stored in a resolver. 
type InfoIterator func(Info) error

// Resolver exposes API for retrieving the instance address and all live instances for a tenant.
type Resolver interface {
    // Iterate iterates the set of known, live instances stored in the resolver.
    // From indicates the first instance from which iteration should occur. Iteration
    // proceeds in ascending order, halting after the instance with the largest known
    // instance ID. Returning iterutil.StopIteration will stop the iteration and return
    // no error. 
    Iterate(ctx context.Context, from base.SQLInstanceID, f InfoIterator) error
}

// GetInfo returns the Info for an instance given the instance ID as reported
// by the Resolver. Returns a NonExistentInstanceError if an instance with the given ID
// does not exist.
func GetInfo(ctx context.Context, r Resolver, id base.SQLInstanceID) (Info, error) {
   var ret Info
   if err := r.Iterate(ctx, id, func(info Info) error {
      ret = info
      return iterutil.StopIteration()
   }); err != nil {
      return Info{}, err
   }
   if ret.ID == id {
      return ret, nil
   }
   return Info{}, NonExistentInstanceError
}

You can then implement this interface on a slice like:

// Infos is a slice of Info. If it is sorted, it can be used as a Resolver. Calls to
// Iterate have undefined behavior if Infos is not sorted.
type Infos []Info

var _ Resolver = (Infos)(nil)

// Iterate can be used to 
func (infos Infos) Iterate(ctx context.Context, from base.SQLInstanceID, f InfoIterator) error {
    idx := sort.Search(len(infos), func(i int) bool {
        return infos[i].ID >= from
    })
    for i := idx; i < len(infos); i++ {
        if err := f(infos[i]); err != nil {
            if iterutils.Done(err) {
                err = nil
            }
            return err
        }
    }
    return nil
}

I think that such an interface will make the cache nicer to use.

Copy link
Copy Markdown
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rimadeodhar)


pkg/sql/sqlinstance/sqlinstance.go, line 33 at r1 (raw file):

Previously, ajwerner wrote…
// AddressResolver exposes API for retrieving the instance address and all live instances for a tenant.
type AddressResolver interface {
	// GetInstance returns the InstanceInfo for a pod given the instance ID.
	// Returns a NonExistentInstanceError if an instance with the given ID
	// does not exist.
	GetInstance(context.Context, base.SQLInstanceID) (InstanceInfo, error)
	// GetAllInstances returns a list of all SQL instances for the tenant.
	GetAllInstances(context.Context) ([]InstanceInfo, error)
}

I've been thinking about these interface a bit. How would you feel about simplifying the interface like what I have below. I also took a hammer to the type names. Consider reading them as sqlinstance.Info, sqlinstance.InfoIterator, sqlinstance.Resolver etc.

// Info exposes information on a SQL instance such as ID, network address and
// the associated sqlliveness.SessionID.
type Info struct {
	InstanceID   base.SQLInstanceID
	InstanceAddr string
	SessionID    sqlliveness.SessionID
}

// InfoIterator is used to iterate instance info stored in a resolver. 
type InfoIterator func(Info) error

// Resolver exposes API for retrieving the instance address and all live instances for a tenant.
type Resolver interface {
    // Iterate iterates the set of known, live instances stored in the resolver.
    // From indicates the first instance from which iteration should occur. Iteration
    // proceeds in ascending order, halting after the instance with the largest known
    // instance ID. Returning iterutil.StopIteration will stop the iteration and return
    // no error. 
    Iterate(ctx context.Context, from base.SQLInstanceID, f InfoIterator) error
}

// GetInfo returns the Info for an instance given the instance ID as reported
// by the Resolver. Returns a NonExistentInstanceError if an instance with the given ID
// does not exist.
func GetInfo(ctx context.Context, r Resolver, id base.SQLInstanceID) (Info, error) {
   var ret Info
   if err := r.Iterate(ctx, id, func(info Info) error {
      ret = info
      return iterutil.StopIteration()
   }); err != nil {
      return Info{}, err
   }
   if ret.ID == id {
      return ret, nil
   }
   return Info{}, NonExistentInstanceError
}

You can then implement this interface on a slice like:

// Infos is a slice of Info. If it is sorted, it can be used as a Resolver. Calls to
// Iterate have undefined behavior if Infos is not sorted.
type Infos []Info

var _ Resolver = (Infos)(nil)

// Iterate can be used to 
func (infos Infos) Iterate(ctx context.Context, from base.SQLInstanceID, f InfoIterator) error {
    idx := sort.Search(len(infos), func(i int) bool {
        return infos[i].ID >= from
    })
    for i := idx; i < len(infos); i++ {
        if err := f(infos[i]); err != nil {
            if iterutils.Done(err) {
                err = nil
            }
            return err
        }
    }
    return nil
}

I think that such an interface will make the cache nicer to use.

Maybe its just me, but I find this interface harder to use. Also, I'm kind of averse to updating the external facing API of the sqlinstance subsystem for a caching change which should be an internal implementation detail for sqlinstance. I could be convinced otherwise if the reason was compelling enough. But in this case, I feel like the original interface is good enough. Thoughts?


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 77 at r1 (raw file):

Previously, ajwerner wrote…
	cacheConfig := cache.Config{
		Policy: cache.CacheLRU,
		ShouldEvict: func(size int, key, value interface{}) bool {
			// SQL instance data is expected to stay small enough
			// (order of a few KB), so we don't need an eviction strategy
			// and can retain cached data indefinitely.
			return false
		},
	}
	r.mu.instances = cache.NewUnorderedCache(cacheConfig)

why use this data structure if you're never going to evict? Why not just a *btree.BTree?

I just (mistakenly) assumed you use the cache data structure everywhere within the code while caching. I'll update this.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 163 at r1 (raw file):

Previously, ajwerner wrote…

Consider tightening this to just the primary index.

Please can you elaborate on this? How can I reference just the primary index? All other rangefeeds seems to reference the entire table, is this the recommended approach going forward?


pkg/sql/sqlinstance/instancestorage/instancereader_test.go, line 52 at r1 (raw file):

Previously, ajwerner wrote…

The test is failing because rangefeeds aren't enabled on the table used for the test. They are enabled by default on all system table ranges and on all tenant ranges.

See this error:

W210914 16:49:20.321486 1529 kv/kvclient/rangefeed/rangefeed.go:228  [rangefeed=sql_instances] 62  rangefeed failed 0 times, restarting: kv/kvserver/replica_rangefeed.go:151: rangefeeds require the kv.rangefeed.enabled setting. See https://www.cockroachlabs.com/docs/v21.2/change-data-capture.html#enable-rangefeeds-to-reduce-latency

Thanks!! This fixed the test.

@rimadeodhar rimadeodhar removed the do-not-merge bors won't merge a PR with this label. label Sep 17, 2021
@rimadeodhar rimadeodhar requested a review from a team September 17, 2021 21:07
@rimadeodhar rimadeodhar marked this pull request as ready for review September 17, 2021 21:10
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 21, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 21, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 21, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 22, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)


pkg/server/server_sql.go, line 951 at r2 (raw file):

	// sqlInstanceProvider must always be started after sqlLivenessProvider
	// as sqlInstanceProvider relies on the session initialized and maintained by
	// sqlLivenessProvider

[nit] period missing


pkg/sql/sqlinstance/sqlinstance.go, line 33 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

Maybe its just me, but I find this interface harder to use. Also, I'm kind of averse to updating the external facing API of the sqlinstance subsystem for a caching change which should be an internal implementation detail for sqlinstance. I could be convinced otherwise if the reason was compelling enough. But in this case, I feel like the original interface is good enough. Thoughts?

FWIW I also think the proposed interface feels harder to use..


pkg/sql/sqlinstance/sqlinstance.go, line 50 at r2 (raw file):

	Instance(context.Context) (base.SQLInstanceID, error)
	// Start starts the instanceprovider. This will block until
	// the instancereader has been started.

[nit] instancereader doesn't mean much in the context of documenting the interface


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 101 at r2 (raw file):

			r.stopper.AddCloser(rf)
		}
		return r.getStartError()

[nit] I think it would be cleaner to send the error through the channel


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 110 at r2 (raw file):

	}
}
func (r *Reader) maybeStartRf(ctx context.Context) *rangefeed.RangeFeed {

[nit] maybeStartRangeFeed.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 112 at r2 (raw file):

func (r *Reader) maybeStartRf(ctx context.Context) *rangefeed.RangeFeed {
	if r.started() {
		// Nothing to do, return

[nit] We're doing a bunch of stuff to tolerate invalid usage like calling Start twice or calling the other functions without Start; this feels unnecessary.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 133 at r2 (raw file):

		// This event corresponds to a deletion.
		if tombstone {
			r.updateInstanceMap(instance, true)

[nit] can just do r.updateInstanceMap(instance, tombstone)


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 196 at r2 (raw file):

}

// GetAllInstances implements sqlinstance.AddressResolver interface.

[nit] It's worth documenting that this implementation does not block (on RPCs or the like). Well, at least when slReader is a cached reader but I assume that's the case outside of tests.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 254 at r2 (raw file):

// getAllInstanceRows returns all instancerow objects contained
// within the map.

[nit] add ", in arbitrary order."


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 286 at r2 (raw file):

}

func (r *Reader) getStartError() error {

[nit] checkStarted might be a better name


pkg/sql/sqlinstance/instancestorage/instancestorage.go, line 166 at r2 (raw file):

		return instancerow{}, errors.Wrapf(err, "could not decode data for instance %d", instanceID)
	}
	if tombstone {

[nit] this seems redundant with the Value check above. I'd just ignore the new return.


pkg/sql/sqlinstance/instancestorage/instancestorage.go, line 212 at r2 (raw file):

		}
		if tombstone {
			// Corresponds to deleted data, continue.

[nit] A Scan does not return deleted data. I'd just ignore the new return.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)


pkg/sql/sqlinstance/sqlinstance.go, line 33 at r1 (raw file):

Previously, RaduBerinde wrote…

FWIW I also think the proposed interface feels harder to use..

I buy that :). My motivation was:

  1. The returned slice is mutable so either we're going to allocate a new one each time or we need the caller to not mutate it. I am almost definitely over-allergic to allocations.
  2. I was hoping to unify getting a single ID and getting multiple.

I'm willing to accept neither are worth it.

Let's note that the slice ordering properties if we keep the slice. Ideally it'd be sorted. Also note the mutability properties.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 163 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

Please can you elaborate on this? How can I reference just the primary index? All other rangefeeds seems to reference the entire table, is this the recommended approach going forward?

I think in the other cases it's just being sort of lazy and knowing that we only have a single index, the primary index. That's the case here too. It's interesting to think about what it would mean to run a schema change on this table in subsequent releases. The method would be codec.IndexPrefix and in this case, the primary index is 1.

In light of schema changes, we'd need to consider taking a catalog.TableDescriptor or something like that. Definitely future work. For now the whole table and the index are synonymous.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 22, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
craig bot pushed a commit that referenced this pull request Sep 23, 2021
69380: docs/RFCs: draft MVCC-ification of bulk operations r=tbg a=dt

Release note: none.
Release justification: none.

70520: tenantcostclient: pass next live instance ID r=RaduBerinde a=RaduBerinde

This set of changes adds tracking instances on the server side. The main motivator at this stage is to detect duplicate requests and avoid double-charging in some cases. A secondary motivator is that doing this later will cause headaches during upgrade.

Informs #68479.

#### tenantcostclient: plumb SQL instance and session ID

This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

#### tenantcost: implement request sequence number

This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

#### tenantcostserver: cleanup stale instances

This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:

 - each tenant sends the ID of the next instance in circular order.
   The live instance set is maintained on the tenant side by a
   separate subsystem.

 - the server uses this information as a "hint" that some instances
   might be stale. When the next ID does not match the expected value,
   a cleanup for a specific instance ID range is triggered. The
   cleanup ultimately checks that the last update is stale, so that
   stale information from the tenant-side doesn't cause incorrect
   removals.

Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

#### tenantcostclient: pass next live instance ID

This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

70531: sql: add version gate to regrole type r=RichardJCai a=RichardJCai

Without the version gate, in a mixed cluster setting, creating
a table on the new binary version with type REGROLE can cause
errors when inserting on the old verison.

Errors can also happen on schema changes where the old binary uses
the new type.

Release justification: GA blocker, without a version gate this can
cause panics.
Release note: None. Not in production.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: richardjcai <caioftherichard@gmail.com>
Copy link
Copy Markdown
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 101 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I think it would be cleaner to send the error through the channel

I have kept it separate as we reuse this method elsewhere to return an error if the Start method hasn't been called/returns an error.


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 112 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] We're doing a bunch of stuff to tolerate invalid usage like calling Start twice or calling the other functions without Start; this feels unnecessary.

I agree, but it feels like it would be necessary though as the functionality executed in Start() is critical to the Reader working correctly for other methods.

Copy link
Copy Markdown
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR based on the comments so far. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Sep 24, 2021
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
Copy link
Copy Markdown
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner: Gentle ping for a review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to get back to this 😓

Reviewed 1 of 11 files at r1, 1 of 7 files at r2, 4 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @rimadeodhar)


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 96 at r3 (raw file):

Quoted 7 lines of code…
	case <-r.initialScanDone:
		if rf != nil {
			// Add rangefeed to the stopper to ensure it
			// is shutdown correctly.
			r.stopper.AddCloser(rf)
		}
		return r.checkStarted()

FWIW: this means that starting the subsystem is a synchronous scan. I'm not sure we need that. Perhaps a TODO or something to make the waiting for an initial scan lazy until calls to read?


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 124 at r3 (raw file):

		})
		if err != nil {
			log.Ops.Warningf(ctx, "failed to decode settings row %v: %v", keyVal.Key, err)

return?


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 154 at r3 (raw file):

		EndKey: instancesTablePrefix.PrefixEnd(),
	}
	rf, err := r.f.RangeFeed(ctx, "sql_instances", instancesTableSpan, r.clock.Now(), updateCacheFn, rangefeed.WithInitialScan(initialScanDoneFn), rangefeed.WithOnInitialScanError(initialScanErrFn))

nit: wrap this

Copy link
Copy Markdown
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/sql/sqlinstance/instancestorage/instancereader.go, line 96 at r3 (raw file):

Previously, ajwerner wrote…
	case <-r.initialScanDone:
		if rf != nil {
			// Add rangefeed to the stopper to ensure it
			// is shutdown correctly.
			r.stopper.AddCloser(rf)
		}
		return r.checkStarted()

FWIW: this means that starting the subsystem is a synchronous scan. I'm not sure we need that. Perhaps a TODO or something to make the waiting for an initial scan lazy until calls to read?

Okay, I have added a TODO for now. Will make a followup PR with the change.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @RaduBerinde, and @rimadeodhar)


pkg/sql/sqlinstance/instancestorage/instancereader_test.go, line 272 at r4 (raw file):

Quoted 15 lines of code…
				if err != sqlinstance.NonExistentInstanceError {
					return errors.Newf("expected non existent instance error")
				}
				return nil
			})
		}
		// Verify request for instance with expired session results in an error.
		{
			err := slStorage.Delete(ctx, sessionIDs[1])
			if err != nil {
				t.Fatal(err)
			}
			testutils.SucceedsSoon(t, func() error {
				_, err = reader.GetInstance(ctx, instanceIDs[0])
				if err != sqlinstance.NonExistentInstanceError {

the linter wants errors.Is here

This PR adds rangefeed backed cache for the sqlinstance
table. This will prevent the SQL instance lookup code from
making multiple roundtrips. This should have good performance
gains for when we support multi-region for serverless.

Release note (performance improvement): The sqlinstance subsystem
no longer reads from the backing SQL table for every request for
SQL instance details. This will result in improved performance
for when we support multi-region setup for the multi-tenant
architecture.
@rimadeodhar
Copy link
Copy Markdown
Collaborator Author

TFTR!

bors r=ajwerner

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 27, 2021

Build succeeded:

@RaduBerinde
Copy link
Copy Markdown
Member

Is it now save to update the code below to just run GetAllInstances every time?

const interval = 1 * time.Minute

@rimadeodhar
Copy link
Copy Markdown
Collaborator Author

Yes, if the only blocker to run GetAllInstances every time was caching, then you can go ahead and make the change. The sqlinstance subsystem caches all instance data and GetAllInstances simply retrieves the data from the cache without issuing any SQL queries.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this pull request Dec 6, 2021
The tenant's cost control code requires a function that retrieves the
next live SQL instance ID. The implementation of this function
currently caches the value and only refreshes it periodically.

This is now unnecessary: cockroachdb#69976 changed the instance reader to use a
rangefeed and serve values from memory.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 7, 2021
73484: kvserver: use background context when applying snapshots r=nvanbenschoten,tbg a=erikgrinaker

Following a3fd4fb, context errors are now propagated back up the
stack when receiving snapshots. However, this triggered a
`maybeFatalOnRaftReadyErr` assertion which crashed the node.

`handleRaftReadyRaftMuLocked` (which is called directly when applying
snapshots) does not appear prepared to safely handle arbitrary context
cancellation, as it is typically called with the Raft scheduler's
background context. Furthermore, by the time we call it we have already
received the entire snapshot, so it does not seem useful to abort
snapshot application just because the caller goes away.

This patch therefore uses a new background context for applying
snapshots, disconnected from the client's context, once the entire
snapshot has been received.

Resolves #73371.
Resolves #73469.
Resolves #73482.

Release note: None

73523: server: remove ad-hoc next-instance-id cache (tenant only) r=RaduBerinde a=RaduBerinde

The tenant's cost control code requires a function that retrieves the
next live SQL instance ID. The implementation of this function
currently caches the value and only refreshes it periodically.

This is now unnecessary: #69976 changed the instance reader to use a
rangefeed and serve values from memory.

Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants