Skip to content

feat: Add local backend for storing cursor state#569

Merged
kodiakhq[bot] merged 20 commits intomainfrom
incremental-syncing
Jan 9, 2023
Merged

feat: Add local backend for storing cursor state#569
kodiakhq[bot] merged 20 commits intomainfrom
incremental-syncing

Conversation

@hermanschaaf
Copy link
Copy Markdown
Contributor

@hermanschaaf hermanschaaf commented Jan 3, 2023

A state store allows us to store cursors for #413. This first iteration introduces the interface and implements it for a local JSON storage backend. After some back and forth I decided to call it backend, similar to Terraform. Alternatives considered were state (confusing because it's not a single state) and storage (confusing because it's similar to destination).

Since the interface is a generic key-value store, we lose the ability to know whether a table is incremental or not, and which columns are used as cursor keys if it is. I'm proposing that we add some fields to the Table definition to be able to at least document these aspects, but it's not enforced.

See cloudquery/cloudquery#6336 for an example implementation using this new functionality.

@github-actions github-actions bot added the feat label Jan 3, 2023
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2023

⏱️ Benchmark results

  • DefaultConcurrencyDFS-2 resources/s: 11,498
  • DefaultConcurrencyRoundRobin-2 resources/s: 11,786
  • Glob-2 ns/op: 205
  • TablesWithChildrenDFS-2 resources/s: 26,451
  • TablesWithChildrenRoundRobin-2 resources/s: 25,701
  • TablesWithRateLimitingDFS-2 resources/s: 28.26
  • TablesWithRateLimitingRoundRobin-2 resources/s: 815.8
  • BufferedScanner-2 ns/op: 12.38
  • LogReader-2 ns/op: 38.51

@hermanschaaf hermanschaaf marked this pull request as ready for review January 4, 2023 13:46
@github-actions github-actions bot added feat and removed feat labels Jan 4, 2023
Copy link
Copy Markdown
Contributor

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Don't have a lot of comments apart from some nits.

I do worry a bit on the global lock and Im curious to run some benchmarks with global and non global locks. If there is no difference then a global lock will be simple for sure if there is maybe we can change the api only to lock per table per client.

{{- if .Options.PrimaryKey}}
PrimaryKey: true,
{{- end }}
{{- if .Options.IncrementalKey}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in codegen given we are deprecating this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may as well keep it here until the codegen gets removed (and then this entire file will get removed), don't see any harm in it

)

type NewExecutionClientFunc func(context.Context, zerolog.Logger, specs.Source) (schema.ClientMeta, error)
type NewExecutionClientFunc func(context.Context, zerolog.Logger, specs.Source, backend.Backend) (schema.ClientMeta, error)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce an optional struct here so we can have a way of both adding required parameters that will break the API and once that are not required like this one (i.e not all plugins have to implement/use cursor)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this now


func (l *Local) Set(_ context.Context, table, key, value string) error {
l.tablesLock.Lock()
defer l.tablesLock.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you pre-alloc the tables you can avoid locking the whole map and just locking on a per table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding, but I think we can't pre-alloc tables, because we don't know which tables will use this before Set is called. And if no tables use it, we don't want to write any file when the backend gets closed.

Copy link
Copy Markdown
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but maybe we should consider making the API more restrictive


import "context"

type Backend interface {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these table methods? So the API will be t.SetCursor(value string)?

That would make it more restrictive, and avoid potential collisions. Also what's the reason we need a key? Can a single table have more than 1 key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a single table can have more than 1 key. For example, if the table is multiplexed on account or project, we may need to pass in different IDs for each of those accounts/projects to resume from the last position. Generally I think only very simple cases will have a single key, most cases will have many.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the table method idea! It might be hard to do in practice though. Right now resolvers don't have access to their own table, and even if they did, the table object wouldn't have a backend attached to it. I think we'll just have to accept that this will be the first pass, and we can iterate on the interface as we see how it's used in practice.

Copy link
Copy Markdown
Member

@erezrokah erezrokah Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a single table can have more than 1 key. For example, if the table is multiplexed on account or project, we may need to pass in different IDs for each of those accounts/projects to resume from the last position. Generally I think only very simple cases will have a single key, most cases will have many.

Multiplexed clients have an ID() method (I think tables can access it). Would that make a good key? I think that would make it easier to trace which cursor belongs to which client/table.

I think we'll just have to accept that this will be the first pass

Good point. If we see the API is used in an inconsistent way we can change it (unless we want it that way 😄 )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I haven't changed it to be a method on Table (which might be hard), but I have updated the Backend interface to be like this:

Set(ctx context.Context, table, clientID, value string) error

So basically we're relying on the client id to give uniqueness (which should be true if you use multiplexers correctly)

@hermanschaaf
Copy link
Copy Markdown
Contributor Author

@yevgenypats Addressed all the comments and went through another iteration, this should be good for another look now.

l.tables[table][clientID] = value
if prev != value {
// only flush if the value changed
return l.flushTable(table, l.tables[table])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we are flushing on every Set call (if the value changed), but if we can ensure that Close always gets called before shutdown then we won't have to do that.

Copy link
Copy Markdown
Contributor

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I might have some suggestions on improving the implementation or performance but I think The external API looks good so it should easy to adjust as we go.

Prob for next feature / improvement around that area I'd like to work on it to get more hands-on experience with that feature to see if I've any additional ideas around that.

@kodiakhq kodiakhq bot merged commit 3b07885 into main Jan 9, 2023
@kodiakhq kodiakhq bot deleted the incremental-syncing branch January 9, 2023 13:32
kodiakhq bot pushed a commit that referenced this pull request Jan 9, 2023
🤖 I have created a release *beep* *boop*
---


## [1.24.0](v1.23.0...v1.24.0) (2023-01-09)


### Features

* Add local backend for storing cursor state ([#569](#569)) ([3b07885](3b07885))
* Remove codegen ([#589](#589)) ([1c5943a](1c5943a))


### Bug Fixes

* **destinations:** Log correct size of batch ([#588](#588)) ([9cebafe](9cebafe))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Jan 9, 2023
This adds a Hacker News source plugin that is able to download all items from the Hacker News API. It can be useful, but is mainly meant to be for instructional purposes, as this is the first example of a plugin making use of the state backend for incremental syncing being introduced in cloudquery/plugin-sdk#569 (this PR depends on that one).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants