feat: Add local backend for storing cursor state#569
Conversation
⏱️ Benchmark results
|
yevgenypats
left a comment
There was a problem hiding this comment.
Looks really good. Don't have a lot of comments apart from some nits.
I do worry a bit on the global lock and Im curious to run some benchmarks with global and non global locks. If there is no difference then a global lock will be simple for sure if there is maybe we can change the api only to lock per table per client.
| {{- if .Options.PrimaryKey}} | ||
| PrimaryKey: true, | ||
| {{- end }} | ||
| {{- if .Options.IncrementalKey}} |
There was a problem hiding this comment.
Do we need this in codegen given we are deprecating this?
There was a problem hiding this comment.
I think we may as well keep it here until the codegen gets removed (and then this entire file will get removed), don't see any harm in it
plugins/source/plugin.go
Outdated
| ) | ||
|
|
||
| type NewExecutionClientFunc func(context.Context, zerolog.Logger, specs.Source) (schema.ClientMeta, error) | ||
| type NewExecutionClientFunc func(context.Context, zerolog.Logger, specs.Source, backend.Backend) (schema.ClientMeta, error) |
There was a problem hiding this comment.
Let's introduce an optional struct here so we can have a way of both adding required parameters that will break the API and once that are not required like this one (i.e not all plugins have to implement/use cursor)
There was a problem hiding this comment.
Looking into this now
|
|
||
| func (l *Local) Set(_ context.Context, table, key, value string) error { | ||
| l.tablesLock.Lock() | ||
| defer l.tablesLock.Unlock() |
There was a problem hiding this comment.
I think if you pre-alloc the tables you can avoid locking the whole map and just locking on a per table.
There was a problem hiding this comment.
Maybe I'm misunderstanding, but I think we can't pre-alloc tables, because we don't know which tables will use this before Set is called. And if no tables use it, we don't want to write any file when the backend gets closed.
erezrokah
left a comment
There was a problem hiding this comment.
Looks good, but maybe we should consider making the API more restrictive
|
|
||
| import "context" | ||
|
|
||
| type Backend interface { |
There was a problem hiding this comment.
Should we make these table methods? So the API will be t.SetCursor(value string)?
That would make it more restrictive, and avoid potential collisions. Also what's the reason we need a key? Can a single table have more than 1 key?
There was a problem hiding this comment.
Yeah, a single table can have more than 1 key. For example, if the table is multiplexed on account or project, we may need to pass in different IDs for each of those accounts/projects to resume from the last position. Generally I think only very simple cases will have a single key, most cases will have many.
There was a problem hiding this comment.
I like the table method idea! It might be hard to do in practice though. Right now resolvers don't have access to their own table, and even if they did, the table object wouldn't have a backend attached to it. I think we'll just have to accept that this will be the first pass, and we can iterate on the interface as we see how it's used in practice.
There was a problem hiding this comment.
Yeah, a single table can have more than 1 key. For example, if the table is multiplexed on account or project, we may need to pass in different IDs for each of those accounts/projects to resume from the last position. Generally I think only very simple cases will have a single key, most cases will have many.
Multiplexed clients have an ID() method (I think tables can access it). Would that make a good key? I think that would make it easier to trace which cursor belongs to which client/table.
I think we'll just have to accept that this will be the first pass
Good point. If we see the API is used in an inconsistent way we can change it (unless we want it that way 😄 )
There was a problem hiding this comment.
Yup, I haven't changed it to be a method on Table (which might be hard), but I have updated the Backend interface to be like this:
Set(ctx context.Context, table, clientID, value string) error
So basically we're relying on the client id to give uniqueness (which should be true if you use multiplexers correctly)
|
@yevgenypats Addressed all the comments and went through another iteration, this should be good for another look now. |
| l.tables[table][clientID] = value | ||
| if prev != value { | ||
| // only flush if the value changed | ||
| return l.flushTable(table, l.tables[table]) |
There was a problem hiding this comment.
For now we are flushing on every Set call (if the value changed), but if we can ensure that Close always gets called before shutdown then we won't have to do that.
yevgenypats
left a comment
There was a problem hiding this comment.
Looks great! I might have some suggestions on improving the implementation or performance but I think The external API looks good so it should easy to adjust as we go.
Prob for next feature / improvement around that area I'd like to work on it to get more hands-on experience with that feature to see if I've any additional ideas around that.
🤖 I have created a release *beep* *boop* --- ## [1.24.0](v1.23.0...v1.24.0) (2023-01-09) ### Features * Add local backend for storing cursor state ([#569](#569)) ([3b07885](3b07885)) * Remove codegen ([#589](#589)) ([1c5943a](1c5943a)) ### Bug Fixes * **destinations:** Log correct size of batch ([#588](#588)) ([9cebafe](9cebafe)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This adds a Hacker News source plugin that is able to download all items from the Hacker News API. It can be useful, but is mainly meant to be for instructional purposes, as this is the first example of a plugin making use of the state backend for incremental syncing being introduced in cloudquery/plugin-sdk#569 (this PR depends on that one).
A state store allows us to store cursors for #413. This first iteration introduces the interface and implements it for a local JSON storage backend. After some back and forth I decided to call it
backend, similar to Terraform. Alternatives considered werestate(confusing because it's not a single state) andstorage(confusing because it's similar todestination).Since the interface is a generic key-value store, we lose the ability to know whether a table is incremental or not, and which columns are used as cursor keys if it is. I'm proposing that we add some fields to the
Tabledefinition to be able to at least document these aspects, but it's not enforced.See cloudquery/cloudquery#6336 for an example implementation using this new functionality.