Skip to content

Conversation

@hermanschaaf
Copy link
Member

@hermanschaaf hermanschaaf commented Dec 28, 2022

This adds new options to the Read API to:

  • limit the number of records returned
  • order the returned records by a set of columns
  • select only rows that match a given value
  • select only a subset of columns

I have chosen not to support filtering using operators (like greater than, less than) as an option for now, as it's a complicated and we don't have a strong need for it. We may want to add it later to make it feature-complete.

This update allows us to fix cloudquery/cloudquery#6069 (will be done in another PR to the CloudQuery repo) and should also allow for cursor-based syncing (#413). This is not implemented as part of this PR though.

@hermanschaaf
Copy link
Member Author

Here is a draft PR for the cloudquery repo to show what usage would look like: cloudquery/cloudquery#6107

@hermanschaaf
Copy link
Member Author

hermanschaaf commented Dec 28, 2022

I see the unit tests are failing: will mark this as a draft for now and look into it tomorrow Updated

@hermanschaaf hermanschaaf marked this pull request as draft December 28, 2022 17:19
@github-actions
Copy link

github-actions bot commented Dec 29, 2022

⏱️ Benchmark results

Comparing with dcd8834

  • DefaultConcurrency-2 resources/s: 11,633 ⬇️ 0.46% decrease vs. dcd8834
  • Glob-2 ns/op: 156.9 ⬆️ 6.37% increase vs. dcd8834
  • TablesWithChildrenDefaultConcurrency-2 resources/s: 32,679 ⬆️ 4.65% increase vs. dcd8834
  • BufferedScanner-2 ns/op: 10.1 ⬆️ 7.15% increase vs. dcd8834
  • LogReader-2 ns/op: 30.76 ⬆️ 0.78% increase vs. dcd8834

@hermanschaaf hermanschaaf marked this pull request as ready for review December 29, 2022 11:28
@yevgenypats
Copy link
Contributor

I like the interface but have a few questions/thought:

Do we really need the LessThan for all types? also for some types Im not sure there is such a think, like for json or for bytea. can we maybe have additional api for destinations called ReadOne which gets and ID? This way it will be consistent? and can also serve for the cursor based thing?

@hermanschaaf
Copy link
Member Author

@yevgenypats Indeed, the meaning of LessThan is unclear for some types, but I don't foresee us using those any time soon, because just like with a JSON database column, sorting on it would be ill-advised. I'm not sure I want to necessarily panic there though. In these cases any kind of ordering is okay, as long as its consistent.

A single ReadOne function, though simpler, will not be enough. I see your point and it could work for the testing case in isolation, because we already know the IDs we want to read (because we just wrote them). But then the test wouldn't be able to assert that the correct number of records were inserted. Maybe we can live with this.

However, when we get to cursor-based syncing, we don't know the ID we're looking for up-front. Let's take a few examples:

  • message history for the Slack plugin: here we need to fetch the latest ts (timestamp) that was inserted into the database in a previous sync. The only information we have before reading is the source name. There is no one ID that we can look up by, but we can get the latest timestamp by issuing an order by ts desc limit 1 read query.
  • service stats for the Fastly plugin: here we need to know the latest start_time for a given service that was inserted in a previous sync. So we'll need to issue an where service_id = $1 order by start_time limit 1 query (aside: this made me realize that we will need some filtering capabilities soon after all)
  • stories for a hypothetical Hacker News plugin: here we need to know the latest story ID that was inserted into the database. We'd want to issue a order by id desc limit 1 read query.

The important thing is that we have OrderBy. To implement OrderBy for the memory db and file types like CSV and JSON, we need LessThan on the types. For databases we can rely on the built-in ORDER BY functionality.

@github-actions github-actions bot added feat and removed feat labels Dec 29, 2022
Copy link
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, but I think I'm missing something regarding the reason this is required for testing purposes.
Could we update the equality check in the tests to ignore the order of the results instead?

I get that we might need this for a future feature (curser based fetching), but since the read is only used in tests, maybe we could find a solution that doesn't require adding functionality to non test related code?

As for cursor based fetching, maybe we could add the order by functionality once we implement that feature, as we might discover more requirements once we try to implement cursor based fetching.

Approving the code change regardless of ⬆️

.PHONY: lint
lint:
golangci-lint run
golangci-lint run --config .golangci.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non blocking] I don't think we need to specify the configuration here as golangci-lint looks for it by default https://golangci-lint.run/usage/configuration/#config-file

}
sortedRes = append(sortedRes, row)
}
sort.Slice(sortedRes, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort.Slice(sortedRes, func(i, j int) bool {
sort.SliceStable(sortedRes, func(i, j int) bool {


resourcesRead, err := p.readAll(ctx, tables[0], sourceName)
cqIDIndex := table.Columns.Index(schema.CqIDColumn.Name)
sort.Slice(resources, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort.Slice(resources, func(i, j int) bool {
sort.SliceStable(resources, func(i, j int) bool {

@hermanschaaf
Copy link
Member Author

@erezrokah Yeah, I was working on this with the eye on it being the first step towards cursor-based syncing (rather than doing it all in one or two massive PRs). However after seeing how it works out, I think we'll probably not go with this Read API approach and use a different interface to achieve cursor-based syncing.

I'll simplify this PR to focus on fixing the tests and mostly try and leave the Read API untouched.

kodiakhq bot pushed a commit that referenced this pull request Jan 2, 2023
Extracted from #542 to allow us to fix cloudquery/cloudquery#6069

Before this change, running cockroachDB tests:

```
Jan  2 11:03:08.760000 INF Query args=["testAppendSource474bca19-69ba-4a5a-8af0-3d890198a243"] commandTag="SELECT 2" module=pgx pid=1570743 sql="SELECT * FROM \"cq_test_write_append\" WHERE _cq_source_name = $1 order by _cq_sync_time asc"
--- FAIL: TestPgPlugin (0.36s)
    --- FAIL: TestPgPlugin/TestWriteOverwrite (0.22s)
        client_test.go:20: expected first resource diff: value at 2 is(d1ebd2a374f14556852c4d5b23608201) TypeUUID while other is(3da5b2eafc814ecb973bcee770bb70e1) TypeUUID
            value at 3 is(d03dd7f8781a4dbb93aaebefb6aaba87) TypeUUID while other is(f16b7d3a0f85441daf30da5d2d8afb26) TypeUUID
            value at 7 is(758c419b5a1b4a46ba45499b402c6aaf) TypeUUID while other is(0f252e92137b4f55a856cbf97d43d098) TypeUUID

FAIL
FAIL	github.com/cloudquery/cloudquery/plugins/destination/postgresql/client	0.640s
```

After this change:

```
ok  	github.com/cloudquery/cloudquery/plugins/destination/postgresql/client	0.662s
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CockroachDB destination tests are failing

4 participants