Skip to content

feat(spanner): add SelectAll method to decode from Spanner iterator.Rows to golang struct#9206

Merged
rahul2393 merged 11 commits intomainfrom
issue-9111
Jan 18, 2024
Merged

feat(spanner): add SelectAll method to decode from Spanner iterator.Rows to golang struct#9206
rahul2393 merged 11 commits intomainfrom
issue-9111

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Dec 29, 2023

Fixes: #9111

Steps to benchmark

- go test -bench=^BenchmarkScan100RowsUsingSelectAll$ -run==^BenchmarkScan100RowsUsingSelectAll$ -benchmem -count 10 > 10_runs_bench_using_select_all.txt && sed -i '' 's/UsingSelectAll//g' 10_runs_bench_using_select_all.txt
- go test -bench=^BenchmarkScan100RowsUsingToStruct$ -run==^BenchmarkScan100RowsUsingToStruct$ -benchmem -count 10 > 10_runs_bench_using_tostruct.txt && sed -i '' 's/UsingToStruct//g' 10_runs_bench_using_tostruct.txt
- go test -bench=^BenchmarkScan100RowsUsingColumns$ -run==^BenchmarkScan100RowsUsingColumns$ -benchmem -count 10 > 10_runs_bench_using_columns.txt && sed -i '' 's/UsingColumns//g' 10_runs_bench_using_columns.txt

Compare results

- go install golang.org/x/perf/cmd/benchstat@latest
- benchstat 10_runs_bench_using_columns.txt 10_runs_bench_using_select_all.txt
goos: darwin
goarch: arm64
pkg: cloud.google.com/go/spanner
               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │             sec/op              │    sec/op     vs base                 │
Scan100Rows-10                       7.076µ ± 2%     14.434µ ± 0%  +102.92% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │              B/op               │     B/op       vs base                │
Scan100Rows-10                      8.320Ki ± 0%   10.945Ki ± 0%  +31.55% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │ 10_runs_bench_using_select_all.txt  │
               │            allocs/op            │ allocs/op   vs base                 │
Scan100Rows-10                        108.0 ± 0%    219.0 ± 0%  +102.78% (p=0.000 n=20)

Result of running the combined benchmark BenchmarkScan on a five field Go struct, here is the chart showing execution time vs number of rows scanned.
Screenshot 2024-01-15 at 20 20 21

@rahul2393 rahul2393 requested review from a team December 29, 2023 07:29
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: spanner Issues related to the Spanner API. labels Dec 29, 2023
@rahul2393
Copy link
Copy Markdown
Contributor Author

rahul2393 commented Dec 29, 2023

@CAFxX Can you please help review this please
Thanks

spanner/row.go Outdated
reflect.ValueOf(dst).Elem().Field(i).Set(reflect.ValueOf(p).Elem())
}
}
sliceVal.Set(reflect.Append(sliceVal, sliceItem))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if we could preallocate the whole resultset, but IIRC this is not possible today because the spanner client does not know how many rows are there in the resultset until the end (not sure if this is a fundamental issue, or it can be improved)

@CAFxX
Copy link
Copy Markdown
Contributor

CAFxX commented Jan 2, 2024

this is not the final optimization

@rahul2393 to be able to know what to focus on in the review, can you give me a rough idea of what kind of optimizations you have in your mind?

@rahul2393
Copy link
Copy Markdown
Contributor Author

this is not the final optimization

@rahul2393 to be able to know what to focus on in the review, can you give me a rough idea of what kind of optimizations you have in your mind?

Thanks CAFxX for review, I am trying to reduce the allocation ops as much as possible by preallocating as you said, since Spanner Row does not have meta information on number of rows we need to see how we can get maximum performance given the situation.

@rahul2393
Copy link
Copy Markdown
Contributor Author

CAFxX
I am done with the final changes, please help check again
QueryWithStats returns the RowCount when iterating the rows, in that case SelectAll will do preallocation for others we have to dynamically grow size.
Below is the benchmark with QueryWithStats

goos: darwin
goarch: arm64
pkg: cloud.google.com/go/spanner
               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │             sec/op              │    sec/op     vs base                 │
Scan100Rows-10                       7.076µ ± 2%    10.269µ ± 0%  +44.37% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │              B/op               │     B/op       vs base                │
Scan100Rows-10                      8.320Ki ± 0%   5.273Ki ± 0%  -36.62% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │ 10_runs_bench_using_select_all.txt  │
               │            allocs/op            │ allocs/op   vs base                 │
Scan100Rows-10                        108.0 ± 0%    113.0 ± 0%  +4.63% (p=0.000 n=20)

For others

goos: darwin
goarch: arm64
pkg: cloud.google.com/go/spanner
               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │             sec/op              │    sec/op     vs base                 │
Scan100Rows-10                       7.076µ ± 2%     14.434µ ± 0%  +102.92% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │  10_runs_bench_using_select_all.txt   │
               │              B/op               │     B/op       vs base                │
Scan100Rows-10                      8.320Ki ± 0%   10.945Ki ± 0%  +31.55% (p=0.000 n=20)

               │ 10_runs_bench_using_columns.txt │ 10_runs_bench_using_select_all.txt  │
               │            allocs/op            │ allocs/op   vs base                 │
Scan100Rows-10                        108.0 ± 0%    219.0 ± 0%  +102.78% (p=0.000 n=20)

Copy link
Copy Markdown
Contributor

@CAFxX CAFxX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round

spanner/row.go Outdated

isPrimitive := itemType.Kind() != reflect.Struct
var pointers []interface{}
isFistRow := true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] isFirstRow

spanner/row.go Outdated
sliceItem.Field(i).Set(reflect.ValueOf(p).Elem())
}
if rowIndex >= 0 {
sliceVal.Index(rowIndex).Set(sliceItem)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this panic if the number of rows returned by RowsReturned is not precise (e.g. it's a lower bound)?

Does it have a huge impact if instead of using reflect.MakeSlice(sliceType, int(nRows), int(nRows)) we use reflect.MakeSlice(sliceType, 0, int(nRows)) and always append? this would handle both the cases in which RowsReturned returns a number that is too low (instead of panicking) or too high (instead of returning a slice with extra zero elements at the end). It would also make it easier in case we want to guarantee that we always append to a user-provided slice.

Copy link
Copy Markdown
Contributor Author

@rahul2393 rahul2393 Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using reflect.MakeSlice(sliceType, 0, int(nRows)) was 25% slower compared to preallocated and setting the value via index, I have updated the code to not panic if RowsReturned is lower bound,
btw it will be lower bound only for DML and PDML so normal Select queries should work without needing append.

spanner/row.go Outdated
if isFistRow {
nRows := rows.RowsReturned()
if nRows != -1 {
sliceVal = reflect.MakeSlice(sliceType, int(nRows), int(nRows))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user passes in v a slice that is preallocated, I think we should append to that slice instead of allocating a new one (and if so, this behavior should be documented as guaranteed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated in comment that we will resets the destination slice

spanner/row.go Outdated
)
}

// SelectAll scans rows into a slice (v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] we should probably document, just as a suggested best practice to avoid running out of memory, to only use SelectAll on resultsets that the user knows to be bounded in size

spanner/row.go Outdated
)
}

// SelectAll scans rows into a slice (v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document the accepted types of v (or enforce it with generics)

spanner/row.go Outdated
}
vType := reflect.TypeOf(v)
if k := vType.Kind(); k != reflect.Ptr {
return errToStructArgType(v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right error message here? shouldn't it be errNotASlicePointer?

spanner/row.go Outdated
isFistRow := true
rowIndex := -1
return rows.Do(func(row *Row) error {
sliceItem := reflect.New(itemType).Elem()
Copy link
Copy Markdown
Contributor

@CAFxX CAFxX Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm following correctly, this implies the type of v must be something like *[]*struct {...}.

Would it be too hard to also support *[]struct { ... }? The benefit for users that pass a *[]struct { ... } would be less calls to the allocator, lower allocated memory, and better memory locality (as all rows are guaranteed to be contiguous in memory, and no pointer chasing is required per element).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad, misread the code

spanner/read.go Outdated
Comment on lines +138 to +151
switch r.QueryStats["rows_returned"].(type) {
case float64:
return r.QueryStats["rows_returned"].(int64)
case string:
v, err := strconv.Atoi(r.QueryStats["rows_returned"].(string))
if err != nil {
return -1
}
return int64(v)
default:
return -1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
switch r.QueryStats["rows_returned"].(type) {
case float64:
return r.QueryStats["rows_returned"].(int64)
case string:
v, err := strconv.Atoi(r.QueryStats["rows_returned"].(string))
if err != nil {
return -1
}
return int64(v)
default:
return -1
}
switch rowsReturned := r.QueryStats["rows_returned"].(type) {
case float64:
return int64(rowsReturned)
case string:
v, err := strconv.ParseInt(rowsReturned, 10, 64)
if err != nil {
v = -1
}
return v
}

spanner/read.go Outdated
Comment on lines +133 to +135
// RowsReturned returns the number of rows returned by the query. If the query
// was a DML statement, the number of rows affected is returned. If the query
// was a PDML statement, the number of rows affected is a lower bound.
Copy link
Copy Markdown
Contributor

@CAFxX CAFxX Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RowsReturned returns the number of rows returned by the query. If the query
// was a DML statement, the number of rows affected is returned. If the query
// was a PDML statement, the number of rows affected is a lower bound.
// RowsReturned returns, a lower bound on the number of rows returned by the query.
// Currently this requires the query to be executed with query stats enabled.
//
// If the query was a DML statement, the number of rows affected is returned.
// If the query was a PDML statement, the number of rows affected is a lower bound.
// If the query was executed without query stats enabled, or if it is otherwise
// impossible to determine the number of rows in the resultset, -1 is returned.

@rahul2393
Copy link
Copy Markdown
Contributor Author

@CAFxX please help take a look again.

spanner/read.go Outdated
}

// RowsReturned returns, a lower bound on the number of rows returned by the query.
// Currently, this requires the query to be executed with query stats enabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also document that the query stats are only included in the last response of the gRPC stream. That means that if you run a query with query stats enabled that returns a large number of rows, then this method is only guaranteed to return the number of rows once you have iterated through all the rows (which again makes it a bit less useful). So maybe add the following to this sentence: The query stats are only guaranteed to be available after iterating through all the rows.

And change the last point to something like this:

// If the query was executed without query stats enabled, if the query stats have not yet been
// returned by the server (the query stats are included in the last message of the query stream),
// or if it is otherwise impossible to determine the number of rows in the resultset, -1 is returned.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is especially true for SELECT queries. DML/PDML will only return one PartialResultSet, as they only return an update count and/or a small number of rows (in case the contain a THEN RETURN clause).

SELECT queries that are executed with AnalyzeMode=PROFILE will include the number of rows returned, but only in the ResultSetStats, and ResultSetStats are returned together with the last PartialResultSet of the stream. So for large queries, this value will only be available once you have iterated through enough of the rows for the client to have fetched the last PartialResultSet from the stream (and in the current API, there is no way to determine when that is).

See https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.ResultSetStats and https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.PartialResultSet (the section on Stats for the last one)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the logic to rely on ReturnedRows stats since for large dataset row count is returned with last result set only hence it won't help in preallocation.
Thanks

spanner/read.go Outdated
}

// Iterator is an interface for iterating over Rows.
type Iterator interface {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this interface? (Or put another way: What are the benefits of adding this interface?)

The name also seems very generic.

Copy link
Copy Markdown
Contributor Author

@rahul2393 rahul2393 Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the interface name and converted to package private so that can't be used by customers, added this to ease mock and unit tests

spanner/row.go Outdated
Comment on lines +409 to +410
// Before starting, SelectAll resets the destination slice,
// so if it's not empty it will overwrite all existing elements.
Copy link
Copy Markdown
Contributor

@CAFxX CAFxX Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, as mentioned before, I think we should only append to the slice (without first resetting it) in case a non-nil slice is provided. WDYT?

The user may have pre-seeded the slice with some entries, or it may want to use the same slice in multiple successive calls. If we always reset, we basically preclude these options. Whereas with the alternative (just append) all these options are available (and the current behavior of resetting is trivial for the user to opt-in to: just pass s[:0] instead of just s)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed reset logic, added unit test to validate it will append on base slice.

@rahul2393 rahul2393 requested review from CAFxX and olavloite and removed request for codyoss January 16, 2024 09:37
//
// var singersByPtr []*Singer
// err := spanner.SelectAll(row, &singersByPtr, spanner.WithLenient())
func SelectAll(rows rowIterator, destination interface{}, options ...DecodeOptions) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: can you call this method from outside the spanner package?

Maybe this answers my earlier question around whether the interface is useful or not...:

  1. type rowIterator interface is not exported.
  2. func SelectAll is exported.
  3. I assume that our standard RowIterator that is returned for query results implements rowIterator.
  4. Can you call the SelectAll method from outside the spanner package when the interface rowIterator is unexported, as long as you have a value that implements the unexported interface?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can call this method from outside spanner package.

  1. Yes, its intentional at the moment so that no one can use it outside the package, the only reason its present there is to ease in test setup.
  2. func SelectAll is exported: Yes and it can be used with RowIterator that is returned for query results
  3. Same as above.
  4. Yes we can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spanner: (*Row).ToStruct is 3x slower than (*Row).Columns

3 participants