Skip to content

Conversation

@ivb-supercede
Copy link
Contributor

@ivb-supercede ivb-supercede commented Jul 15, 2021

This code is based on that contained in:
#726

This PR has two components:

  1. A new backend class, PersistQueryStream, for backends which may allow for constant-memory streaming of query results. This is intended to replace the use of selectSource from PersistQueryRead, which is "broken" in the sense that the ConduitT it returns rarely actually does any streaming.
  2. An implementation of PersistQueryStream for SqlBackend which is based on an internal field, connPrepareCursor, which allows for backends to provide a way to run queries using an internal cursor which iterates over rows as requested, instead of returning them all at once. This internal field is provided for the Postgres version of SqlBackend, using functions from postgresql-simple.

This code differs in an important way from the above PR in that it solves a memory issue in the way the ConduitT was constructed that meant that the implementation using cursors was also not memory-constant. This implementation is memory-constant, tested to large sets of results.

@timds

Before submitting your PR, check that you've:

  • Documented new APIs with Haddock markup
  • Added @since declarations to the Haddock
  • Ran stylish-haskell on any changed files.
  • Adhered to the code style (see the .editorconfig file for details)

After submitting your PR:

  • Update the Changelog.md file with a link to your PR
  • Bumped the version number if there isn't an (unreleased) on the Changelog
  • Check that CI passes (or if it fails, for reasons unrelated to your change, like CI timeouts)

timds and others added 5 commits July 15, 2021 15:17
This code is based on that contained in:
#726

This commit has two components:

  1. A new backend class, `PersistQueryStream`, for backends which *may*
allow for constant-memory streaming of query results. This is intended
to replace the use of `selectSource` from `PersistQueryRead`, which is
"broken" in the sense that the `ConduitT` it returns rarely actually
does any streaming.
  2. An implementation of `PersistQueryStream` for `SqlBackend` which is
based on an internal field, `connPrepareCursor`, which allows for
backends to provide a way to run queries using an internal cursor which
iterates over rows as requested, instead of returning them all at once.
This internal field is provided for the Postgres version of
`SqlBackend`, using functions from `postgresql-simple`.

This code differs in an important way from the above PR in that it
solves a memory issue in the way the `ConduitT` was constructed that
meant that the implementation using cursors was also not
memory-constant. This implementation is memory-constant, tested to large
sets of results.

Co-authored-by: Tim Stapels <tim@supercede.com>
@ivb-supercede
Copy link
Contributor Author

Interestingly, this code has some odd performance characteristics that I can't figure out the reason for. Calling selectStream seems to be slower than using persistent-pagination, when performing a like-for-like query - and the profiling information seems to suggest that it's because of row parsing being slower in the cursor code. I couldn't find the exact reason, however.

@jezen jezen requested a review from parsonsmatt July 15, 2021 19:28
@jezen
Copy link
Member

jezen commented Jul 15, 2021

@parsonsmatt Hope you don't mind me pinging you for review on this — just thought you'd be a likely candidate as it's closely related to persistent-pagination 🙂

Copy link
Collaborator

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So -

Streaming rows from the database in a safe, composable, and fast way is apparently a hard problem.

This seems like a good solution! But you're noticing that the persistent-pagination approach is faster, anyway. I have no idea why that would be the case. And the tests don't cover a lot of the gnarly concurrency stuff that plague the other attempts I've seen.

So, I'm wondering - can this functionality be exposed from a 3rd party library, like persistent-postgresql-streaming? If not, what functionality do we need to export from persistent to support that library's creation?

Once persistent-postgresql-streaming is 'battle tested' and known to be safe, efficient, and awesome, then I'd be happy to merge it upstream. However, until it's known good, I'd prefer to err on the side of safety and not introduce features that may need to be deprecated or marked as unsafe, all due to the tricky vagaries of concurrency and foreign code.

I do appreciate the approach and the effort you spent, and I apologize for the delay in making this review.

Comment on lines +44 to +58
specs :: Spec
specs = describe "Streaming rows" $ do
describe "selectStream" $ before_ wipe $ do
itDb "streams rows" $ do
runConduit $ numbers .| CL.chunksOf 1000 .| CC.mapM_ (void . insertMany)

let resultStream = selectStream [] [ Asc StreamableNumberVal ]

let checkIncrementing expected (Entity _ (StreamableNumber actual)) = do
expected `shouldBe` actual
pure $ expected + 1

result <- runConduit $ resultStream .| CC.foldM checkIncrementing 1

result `shouldBe` numberOfRows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the problems with other selectSource attempts is that we ran into problems with concurrency. See eg #981 for commentary on this, as well as a prior attempt to make selectSource stream #726

Can we check that this doesn't have the same downside?

Copy link
Contributor Author

@ivb-supercede ivb-supercede Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check it. I believe the nature of cursor queries should automatically avoid some of these concurrency issues, since streaming really fetches batches of rows from the cursor by name.

It might be necessary to make cursor statements bypass the connStmtMap, since the randomly-generated cursor name isn't included in the prepared query.

Also note that this code is largely based on #726 - the main changes being a fix to a memory leak in that code, and the difference in the way cursor statements are used by SqlBackend.

Copy link
Contributor Author

@ivb-supercede ivb-supercede Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to follow the exact control flow, but: I believe this implementation is actually immune to some of these concurrency issues.

Testing has shown that certainly the single-threaded version behaves as expected. Conduits which simultaneously run the same query in a single-threaded application (one of the problematic uses mentioned in #981) yield results in the expected way, and don't affect each other. I think this is for the reason I mentioned above - each call to withCursorStmt' will cause a new temporary cursor to be created, the name of which is then used to grab rows in chunks.

The multi-threaded version might well still have issues - but testing that can wait until persistent-postgresql-streaming is spun off.

Comment on lines +5 to +6
* Added a new class, `PersistQueryStream`, and a new function, `selectStream`,
for backends which *may* stream results in constant memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an approach standpoint - I'd prefer not to add new type classes.

This PR does everything right in terms of following how the code base currently does stuff, but I think the current way is bad and probably not worth continuing. I'd like to explore an alternative that can satisfy the desire for polymorphism while being a bit easier to write and nicer to maintain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up #1302 to philosophize on this. I don't think it should hold things back though.

-> [SelectOpt record]
-> ConduitM () (Entity record) (ReaderT backend m) ()
selectSourceStream filts opts = do
srcRes <- lift $ selectSourceRes filts opts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to deprecate selectSourceRes and not see it used as a default - if someone sees this instance available, they should know that it works in a way that selectSourceRes does not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg - an instance PersistQueryStream MongoContext would compile just fine, but it would probably fail to work in a reasonable way. That seems bad to me.

Comment on lines 95 to 96
getStmtConn' :: Bool -> SqlBackend -> Text -> IO Statement
getStmtConn' useCursorIfPossible conn sql = do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of Bool parameters when we have some specific intent - it makes it hard to read at the use-sites. Introducing a separate data type that specifies the behavior would be preferable IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been replaced with a sum type.

-- streaming the results in a memory-constant way.
--
-- @since 2.13.2.0
class (PersistQueryRead backend) => PersistQueryStream backend where
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this class makes the expectations around selectSource really weird - that's a class method on PersistQuery, so why do we have a separate method here?

Comment on lines 103 to 104
Just prepareCursor | useCursorIfPossible -> prepareCursor sql
_ -> connPrepare conn sql
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is where we switch - if we have connPrepareCursor = Just ... and we're trying to stream, then we do this thing. otherwise, we do the old thing.

Comment on lines +542 to +562
withCursorStmt' :: MonadIO m
=> PG.Connection
-> PG.Query
-> [PersistValue]
-> Acquire (ConduitM () [PersistValue] m ())
withCursorStmt' conn query vals =
foldWithCursor `fmap` mkAcquire openC closeC
where
openC = do
rawquery <- liftIO $ PG.formatQuery conn query (map P vals)
PGC.declareCursor conn (PG.Query rawquery)
closeC = PGC.closeCursor
foldWithCursor cursor = go
where
go = do
-- 256 is the default chunk size used for fetching
rows <- liftIO $ PGC.foldForward cursor 256 processRow []
case rows of
Left final -> CL.sourceList final
Right nonfinal -> CL.sourceList nonfinal >> go
processRow s row = pure $ s <> [map unP row]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks fine to me

This was part of the feedback on #1298.
@ivb-supercede
Copy link
Contributor Author

ivb-supercede commented Aug 4, 2021

So, I'm wondering - can this functionality be exposed from a 3rd party library, like persistent-postgresql-streaming? If not, what functionality do we need to export from persistent to support that library's creation?

I haven't tried this out yet, but I think the biggest obstacle is the inability to go from the SqlBackend built from a PostgreSQL Connection back to the Connection itself - so it's not possible to write a function of type

getPGConnection :: ReaderT SqlBackend m Connection

which would be required to call PGC.foldForward (or any of the lower-level LibPQ functions). The only way that's possible in this PR is because the cursor code hooks into the point where the SqlBackend is first created and therefore has access to the Connection at that point.

There is, as far as I can tell, no easy way to get around that with the current polymorphic SqlBackend. One ugly option would be to give SqlBackend a field which might contain a Connection:

data SqlBackend = SqlBackend
  { ...
  , connPostgreSQLConnection :: Maybe Connection
  }

Another is for persistent-postgresql-streaming to simply require the user to provide the Connection to use selectStream, which is also very ugly.

@parsonsmatt
Copy link
Collaborator

Makes sense - there was some work done for the SQLite backend to expose the 'raw' connection. If exposing the raw Postgres connection (in a similar way) works, then we can get that out the door for y'all relatively quickly and easily.

ivanbakel pushed a commit to ivanbakel/persistent that referenced this pull request Aug 18, 2021
As part of the discussion in yesodweb#1298, it was found that
external libraries which want to hook into Persistent's Postgres API are
limited by their inability to access the postgresql-simple `Connection`
which backs a Postgres `SqlBackend`.

A similar issue in the past with Sqlite was solved by the `RawSqlite`
wrapper, which is able to expose the raw connection as well as be
compatible with the wrapped `SqlBackend` that it stores.

This adds an equivalent wrapper to persistent-postgresql, which exposes
the raw connection, and therefore allows third-party libraries to access
it.
parsonsmatt added a commit that referenced this pull request Sep 6, 2021
* Add RawPostgresql backend which exposes connection

As part of the discussion in #1298, it was found that
external libraries which want to hook into Persistent's Postgres API are
limited by their inability to access the postgresql-simple `Connection`
which backs a Postgres `SqlBackend`.

A similar issue in the past with Sqlite was solved by the `RawSqlite`
wrapper, which is able to expose the raw connection as well as be
compatible with the wrapped `SqlBackend` that it stores.

This adds an equivalent wrapper to persistent-postgresql, which exposes
the raw connection, and therefore allows third-party libraries to access
it.

* Run stylish-haskell on Postgresql.hs

* Add ChangeLog entry for 2.13.1.0

* Remove microlens dependency, expose RawPostgresql fields

This was requested as part of the review on #1305.

Co-authored-by: Matt Parsons <parsonsmatt@gmail.com>

* Fix backwards-compatibility of Postgresql.hs

Co-authored-by: Matt Parsons <parsonsmatt@gmail.com>
parsonsmatt pushed a commit that referenced this pull request Oct 7, 2021
* Expose orderClause from PersistQuery.hs

This is part of the followup to #1298.

This exposes `orderClause` in a similar manner to `filterClause`, to
help third-party libraries which want to format their own SQL queries
using Persistent's internal helpers.

It rewrites `orderClause` a bit to bring it in line with the other
exposed functions, but there should be no functional changes as a
result.

* Version bump for PR #1317
parsonsmatt pushed a commit that referenced this pull request Oct 7, 2021
* Expose PSQL parser helpers in new 'Internal' module

This is part of the followup to #1298.

The code exposed in this new `Database.Persist.Postgresql.Internal`
module is very helpful to third-party libraries which want to use
functions from `postgresql-simple` to produce `PersistValue`s for
feeding to `persistent` functions.

Exposing them is necessary to allow for a third-party library which
provides streaming of Persistent entities using Postgres cursors.

* stylish-haskell pass for PR

* Version bump for PR #1316
@ivb-supercede
Copy link
Contributor Author

The code from this PR is now up in a separate library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants