Add API for mem-constant DB streams, Postgres implementation #1298

ivb-supercede · 2021-07-15T15:07:39Z

This code is based on that contained in:
#726

This PR has two components:

A new backend class, PersistQueryStream, for backends which may allow for constant-memory streaming of query results. This is intended to replace the use of selectSource from PersistQueryRead, which is "broken" in the sense that the ConduitT it returns rarely actually does any streaming.
An implementation of PersistQueryStream for SqlBackend which is based on an internal field, connPrepareCursor, which allows for backends to provide a way to run queries using an internal cursor which iterates over rows as requested, instead of returning them all at once. This internal field is provided for the Postgres version of SqlBackend, using functions from postgresql-simple.

This code differs in an important way from the above PR in that it solves a memory issue in the way the ConduitT was constructed that meant that the implementation using cursors was also not memory-constant. This implementation is memory-constant, tested to large sets of results.

@timds

Before submitting your PR, check that you've:

Documented new APIs with Haddock markup
Added @since declarations to the Haddock
Ran stylish-haskell on any changed files.
Adhered to the code style (see the .editorconfig file for details)

After submitting your PR:

Update the Changelog.md file with a link to your PR
Bumped the version number if there isn't an (unreleased) on the Changelog
Check that CI passes (or if it fails, for reasons unrelated to your change, like CI timeouts)

This code is based on that contained in: #726 This commit has two components: 1. A new backend class, `PersistQueryStream`, for backends which *may* allow for constant-memory streaming of query results. This is intended to replace the use of `selectSource` from `PersistQueryRead`, which is "broken" in the sense that the `ConduitT` it returns rarely actually does any streaming. 2. An implementation of `PersistQueryStream` for `SqlBackend` which is based on an internal field, `connPrepareCursor`, which allows for backends to provide a way to run queries using an internal cursor which iterates over rows as requested, instead of returning them all at once. This internal field is provided for the Postgres version of `SqlBackend`, using functions from `postgresql-simple`. This code differs in an important way from the above PR in that it solves a memory issue in the way the `ConduitT` was constructed that meant that the implementation using cursors was also not memory-constant. This implementation is memory-constant, tested to large sets of results. Co-authored-by: Tim Stapels <tim@supercede.com>

ivb-supercede · 2021-07-15T16:05:13Z

Interestingly, this code has some odd performance characteristics that I can't figure out the reason for. Calling selectStream seems to be slower than using persistent-pagination, when performing a like-for-like query - and the profiling information seems to suggest that it's because of row parsing being slower in the cursor code. I couldn't find the exact reason, however.

jezen · 2021-07-15T19:29:41Z

@parsonsmatt Hope you don't mind me pinging you for review on this — just thought you'd be a likely candidate as it's closely related to persistent-pagination 🙂

parsonsmatt

So -

Streaming rows from the database in a safe, composable, and fast way is apparently a hard problem.

This seems like a good solution! But you're noticing that the persistent-pagination approach is faster, anyway. I have no idea why that would be the case. And the tests don't cover a lot of the gnarly concurrency stuff that plague the other attempts I've seen.

So, I'm wondering - can this functionality be exposed from a 3rd party library, like persistent-postgresql-streaming? If not, what functionality do we need to export from persistent to support that library's creation?

Once persistent-postgresql-streaming is 'battle tested' and known to be safe, efficient, and awesome, then I'd be happy to merge it upstream. However, until it's known good, I'd prefer to err on the side of safety and not introduce features that may need to be deprecated or marked as unsafe, all due to the tricky vagaries of concurrency and foreign code.

I do appreciate the approach and the effort you spent, and I apologize for the delay in making this review.

parsonsmatt · 2021-07-30T20:56:29Z

persistent-postgresql/test/Streaming.hs

+specs :: Spec
+specs = describe "Streaming rows" $ do
+    describe "selectStream" $ before_ wipe $ do
+        itDb "streams rows" $ do
+          runConduit $ numbers .| CL.chunksOf 1000 .| CC.mapM_ (void . insertMany)
+
+          let resultStream = selectStream [] [ Asc StreamableNumberVal ]
+
+          let checkIncrementing expected (Entity _ (StreamableNumber actual)) = do
+                expected `shouldBe` actual
+                pure $ expected + 1
+
+          result <- runConduit $ resultStream .| CC.foldM checkIncrementing 1
+
+          result `shouldBe` numberOfRows


One of the problems with other selectSource attempts is that we ran into problems with concurrency. See eg #981 for commentary on this, as well as a prior attempt to make selectSource stream #726

Can we check that this doesn't have the same downside?

I'll double check it. I believe the nature of cursor queries should automatically avoid some of these concurrency issues, since streaming really fetches batches of rows from the cursor by name.

It might be necessary to make cursor statements bypass the connStmtMap, since the randomly-generated cursor name isn't included in the prepared query.

Also note that this code is largely based on #726 - the main changes being a fix to a memory leak in that code, and the difference in the way cursor statements are used by SqlBackend.

I'm struggling to follow the exact control flow, but: I believe this implementation is actually immune to some of these concurrency issues.

Testing has shown that certainly the single-threaded version behaves as expected. Conduits which simultaneously run the same query in a single-threaded application (one of the problematic uses mentioned in #981) yield results in the expected way, and don't affect each other. I think this is for the reason I mentioned above - each call to withCursorStmt' will cause a new temporary cursor to be created, the name of which is then used to grab rows in chunks.

The multi-threaded version might well still have issues - but testing that can wait until persistent-postgresql-streaming is spun off.

parsonsmatt · 2021-07-30T21:20:59Z

persistent/ChangeLog.md

+    * Added a new class, `PersistQueryStream`, and a new function, `selectStream`,
+      for backends which *may* stream results in constant memory.


From an approach standpoint - I'd prefer not to add new type classes.

This PR does everything right in terms of following how the code base currently does stuff, but I think the current way is bad and probably not worth continuing. I'd like to explore an alternative that can satisfy the desire for polymorphism while being a bit easier to write and nicer to maintain.

I wrote up #1302 to philosophize on this. I don't think it should hold things back though.

parsonsmatt · 2021-07-30T22:36:23Z

persistent/Database/Persist/Class/PersistQuery.hs

+        -> [SelectOpt record]
+        -> ConduitM () (Entity record) (ReaderT backend m) ()
+    selectSourceStream filts opts = do
+      srcRes <- lift $ selectSourceRes filts opts


I think I'd prefer to deprecate selectSourceRes and not see it used as a default - if someone sees this instance available, they should know that it works in a way that selectSourceRes does not.

Eg - an instance PersistQueryStream MongoContext would compile just fine, but it would probably fail to work in a reasonable way. That seems bad to me.

parsonsmatt · 2021-07-30T23:31:46Z

persistent/Database/Persist/Sql/Raw.hs

+getStmtConn' :: Bool -> SqlBackend -> Text -> IO Statement
+getStmtConn' useCursorIfPossible conn sql = do


I'm not really a fan of Bool parameters when we have some specific intent - it makes it hard to read at the use-sites. Introducing a separate data type that specifies the behavior would be preferable IMO.

This has been replaced with a sum type.

parsonsmatt · 2021-07-30T23:33:26Z

persistent/Database/Persist/Class/PersistQuery.hs

+-- streaming the results in a memory-constant way.
+--
+-- @since 2.13.2.0
+class (PersistQueryRead backend) => PersistQueryStream backend where


I think adding this class makes the expectations around selectSource really weird - that's a class method on PersistQuery, so why do we have a separate method here?

parsonsmatt · 2021-07-30T23:35:45Z

persistent/Database/Persist/Sql/Raw.hs

+                    Just prepareCursor | useCursorIfPossible -> prepareCursor sql
+                    _ -> connPrepare conn sql


so this is where we switch - if we have connPrepareCursor = Just ... and we're trying to stream, then we do this thing. otherwise, we do the old thing.

parsonsmatt · 2021-07-30T23:38:20Z

persistent-postgresql/Database/Persist/Postgresql.hs

+withCursorStmt' :: MonadIO m
+          => PG.Connection
+          -> PG.Query
+          -> [PersistValue]
+          -> Acquire (ConduitM () [PersistValue] m ())
+withCursorStmt' conn query vals =
+    foldWithCursor `fmap` mkAcquire openC closeC
+  where
+    openC = do
+        rawquery <- liftIO $ PG.formatQuery conn query (map P vals)
+        PGC.declareCursor conn (PG.Query rawquery)
+    closeC = PGC.closeCursor
+    foldWithCursor cursor = go
+      where
+        go = do
+            -- 256 is the default chunk size used for fetching
+            rows <- liftIO $ PGC.foldForward cursor 256 processRow []
+            case rows of
+                Left final -> CL.sourceList final
+                Right nonfinal -> CL.sourceList nonfinal >> go
+    processRow s row = pure $ s <> [map unP row]


This implementation looks fine to me

This was part of the feedback on #1298.

ivb-supercede · 2021-08-04T16:16:14Z

So, I'm wondering - can this functionality be exposed from a 3rd party library, like persistent-postgresql-streaming? If not, what functionality do we need to export from persistent to support that library's creation?

I haven't tried this out yet, but I think the biggest obstacle is the inability to go from the SqlBackend built from a PostgreSQL Connection back to the Connection itself - so it's not possible to write a function of type

getPGConnection :: ReaderT SqlBackend m Connection

which would be required to call PGC.foldForward (or any of the lower-level LibPQ functions). The only way that's possible in this PR is because the cursor code hooks into the point where the SqlBackend is first created and therefore has access to the Connection at that point.

There is, as far as I can tell, no easy way to get around that with the current polymorphic SqlBackend. One ugly option would be to give SqlBackend a field which might contain a Connection:

data SqlBackend = SqlBackend
  { ...
  , connPostgreSQLConnection :: Maybe Connection
  }

Another is for persistent-postgresql-streaming to simply require the user to provide the Connection to use selectStream, which is also very ugly.

parsonsmatt · 2021-08-04T16:48:58Z

Makes sense - there was some work done for the SQLite backend to expose the 'raw' connection. If exposing the raw Postgres connection (in a similar way) works, then we can get that out the door for y'all relatively quickly and easily.

As part of the discussion in yesodweb#1298, it was found that external libraries which want to hook into Persistent's Postgres API are limited by their inability to access the postgresql-simple `Connection` which backs a Postgres `SqlBackend`. A similar issue in the past with Sqlite was solved by the `RawSqlite` wrapper, which is able to expose the raw connection as well as be compatible with the wrapped `SqlBackend` that it stores. This adds an equivalent wrapper to persistent-postgresql, which exposes the raw connection, and therefore allows third-party libraries to access it.

* Add RawPostgresql backend which exposes connection As part of the discussion in #1298, it was found that external libraries which want to hook into Persistent's Postgres API are limited by their inability to access the postgresql-simple `Connection` which backs a Postgres `SqlBackend`. A similar issue in the past with Sqlite was solved by the `RawSqlite` wrapper, which is able to expose the raw connection as well as be compatible with the wrapped `SqlBackend` that it stores. This adds an equivalent wrapper to persistent-postgresql, which exposes the raw connection, and therefore allows third-party libraries to access it. * Run stylish-haskell on Postgresql.hs * Add ChangeLog entry for 2.13.1.0 * Remove microlens dependency, expose RawPostgresql fields This was requested as part of the review on #1305. Co-authored-by: Matt Parsons <parsonsmatt@gmail.com> * Fix backwards-compatibility of Postgresql.hs Co-authored-by: Matt Parsons <parsonsmatt@gmail.com>

* Expose orderClause from PersistQuery.hs This is part of the followup to #1298. This exposes `orderClause` in a similar manner to `filterClause`, to help third-party libraries which want to format their own SQL queries using Persistent's internal helpers. It rewrites `orderClause` a bit to bring it in line with the other exposed functions, but there should be no functional changes as a result. * Version bump for PR #1317

* Expose PSQL parser helpers in new 'Internal' module This is part of the followup to #1298. The code exposed in this new `Database.Persist.Postgresql.Internal` module is very helpful to third-party libraries which want to use functions from `postgresql-simple` to produce `PersistValue`s for feeding to `persistent` functions. Exposing them is necessary to allow for a third-party library which provides streaming of Persistent entities using Postgres cursors. * stylish-haskell pass for PR * Version bump for PR #1316

ivb-supercede · 2021-11-02T12:41:30Z

The code from this PR is now up in a separate library.

timds and others added 5 commits July 15, 2021 15:17

Run stylish-haskell on PR #1298

333caab

Bump persistent version, add Changelog entry

d82c9ef

Bump persistent-postgresql version, add Changelog entry

f333bca

Remove use of NumericUnderscores

b318dc6

jezen requested a review from parsonsmatt July 15, 2021 19:28

parsonsmatt mentioned this pull request Jul 30, 2021

Unify approaches to polymorphism #1302

Open

parsonsmatt requested changes Jul 30, 2021

View reviewed changes

Replace boolean in getStmtConn' with datatype

7d1a2c4

This was part of the feedback on #1298.

ivb-supercede mentioned this pull request Aug 18, 2021

Add RawPostgresql based on RawSqlite #1305

Merged

7 tasks

This was referenced Sep 9, 2021

Expose PSQL parser helpers in new 'Internal' module #1316

Merged

Expose orderClause from PersistQuery.hs #1317

Merged

ivb-supercede closed this Nov 2, 2021

ivb-supercede mentioned this pull request Jan 10, 2022

How to migrate from selectSource to selectCursor? SupercedeTech/persistent-postgresql-streaming#1

Open

		* Added a new class, `PersistQueryStream`, and a new function, `selectStream`,
		for backends which may stream results in constant memory.

		getStmtConn' :: Bool -> SqlBackend -> Text -> IO Statement
		getStmtConn' useCursorIfPossible conn sql = do

		Just prepareCursor \| useCursorIfPossible -> prepareCursor sql
		_ -> connPrepare conn sql

Add API for mem-constant DB streams, Postgres implementation #1298

Add API for mem-constant DB streams, Postgres implementation #1298

Uh oh!

Conversation

ivb-supercede commented Jul 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivb-supercede commented Jul 15, 2021

Uh oh!

jezen commented Jul 15, 2021

Uh oh!

parsonsmatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivb-supercede Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivb-supercede Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivb-supercede commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parsonsmatt commented Aug 4, 2021

Uh oh!

ivb-supercede commented Nov 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivb-supercede commented Jul 15, 2021 •

edited

Loading

ivb-supercede Aug 2, 2021 •

edited

Loading

ivb-supercede Aug 4, 2021 •

edited

Loading

ivb-supercede commented Aug 4, 2021 •

edited

Loading