Moving to SQLAlchemy >= 1.4 by McToel · Pull Request #8158 · dask/dask

McToel · 2021-09-19T22:24:27Z

Currently, dask does not work properly with SQLAlchemy 1.4. I am working on getting dask to work with SQLAlchemy 1.4. This is the state of my work, which does kind of work but is not yet properly tested and documented. It also seriously changes the API around sql io which will break code.

Right now, it is possible to pass a sql function as index_col or columns. If, for example, someone wants to cast the index to another type via a sql function, it could also be done using read_sql_query(sql=the_sql_to_do_the_casting). This "feature" however is very buggy right now and supporting this feature with SQLAlchemy 1.4 makes a big mess, so I suggest removing it. As my changes actually support sql queries (it was very buggy), all of those cast can be done using sql selects.

When breaking code anyway, I suggest to also make the dask sql io API more similar to the pandas sql io API.

What I want to change:

read_sql_table() will no longer support sql queries
read_sql_query() will handle SQLAlchemy queries
read_sql() will try to call read_sql_table() or read_sql_query() depending on the passed arguments (not yet implemented)
rename some arguments
~~it might be possible to support str sql queries (not yet implemented)~~ (no, this will not work)

Before I implement, test and document all the changes, I would like to have some feedback on my suggestions.

GPUtester · 2021-09-19T22:24:29Z

Can one of the admins verify this patch?

jsignell · 2021-09-20T16:35:22Z

cc @martindurant

martindurant · 2021-09-21T14:38:45Z

Supporting this feature with SQLAlchemy 1.4 makes a big mess, so I suggest removing it.

Can you clarify this? We certainly wouldn't want to throw away functionality if we don't have to. We have no other way to support general SQL (as opposed to reading a table), because we need to be able to compose the partition selection logic, and that would be SQL dialect specific when using text.

McToel · 2021-09-21T18:33:19Z

I meant supporting sql functions passed to index_col and columns is a mess. That might have been unclear.

SQLALchemy 1.3 select objects had a .c / .column attribute. Columns have these too. In 1.4, select objects should now use .selected_columns instead. This makes it really hard to mix those two in the same way as it is right now. Especially as they have to be combined in a SQL query in the end.
It could be possible to convert columns to select statements, but I don't know how to combine them back into a single query, and I think that it might make the query less performant.

General SQL is supported through SQLAlchemy queries. Applying a SQL function to the index_col for example, can be done like this:

s = sql.select(
    [
        sa.cast(sql.column("number"), sa.types.BigInteger).label("number"),
        sql.column("name"),
    ]
).where(sql.column("number") >= 5).select_from(sql.table("test"))

out = read_sql_query(s, db, npartitions=2, index_col="number")

martindurant · 2021-09-21T18:36:23Z

General SQL is supported through SQLAlchemy queries

but can we modify these to produce the partitioned sub-queries needed to build a dask dataframe? If yes, great!

McToel · 2021-09-21T19:05:32Z

At least it does work in that test. I will be trying to find a case to break it, but until now it works.

…y 1.3

jsignell · 2022-01-19T14:45:56Z

What is the status of this @McToel and @martindurant? Are there still open discussions?

martindurant · 2022-01-19T14:56:31Z

Thanks for the ping. There's quite a lot in this PR, and I am not an expert on sqlalchemy, but we have a very good reason to move ahead now, even if we drop support for slqlachemy<1.4

McToel · 2022-01-19T15:06:15Z

I think it is fully implemented. I basically left the old code in there and added Deprecation warnings to it. If someone is using sqlalchemy 1.3 or prior, it should still work but throw a warning. However, I wasn't sure how dask would normally handle deprecations and I think I might not have fully tested the deprecations.

And the test are failing right now because the tester is using sqlalchemy 1.3 I think.

jsignell · 2022-01-19T15:16:47Z

Yeah since pandas is moving to >1.4 I think we can reasonably do that as well.

If it isn't too hard, it would definitely be better if we could go through a deprecation cycle. That would look like adding clear warnings with suggested actions and then altering the tests to expect those warnings on older versions.

martindurant · 2022-01-19T15:57:50Z

I assume the win37 tests pass if we merge from main?

jsignell · 2022-01-19T16:35:02Z

This win37 tests are failing on main, they should be ignored.

martindurant · 2022-01-20T15:58:16Z

+1 from me, this is nicely done.

jsignell · 2022-01-21T16:45:12Z

@McToel do you mind updating the environment.yamls so that the tests run with latest sqlalchemy?

McToel · 2022-01-21T19:03:59Z

I have no idea how the yamls work. So if someone else has the time to do it, it would be great

martindurant · 2022-01-21T19:08:35Z

e.g., this line: https://github.com/dask/dask/blob/main/continuous_integration/environment-3.9.yaml#L33

jsignell · 2022-01-21T21:57:24Z

Sorry I wasn't very clear @McToel. Do you mind if I push to your branch? There are a few other environment.yaml files to change for other python versions.

McToel · 2022-01-21T22:11:31Z

I think I've done it myself, but if anything is missing, feel free to just contribute to my branch.

jsignell · 2022-01-24T16:52:33Z

Ok, @McToel I just removed a file that has been deleted on main since this PR was first opened.

jsignell · 2022-01-24T17:57:42Z

It looks like there is one issue with the int32 dtype:

================================== FAILURES ===================================
____________________________ test_query_with_meta _____________________________
[gw1] win32 -- Python 3.8.12 C:\Miniconda3\envs\test-environment\python.exe

db = 'sqlite:///C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpwxx4978a'

    def test_query_with_meta(db):
        from sqlalchemy import sql
    
        data = {
            "name": pd.Series([], name="name", dtype="str"),
            "age": pd.Series([], name="age", dtype="int"),
        }
        index = pd.Index([], name="number", dtype="int")
        meta = pd.DataFrame(data, index=index)
    
        s1 = sql.select(
            [sql.column("number"), sql.column("name"), sql.column("age")]
        ).select_from(sql.table("test"))
        out = read_sql_query(s1, db, npartitions=2, index_col="number", meta=meta)
>       assert_eq(out, df[["name", "age"]])

dask\dataframe\io\tests\test_sql.py:433: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

a =            name  age
number              
0         Alice   33
1           Bob   40
2         Chris   22
3          Dora   16
4         Edith   53
5       Francis   30
6       Garreth   20
b =            name  age
number              
0         Alice   33
1           Bob   40
2         Chris   22
3          Dora   16
4         Edith   53
5       Francis   30
6       Garreth   20
check_names = True, check_dtype = True, check_divisions = True
check_index = True, scheduler = 'sync', kwargs = {}

    def assert_eq(
        a,
        b,
        check_names=True,
        check_dtype=True,
        check_divisions=True,
        check_index=True,
        scheduler="sync",
        **kwargs,
    ):
        if check_divisions:
            assert_divisions(a, scheduler=scheduler)
            assert_divisions(b, scheduler=scheduler)
            if hasattr(a, "divisions") and hasattr(b, "divisions"):
                at = type(np.asarray(a.divisions).tolist()[0])  # numpy to python
                bt = type(np.asarray(b.divisions).tolist()[0])  # scalar conversion
                assert at == bt, (at, bt)
        assert_sane_keynames(a)
        assert_sane_keynames(b)
        a = _check_dask(
            a, check_names=check_names, check_dtypes=check_dtype, scheduler=scheduler
        )
        b = _check_dask(
            b, check_names=check_names, check_dtypes=check_dtype, scheduler=scheduler
        )
        if hasattr(a, "to_pandas"):
            a = a.to_pandas()
        if hasattr(b, "to_pandas"):
            b = b.to_pandas()
        if isinstance(a, (pd.DataFrame, pd.Series)):
            a = _maybe_sort(a, check_index)
            b = _maybe_sort(b, check_index)
        if not check_index:
            a = a.reset_index(drop=True)
            b = b.reset_index(drop=True)
        if isinstance(a, pd.DataFrame):
>           tm.assert_frame_equal(
                a, b, check_names=check_names, check_dtype=check_dtype, **kwargs
E               AssertionError: Attributes of DataFrame.iloc[:, 1] (column name="age") are different
E               
E               Attribute "dtype" are different
E               [left]:  int32
E               [right]: int64

dask\dataframe\utils.py:560: AssertionError

ref: https://github.com/dask/dask/runs/4925139130?check_suite_focus=true

McToel · 2022-01-24T22:46:25Z

That's interesting. On my platform (Ubuntu, 3.8) it works perfectly fine. I switched to python 3.9 and removed df.append which was causing a deprecation warning. But I don't have a dev Windows machine right now, so it is a little hard for me to debug those errors. @jsignell do you have I Windows machine?

jsignell · 2022-01-25T14:34:52Z

I don't have a windows machine, but I think we should merge this and fix the windows issue as a follow-up. I am going to push a skip for the win32 case.

jsignell · 2022-01-25T15:39:26Z

I'm looking into how to filter all the warnings for python 3.7

McToel · 2022-01-25T17:23:13Z

Thanks a lot! The only failing test is the docs build, and that seems to be unrelated to the changes in this branch. If @jsignell and @martindurant are happy with it, I think it should be merged.

jsignell · 2022-01-25T20:43:20Z

I just merged main into this branch to try to get CI working properly. Thanks for being so patient on this @McToel, it should be in soon.

jsignell · 2022-01-25T22:00:51Z

Ok I feel persuaded that this is an improvement. Thanks @McToel for taking on this work!

Moving to SQLAlchemy >= 1.4

487f097

github-actions bot added dataframe io labels Sep 19, 2021

jsignell linked an issue Oct 1, 2021 that may be closed by this pull request

Support for sqlalchemy>= 1.4.0 #7406

Closed

McToel added 2 commits October 24, 2021 02:19

Continued with adopting dask to sqlalchemy 1.4

8168919

Updated the docs

aed1a46

McToel force-pushed the main branch from 91ed9d3 to aed1a46 Compare January 3, 2022 02:51

McToel added 4 commits January 3, 2022 06:09

Added deprication warnings and backwards compatability with sqlalchem…

c5910c6

…y 1.3

Added deprication warnings and backwards compatability with sqlalchem…

1c65bad

…y 1.3

Merge branch 'main' of https://github.com/McToel/dask into main

c83cadb

I honestly don't get why the merge readded this line

476cf0b

This was referenced Jan 18, 2022

Failures with upcoming pandas 1.4 release #8580

Closed

support sqlalchemy>=1.4.0 #8587

Closed

Change enviroment to sqlalchemy>=1.4.0

5ebe6a8

Also change bumped the other yamls

12f7ed2

Remove environment-3.8-dev.yaml

83928d6

removed df.append as it is now depricated

dc87d21

jsignell mentioned this pull request Jan 25, 2022

read_sql_query with meta converts dtypes from 32 to 64. #8620

Open

Don't check dtype for win32

e690b20

Ignore all warkings if pandas is older than 1.2.0

d43ecf3

Merge branch 'main' into McToel/main

3feaf30

jsignell merged commit 95e1cf3 into dask:main Jan 25, 2022

jakirkham mentioned this pull request Mar 17, 2022

Add Python 3.10 support #8566

Merged

3 tasks

Uh oh!

Conversation

McToel commented Sep 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GPUtester commented Sep 19, 2021

Uh oh!

jsignell commented Sep 20, 2021

Uh oh!

martindurant commented Sep 21, 2021

Uh oh!

McToel commented Sep 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Sep 21, 2021

Uh oh!

McToel commented Sep 21, 2021

Uh oh!

jsignell commented Jan 19, 2022

Uh oh!

martindurant commented Jan 19, 2022

Uh oh!

McToel commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Jan 19, 2022

Uh oh!

martindurant commented Jan 19, 2022

Uh oh!

jsignell commented Jan 19, 2022

Uh oh!

martindurant commented Jan 20, 2022

Uh oh!

jsignell commented Jan 21, 2022

Uh oh!

McToel commented Jan 21, 2022

Uh oh!

martindurant commented Jan 21, 2022

Uh oh!

jsignell commented Jan 21, 2022

Uh oh!

McToel commented Jan 21, 2022

Uh oh!

jsignell commented Jan 24, 2022

Uh oh!

jsignell commented Jan 24, 2022

Uh oh!

McToel commented Jan 24, 2022

Uh oh!

jsignell commented Jan 25, 2022

Uh oh!

jsignell commented Jan 25, 2022

Uh oh!

McToel commented Jan 25, 2022

Uh oh!

jsignell commented Jan 25, 2022

Uh oh!

jsignell commented Jan 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

McToel commented Sep 19, 2021 •

edited

Loading

McToel commented Sep 21, 2021 •

edited

Loading

McToel commented Jan 19, 2022 •

edited

Loading