[Python] Add `read_csv` method by Tishj · Pull Request #6015 · duckdb/duckdb

Tishj · 2023-01-27T14:19:20Z

This method is made to be a copy of the read_csv method of Pandas, as far as our current options allow it to be.

The supported options are:

header - can be given as 0, to be compatible with pandas, or as a boolean
dtype - can be given as a dict{str : str} or as a list(str)
sep|delimiter - given as string
na_values - define the null string
skiprows - skip the first n rows
compression - given as string
quotechar - given as string
escapechar given as string
encoding given as string (only utf-8 is supported)

Future improvements:
Add names, then we can also support prefix by combining it into the names
Add usecols, this can be done by pushing a projection on top of the read_csv_relation

…he correct format for read_csv_auto

Mytherin

Thanks for the PR! LGTM.

encoding given as string (only utf-8 is supported)

Unrelated to this PR - but I wonder if we could easily support different encodings using #5829

tools/pythonpkg/duckdb_python.cpp

tools/pythonpkg/src/pyconnection.cpp

tools/pythonpkg/duckdb_python.cpp

tools/pythonpkg/src/pyconnection.cpp

tools/pythonpkg/tests/fast/api/test_read_csv.py

pdet · 2023-01-30T16:22:57Z

tools/pythonpkg/tests/fast/api/test_read_csv.py

+	filename = os.path.join(os.path.dirname(os.path.realpath(__file__)),'..','data',name)
+	return filename
+
+class TestReadCSV(object):


Can we also add a test with the parallel_csv_reader?

Perhaps we can add a boolean flag parallel to the reader to trigger whether or not to enable the parallel read?

I am looking into this, but the parallel csv reader is still locked behind context.options.experimental_parallel_csv_reader

I was thinking I could temporarily switch that to true and reset it after binding, but I don't think I can reliably do that
So maybe we should just upgrade it to a csv reader option?

So maybe we should just upgrade it to a csv reader option?

I think that is a good idea

Removing the DBConfig setting as well, or not yet?

Probably not yet? Just letting the csv reader option override the behavior of the DBConfig setting if supplied?

Tishj · 2023-01-31T09:10:36Z

I have a segfault that I really can't wrap my head around

Test:

	def test_encoding(self, duckdb_cursor):
		with pytest.raises(duckdb.BinderException, match="Copy is only supported for UTF-8 encoded files, ENCODING 'UTF-8'"):
			rel = duckdb_cursor.read_csv(TestFile('quote_escape.csv'), encoding=";")

Result:

  Fatal Python error: Segmentation fault
  
  Current thread 0x00007f84cd397740 (most recent call first):
    File "/project/tests/fast/api/test_read_csv.py", line 105 in test_encoding

The only thing I'm doing with encoding is checking it for "utf-8" and throwing an error if it's not, nothing else get's done for it

	if (!py::none().is(encoding)) {
		if (!py::isinstance<py::str>(encoding)) {
			throw InvalidInputException("read_csv only accepts 'encoding' as a string");
		}
		string encoding = StringUtil::Lower(py::str(encoding));
		if (encoding != "utf8" && encoding != "utf-8") {
			throw BinderException("Copy is only supported for UTF-8 encoded files, ENCODING 'UTF-8'");
		}
	}

Ah maybe because I re-use the variable, could be that Linux doesn't like that

…llel_csv' config setting

…tead

… define aliases for python module methods

Tishj · 2023-02-02T12:00:56Z

I am kind of confused by this giant regression

pandas_load_lineitem
Old timing: 3.6648285388946533
New timing: 9.347676873207092

It seems consistent, but looking at regression_test_python.py::run_dataload I can't discern how these changes would affect it

Mytherin

Thanks for the PR! LGTM

Tishj · 2023-02-03T05:18:51Z

Thanks, I also have a branch thats nearly ready for a PR for the read_parquet, to_csv, to_parquet + write_parquet (alias) methods

And the cleanup we talked about to the read_csv_relation internals

Tishj added 6 commits January 26, 2023 18:00

'read_csv' creates a read csv relation, and transforms arguments to t…

8491128

…he correct format for read_csv_auto

adding tests for read_csv method in python api

c32a8f6

added tests for all the supported duckdb.read_csv options

8694fc6

Merge branch 'master' into python_read_csv

a28c78a

add it to the pyconnection wrapper

7a2d02f

also allow 'dtype' to be a list of strings

0bd09e9

Tishj requested review from Mause and Mytherin and removed request for Mause January 27, 2023 14:39

Mytherin reviewed Jan 27, 2023

View reviewed changes

Tishj requested a review from Mause January 27, 2023 16:25

Mause suggested changes Jan 29, 2023

View reviewed changes

tools/pythonpkg/duckdb_python.cpp Outdated Show resolved Hide resolved

tools/pythonpkg/src/pyconnection.cpp Outdated Show resolved Hide resolved

tools/pythonpkg/src/pyconnection.cpp Outdated Show resolved Hide resolved

tools/pythonpkg/duckdb_python.cpp Outdated Show resolved Hide resolved

Tishj added 3 commits January 30, 2023 10:22

make checking for True less verbose

06258e4

fix CI failure on empty quote, and fix typo

74bace5

added docstrings for 'read_csv'

a144a87

Mause reviewed Jan 30, 2023

View reviewed changes

tools/pythonpkg/src/pyconnection.cpp Outdated Show resolved Hide resolved

Tishj added 2 commits January 30, 2023 12:56

got rid of kwargs

6ef3deb

fix breaking CI because the tests file(s) could not be located

871ec98

pdet reviewed Jan 30, 2023

View reviewed changes

tools/pythonpkg/tests/fast/api/test_read_csv.py Outdated Show resolved Hide resolved

fix stubs

0854fdf

pdet reviewed Jan 30, 2023

View reviewed changes

Tishj added 8 commits January 31, 2023 10:14

fix 'encoding' test

619d6ba

add 'parallel' option to 'read_csv', overrides the 'experimental_para…

82b61c7

…llel_csv' config setting

reworked read_csv_relation to derive from table_function_relation ins…

653f18a

…tead

updated stubs to be explicitly optional, and added 'parallel'

2ac6038

fix read_csv_relation throwing a binder error unexpectedly

1a93d72

remove debug code

56af93f

merge 'from_csv_auto' and 'read_csv' into one, add a helper method to…

57a521e

… define aliases for python module methods

remove dead code

34d0b80

Tishj added 6 commits February 1, 2023 14:19

add missing duckdb-only options

bfa8a6c

add tests for additional options

06e7af4

updated stubs

0ce6635

Merge branch 'master' into python_read_csv

901eb10

Merge branch 'master' into python_read_csv

1690189

add missing files used by tests

b648129

Mytherin approved these changes Feb 2, 2023

View reviewed changes

Mytherin merged commit 67e2199 into duckdb:master Feb 2, 2023

Tishj deleted the python_read_csv branch November 7, 2025 16:16

Conversation

Tishj commented Jan 27, 2023

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdet Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

Mytherin Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

Tishj Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

Mytherin Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

Tishj Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Feb 2, 2023

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Tishj commented Feb 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tishj Jan 31, 2023 •

edited

Loading

Tishj commented Jan 31, 2023 •

edited

Loading