Skip to content

[Python] Add read_csv method#6015

Merged
Mytherin merged 26 commits intoduckdb:masterfrom
Tishj:python_read_csv
Feb 2, 2023
Merged

[Python] Add read_csv method#6015
Mytherin merged 26 commits intoduckdb:masterfrom
Tishj:python_read_csv

Conversation

@Tishj
Copy link
Contributor

@Tishj Tishj commented Jan 27, 2023

This method is made to be a copy of the read_csv method of Pandas, as far as our current options allow it to be.

The supported options are:

  • header - can be given as 0, to be compatible with pandas, or as a boolean
  • dtype - can be given as a dict{str : str} or as a list(str)
  • sep|delimiter - given as string
  • na_values - define the null string
  • skiprows - skip the first n rows
  • compression - given as string
  • quotechar - given as string
  • escapechar given as string
  • encoding given as string (only utf-8 is supported)

Future improvements:
Add names, then we can also support prefix by combining it into the names
Add usecols, this can be done by pushing a projection on top of the read_csv_relation

@Tishj Tishj requested review from Mause and Mytherin and removed request for Mause January 27, 2023 14:39
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM.

encoding given as string (only utf-8 is supported)

Unrelated to this PR - but I wonder if we could easily support different encodings using #5829

@Tishj Tishj requested a review from Mause January 27, 2023 16:25
filename = os.path.join(os.path.dirname(os.path.realpath(__file__)),'..','data',name)
return filename

class TestReadCSV(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with the parallel_csv_reader?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add a boolean flag parallel to the reader to trigger whether or not to enable the parallel read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking into this, but the parallel csv reader is still locked behind context.options.experimental_parallel_csv_reader

I was thinking I could temporarily switch that to true and reset it after binding, but I don't think I can reliably do that
So maybe we should just upgrade it to a csv reader option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should just upgrade it to a csv reader option?

I think that is a good idea

Copy link
Contributor Author

@Tishj Tishj Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the DBConfig setting as well, or not yet?

Probably not yet? Just letting the csv reader option override the behavior of the DBConfig setting if supplied?

@Tishj
Copy link
Contributor Author

Tishj commented Jan 31, 2023

I have a segfault that I really can't wrap my head around

Test:

	def test_encoding(self, duckdb_cursor):
		with pytest.raises(duckdb.BinderException, match="Copy is only supported for UTF-8 encoded files, ENCODING 'UTF-8'"):
			rel = duckdb_cursor.read_csv(TestFile('quote_escape.csv'), encoding=";")

Result:

  Fatal Python error: Segmentation fault
  
  Current thread 0x00007f84cd397740 (most recent call first):
    File "/project/tests/fast/api/test_read_csv.py", line 105 in test_encoding

The only thing I'm doing with encoding is checking it for "utf-8" and throwing an error if it's not, nothing else get's done for it

	if (!py::none().is(encoding)) {
		if (!py::isinstance<py::str>(encoding)) {
			throw InvalidInputException("read_csv only accepts 'encoding' as a string");
		}
		string encoding = StringUtil::Lower(py::str(encoding));
		if (encoding != "utf8" && encoding != "utf-8") {
			throw BinderException("Copy is only supported for UTF-8 encoded files, ENCODING 'UTF-8'");
		}
	}

Ah maybe because I re-use the variable, could be that Linux doesn't like that

@Tishj
Copy link
Contributor Author

Tishj commented Feb 2, 2023

I am kind of confused by this giant regression

pandas_load_lineitem
Old timing: 3.6648285388946533
New timing: 9.347676873207092

It seems consistent, but looking at regression_test_python.py::run_dataload I can't discern how these changes would affect it

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM

@Mytherin Mytherin merged commit 67e2199 into duckdb:master Feb 2, 2023
@Tishj
Copy link
Contributor Author

Tishj commented Feb 3, 2023

Thanks, I also have a branch thats nearly ready for a PR for the read_parquet, to_csv, to_parquet + write_parquet (alias) methods

And the cleanup we talked about to the read_csv_relation internals

@Tishj Tishj deleted the python_read_csv branch November 7, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants