Skip to content

Re-encode for my csv problem#52

Closed
scoates wants to merge 2 commits intosimonw:mainfrom
scoates:reencode-csv
Closed

Re-encode for my csv problem#52
scoates wants to merge 2 commits intosimonw:mainfrom
scoates:reencode-csv

Conversation

@scoates
Copy link

@scoates scoates commented Dec 31, 2021

Discussed in #50

I think I had two problems here, but they might be related:

  1. commit.tree.blobs was [] here. I changed this to use the tree['filename'] notation
  2. my data was in Latin-1/iso-8859-1 (I didn't know this at first). I added an option to --re-encode

Tests pass.

…e behaviour of a sometimes-empty commit.tree.blobs and avoids the filtered comprehension.
@scoates scoates changed the title Reencode for my csv problem Re-encode for my csv problem Dec 31, 2021
@scoates
Copy link
Author

scoates commented Dec 31, 2021

FWIW:

❯ file -I scraped-data/emergency-rooms/quebec/Releve_horaire_urgences_7jours.csv

scraped-data/emergency-rooms/quebec/Releve_horaire_urgences_7jours.csv: application/csv; charset=iso-8859-1

@simonw
Copy link
Owner

simonw commented Jul 27, 2022

Sorry for not looking at this sooner!

I'm not keen on --re-encode as the option here. I prefer --encoding X purely for consistency with my other tool sqlite-utils: https://sqlite-utils.datasette.io/en/stable/cli-reference.html#insert

@scoates
Copy link
Author

scoates commented Jul 28, 2022

I honestly forget how this works. If you're happy with the other method, so am I. (-:

@lassebenni
Copy link

Can we merge this? I also encountered this issue and didn't see this PR so ended up with a similar fix but this wouldv'e saved me some time!

simonw pushed a commit that referenced this pull request Dec 21, 2025
This addresses the encoding issue reported in PR #52 where CSV files
encoded in Latin-1/ISO-8859-1 couldn't be processed because the default
encoding was hardcoded as UTF-8.

Changes:
- Add --encoding option to specify character encoding for CSV files
- Modify build_csv_convert_string() to accept and use the encoding parameter
- Add validation that --encoding requires --csv or --dialect
- Add comprehensive tests for Latin-1, ISO-8859-1, and UTF-16 encodings
simonw pushed a commit that referenced this pull request Dec 21, 2025
This addresses the encoding issue reported in PR #52 where CSV files
encoded in Latin-1/ISO-8859-1 couldn't be processed because the default
encoding was hardcoded as UTF-8.

Changes:
- Add --encoding option to specify character encoding for CSV files
- Modify build_csv_convert_string() to accept and use the encoding parameter
- Add validation that --encoding requires --csv or --dialect
- Add comprehensive tests for Latin-1, ISO-8859-1, and UTF-16 encodings
- Document --encoding option in README
simonw added a commit that referenced this pull request Dec 21, 2025
This addresses the encoding issue reported in PR #52 where CSV files
encoded in Latin-1/ISO-8859-1 couldn't be processed because the default
encoding was hardcoded as UTF-8.

Changes:
- Add --encoding option to specify character encoding for CSV files
- Modify build_csv_convert_string() to accept and use the encoding parameter
- Add validation that --encoding requires --csv or --dialect
- Add comprehensive tests for Latin-1, ISO-8859-1, and UTF-16 encodings
- Document --encoding option in README

Co-authored-by: Claude <noreply@anthropic.com>
@simonw
Copy link
Owner

simonw commented Dec 21, 2025

Now fixed.

@simonw simonw closed this Dec 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants