-
Notifications
You must be signed in to change notification settings - Fork 1k
added fwrite and fread vignette #7216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7216 +/- ##
=======================================
Coverage 98.79% 98.79%
=======================================
Files 81 81
Lines 15254 15254
=======================================
Hits 15070 15070
Misses 184 184 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| #### 1.1.2 **Reading from R connections and URLs** | ||
|
|
||
| `fread()` is highly versatile and can accept R connection objects as input to its file (or input) argument. This allows you to read from various sources, including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was this information sourced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it does not support connections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the vignette to remove the incorrect statement about R connection support.
The section now only documents reading from file paths and URLs (as character strings)
|
I think we rather prefer "fread and fwrite" than "fread() and fwrite()" |
| @@ -0,0 +1,339 @@ | |||
| --- | |||
| title: "fread() and fwrite()" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fread and fwrite
Or
Fast read and Fast write
|
|
||
| Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail. | ||
|
|
||
| On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. We recommend [Data Science at the Command Line](https://www.oreilly.com/library/view/data-science-at/9781491947845/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if vignette is good place for recommending a book, especially if it is not freely available
| ```{r} | ||
| # 1. Create a sample data.table and write it to a gzipped CSV | ||
| set.seed(123) | ||
| original_dt <- data.table(A = 1:5, B = runif(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not = as assignment operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will update it
| output: rmarkdown::html_vignette # <--- Changed | ||
| vignette: > | ||
| %\VignetteIndexEntry{Fast Read and Fast Write} | ||
| %\VignetteEngine{knitr::rmarkdown} # <--- Changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the same vignette engine and output format as the rest of the vignettes? Whom are these comments for?
|
|
||
| Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail. | ||
|
|
||
| On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| all_lines = readLines("example_data.txt") | ||
| data_lines = grep("HEADER", all_lines, value = TRUE, invert = TRUE) | ||
| fread(text = data_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the previous revision of this vignette fail to build for you? Now this code is in conflict with the text below because grep -v is nowhere to be seen.
|
|
||
| `fread()` doesn't directly support SQL `INSERT` scripts, but they can be processed via command-line tools. For example, given `insert_script.sql`: | ||
|
|
||
| ```{r, eval=FALSE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this a commented-out R code block? If you can't get SQL syntax highlighting to work (shouldn't it be just ```SQL?), try a non-typed code block delimited by just ```.
| file.remove("insert_script.sql") | ||
| ``` | ||
|
|
||
| - The `awk` command transforms each INSERT line into a comma-separated list of its values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line conflicts with the code block above.
|
|
||
| ### 1.6 **`integer64` Support** | ||
|
|
||
| By default, `fread` detects integers larger than 2³¹ and reads them as `bit64::integer64` to preserve full precision. This behavior can be overridden in three ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markdown is a superset of HTML. Does 2<sup>31</sup> work? Does $2^{31}$ work with our vignette engines? (They don't necessarily work with equations, but <sup> should work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both of them are working
|
welcome for further vignette translation. |
|
|
||
| Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail. | ||
|
|
||
| On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. We recommend [Data Science at the Command Line](https://www.oreilly.com/library/view/data-science-at/9781491947845/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure that we should recommend Cygwin anymore. I'm working on WSL here and its a charm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for fread to run commands on windows right?
do you need any special environment to give fread access to unix shell commands from WSL or Rtools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so it would be helpful to document
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
|
|
||
| When data is written as strings (either inherently, like character columns, or by choice, like `dateTimeAs="ISO"`), `quote="auto"` (default) intelligently quotes fields: | ||
|
|
||
| **Contextual Quoting**: Only quotes fields if they contain the delimiter (`sep`), a double quote, or a newline. This ensures clean, RFC 4180-compliant CSVs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also quote at carriage return \r or empty strings "" to distinguish from NA! so we are not strictly RFC 4180-compliant
| text_field = c("Contains,a,comma", "Contains \"a quote\"", "Clean_text", "", NA), | ||
| numeric_field = 1:5 | ||
| ) | ||
| temp_quote_adv <- tempfile(fileext = ".csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why going back to use <- for assignment?
| .old.th = setDTthreads(1) | ||
| ``` | ||
|
|
||
| The `fread()` and `fwrite()` functions in the `data.table` R package are not only optimized for speed on large files, but also offer powerful and convenient features for working with small datasets. This vignette highlights their usability, flexibility, and performance for efficient data import and export. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really highlight performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove the word from the introduction, or would it be better to add a short new section at the end? The new section could include a simple system.time() comparison of fread/fwrite vs read.csv/write.csv on a 1-million-row table.
ben-schwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective the only open thing is the recommendation of the command line book and cygwin as pointed out by Jan.
ben-schwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, not sure though if we want to do benchmarks though
|
you could link atime benchmarks https://tdhock.github.io/blog/2024/pandas-dt/ and https://tdhock.github.io/blog/2023/dt-atime-figures/ |
|
great thanks |
closes #2855
This PR adds a vignette on data I/O with fread() and fwrite, addressing issue #2855.
Covers key options, performance tips, and optimizations (e.g., multi-threading, column selection).
The fwrite() section documents:
Includes runnable code snippets for common workflows.
hi @MichaelChirico , @jangorecki , @tdhock can you please have a look when you got time .
thanks ...