Skip to content

Conversation

@venom1204
Copy link
Contributor

closes #2855

This PR adds a vignette on data I/O with fread() and fwrite, addressing issue #2855.

Covers key options, performance tips, and optimizations (e.g., multi-threading, column selection).

The fwrite() section documents:

  • Intelligent quoting (quote="auto").
  • Fine-grained date/time serialization (dateTimeAs).
  • Support for bit64::integer64.
  • Column order and subset control.

Includes runnable code snippets for common workflows.

hi @MichaelChirico , @jangorecki , @tdhock can you please have a look when you got time .

thanks ...

@codecov
Copy link

codecov bot commented Jul 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.79%. Comparing base (053d905) to head (d1cb56a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7216   +/-   ##
=======================================
  Coverage   98.79%   98.79%           
=======================================
  Files          81       81           
  Lines       15254    15254           
=======================================
  Hits        15070    15070           
  Misses        184      184           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


#### 1.1.2 **Reading from R connections and URLs**

`fread()` is highly versatile and can accept R connection objects as input to its file (or input) argument. This allows you to read from various sources, including:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where was this information sourced?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it does not support connections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the vignette to remove the incorrect statement about R connection support.
The section now only documents reading from file paths and URLs (as character strings)

@jangorecki
Copy link
Member

I think we rather prefer "fread and fwrite" than "fread() and fwrite()"

@@ -0,0 +1,339 @@
---
title: "fread() and fwrite()"
Copy link
Member

@jangorecki jangorecki Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fread and fwrite
Or
Fast read and Fast write


Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail.

On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. We recommend [Data Science at the Command Line](https://www.oreilly.com/library/view/data-science-at/9781491947845/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if vignette is good place for recommending a book, especially if it is not freely available

```{r}
# 1. Create a sample data.table and write it to a gzipped CSV
set.seed(123)
original_dt <- data.table(A = 1:5, B = runif(5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not = as assignment operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will update it

Comment on lines 4 to 7
output: rmarkdown::html_vignette # <--- Changed
vignette: >
%\VignetteIndexEntry{Fast Read and Fast Write}
%\VignetteEngine{knitr::rmarkdown} # <--- Changed
Copy link
Member

@aitap aitap Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the same vignette engine and output format as the rest of the vignettes? Whom are these comments for?


Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail.

On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modern versions of Rtools consist of Cygwin (well, MSYS2) plus pre-compiled third-party dependencies. In particular, they do contain grep and gawk, which could be enough to let this vignette build on Windows. (But make sure that the awk example works with a POSIX, non-GNU awk.)

Comment on lines 49 to 51
all_lines = readLines("example_data.txt")
data_lines = grep("HEADER", all_lines, value = TRUE, invert = TRUE)
fread(text = data_lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the previous revision of this vignette fail to build for you? Now this code is in conflict with the text below because grep -v is nowhere to be seen.


`fread()` doesn't directly support SQL `INSERT` scripts, but they can be processed via command-line tools. For example, given `insert_script.sql`:

```{r, eval=FALSE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this a commented-out R code block? If you can't get SQL syntax highlighting to work (shouldn't it be just ```SQL?), try a non-typed code block delimited by just ```.

file.remove("insert_script.sql")
```

- The `awk` command transforms each INSERT line into a comma-separated list of its values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line conflicts with the code block above.


### 1.6 **`integer64` Support**

By default, `fread` detects integers larger than 2³¹ and reads them as `bit64::integer64` to preserve full precision. This behavior can be overridden in three ways:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown is a superset of HTML. Does 2<sup>31</sup> work? Does $2^{31}$ work with our vignette engines? (They don't necessarily work with equations, but <sup> should work.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both of them are working

@venom1204 venom1204 requested a review from aitap July 27, 2025 19:05
@ChristianWia
Copy link
Contributor

welcome for further vignette translation.

@venom1204 venom1204 requested a review from ben-schwen July 31, 2025 12:39

Look at this [example](https://stackoverflow.com/questions/36256706/fread-together-with-grepl/36270543#36270543) for more detail.

On Windows we recommend [Cygwin](https://www.cygwin.com/) (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft [announced](https://www.hanselman.com/blog/developers-can-run-bash-shell-and-usermode-ubuntu-linux-binaries-on-windows-10) they will include these tools in Windows 10 natively. On Linux and macOS, these tools have always been included in the operating system. You can find many examples and tutorials about command line tools online. We recommend [Data Science at the Command Line](https://www.oreilly.com/library/view/data-science-at/9781491947845/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure that we should recommend Cygwin anymore. I'm working on WSL here and its a charm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for fread to run commands on windows right?
do you need any special environment to give fread access to unix shell commands from WSL or Rtools?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so it would be helpful to document

venom1204 and others added 4 commits August 4, 2025 15:14
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

When data is written as strings (either inherently, like character columns, or by choice, like `dateTimeAs="ISO"`), `quote="auto"` (default) intelligently quotes fields:

**Contextual Quoting**: Only quotes fields if they contain the delimiter (`sep`), a double quote, or a newline. This ensures clean, RFC 4180-compliant CSVs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also quote at carriage return \r or empty strings "" to distinguish from NA! so we are not strictly RFC 4180-compliant

text_field = c("Contains,a,comma", "Contains \"a quote\"", "Clean_text", "", NA),
numeric_field = 1:5
)
temp_quote_adv <- tempfile(fileext = ".csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why going back to use <- for assignment?

.old.th = setDTthreads(1)
```

The `fread()` and `fwrite()` functions in the `data.table` R package are not only optimized for speed on large files, but also offer powerful and convenient features for working with small datasets. This vignette highlights their usability, flexibility, and performance for efficient data import and export.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really highlight performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the word from the introduction, or would it be better to add a short new section at the end? The new section could include a simple system.time() comparison of fread/fwrite vs read.csv/write.csv on a 1-million-row table.

Copy link
Member

@ben-schwen ben-schwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective the only open thing is the recommendation of the command line book and cygwin as pointed out by Jan.

Copy link
Member

@ben-schwen ben-schwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, not sure though if we want to do benchmarks though

@venom1204 venom1204 requested a review from tdhock August 6, 2025 08:19
@tdhock
Copy link
Member

tdhock commented Aug 6, 2025

you could link atime benchmarks https://tdhock.github.io/blog/2024/pandas-dt/ and https://tdhock.github.io/blog/2023/dt-atime-figures/

@tdhock
Copy link
Member

tdhock commented Aug 14, 2025

great thanks

@tdhock tdhock merged commit 0e17e62 into master Aug 14, 2025
11 checks passed
@jangorecki jangorecki deleted the issue2855 branch September 27, 2025 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fread (and fwrite) vignette

6 participants