Skip to content

Add readdir feature#207

Merged
talex5 merged 7 commits intoocaml-multicore:mainfrom
patricoferris:readdir
Jun 1, 2022
Merged

Add readdir feature#207
talex5 merged 7 commits intoocaml-multicore:mainfrom
patricoferris:readdir

Conversation

@patricoferris
Copy link
Copy Markdown
Collaborator

This PR starts the implementation of a readdir feature for Eio. I opened it mainly to have a conversation about the API.

  • What should be returned when you read the contents of a directory? This PR only returns the names of the contents, libuv does expose the kind of the entries which probably would be nice to add?
  • What should the name of the function be, here it is read_dir but I have seen readdir to follow the convention used by the C function, contents used by Bos etc.

At the moment I don't see a way to provide an asynchronous version with uring? I think with the 5.17 kernel there will be a getdents64 uring submission function which could then be used.

Will fix #206

@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Mar 16, 2022

What should be returned when you read the contents of a directory? This PR only returns the names of the contents, libuv does expose the kind of the entries which probably would be nice to add?

Yes, getdents on Linux gives you the kind, which saves having to stat each entry in some cases. Though it can also return DT_UNKNOWN if the FS doesn't support it.

At the moment I don't see a way to provide an asynchronous version with uring?

We can use systhreads for things that uring doesn't handle. We'll probably need to do that anyway for stuff like gethostbyname.

@hexagonrecursion
Copy link
Copy Markdown

  1. Why is it read_dir : Eio.Dir.t -> path -> string list? I expected read_dir : Eio.Dir.t -> string list - "given Eio.Dir.t give me its contents"
  2. You should consider how the API will be used. If you give me a list of strings, I will usually have to check each one to see whether it is a directory or a file because the API for working with directories and files is different. How about a list of the following?
    type dirent = [ 
    | `File of path
    | `Dir of Eio.Dir.t
    ]
  3. What if a directory contains 10**12 files (see) and I don't want a list that big in memory? There should be an API to read the contents of the directory asynchronously just like we are able to read a file asynchronously

@patricoferris
Copy link
Copy Markdown
Collaborator Author

Thanks both for the comments :))

Though it can also return DT_UNKNOWN if the FS doesn't support it.

Yep, this is also exposed in luv https://aantron.github.io/luv/luv/Luv/File/Dirent/Kind/index.html#type-t.UNKNOWN

We can use systhreads for things that uring doesn't handle. We'll probably need to do that anyway for stuff like gethostbyname.

Forgot this was an option, I'll have a go!

Why is it read_dir : Eio.Dir.t -> path -> string list? I expected read_dir : Eio.Dir.t -> string list - "given Eio.Dir.t give me its contents"

My understanding is that an Eio.Dir.t is not a directory, but instead a file-system access capability. Maybe Dir is a bit misleading, perhaps Eio.Fs is better, but maybe @talex5 has good reason for the naming scheme? In the case of cwd it ensures we can't access things like ../../some/dir whereas the full-system access capability fs would allow such a thing but note that it doesn't actually impact the path, it just checks what we're trying to access with the path.

There's also https://github.com/ocaml-multicore/eio#design-note-capabilities for some more context on this design.

How about a list of the following?

type dirent = [ 
| `File of path
| `Dir of Eio.Dir.t
]

Yep returning the kinds makes sense. Maybe following the layout of luv so:

type dirent = {
  kind : kind;
  name : string;
}
and kind = [ `Unknown | `Dir | `File ... ]

`Dir of Eio.Dir.t isn't quite right given the explanation of Eio.Dir.t above.

What if a directory contains 10**12 files (see) and I don't want a list that big in memory?

Great point, I'll see what's available in the backend APIs to make this possible :))

@hexagonrecursion
Copy link
Copy Markdown

hexagonrecursion commented Mar 17, 2022

Given:

let cwd = Eio.Stdenv.cwd env in
Eio.Dir.read_dir cwd "path/to/foo"

will the result be path/to/foo/bar.txt or bar.txt? I think path/to/foo/bar.txt would be more convenient.

@patricoferris
Copy link
Copy Markdown
Collaborator Author

patricoferris commented Mar 17, 2022

Currently it would bebar.txt which I think is in line with readdir(3), Sys.readdir and Luv.File.readdir so if we want paths to the directory entries, it probably shouldn't be called read_dir. Reconstructing the path (once we have a better path type) wouldn't be too hard and maybe we could provide a convenience function built on top of read_dir for this? Alternatively, I suppose, it won't be too hard to provide a Path.filename function too.

@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Mar 18, 2022

Why is it read_dir : Eio.Dir.t -> path -> string list? I expected read_dir : Eio.Dir.t -> string list - "given Eio.Dir.t give me its contents"

My understanding is that an Eio.Dir.t is not a directory, but instead a file-system access capability. Maybe Dir is a bit misleading, perhaps Eio.Fs is better, but maybe @talex5 has good reason for the naming scheme?

I just copied Rust's name for it (https://docs.rs/cap-std/0.22.1/cap_std/fs/struct.Dir.html). They have e.g.

impl Dir {
  fn read_dir<P: AsRef<Path>>(&self, path: P) -> Result<ReadDir>

The Eio API also matches the POSIX one (e.g. openat2, where you pass a base FD and a path relative to that). The reason for having both is mostly to do with symlinks: you can't access the parent of the base FD, but you can follow symlinks within the subtree.

So the idea is that a Dir.t is a directory, and doesn't allow access to its parent. However, fs is an exception to this rule because it represents the current directory but also allows access to everything (via .., symlinks, or an absolute path).

@patricoferris
Copy link
Copy Markdown
Collaborator Author

I pushed a commit which tries to use systhreads now they're fixed in OCaml trunk. It's meant to generate a discussion about how best to use them to provide non-blocking, asynchronous versions of unsupported functions in Uring.

Right now the implementation uses a new effect, Unblock which takes a function to "unblock". It runs this function in a newly created systhread and the result is then enqueued to the run queue. I don't know if this is the general idea that you had in mind @talex5 ? I had a look at Lwt and Mwt which are more complicated with pools of threads and workers, but I thought thanks to effects we might be able to simplify that a fair bit.

Also the readdir function reuses openat2 to do the necessary filesystem containment checking, but then opens a directory handle anyway which probably isn't particularly efficient.

Copy link
Copy Markdown
Collaborator

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it seems to work!

Also the readdir function reuses openat2 to do the necessary filesystem containment checking, but then opens a directory handle anyway which probably isn't particularly efficient.

Also, there's a possible race if a symlink appears between the check and the open. The libc and OCaml APIs for reading directories aren't very nice, and if we want to return types too then I guess we should just call getdents64 directly and avoid the whole mess.

What if a directory contains 10**12 files (see) and I don't want a list that big in memory? There should be an API to read the contents of the directory asynchronously just like we are able to read a file asynchronously

Yes, probably makes sense for the backend API to provide open_dir, etc, and for read_dir to be a wrapper around that.

@patricoferris
Copy link
Copy Markdown
Collaborator Author

Yes, probably makes sense for the backend API to provide open_dir, etc, and for read_dir to be a wrapper around that.

Should Eio.Dir.read_dir try to read all directory entries or just some amount of them and the user can call it multiple times ? Or should that be a completely separate function ?

Copy link
Copy Markdown
Collaborator

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Eio.Dir.read_dir try to read all directory entries or just some amount of them and the user can call it multiple times ? Or should that be a completely separate function ?

Ideally, I think there should be two functions. Maybe the backend would provide something like:

method with_dir_entries : string -> (dir_entry Seq.t -> 'a) -> 'a

But I'm happy to merge it with just this read-everything function for now. Almost everyone will just use that.

@patricoferris patricoferris marked this pull request as ready for review May 27, 2022 21:20
@talex5 talex5 merged commit 3bcf6f3 into ocaml-multicore:main Jun 1, 2022
talex5 added a commit to talex5/opam-repository that referenced this pull request Jun 28, 2022
CHANGES:

API changes:

- `Net.accept_sub` is deprecated in favour of `accept_fork` (@talex5 ocaml-multicore/eio#240).
  `Fiber.fork_on_accept`, which it used internally, has been removed.

- Allow short writes in `Read_source_buffer` (@talex5 ocaml-multicore/eio#239).
  The reader is no longer required to consume all the data in one go.
  Also, add `Linux_eio.Low_level.writev_single` to expose this behaviour directly.

- `Eio.Unix_perm` is now `Eio.Dir.Unix_perm`.

New features:

- Add `Eio.Mutex` (@TheLortex @talex5 ocaml-multicore/eio#223).

- Add `Eio.Buf_write` (@talex5 ocaml-multicore/eio#235).
  This is a buffered writer for Eio sinks, based on Faraday.

- Add `Eio_mock` library for testing (@talex5 ocaml-multicore/eio#228).
  At the moment it has mock flows and networks.

- Add `Eio_mock.Backend` (@talex5 ocaml-multicore/eio#237 ocaml-multicore/eio#238).
  Allows running tests without needing a dependency on eio_main.
  Also, as it is single-threaded, it can detect deadlocks in test code instead of just hanging.

- Add `Buf_read.{of_buffer, of_string, parse_string{,_exn}, return}` (@talex5 ocaml-multicore/eio#225).

- Add `<*>` combinator to `Buf_read.Syntax` (@talex5 ocaml-multicore/eio#227).

- Add `Eio.Dir.read_dir` (@patricoferris @talex5 ocaml-multicore/eio#207 ocaml-multicore/eio#218 ocaml-multicore/eio#219)

Performance:

- Add `Buf_read` benchmark and optimise it a bit (@talex5 ocaml-multicore/eio#230).

- Inline `Buf_read.consume` to improve performance (@talex5 ocaml-multicore/eio#232).

Bug fixes / minor changes:

- Allow IO to happen even if a fiber keeps yielding (@TheLortex @talex5 ocaml-multicore/eio#213).

- Fallback for `traceln` without an effect handler (@talex5 ocaml-multicore/eio#226).
  `traceln` now works outside of an event loop too.

- Check for cancellation when creating a non-protected child context (@talex5 ocaml-multicore/eio#222).

- eio_linux: handle EINTR when calling `getrandom` (@bikallem ocaml-multicore/eio#212).

- Update to cmdliner.1.1.0 (@talex5 ocaml-multicore/eio#190).
talex5 added a commit to talex5/opam-repository that referenced this pull request Jun 28, 2022
CHANGES:

API changes:

- `Net.accept_sub` is deprecated in favour of `accept_fork` (@talex5 ocaml-multicore/eio#240).
  `Fiber.fork_on_accept`, which it used internally, has been removed.

- Allow short writes in `Read_source_buffer` (@talex5 ocaml-multicore/eio#239).
  The reader is no longer required to consume all the data in one go.
  Also, add `Linux_eio.Low_level.writev_single` to expose this behaviour directly.

- `Eio.Unix_perm` is now `Eio.Dir.Unix_perm`.

New features:

- Add `Eio.Mutex` (@TheLortex @talex5 ocaml-multicore/eio#223).

- Add `Eio.Buf_write` (@talex5 ocaml-multicore/eio#235).
  This is a buffered writer for Eio sinks, based on Faraday.

- Add `Eio_mock` library for testing (@talex5 ocaml-multicore/eio#228).
  At the moment it has mock flows and networks.

- Add `Eio_mock.Backend` (@talex5 ocaml-multicore/eio#237 ocaml-multicore/eio#238).
  Allows running tests without needing a dependency on eio_main.
  Also, as it is single-threaded, it can detect deadlocks in test code instead of just hanging.

- Add `Buf_read.{of_buffer, of_string, parse_string{,_exn}, return}` (@talex5 ocaml-multicore/eio#225).

- Add `<*>` combinator to `Buf_read.Syntax` (@talex5 ocaml-multicore/eio#227).

- Add `Eio.Dir.read_dir` (@patricoferris @talex5 ocaml-multicore/eio#207 ocaml-multicore/eio#218 ocaml-multicore/eio#219)

Performance:

- Add `Buf_read` benchmark and optimise it a bit (@talex5 ocaml-multicore/eio#230).

- Inline `Buf_read.consume` to improve performance (@talex5 ocaml-multicore/eio#232).

Bug fixes / minor changes:

- Allow IO to happen even if a fiber keeps yielding (@TheLortex @talex5 ocaml-multicore/eio#213).

- Fallback for `traceln` without an effect handler (@talex5 ocaml-multicore/eio#226).
  `traceln` now works outside of an event loop too.

- Check for cancellation when creating a non-protected child context (@talex5 ocaml-multicore/eio#222).

- eio_linux: handle EINTR when calling `getrandom` (@bikallem ocaml-multicore/eio#212).

- Update to cmdliner.1.1.0 (@talex5 ocaml-multicore/eio#190).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API request: readdir

3 participants