Skip to content

Improve ipfs add performance for large trees of files & directories #6523

@dirkmc

Description

@dirkmc

Calling ipfs add on a directory with a large tree of sub-directories and files is slow. This use case is particularly important for file-system based package managers.

Background

IPFS deals with immutable blocks. Blocks are stored in the blockstore.
The UnixFS package breaks files up into chunks, and converts them to IPLD objects.
The DAG Service stores IPLD objects in the blockstore.

The Mutable File System (MFS) is an abstraction that presents IPFS as a file system. For example consider the following directory structure with associated hashes:

animals         <Qm1234...>
  land          <Qm5678...>
    dogs.txt    <Qm9012...>
    cats.txt    <Qm3456...>
  ocean         <Qm7890...>
    fish.txt    <Qm4321...>

If the contents of fish.txt changes, the CID for fish.txt will also change. The link from ocean → fish will change, so the CID for ocean will change. The link from animals → ocean will change so the CID for animals will change. MFS manages those links and the propagation of changes up the tree.

Algorithm

ipfs add uses the MFS package to add files and directories to the IPFS blockstore. To add a directory with a large tree of sub-directories and files:

  • Create an MFS root for the root directory (animals in the example above)
  • Recurse through the directory structure in "depth first search" fashion.
    For each directory
    • Create a corresponding empty directory in MFS eg animals/ocean
      This adds the empty directory to the blockstore.
    • For each file in the directory eg animals/ocean/fish.txt
      • Read the file contents
      • Convert the contents into a chunked IPLD node
      • Add the IPLD Node and all its chunks to the blockstore
      • Create the directory in MFS, if it doesn't exist (†) eg animals/ocean
      • Add the IPLD Node representing the file to MFS at the correct path (eg animals/ocean/fish.txt)
        Note: This again adds the IPLD Node root to the blockstore
  • Recurse through the MFS representation of the directory structure
    • For each directory, call directory.GetNode()
      Note that at this stage, the links to files in the directories have been created, so the directory created here will have a different CID than the empty directory created before the files were added. Calling directory.GetNode() (confusingly) writes the directory with links to files to the blockstore

(†) Although we've already created the directory, it's necessary to again ensure it exists before adding the file, because after processing every 256k files, the MFS internal directory cache structure is dereferenced to allow for golang garbage collection

Areas for Improvement

  • The IPLD Node root for a file is added to the blockstore twice
    • When the file is converted to an IPLD node
    • When the file is added to MFS
    • Note: @Stebalien points out that this mitigated by the fact that we check if we already have a block before writing it (in the blockservice itself)
  • The MFS directory structure is kept in memory
  • Recursion over the directory structure happens twice
    • While reading the structure from input
    • While writing out the directories to the blockstore
  • The progress indicator pauses for a long period when it is almost at 100% while the directories are being written to the blockstore

Proposed Improvements

The above issues would be mitigated if we interact directly with the UnixFS API instead of with the MFS API:

  • Recurse once over the directory structure
  • Add files as we go
  • Add directories once all their files have been added

Future Work

It has been noted that disk throughput and CPU usage are not close to maximum while adding large numbers of files. Future work should focus on analyzing these findings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementA net-new feature or improvement to an existing featuretopic/metaTopic meta

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions