Calling ipfs add on a directory with a large tree of sub-directories and files is slow. This use case is particularly important for file-system based package managers.
Background
IPFS deals with immutable blocks. Blocks are stored in the blockstore.
The UnixFS package breaks files up into chunks, and converts them to IPLD objects.
The DAG Service stores IPLD objects in the blockstore.
The Mutable File System (MFS) is an abstraction that presents IPFS as a file system. For example consider the following directory structure with associated hashes:
animals <Qm1234...>
land <Qm5678...>
dogs.txt <Qm9012...>
cats.txt <Qm3456...>
ocean <Qm7890...>
fish.txt <Qm4321...>
If the contents of fish.txt changes, the CID for fish.txt will also change. The link from ocean → fish will change, so the CID for ocean will change. The link from animals → ocean will change so the CID for animals will change. MFS manages those links and the propagation of changes up the tree.
Algorithm
ipfs add uses the MFS package to add files and directories to the IPFS blockstore. To add a directory with a large tree of sub-directories and files:
- Create an MFS root for the root directory (
animals in the example above)
- Recurse through the directory structure in "depth first search" fashion.
For each directory
- Create a corresponding empty directory in MFS eg
animals/ocean
This adds the empty directory to the blockstore.
- For each file in the directory eg
animals/ocean/fish.txt
- Read the file contents
- Convert the contents into a chunked IPLD node
- Add the IPLD Node and all its chunks to the blockstore
- Create the directory in MFS, if it doesn't exist (†) eg
animals/ocean
- Add the IPLD Node representing the file to MFS at the correct path (eg
animals/ocean/fish.txt)
Note: This again adds the IPLD Node root to the blockstore
- Recurse through the MFS representation of the directory structure
- For each directory, call
directory.GetNode()
Note that at this stage, the links to files in the directories have been created, so the directory created here will have a different CID than the empty directory created before the files were added. Calling directory.GetNode() (confusingly) writes the directory with links to files to the blockstore
(†) Although we've already created the directory, it's necessary to again ensure it exists before adding the file, because after processing every 256k files, the MFS internal directory cache structure is dereferenced to allow for golang garbage collection
Areas for Improvement
- The IPLD Node root for a file is added to the blockstore twice
- When the file is converted to an IPLD node
- When the file is added to MFS
- Note: @Stebalien points out that this mitigated by the fact that we check if we already have a block before writing it (in the blockservice itself)
- The MFS directory structure is kept in memory
- Recursion over the directory structure happens twice
- While reading the structure from input
- While writing out the directories to the blockstore
- The progress indicator pauses for a long period when it is almost at 100% while the directories are being written to the blockstore
Proposed Improvements
The above issues would be mitigated if we interact directly with the UnixFS API instead of with the MFS API:
- Recurse once over the directory structure
- Add files as we go
- Add directories once all their files have been added
Future Work
It has been noted that disk throughput and CPU usage are not close to maximum while adding large numbers of files. Future work should focus on analyzing these findings.
Calling
ipfs addon a directory with a large tree of sub-directories and files is slow. This use case is particularly important for file-system based package managers.Background
IPFS deals with immutable blocks. Blocks are stored in the blockstore.
The UnixFS package breaks files up into chunks, and converts them to IPLD objects.
The DAG Service stores IPLD objects in the blockstore.
The Mutable File System (MFS) is an abstraction that presents IPFS as a file system. For example consider the following directory structure with associated hashes:
If the contents of
fish.txtchanges, the CID forfish.txtwill also change. The link fromocean → fishwill change, so the CID foroceanwill change. The link fromanimals → oceanwill change so the CID foranimalswill change. MFS manages those links and the propagation of changes up the tree.Algorithm
ipfs adduses the MFS package to add files and directories to the IPFS blockstore. To add a directory with a large tree of sub-directories and files:animalsin the example above)For each directory
animals/oceanThis adds the empty directory to the blockstore.
animals/ocean/fish.txtanimals/oceananimals/ocean/fish.txt)Note: This again adds the IPLD Node root to the blockstore
directory.GetNode()Note that at this stage, the links to files in the directories have been created, so the directory created here will have a different CID than the empty directory created before the files were added. Calling
directory.GetNode()(confusingly) writes the directory with links to files to the blockstore(†) Although we've already created the directory, it's necessary to again ensure it exists before adding the file, because after processing every 256k files, the MFS internal directory cache structure is dereferenced to allow for golang garbage collection
Areas for Improvement
Proposed Improvements
The above issues would be mitigated if we interact directly with the UnixFS API instead of with the MFS API:
Future Work
It has been noted that disk throughput and CPU usage are not close to maximum while adding large numbers of files. Future work should focus on analyzing these findings.