Tracker server library previously powering LAION's distributed compute network for filtering commoncrawl with CLIP to produce the LAION-400M and LAION-5B datasets. The previous code has now repurposed as a general-use multi-layer distributed compute tracker and job manager, with added support for a frontend web server dashboard, user leaderboards and up to 5 sequential stages of workers for each job.
- Client Repo: TheoCoombes/distcompute-client
- LAION-5B Paper: https://arxiv.org/abs/2210.08402
- LAION-5B Implementation (Client): TheoCoombes/crawlingathome
- LAION-5B Implementation (Server): TheoCoombes/crawlingathome-server
git clone https://github.com/TheoCoombes/Distributed-Compute-Tracker.git
cd Distributed-Compute-Tracker
pip install -r requirements.txt
- Redis Guide - follow steps 1-2.
- You may need to configure your Redis connection url in
config.pyif you have changed any port bindings.
- PostGreSQL Guide - follow steps 1-4, noting down the name you give your database.
- Install the required python library for the database you are using. (see link above)
- Configure your SQL connection in
config.py, adding your database name to theSQL_CONN_URL.
- Create a JSON file containing either a list of strings or a list of dicts, each with job data (e.g. urls / filenames to process etc.) and run
init.py --json <file>to setup the database. - Alternatively, you can also create a brace expansion for your initial job data, e.g.
init.py --brace "./data/file_{00..99}.tar". - For more info, run
init.py --help. - WARNING: running init.py will reset your database, so ensure you make a backup of any previous data before running the script!
- Open
config.py, and renamePROJECT_NAMEto a more suitable name for your project. - Edit
STAGE_<N>to add the names of each stage of your workflow. If the next stage is set to None, the job is marked as complete. Otherwise, workers operating at the next stage will recieve the output of the current stage as an input. - If you would like a linear
input -[worker]-> outputworkflow, only enableSTAGE_A. - The default setting is the workflow used previously for the production of the LAION-5B dataset. CPU workers at stage A would download and store images+alt text from CommonCrawl in tar files. GPU workers at stage B would then be inputted with these tar files, and then filter these images using CLIP to create the final dataset. (see paper)
- You can either use
gunicornoruvicorn. Previously, the LAION-5B production server useduvicornwith 12 worker processes. - e.g.
uvicorn main:app --host 0.0.0.0 --port 80 --workers 12
As stated in step 5 of installation, you need to run the server directly using a ASGI server library of your choice:
uvicorn main:app --host 0.0.0.0 --port 80 --workers 12
- Runs the server through Uvicorn, using 12 processes.