FediLive is a data collection tool designed to quickly fetch platform-wide public activities from Mastodon instances during a user-defined time period for downstream analysis. A dataset collected via FediLive over a period of approximately two weeks has been published on Zenodo.
It currently provides two running modes via two seperated branches:
- A. Single version: a lightweight version without MongoDB, suitable for single-machine or simpler crawling tasks.
- B. Multi version: a distributed version with MongoDB support (See MongoDB Guide), suitable for multi-machine parallel crawling, task coordination, and large-scale snapshot collection.
FediLive is developed and maintained by the Big Data and Networking (DataNET) Group at Fudan University.
If you use FediLive or the example dataset in your research, please cite our paper:
@inproceedings{Min2025FediLive,
author = {Min, Shaojie and Wang, Shaobin and Luo, Yaxiao and Gao, Min and Gong, Qingyuan and Xiao, Yu and Chen, Yang},
title = {{FediLive: A Framework for Collecting and Preprocessing Snapshots of Decentralized Online Social Networks}},
year = {2025},
booktitle = {Companion Proceedings of the ACM on Web Conference 2025},
series = {WWW '25},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
pages = {765--768},
doi = {10.1145/3701716.3715298},
url = {https://doi.org/10.1145/3701716.3715298}
}FediLive supports collecting the following types of public Mastodon data:
- Posts and replies
- Reblogs and favourites
- Contexts of posts/conversations
- User interaction networks for downstream preprocessing and analysis
- you intend to run FediLive on a single machine,
- you do not intend to deploy MongoDB,
- your crawling task is relatively small or you prefer a lightweight setup.
- you have the ability to run FediLive on multiple machines in parallel,
- you need distributed task coordination,
- you intend to manage large-scale crawling more efficiently,
- you need centralized storage of instance status and crawl progress.
FediLive has been tested on Ubuntu 20.04 LTS.
- OS: Ubuntu 20.04 LTS (64-bit)
- Memory: 8GB RAM or above
- Storage: 20GB available space or above
- Python: 3.8–3.13
- MongoDB: 5.0.30 recommended
FediLive currently has two branches:
- A. Single version:
Single - B. Multi version:
Multi
Clone the branch that matches your use case.
git clone -b Single git@github.com:FDUDataNET/FediLive.git
cd FediLivegit clone -b Multi git@github.com:FDUDataNET/FediLive.git
cd FediLiveThe following installation steps are shared by both versions.
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtThe Single version does not require MongoDB.
Edit config/config.yaml as follows:
api:
instance_token: "your_instance_api_token"
livefeeds_token: "your_mastodon_api_token"
email: "your_email@example.com"
paths:
instances_list: "instances_list.txt"
logging:
level: "INFO"
file: "logs/app.log"-
instance_token: token for retrieving the list of Mastodon instances frominstances.social
Apply at: https://instances.social/api/token -
livefeeds_token: token used to collect posts from Mastodon instances
Tokens can be requested according to the Mastodon API documentation:
https://docs.joinmastodon.org/ -
email: your contact email
instances_list: file path used to save the retrieved list of instances
level: logging level, such asDEBUG,INFO,WARNING,ERROR, orCRITICALfile: log file path
The Multi version is designed for distributed parallel crawling across multiple machines.
In the Multi version:
- one machine should be selected as the central node
- the central node stores instance information, crawling ranges, and coordinates crawling tasks
- the other machines serve as worker nodes to crawl data from instances
- each machine should have MongoDB installed
- in
config.yaml:mongodb_centralshould point to the central node databasemongodb_localshould point to the current machine’s local database
Edit config/config.yaml like this:
mongodb_central:
username: "central_admin"
password: "CentralPassword123!"
host: "central.mongodb.server.com"
port: 27017
mongodb_local:
username: "local_admin"
password: "LocalPassword456!"
host: "local.mongodb.server.com"
port: 27018
api:
central_token: "your_central_api_token"
email: "your_email@example.com"
paths:
instances_list: "instances_list.txt"
token_list: "tokens/token_list.txt"
logging:
level: "INFO"
file: "logs/app.log"
whitelist:
- "mastodon.social"
- "mstdn.social"mongodb_central: connection information for the central node databasemongodb_local: connection information for the local machine database
central_token: token for collecting the list of Mastodon instances frominstances.social
Apply at: https://instances.social/api/tokenemail: your contact email
instances_list: file path used to save the retrieved list of instancestoken_list: file containing Mastodon API tokens, one token per line
level: logging levelfile: log file path
If the livefeeds time range is large, some large instances that are normally crawlable, such as mastodon.social, may occasionally encounter connection errors due to heavy request volume and may be blacklisted by livefeeds_worker.py.
You can add known stable large instances to the whitelist so they will not be blacklisted automatically.
You need:
- one
instance_token - one
livefeeds_token
Populate tokens/token_list.txt with Mastodon API tokens, one token per line.
Make sure the number of tokens is greater than the number of parallel processes you plan to run.
These tokens are used to collect posts from various Mastodon instances.
The usage differs slightly between the two versions. Shared steps are grouped together below, and different commands are shown separately.
This step retrieves the list of Mastodon instances.
python -m fetcher.masto_list_fetcherRun this on the central node:
python ./fetcher/masto_list_fetcher.pyThis step collects public posts during a specified time period.
You can run this on one or multiple machines in parallel.
python -m fetcher.livefeeds_worker --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"Parameters:
--processnum: number of parallel processes--start: start time, formatYYYY-MM-DD HH:MM:SS--end: end time, formatYYYY-MM-DD HH:MM:SS
Run this on multiple machines in parallel:
python ./fetcher/livefeeds_worker.py --id 0 --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"Parameters:
--id: worker ID, starting from 0, used to select different API tokens--processnum: number of parallel processes on each host--start: start time, formatYYYY-MM-DD HH:MM:SS(UTC+0)--end: end time, formatYYYY-MM-DD HH:MM:SS(UTC+0)
This step collects users who reblogged or favourited posts.
python -m fetcher.reblog_favourite --processnum 3Parameters:
--processnum: number of parallel processes
python ./fetcher/reblog_favourite.py --processnum 3 --id 0Parameters:
--processnum: number of parallel processes--id: worker ID used to select different API tokens
A context refers to the complete reply conversation of a post.
This feature is not documented in the original Single README.
Run this on multiple machines in parallel:
python ./fetcher/context.py --processnum 3 --id 0Parameters:
--processnum: number of parallel processes--id: worker ID used to select different API tokens
This operation is documented only for the Multi version.
Run this on all machines to remove existing livefeeds, reblogs, favourites, and related crawl state stored in MongoDB.
Make sure to back up your data before running this command.
python ./fetcher/reboot.pyIf some large but normally crawlable instances were temporarily marked unavailable during crawling, you can reactivate them using the whitelist.
python ./fetcher/reactivate_whitelist.py- FediLive uses the Mastodon REST API for data collection.
- Some errors may occur during crawling due to network instability, busy servers, heterogeneous instance behavior, or rate limits.
- These errors are generally handled within the code.
The instance list retrieved in the first step may include not only Mastodon instances but also other platforms connected in the Fediverse ecosystem. As a result:
- some instances may not behave fully like Mastodon
- some requests may fail when using Mastodon REST API endpoints
- in these cases, the corresponding instance’s
processableflag in MongoDB may be set tofalse - if an instance is temporarily overloaded, its
processableflag may be set toserver_busy
You can inspect detailed crawl status using mongosh.
All operations and errors are logged to the file specified in config/config.yaml.
Example configuration:
logging:
level: "INFO"
file: "logs/app.log"DEBUG: detailed information, mainly for diagnosing problemsINFO: confirmation that things are working as expectedWARNING: an indication that something unexpected happened, or may happen soonERROR: a more serious problem that prevents part of the program from working correctlyCRITICAL: a very serious error indicating that the program may not be able to continue
This guidance is only for the Multi version.
Visit the following pages to download and install MongoDB and mongosh:
- MongoDB Community Server:
https://www.mongodb.com/try/download/community - MongoDB Shell:
https://www.mongodb.com/try/download/shell
Install MongoDB on each server that participates in crawling.
In your MongoDB config file (for example, mongod.conf), modify:
net.bindIpto0.0.0.0net.portto a non-default value if needed
This allows worker machines to access the central node database.
For security reasons, it is strongly recommended to:
- avoid exposing the default port directly
- configure username/password authentication
- restrict network access where possible
Run the following commands:
mongosh --port your_port_number
use admin
db.createUser({
user: "your_username",
pwd: "your_password",
roles: [{ role: "root", db: "admin" }]
})After MongoDB is configured, fill the corresponding connection information into:
mongodb_centralmongodb_local
This section is mainly documented for the Multi version and can be used after data collection.
Place FediLive crawled JSON files in the data/ directory using the following naming conventions:
- Reply data:
reply*.json - Boost/Favourite data:
boostersfavourites*.json
python preprocess/load_network.py --data_dir ./dataNetwork loaded with 15420 nodes and 87364 edgesfrom preprocess.measure import calculate_metrics, analyze_cross_instance_statistics
# Calculate global metrics
metrics = calculate_metrics(G)
"""
Graph Metrics:
Nodes: 15420
Edges: 87364
Density: 0.000368
Average Degree: 5.668
Clustering Coefficient: 0.142
Average Shortest Path Length: 4.21
"""
# Cross-instance statistics
cross_stats = analyze_cross_instance_statistics(G)
"""
{
'Total Edges': 87364,
'Cross-Instance Edges': 23658,
'Cross-Instance Edge Ratio': 0.271,
'Nodes Involved in Cross-Instance Interactions': 8421,
'Node Interaction Percentage': 54.63%
}
"""# Analyze by instance groups
instance_metrics = analyze_grouped_subgraphs(G, group_type='instance')
# Analyze by edge types
edge_metrics = analyze_grouped_subgraphs(G, group_type='edge_type')git clone -b Single git@github.com:FDUDataNET/FediLive.git
cd FediLive
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# edit config/config.yaml
python -m fetcher.masto_list_fetcher
python -m fetcher.livefeeds_worker --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"
python -m fetcher.reblog_favourite --processnum 3git clone -b Multi git@github.com:FDUDataNET/FediLive.git
cd FediLive
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# install and configure MongoDB on all machines
# edit config/config.yaml
# fill tokens/token_list.txt
python ./fetcher/masto_list_fetcher.py
python ./fetcher/livefeeds_worker.py --id 0 --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"
python ./fetcher/reblog_favourite.py --processnum 3 --id 0
python ./fetcher/context.py --processnum 3 --id 0| Feature | A. Single version | B. Multi version |
|---|---|---|
| MongoDB required | No | Yes |
| Single-machine crawling | Yes | Yes |
| Multi-machine distributed crawling | Limited / manual | Yes |
| Central task coordination | No | Yes |
| Token list file | No | Yes |
Worker ID (--id) |
No | Yes |
| Context fetching documented | No | Yes |
| Experiment reboot documented | No | Yes |
| Whitelist reactivation | No | Yes |
| Preprocessing guide included | Not documented | Yes |