FediLive

FediLive is a data collection tool designed to quickly fetch platform-wide public activities from Mastodon instances during a user-defined time period for downstream analysis. A dataset collected via FediLive over a period of approximately two weeks has been published on Zenodo.

It currently provides two running modes via two seperated branches:

A. Single version: a lightweight version without MongoDB, suitable for single-machine or simpler crawling tasks.
B. Multi version: a distributed version with MongoDB support (See MongoDB Guide), suitable for multi-machine parallel crawling, task coordination, and large-scale snapshot collection.

Citation

FediLive is developed and maintained by the Big Data and Networking (DataNET) Group at Fudan University.

If you use FediLive or the example dataset in your research, please cite our paper:

@inproceedings{Min2025FediLive,
  author    = {Min, Shaojie and Wang, Shaobin and Luo, Yaxiao and Gao, Min and Gong, Qingyuan and Xiao, Yu and Chen, Yang},
  title     = {{FediLive: A Framework for Collecting and Preprocessing Snapshots of Decentralized Online Social Networks}},
  year      = {2025},
  booktitle = {Companion Proceedings of the ACM on Web Conference 2025},
  series    = {WWW '25},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  pages     = {765--768},
  doi       = {10.1145/3701716.3715298},
  url       = {https://doi.org/10.1145/3701716.3715298}
}

Overview

FediLive supports collecting the following types of public Mastodon data:

Posts and replies
Reblogs and favourites
Contexts of posts/conversations
User interaction networks for downstream preprocessing and analysis

Which version should you choose?

Use the Single version if:

you intend to run FediLive on a single machine,
you do not intend to deploy MongoDB,
your crawling task is relatively small or you prefer a lightweight setup.

Use the Multi version if:

you have the ability to run FediLive on multiple machines in parallel,
you need distributed task coordination,
you intend to manage large-scale crawling more efficiently,
you need centralized storage of instance status and crawl progress.

Development Environment

FediLive has been tested on Ubuntu 20.04 LTS.

1. Recommended environment

OS: Ubuntu 20.04 LTS (64-bit)
Memory: 8GB RAM or above
Storage: 20GB available space or above
Python: 3.8–3.13

Additional requirement for Multi version

MongoDB: 5.0.30 recommended

2. Repository Branches

FediLive currently has two branches:

A. Single version: Single
B. Multi version: Multi

Clone the branch that matches your use case.

3.A. Clone Single version

git clone -b Single git@github.com:FDUDataNET/FediLive.git
cd FediLive

3.B. Clone Multi version

git clone -b Multi git@github.com:FDUDataNET/FediLive.git
cd FediLive

Installation and Configuration

The following installation steps are shared by both versions.

1. Create and activate a virtual environment (optional)

python3 -m venv venv
source venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3.A. Single Version Configuration

The Single version does not require MongoDB.

Edit config/config.yaml as follows:

api:
  instance_token: "your_instance_api_token"
  livefeeds_token: "your_mastodon_api_token"
  email: "your_email@example.com"

paths:
  instances_list: "instances_list.txt"

logging:
  level: "INFO"
  file: "logs/app.log"

Configuration fields

API

instance_token: token for retrieving the list of Mastodon instances from instances.social
Apply at: https://instances.social/api/token
livefeeds_token: token used to collect posts from Mastodon instances
Tokens can be requested according to the Mastodon API documentation:
https://docs.joinmastodon.org/
email: your contact email

Paths

instances_list: file path used to save the retrieved list of instances

Logging

level: logging level, such as DEBUG, INFO, WARNING, ERROR, or CRITICAL
file: log file path

3.B. Multi Version Configuration

The Multi version is designed for distributed parallel crawling across multiple machines.

Architecture

In the Multi version:

one machine should be selected as the central node
the central node stores instance information, crawling ranges, and coordinates crawling tasks
the other machines serve as worker nodes to crawl data from instances
each machine should have MongoDB installed
in config.yaml:
- mongodb_central should point to the central node database
- mongodb_local should point to the current machine’s local database

Edit config/config.yaml like this:

mongodb_central:
  username: "central_admin"
  password: "CentralPassword123!"
  host: "central.mongodb.server.com"
  port: 27017

mongodb_local:
  username: "local_admin"
  password: "LocalPassword456!"
  host: "local.mongodb.server.com"
  port: 27018

api:
  central_token: "your_central_api_token"
  email: "your_email@example.com"

paths:
  instances_list: "instances_list.txt"
  token_list: "tokens/token_list.txt"

logging:
  level: "INFO"
  file: "logs/app.log"

whitelist:
  - "mastodon.social"
  - "mstdn.social"

Configuration fields

MongoDB

mongodb_central: connection information for the central node database
mongodb_local: connection information for the local machine database

API

central_token: token for collecting the list of Mastodon instances from instances.social
Apply at: https://instances.social/api/token
email: your contact email

Paths

instances_list: file path used to save the retrieved list of instances
token_list: file containing Mastodon API tokens, one token per line

Logging

level: logging level
file: log file path

Whitelist

If the livefeeds time range is large, some large instances that are normally crawlable, such as mastodon.social, may occasionally encounter connection errors due to heavy request volume and may be blacklisted by livefeeds_worker.py.

You can add known stable large instances to the whitelist so they will not be blacklisted automatically.

4. API Tokens

4.A. Single Version

You need:

one instance_token
one livefeeds_token

4.B. Multi Version

Populate tokens/token_list.txt with Mastodon API tokens, one token per line.

Make sure the number of tokens is greater than the number of parallel processes you plan to run.

These tokens are used to collect posts from various Mastodon instances.

Usage

The usage differs slightly between the two versions. Shared steps are grouped together below, and different commands are shown separately.

1. Fetch Instance Information

This step retrieves the list of Mastodon instances.

1.A. Single Version

python -m fetcher.masto_list_fetcher

1.B. Multi Version

Run this on the central node:

python ./fetcher/masto_list_fetcher.py

2. Fetch Posts / Livefeeds

This step collects public posts during a specified time period.

2.A. Single Version

You can run this on one or multiple machines in parallel.

python -m fetcher.livefeeds_worker --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"

Parameters:

--processnum: number of parallel processes
--start: start time, format YYYY-MM-DD HH:MM:SS
--end: end time, format YYYY-MM-DD HH:MM:SS

2.B. Multi Version

Run this on multiple machines in parallel:

python ./fetcher/livefeeds_worker.py --id 0 --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"

Parameters:

--id: worker ID, starting from 0, used to select different API tokens
--processnum: number of parallel processes on each host
--start: start time, format YYYY-MM-DD HH:MM:SS (UTC+0)
--end: end time, format YYYY-MM-DD HH:MM:SS (UTC+0)

3. Fetch Reblogs and Favourites

This step collects users who reblogged or favourited posts.

3.A. Single Version

python -m fetcher.reblog_favourite --processnum 3

Parameters:

--processnum: number of parallel processes

3.B. Multi Version

python ./fetcher/reblog_favourite.py --processnum 3 --id 0

Parameters:

--processnum: number of parallel processes
--id: worker ID used to select different API tokens

4. Fetch Contexts

A context refers to the complete reply conversation of a post.

4.A. Single Version

This feature is not documented in the original Single README.

4.B. Multi Version

Run this on multiple machines in parallel:

python ./fetcher/context.py --processnum 3 --id 0

Parameters:

--processnum: number of parallel processes
--id: worker ID used to select different API tokens

5. Restart / Reset an Experiment

This operation is documented only for the Multi version.

5.B. Multi Version

Run this on all machines to remove existing livefeeds, reblogs, favourites, and related crawl state stored in MongoDB.

Make sure to back up your data before running this command.

python ./fetcher/reboot.py

6. Reactivate Whitelisted Instances

If some large but normally crawlable instances were temporarily marked unavailable during crawling, you can reactivate them using the whitelist.

6.B. Multi Version

python ./fetcher/reactivate_whitelist.py

Notes

General notes

FediLive uses the Mastodon REST API for data collection.
Some errors may occur during crawling due to network instability, busy servers, heterogeneous instance behavior, or rate limits.
These errors are generally handled within the code.

Multi version-specific notes

The instance list retrieved in the first step may include not only Mastodon instances but also other platforms connected in the Fediverse ecosystem. As a result:

some instances may not behave fully like Mastodon
some requests may fail when using Mastodon REST API endpoints
in these cases, the corresponding instance’s processable flag in MongoDB may be set to false
if an instance is temporarily overloaded, its processable flag may be set to server_busy

You can inspect detailed crawl status using mongosh.

Logging

All operations and errors are logged to the file specified in config/config.yaml.

Example configuration:

logging:
  level: "INFO"
  file: "logs/app.log"

Logging levels

DEBUG: detailed information, mainly for diagnosing problems
INFO: confirmation that things are working as expected
WARNING: an indication that something unexpected happened, or may happen soon
ERROR: a more serious problem that prevents part of the program from working correctly
CRITICAL: a very serious error indicating that the program may not be able to continue

MongoDB Setup Guide

This guidance is only for the Multi version.

1. Install MongoDB

Visit the following pages to download and install MongoDB and mongosh:

MongoDB Community Server:
https://www.mongodb.com/try/download/community
MongoDB Shell:
https://www.mongodb.com/try/download/shell

Install MongoDB on each server that participates in crawling.

2. Modify the MongoDB configuration file

In your MongoDB config file (for example, mongod.conf), modify:

net.bindIp to 0.0.0.0
net.port to a non-default value if needed

This allows worker machines to access the central node database.

For security reasons, it is strongly recommended to:

avoid exposing the default port directly
configure username/password authentication
restrict network access where possible

3. Add an access user

Run the following commands:

mongosh --port your_port_number
use admin
db.createUser({
  user: "your_username",
  pwd: "your_password",
  roles: [{ role: "root", db: "admin" }]
})

4. Fill in `config/config.yaml`

After MongoDB is configured, fill the corresponding connection information into:

mongodb_central
mongodb_local

Preprocessing Usage Guide

This section is mainly documented for the Multi version and can be used after data collection.

Data Preparation

Place FediLive crawled JSON files in the data/ directory using the following naming conventions:

Reply data: reply*.json
Boost/Favourite data: boostersfavourites*.json

Build Interaction Network

python preprocess/load_network.py --data_dir ./data

Sample Output

Network loaded with 15420 nodes and 87364 edges

Network Analysis

from preprocess.measure import calculate_metrics, analyze_cross_instance_statistics

# Calculate global metrics
metrics = calculate_metrics(G)
"""
Graph Metrics:
  Nodes: 15420
  Edges: 87364
  Density: 0.000368
  Average Degree: 5.668
  Clustering Coefficient: 0.142
  Average Shortest Path Length: 4.21
"""

# Cross-instance statistics
cross_stats = analyze_cross_instance_statistics(G)
"""
{
  'Total Edges': 87364,
  'Cross-Instance Edges': 23658,
  'Cross-Instance Edge Ratio': 0.271,
  'Nodes Involved in Cross-Instance Interactions': 8421,
  'Node Interaction Percentage': 54.63%
}
"""

Group Analysis

# Analyze by instance groups
instance_metrics = analyze_grouped_subgraphs(G, group_type='instance')

# Analyze by edge types
edge_metrics = analyze_grouped_subgraphs(G, group_type='edge_type')

Recommended Quick Start

A. Quick start for Single version

git clone -b Single git@github.com:FDUDataNET/FediLive.git
cd FediLive
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# edit config/config.yaml
python -m fetcher.masto_list_fetcher
python -m fetcher.livefeeds_worker --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"
python -m fetcher.reblog_favourite --processnum 3

B. Quick start for Multi version

git clone -b Multi git@github.com:FDUDataNET/FediLive.git
cd FediLive
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# install and configure MongoDB on all machines
# edit config/config.yaml
# fill tokens/token_list.txt
python ./fetcher/masto_list_fetcher.py
python ./fetcher/livefeeds_worker.py --id 0 --processnum 2 --start "2024-01-01 00:00:00" --end "2024-01-02 00:00:00"
python ./fetcher/reblog_favourite.py --processnum 3 --id 0
python ./fetcher/context.py --processnum 3 --id 0

Version Differences at a Glance

Feature	A. Single version	B. Multi version
MongoDB required	No	Yes
Single-machine crawling	Yes	Yes
Multi-machine distributed crawling	Limited / manual	Yes
Central task coordination	No	Yes
Token list file	No	Yes
Worker ID (`--id`)	No	Yes
Context fetching documented	No	Yes
Experiment reboot documented	No	Yes
Whitelist reactivation	No	Yes
Preprocessing guide included	Not documented	Yes

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
config		config
fetcher		fetcher
preprocess		preprocess
tokens		tokens
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FediLive

Citation

Overview

Which version should you choose?

Use the Single version if:

Use the Multi version if:

Development Environment

1. Recommended environment

Additional requirement for Multi version

2. Repository Branches

3.A. Clone Single version

3.B. Clone Multi version

Installation and Configuration

1. Create and activate a virtual environment (optional)

2. Install dependencies

3.A. Single Version Configuration

Configuration fields

API

Paths

Logging

3.B. Multi Version Configuration

Architecture

Configuration fields

MongoDB

API

Paths

Logging

Whitelist

4. API Tokens

4.A. Single Version

4.B. Multi Version

Usage

1. Fetch Instance Information

1.A. Single Version

1.B. Multi Version

2. Fetch Posts / Livefeeds

2.A. Single Version

2.B. Multi Version

3. Fetch Reblogs and Favourites

3.A. Single Version

3.B. Multi Version

4. Fetch Contexts

4.A. Single Version

4.B. Multi Version

5. Restart / Reset an Experiment

5.B. Multi Version

6. Reactivate Whitelisted Instances

6.B. Multi Version

Notes

General notes

Multi version-specific notes

Logging

Logging levels

MongoDB Setup Guide

1. Install MongoDB

2. Modify the MongoDB configuration file

3. Add an access user

4. Fill in config/config.yaml

Preprocessing Usage Guide

Data Preparation

Build Interaction Network

Sample Output

Network Analysis

Group Analysis

Recommended Quick Start

A. Quick start for Single version

B. Quick start for Multi version

Version Differences at a Glance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

4. Fill in `config/config.yaml`

Packages