AI DataGraph

TLDR: AI DataGraph lets data providers—whether they’re governments, enterprises, or individuals—share their data securely for AI model training and get paid for it. It leverages the Constellation Metagraph to allow providers to create profiles and send privacy-gated updates, to enable AI researchers to query this data by sending a token payment, and for the data providers to be rewarded for each time their data was used as part of an AI model's query.

Problem(s) we're solving:

As a machine learning engineer, I know how powerful AI can be in helping us solve our most complex problems, but its effectiveness depends entirely on getting access to high-quality data. For really important use cases like healthcare and disease prevention, these models need to be trained on nuanced data such as diet, age, gender, daily activity, and other health metrics.

Billions of these useful data points are generated each day from our phones and smart watches, but they’re spread across millions of people and are highly sensitive, so it’s no surprise that most people aren’t willing to share this data.

From talking to potential users in preparation for this hackathon, it’s clear they would be willing to share this data if it's (a) super simple to set up, (b) if they have fine-grained control over what they share, and (c) if they get compensated for it.

Overview of our solution: AI DataGraph

alt AI DataGraph works through the use of 3 distinct "phases" as described below. More information is in the "How it works" section:

Phase 1 (Data Ingestion): Data providers (in our example, these are consumers) are able to sign up through the web app and select different types of information they are willing to share. This is a combination of demographic information (not updated often), daily updates like sleep and average heart rate, and hourly updates like steps and calories consumed in the past hour. Once they've set this up, their relevant updates are sent each hour (or each day) from their device automatically into the metagraph with minimal involvement on their part.
Phase 2 (AI models request data access for training): Whenever an AI engineer wants to request access to this data for training their models, they send a payment (token) to the metagraph, and once this transaction is confirmed, they receive the data. During this process, our rewards logic records which data points were returned with which query (or multiple queries) for reward distribution in Phase 3.
Phase 3 (Rewards distribution): Periodically, our app runs a process to distribute rewards. It uses the tracking from Phase 2 to calculate how much to pay out each data provider based on the number of times their data points were used in the queries since the last rewards cycle. This logic handles cases where each data point can be returned as part of more than one query and each query can have a variable amount of 'bounty' offered - the code handles all of this.

Why Metagraphs and Constellation?

The Constellation Metagraph is an ideal framework for AI DataGraph because of its unique ability to handle privacy, scalability, and decentralization—all essential to our solution.

Privacy-first infrastructure: Metagraphs allow us to define custom privacy rules, enabling data providers to securely share sensitive information without worrying about their data being misused. By using custom consensus mechanisms, we ensure that private data is validated in a way that preserves confidentiality before it's processed on-chain.
Scalability and throughput: AI model training requires enormous amounts of data, and traditional blockchains are too slow or too expensive to handle this efficiently. The Hypergraph's extreme scalability allows us to ingest, validate, and share millions of data points across many users with high throughput and low fees. This is critical to making our solution practical and attractive to both data providers and AI researchers.
Decentralization and trust: Unlike centralized solutions, Metagraphs provide a decentralized network that eliminates the need for a trusted intermediary, reducing the risks of centralized control over sensitive data. This not only ensures data security but also encourages more people to participate as data providers because they can trust that their data remains in their control.

Ultimately, Constellation's Metagraph framework allows us to build an infrastructure where data sovereignty, privacy, and efficiency are at the forefront. It’s the backbone that makes AI DataGraph a scalable and secure solution for data sharing.

How it works (more detail)

alt Data providers (including consumers) have full control over what data they are willing to share and how often they're willing to share this data.

Component	How it works
Creating a data provider profile	Data providers (in our example, these are consumers) are able to sign up through the web app and select different types of information they are willing to share. This is a combination of demographic information (not updated often), daily updates like sleep and average heart rate, and hourly updates like steps and calories consumed in the past hour. Once they've set this up, their relevant updates are sent each hour (or each day) from their device automatically into the metagraph with minimal involvement on their part. This information, along with their wallet address, is saved in an external database (we're using postgres in our project).
Data updates are sent to metagraph	Data is automatically sent to the metagraph as an update based on its frequency (e.g., if it's a daily update like `sleep_hours`, one update is sent each day per device that is sharing that. For hourly updates like `calories_consumed`, that is sent once per hour per device that is sharing that). These updates go through the lifecycle methods on the metagraph (notably the L1 `validateUpdate`, which checks for data standardization, and L0 `validateData` that checks for stateful information like ensuring each update is after its most previous timestamp for that same update type and data provider).
Metagraph hashes and stores information on and off-chain	During the `combine` lifecycle function, the metagraph logic hashes the private data and sends this hash and the data itself to an external database (we're using postgres in our example). On-chain, it only stores the hash itself instead of the private data. It also stores an array of the fields being shared in that update to make it easier to query for the relevant data updates downstream.
AI engineers request access to data for model training	An AI engineer sends a JSON request for data types they want returned (e.g., "sleep_hours", "gender"). They also send a token amount to pay for this dataset. Our server sends the token transaction to the metagraph, gets back a txn hash, polls the latest snapshots until it sees that the txn has been confirmed. Then it queries the metagraph endpoint for all the hashes requested, which it then turns around and queries our external database to get the underlying private data before providing it back to the requester. During this process, it also records that each of these data points have been used in a specific query, which it uses downstream to calculate rewards.
Rewards are distributed periodically	Every 10 minutes, in our application, we run the rewards distribution cycle. The server handles most of this now (though we plan on moving more of this on-chain in the future). It queries the external database to find all data points that are due for a payout, then uses the payment information for each individual query to determine what share of that bounty the data provider is due a payout for. Finally, it pays all this out by sending bulk transactions to the metagraph.

What's next

Our current solution is built for consumer health data, but the long-term goal is to provide a flexible, privacy-controlled data-sharing framework that can be used for any type of data. This could include financial data, environmental data, or even government data—anything that benefits from secure, decentralized access.

With the infrastructure we’re building, developers and enterprises can easily implement similar data-sharing networks for their own use cases, leveraging Constellation’s metagraph technology to solve the privacy and scalability challenges inherent in data-sharing.

Hackathon Criteria

Criteria
Technical Difficulty and Implementation	Our project incorporates multiple technical components and features related to the Metagraph: integration with external databases, privacy hashing, complex rewards tracking, and distribution through leveraging the calculated state component. It also uses a combination of the Metagraph, a frontend web application (for data providers to sign up), and a backend server to handle sending transactions, polling, and confirmations.
Technical Readiness	Our project is ready to use today and is designed for scale. The most critical technical challenges (privacy hashing, token-gated querying from AI engineers, and rewards tracking and distribution) are addressed. Once Constellation's new feature for allowing Metagraphs to hold funds arrives, we will be able to move almost everything on-chain and let the Metagraph handle accepting tokens as funding sources, calculating rewards, and distributing them with minimal centralization.
Impact on the Ecosystem	AI DataGraph has immense potential to exponentially increase the usage of the Constellation network. Our framework is designed to scale quickly due to the ever-increasing number of data points that can be sent to the Metagraph (hundreds per day per person for healthcare data) and the ongoing need for AI models to train on more proprietary data. *Our long-term vision is to abstract this into a platform for anyone* to build their own privacy-gated data-sharing networks (enterprises, governments, developers)**, which would bring significant growth to Constellation by showcasing its potential as the backbone of privacy-first, data-intensive decentralized applications.
Novelty and Innovation	AI DataGraph expands on a long-standing goal for digital networks (compensation for data generation), but our key insight is the privacy-gated aspect of how this works. As an AI engineer myself, I know that the more sensitive the data, the more valuable it is for training powerful AI models to help solve complex problems. Our infrastructure sets this up as a true marketplace: users have 100% control over what data they're willing to share and how much they can earn. This fine-grained data access permission is what makes AI DataGraph stand out.
Product Market Fit and User Experience	The goal of AI DataGraph is to help consumers and AI engineers trade health-related data in a privacy-focused way with minimal impact. The app is designed to be incredibly simple to use: consumers set up their profile once, connect their devices, and that's it—they'll start earning incentives on autopilot. For AI researchers, they just hit our API endpoint like they would for any other data pull; our code takes care of the token payment, polling, hash-based retrieval, etc. Our longer-term goal is to make this a developer-focused framework for any engineer to build similar privacy-gated data-sharing networks for any use case: finance, entertainment, etc.