-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Let’s use this space to design the core data models for SourceCred. Following is a brain-dump of the current state of affairs as of 2018-02-22-ish, as developed in conversations between @dandelionmane and myself.
First attempt: the project DAG
Originally, we considered the following scheme.
The project graph is a DAG whose nodes are primarily contributions: contributions include commits, pull requests, issues, comments on PRs and issues, design docs, etc. A weighted edge from a to b indicates that b provided value to a (in proportion to the edge weight). The leaf nodes of the graph correspond to users. There is only one instance of this graph, which serves as a reflection of the repository’s “official”/“objective” value function. Individuals can update the graph by standard pull requests (and maybe also via some lower-friction, continuous workflows; details to be fleshed out).
There is a root node in the graph representing the repository as a whole. The graph will likely also include “subprojects”, which may be functional areas of the application (“frontend”, “backend”, “plugins”), organizational divisions (“buildcop”, “triage”), or anything else that is perceived as important to the organization.
At allocation time, one unit of cred is minted to the root node. The cred “flows down” the graph, roughly according to the following algorithm. Let n be a node all of whose parents have been visited, and let c be the amount of cred that has flowed to n; then, to each child n′ of n with normalized edge-weight p, flow p c cred to n′. Repeat until all nodes have been visited. Each user receives cred proportional to the amount of cred at their respective leaf node.
One particularly desirable property of this mechanism is that it is transparent in the following sense. Suppose that we want to understand the distribution of impact among users for some subcomponent of the project: e.g., we want to see who has contributed the most value to the backend APIs. We can rerun the algorithm on the same project graph, but mint 1¤ to the “backend APIs” node instead of the root project node. The results will be useful because the process is additive. For instance, if a project has a root node with two children—“backend”, with weight 0.6; and “frontend”, with weight 0.4—then a user’s cred will be ctotal = 0.6 · cbackend + 0.4 · cfrontend, where cu is the credit assigned to the user when the search starts from node u.
Second attempt: the project graph
A major problem with this approach is the requirement that the graph be acyclic. We expect cyclic project graphs to be quite common in reality. (TODO: Provide examples.) In such cases, the process described above no longer directly applies. Instead, we would like to continue flowing cred along edges until we reach some kind of equilibrium.
Note that if we interpret a project DAG as a Markov chain, then the process described above can also be formulated as, “find the distribution of the chain after a large number of steps when starting from a distribution that is at the root node with probability 1”. Markov chains generalize to graphs with cycles, so we will attempt to formulate an improved version of the system along these lines. An immediate problem is that the chain is not likely to be ergodic—users will be leaf nodes, so will not communicate with other states—so we will have to be careful when dealing with limiting and stationary distributions.
A natural first idea is to simply take the limit distribution π₀ Ak for large k, where π₀ is the distribution that has 1 for the root node and 0 elsewhere. Then, we could restrict the support of the resulting distribution to the nodes corresponding to users, and after normalization we would have a valid allocation. While this seems reasonable so far, it is not clear how to achieve the transparency discussed above. The strategy of “assign 1¤ to an arbitrary project node and re-run as normal” does not obviously do the right thing: if the chain is ergodic, then changing the initial conditions does not affect the limiting distribution, and a densely connected graph may be “nearly ergodic” enough for the cred to become “all mixed up”. (As you might guess from reading this paragraph, the precise dynamics of what goes right and wrong, and how, are not clear to me. It might be helpful to run a simple version of these algorithms on some toy repos and some real-world repos to explore the behavior.)
More info to be posted in this thread.