Skip to content

Implement dependency graph construction #4

@dbarowy

Description

@dbarowy

The purpose of the dependence graph is to show the functional dependencies of all formulas in the spreadsheet. We use this information to do our analysis.

Building the graph is not as straightforward as it would seem, since you can't build the tree simply. This is because there is no way, in general, to scan the spreadsheet so that you get the nodes in the order that you would like to have them. The most convenient order would be to get the root node first, and build the tree upward, adding child nodes in a depth- or breadth-first manner. Instead, you have to build the tree "edgewise", by adding one node at a time, in whatever order your spreadsheet traversal dictates.

To be clear: this is not actually a tree. It is a DAG. A node may have multiple parents and multiple children. Edges are directed, meaning that information flows in the direction of the arrow.

Nodes represent formulas or data items. Edges represent functional dependencies. Here's an example, to make the idea concrete:

Suppose we have some data in a range, A1:A5, which looks like:

screen shot 2013-06-22 at 12 08 19 pm

and we have a formula:

=SUM(A1:A5)

This means that we should have a graph that looks like:

screen shot 2013-06-22 at 12 11 13 pm

If we were to add another input to the formula, say B1:B5, it would look like:

screen shot 2013-06-22 at 12 18 05 pm

Furthermore, the formula can itself be an input to other formulas:

screen shot 2013-06-22 at 12 19 32 pm

In fact, there can be almost any arbitrary kind of complexity

screen shot 2013-06-22 at 12 20 44 pm

e.g., the graph can also have multiple roots, and it doesn't even need to be "connected"

screen shot 2013-06-22 at 12 23 46 pm

but there's ONE caveat: there is no path in this graph which forms a loop. That's why it's called a directed acyclic graph (DAG).

Note that it is important to represent contiguous objects, like the ranges shown above, as a single node in the tree.

Also, in the Excel implementation of CheckCell, we refer to inputs as children and output formulas as roots. This was a mistake, because it runs counter to the way practically everybody talks about graphs. Edges point in the direction of leaves, from the direction of the root. Thus the inputs are the roots, and the outputs are the leaves.

As an example of not getting the objects you want in a convenient order, suppose that your traversal of the spreadsheet happened to give them to you in this order:

screen shot 2013-06-22 at 12 32 50 pm

Your algorithm will need to be able to deal with this. Fortunately it's not that hard. Here's the right way to do it.

  1. Use a dictionary to keep track of the following mapping:

    address -> node

  2. Now, traverse the spreadsheet, and for each cell, and IF IT IS A FORMULA

2a. Create a node for it, and add that node to the dictionary.

2b. Parse the formula

2c. Extract the functional dependencies; these are references to other cells. If the input is a data range, check the dictionary to see if it already exists. If it doesn't, create a node for it, and add it to the dictionary, if it does, get the object from the dictionary. If the object is a formula, skip it for now; we'll come back to it later in this loop.

2d. Add the input nodes as parents of the current formula cell.

2c. Add the current node as a child for all of its parents.

We also keep a few flags on these nodes, e.g., is the node a "data only" node or a formula? We will need to know these things later on.

There are a variety of optimizations that are useful (like keeping a list of all of the formulas), but let's not worry about those now. There is a fair amount of code in CheckCell that exists because we were trying out different ideas, and I don't remember clearly which ones are necessary. We'll add them as we need them.

Note that we use addresses as keys in the dictionary. This is the reason I implemented .Equals and .GetHashCode on the AST.Address class. If you parse a reference somewhere, you want to be sure that, even if that reference is in a slightly different form (e.g., "A1:A5" vs "Sheet1!A1:A5") that those references actually refer to the same node. JavaScript may have different requirements for its own dictionaries/hashes, so use whatever is convenient there.

You should start by designing a Node class for this tree. Minimally, it needs these fields:

Address: a reference to the AST.Address
Parents: a list of Nodes
Children: a list of Nodes
IsFormula: bool

Let me know if you have questions. You should probably post them here, and other people in our group can chime in if they have comments.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions