Skip to content

Cycle basis#1257

Closed
iosonofabio wants to merge 22 commits intoigraph:developfrom
iosonofabio:cyclebasis
Closed

Cycle basis#1257
iosonofabio wants to merge 22 commits intoigraph:developfrom
iosonofabio:cyclebasis

Conversation

@iosonofabio
Copy link
Copy Markdown
Member

@iosonofabio iosonofabio commented Sep 9, 2019

DO NOT MERGE YET!

Trying to implement functions for finding a cycle basis:

  • starting from undirected, unweighted graphs
  • undirected, weighted graphs in progress

Final TODO before merging:

  • finish implementation
  • rebase onto most recent develop branch
  • check for memory leaks
  • special checks for igraph_vector_ptr since current convention is shaky
  • convert tabs to spaces
  • check docs
  • check tests
  • add 1-2 examples

@szhorvat
Copy link
Copy Markdown
Member

szhorvat commented Sep 9, 2019

Just some very quick notes without looking through it fully:

  • This function should return a ptr_vector_t, each element of which is a basis element, i.e. a vector_t of edge indices.
  • It must work for multigraphs, e.g. 1-2, 1-2, 2-3 has one cycle.
  • It must also handle self-loops.
  • It should work on directed graphs and simply ignore edge directions. One may want to use the result to create an edge-cycle matrix where 0, 1 and -1 are used to indicate whether an edge is not in a cycle, in the cycle pointing in the cycle direction, or in the cycle pointing in the reverse direction, respectively. This is a common thing to do with cycle bases.

You seem to be using some raw char * arrays for boolean vectors. We also have igraph_vector_bool_t. I did notice that an igraph_bool_t is an int, i.e. typically 32-bit instead of 8-bit. Perhaps this is why you went with the raw char array? @ntamas Can you comment on why igraph_bool_t is int and not, say, and unsigned char?

@ntamas
Copy link
Copy Markdown
Member

ntamas commented Sep 9, 2019

I have no idea; this has always been this way before I joined the development of igraph. I think we could change this easily. You could try changing the typedef of igraph_bool_t to unsigned char in the source and see if anything breaks (it shouldn't - if it does, then the point where it breaks is probably an interesting place to look at as it could uncover places where we are (ab)using an igraph_bool_t to do something else).

int igraph_cycle_basis_unweighted_undirected(const igraph_t* graph,
igraph_vector_t* cycles) {

// FIXME: deal with unconnected graphs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether MSVC in C89 mode is okay with comments starting with double slashes. I have always tried to avoid this because I wasn't sure whether MSVC likes it or not and it's too complicated for me to get hold of a Windows machine :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC 2015 and later do support // comments as "a Microsoft extension", unless standards mode is triggered. https://docs.microsoft.com/en-us/cpp/c-language/c-comments?view=vs-2015 This can also be tested e.g. at godbolt.org

I think earlier versions support it too, but I do not have proof.

I always went with /* ... */ just in case. You never know what compilers people use on what weird platforms that we don't have access to. Maybe someone uses an IBM compiler on a Power architecture machine. I'd be in favour of sticking with the standard (even though there are already several // comments in the igraph code base—my IDE keeps flagging them actually).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// is C99 I believe. Older MSVC compilers do no support it. I don't know what Python 2.x uses now to compile its extensions, but R is C99, so if you want to use them I don't mind.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate issues:

  1. Is it okay to use arbitrary C99 features? No, because even MSVC 2019 doesn't support them fully. We should not use e.g. variable-length arrays for this reason.

  2. Is it okay to use // comments? Compilers typically support // as an extension, even if they do not fully support C99. So it might not be a problem in practice. It is claimed that MSVC 2010 already supported // comments.

@gaborcsardi
Copy link
Copy Markdown
Contributor

I have no idea; this has always been this way before I joined the development of igraph. I think we could change this easily. You could try changing the typedef of igraph_bool_t to unsigned char in the source and see if anything breaks

This breaks the R interface because is there a logical is an int.

@szhorvat
Copy link
Copy Markdown
Member

szhorvat commented Sep 9, 2019

Some more comments:

I think the following would be reasonable names / prototypes:

int igraph_fundamental_cycle_basis(const igraph_t *graph, igraph_vector_ptr_t *basis);
int igraph_minimum_cycle_basis(const igraph_t *graph, igraph_vector_ptr_t *basis, const igraph_vector_t *weights);

Alternatively, minimum_weight_cycle_basis would be more precise, but longer.


I do not think we need to include "undirected" in the name, as this is understood. In many applications, people would run this on directed (i.e. oriented) graphs and also want to know if each edge points along the cycle direction or opposite to it. This information could even be optionally returned by the function in an additional vector (however, it is easy enough to compute, so maybe we don't need to include it).

I do not think we need to include "unweighted" in the name, as weights are not relevant for bases unless we look for the minimum weight one (which is a special application). However, it is useful to indicate that it returns a fundamental basis.

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Sep 10, 2019

I believe the implementation can be made more efficient in the following way.

While doing a BFS, you maintain a vector father that denotes for each node the node through which it was first visited (i.e. this corresponds to the spanning tree). Let us initially set father to -1 for all nodes, so that if father[i] < 0 it means that node i was not yet visited. In addition, we keep a vector dist which is the distance from the root vertex. Whenever you encounter a node i that was already visited (father[i] > 0) when traversing the edge j -> i, you backtrack the fathers of i and j (i.e. ancestors) until dist[i] == dist[j] and then keep iterating until the ancestors are identical. In other words:

int ancestor_i = i;
int ancestor_j = j;
while (ancestor_i != ancestor_j)
{
  if (dist[ancestor_i] > dist[ancestor_j])
    ancestor_i = father[ancestor_i];
  else if (dist[ancestor_i] < dist[ancestor_j])
    ancestor_j = father[ancestor_j];
  else
  {
    ancestor_i = father[ancestor_i];
    ancestor_j = father[ancestor_j];
  }
}

By maintaining the traversed list of ancestors, you have the list of vertices in the cycle. If ancestor_i == j you know that the cycle was directed and otherwise it is an undirected cycle (i.e. a cycle ignoring the edge directions). If you want a vector of edges instead of vertices (which is probably more useful in a multigraph) you can use edge ids instead of vertex ids to identify the father (i.e. the edge that leads to the father). Finally, for multiple components you should include a root vector that identifies from which root the node was reached. If root[i] != root[j] no cycle can be identified.

@iosonofabio
Copy link
Copy Markdown
Member Author

iosonofabio commented Sep 10, 2019 via email

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Sep 10, 2019

I believe that the initial BFS that you perform is sufficient for immediately identifying the cycles, while creating the spanning tree. In other words, you identify the cycles as you are building the spanning tree in one go. You now repeatedly perform a DFS on the resulting spanning tree, but you will then repeatedly backtrack the same paths I believe (i.e. the parent variable remains essentially identical across igraph_dfs runs).

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Sep 10, 2019

And probably you should indeed call it parent instead of father :)

@iosonofabio
Copy link
Copy Markdown
Member Author

iosonofabio commented Sep 10, 2019 via email

@stale
Copy link
Copy Markdown

stale bot commented Jan 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues that have been inactive for a long time; will be closed in the absence of further activity label Jan 22, 2020
@ntamas
Copy link
Copy Markdown
Member

ntamas commented Jan 22, 2020

@iosonofabio If you are still working on this, feel free to remove the wontfix label. Stalebot will also stop bothering this PR if you add the pinned label to it.

@stale stale bot removed the stale Issues that have been inactive for a long time; will be closed in the absence of further activity label Jan 22, 2020
@stale
Copy link
Copy Markdown

stale bot commented Mar 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues that have been inactive for a long time; will be closed in the absence of further activity label Mar 22, 2020
@szhorvat szhorvat added the todo Triaged for implementation in some unspecified future version label Mar 22, 2020
@stale stale bot removed the stale Issues that have been inactive for a long time; will be closed in the absence of further activity label Mar 22, 2020
@iosonofabio
Copy link
Copy Markdown
Member Author

I restarted working on this one, don't worry about the massive merge commits for now, I can squash-rebase at the end.

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Jun 4, 2020

Great! I noticed you merged in develop so perhaps we should discuss whether this should go into master or develop. If my understanding is correct, this is an entirely new function, right? That means it can go into master, because it is backwards compatible. But that means that we should not bring in all the changes from the develop branch. What do you think?

@iosonofabio
Copy link
Copy Markdown
Member Author

Good point. I'm working on the weighted graph function which is significantly more messy than the unweighted one, so I'm unclear on the time frame - not so familiar with Horton cycles I admit. There's also no urgency I believe. So maybe develop is fine? But I'm also fine rebasing against master

@ntamas
Copy link
Copy Markdown
Member

ntamas commented Jun 4, 2020

I'd say keep it on develop. As for me, I tend to do all development on develop now.

@vtraag vtraag changed the base branch from master to develop June 5, 2020 07:15
@iosonofabio
Copy link
Copy Markdown
Member Author

Ok, I'm sharpening the tools for this one. Some advice would be appreciated, but in terms of general approach and specific code. Two points in particular:

  1. It seems reasonable to me to implement this algorithm: https://dl.acm.org/doi/abs/10.1145/1644015.1644023, see fig 2. Anyone objects or has a better suggestion?

  2. I starting shaping up the code for that and it needs a bunch of auxiliary functions that might be useful beyond this particular PR, e.g. an approximate calculation of a feedback vertex set. If anyone could have a quick look at my igraph_minimum_cycle_basis (only a stub) and comment on what functions exist already and whether any of the private ones should instead be lifted to be part to the API, I'be be grateful.

Thanks!
Fabio

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Jun 7, 2020

It seems reasonable to me to implement this algorithm: https://dl.acm.org/doi/abs/10.1145/1644015.1644023, see fig 2. Anyone objects or has a better suggestion?

I actually ran into the same publication recently. It sounded like an interesting option, although I have not yet explored the details.

@iosonofabio
Copy link
Copy Markdown
Member Author

iosonofabio commented Jun 10, 2020

I can read the paper, that's not the problem. How much difference do you think it will make in practice versus the increased complexity?

My understanding is the more recent approach does not use all vertices but just a feedback vertex set (which we have to implement separately - can default to all vertices of degree != 2 for now if I'm not wrong). Moreover it also only analyzes cycles in distinct subtrees from the root.

However this algorithm requires travelling across the trees from root to leaves, which is more difficult than the other way (just storing a parent vector). That's what I'm wondering right now, i.e. what an efficient data type would be for that purpose.

Wait, I should just look a little longer. Seems like the minimum spanning tree prim function uses a vector of edges to store the tree, we can do the same here. Then we can travel from root to children relatively easily.

@vtraag
Copy link
Copy Markdown
Member

vtraag commented Jun 10, 2020

2\. argsort?

If you do need this is practice, I think it would make a great contribution to the igraph_vector* types. It would probably be good to make this a public implementation then.

Comment on lines +125 to +132
igraph_integer_t from, to;
igraph_edge(graph, (igraph_integer_t) edge, &from, &to);

/* we only know that either one is actnode, swap if needed */
if (to==actnode) {
to = from;
from = to;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting in case you were not aware:

We have IGRAPH_TO and IGRAPH_FROM, which are both simpler to use and more efficient than igraph_edge (they avoid any function call).

There is also IGRAPH_OTHER. If an undirected edge is u-v then IGRAPH_OTHER(graph, edge, v) gives u. Did you need IGRAPH_OTHER(graph, edge, actnode) here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do this #1296

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a copycat from spanning_trees.c. Shall I make a PR to fix that?

Copy link
Copy Markdown
Member

@ntamas ntamas Jun 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fixed in spanning_trees.c as well. That file is a very old part of the codebase, probably well before the time we've added IGRAPH_TO and IGRAPH_FROM. It probably does not make any difference performance-wise, but it helps people to pick up best practices if we have less places in the codebase where we don't use the shortcuts we've implemented to make our life easier in the long run.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably does not make any difference performance-wise

It depends. If this is the only function call in a loop, eliminating that call may very well make a noticeable difference. Function calls are expensive.

IGRAPH_CHECK(igraph_incident(graph, &inc_edges, (igraph_integer_t) actnode,
IGRAPH_ALL));

for (j=0; j<igraph_vector_size(&inc_edges); j++) {
Copy link
Copy Markdown
Member

@szhorvat szhorvat Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using vector_size in for loops like this. Instead, get the size first, store it in a variable, then use that variable.

Here it would not matter that much because there are already several other function calls in the loop, but in general, it's good to get into the habit of storing the size in a variable. In a tight loop, that function call would bring a noticeable performance hit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copycat of the exact same pattern in spanning_trees.c, e.g. line 321. I think the compiler will put the igraph_vector_size(&inc_edges) into a variable anyway, so I'm not sure a human should deal with these kinds of optimizations.

So if this is a performance problem because the compiler does not do its job, I can open a separate PR to fix it everywhere else too. @ntamas @vtraag ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been surprised several times in the past by the complexity of optimizations compilers can do automatically, but I'm still unsure :)

On one hand, yes, it should be the compiler's job to optimize this and leave it to us mere mortals to deal with the more interesting parts of programming. On the other hand, in order to do the optimization here in this particular case, the compiler (let it be gcc, clang, MSVC or something else) must be able to infer that 1) igraph_vector_size() is side-effect-free (okay, that's actually quite easy), and that 2) the thing that igraph_vector_size() calculates is an invariant of the loop so it can be lifted to an external variable (which might even end up being in a register only). If the compiler can infer this, that's great.

As for me, I routinely lift the size of a vector into a variable on its own and then use that variable in the loop -- compilers were not that good in the past and old habits die hard. Also, it gives me peace of mind to do so because then I can say that it probably works even on a crappy compiler. However, if someone leaves the function call in the loop, that's also okay if I know it's a simple function call that does not do anything complex.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler cannot infer that igraph_vector_size is side-effect free because it does not have the definition of that function. It is in a different translation unit.

Link-time optimization might be able to mitigate this to some extent, but it's not guaranteed, and LTO does not even work on all platforms. We should not rely on it.

Copy link
Copy Markdown
Member

@szhorvat szhorvat Jun 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iosonofabio In C++ one can just do for (int i=0; i < vec.size(); ++i) because vec.size() will be an inline function that lives in a header. The compiler can analyse it and figure out that it does not need to call it for every loop iteration. In C, this is not possible because there is no general support for inline functions and the definition will simply not be visible to the compiler.

It is good practice to make a habit of avoiding this pattern. It's very easy to do so and then you don't have to keep thinking about how big the performance impact is anyway.

while (! igraph_dqueue_empty(&q)) {
/* current node, incoming edge, and distance from tree root */
long int actnode=(long int) igraph_dqueue_pop(&q);
long int actedge=(long int) igraph_dqueue_pop(&q);
Copy link
Copy Markdown
Member

@szhorvat szhorvat Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use tab characters for indentation. It appears that there are lots of tabs in this file, mixed with spaces.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking about tabs vs spaces: I think we have an .editorconfig at the root, but not all editors pick up the settings from .editorconfig by default, so we need another safety mechanism to prevent misformatted code from slipping into the codebase. Initially I wanted to propose using pre-commit for validating the codebase with pre-commit hooks, but then I realized that third-party contributors would probably not have pre-commit installed and then we would be back to square one. Do you guys know about any Github plugin that could be used to validate code style in the PRs, or do we have to create a CI test for that?

/* Clean */
igraph_vector_destroy(&fvs);
igraph_vector_ptr_destroy_all(trees);
IGRAPH_FINALLY_CLEAN(1 + n + 1); // FIXME: Check this number
Copy link
Copy Markdown
Member

@szhorvat szhorvat Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a useful read:

https://igraph.discourse.group/t/who-should-clean-up-vector-ptr-out-argument-in-case-of-interruption-or-error/153

Unfortunately, the handling of vector_ptrs in an error conditions is not completely standardized yet.

I am not aware of any existing case where all vector_ptr elements are put on the finally stack, and maybe it would be better not to start doing this. See the above thread for how this is usually handled.

Something to pay attention to:

Should the function exit due to an error condition or interruption, it should be safe for the caller to destroy/free any non-NULL elements of the ptr_vector. Thus, anything you destroy must be set to NULL. Alternatively, if everything was destroyed, the vector_ptr can be cleared.

As far as I can tell, existing functions do not leave any elements undestroyed or unfreed in an error condition.

Tip: igraph_Free(ptr) will set ptr to NULL, but igraph_free will not (cannot). The first is a macro, to be used in normal situations. The second is a function, to be used with IGRAPH_FINALLY, where one must take its address.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any existing case where all vector_ptr elements are put on the finally stack

The size of the finally stack is limited so probably not all elements would fit there anyway. So far I was following the principle that the vector_ptr takes "ownership" of any pointer that you put into it, so you only need to put a cleanup function on the finally stack that iterates over the elements of the vector_ptr and destroys and frees the non-null ones. This function can then be popped off the finally stack when the vector_ptr is returned to the user. The "item destructor" pattern outlined in the Discourse group is supposed to help with the implementation.

@iosonofabio
Copy link
Copy Markdown
Member Author

Thanks @szhorvat . At this stage I'm not fixing memory leaks and similar stuff, but thanks.

I haven't seen any comments on the general strategy, I'm just pushing ahead then. I'll come back to the nitty gritty details once the core of the algorithm is implemented 👍

@szhorvat
Copy link
Copy Markdown
Member

szhorvat commented Jun 11, 2020

@iosonofabio Is the fundamental cycles part considered to be ready now?

For the minimum cycle basis, I am familiar with Horton's algorithm, which is simple:

  • find fundamental cycles corresponding to a BFS tree starting from each vertex
  • order the entire group of cycles by weight
  • starting from the smallest, use Gaussian elimination to find a linearly independent set which is a basis

I am not familiar with the paper you linked. I'd need to look at it in detail to comment, which would take a few hours. I need to force myself off igraph for now, so I can make progress on other projects ...

@iosonofabio
Copy link
Copy Markdown
Member Author

Fantastic @szhorvat , thanks for the recap. The paper is an incremental improvement over that. Don't worry too much about this one, it'll take me a few more days ;-)

@iosonofabio iosonofabio marked this pull request as draft June 22, 2020 01:08
@vtraag
Copy link
Copy Markdown
Member

vtraag commented Jul 1, 2020

FYI, I just came across igraph_vector_order (and it siblings igraph_vector_order1 and igraph_vector_order1_int), which essentially provide an argsort of a vector. This really shows that we should make these functions public and provide decent documentation for them.

@ntamas ntamas mentioned this pull request Apr 8, 2021
@iosonofabio
Copy link
Copy Markdown
Member Author

This one is also not urgent, esp. since @szhorvat has implemented a similar algorithm. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

todo Triaged for implementation in some unspecified future version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants