Skip to content

Graph refactoring#553

Merged
maciejpirog merged 12 commits intomainfrom
ch/refactoring-graph
Jan 28, 2026
Merged

Graph refactoring#553
maciejpirog merged 12 commits intomainfrom
ch/refactoring-graph

Conversation

@corneliuhoffman
Copy link
Contributor

@corneliuhoffman corneliuhoffman commented Jan 22, 2026

Summary

This si a refactoring consolidating PR. The main conceptual change is that in the intrafile graph/signature we have no need for the type of nodes/keys to be a name list but just a name. This cleans up the graph and simplifies the procedures.

  • Introduce libs/graph_lib library with shared types for call graph construction abstracting the graphs a bit
  • Refactor intrafile taint analysis to use unified graph types (Call_graph_types.node, Call_graph_types.edge, Call_graph_types.G)
  • Consolidate language-specific configurations into Lang_config.ml (HOF patterns, constructor models)
  • Simplify Builtin_models.ml and Object_initialization.ml by extracting shared logic

Additional chages by @maciejpirog

  • Move graph-related thing to src/call_graph. Rationale: it depends on AST and IL, so should not be in libs. Move some helper functions from tainting/Function_call_graph to the call_graph/Call_graph (in future, I want to remove the Function_call_graph module). Make Reachable module a module (not a functor). Rename it to Graph_reachability.
  • Remove callee_fn_id from the type of edge labels. This is because it can be always recovered form the "src" of the edge.
  • Remove some unused types.
  • Introduce Function_id, which serves as an abstract type for identifying functions globally. It is used as the type of nodes in function call graph and as the type of keys in the signature db.
  • Remove fn_id (the list of names) from Shape_and_sig. The fn_id type will be an internal type used when building a call graph from AST. Maybe removed completely when function name resolution is moved to naming.
  • Move some things out of tainting/Function_call_graph and rename it to tainting/Graph_from_AST. Ideally, it would be moved to src/call_graph, but it depends on things from src/tainting, so for now it stays where it is.
  • Add .mli files to Graph_reachability and Graph_from_AST. These modules now export only 1 and 3 functions respectively, which should clarify the code.

@dimitris-m dimitris-m force-pushed the ch/refactoring-graph branch 3 times, most recently from 15e4243 to 27324fb Compare January 22, 2026 12:56
@dimitris-m dimitris-m removed the request for review from willem-delbare January 22, 2026 12:57
(* Call Graph Types - core types for call graph construction and analysis *)

(** Position in a file (line/column) *)
type position = { line : int; column : int [@key "character"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a new type? Can't we simply use Pos.t?

[@@deriving yojson, eq, show, ord]

(** Range in a file (start/end positions) *)
type range = { start : position; end_ : position [@key "end"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, maybe reuse somethign we already have? (Tok_range.t maybe?)

[@@deriving yojson, eq, show, ord]

(** Function identifier - IL.name contains name string + token with file/position *)
type function_id = IL.name option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IL.name seems to have some irrelevant information, like sid, which is not stable across runs. Maybe use Tok.location?

(* Helper to extract normalized key for node comparison.
Resolves file paths to canonical form to handle different representations
of the same file (e.g., /foo/bar vs /foo/baz/../bar). *)
let node_compare_key (v : node) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication: hash_function_id vs node_compare_key

Hashtbl.hash (fst n.IL.ident, filename, line, col)

(** Call graph node - just a function identifier (IL.name) *)
type node = IL.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type node = function_id?
I guess most of the code is oblivious to what the type function_id actually is, so let's not break the abstraction


(** Call graph edge - represents a call from one function to another *)
type edge = {
callee_fn_id : IL.name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callee_fn_id : function_id; ?

module Dot = Graph.Graphviz.Dot (Display)

(** Node tracking for incremental updates *)
let removed_node_keys : (string, unit) Hashtbl.t = Hashtbl.create 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashtable on the toplevel. Not thread-safe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a starting size of 1000 is maybe not an optimal choice... start smaller?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideal solution: no top level hashtable, can this be embedded where it's used?
Second-best: DLS assuming of course that our concurrency architecture has not changed in any way.

close_out oc

(* Load graph from marshal file *)
let load_graph (path : string) : G.t =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need user-friendly error messages in case of read/write error?

type fun_info = {
fn_id : Shape_and_sig.fn_id;
opt_name : IL.name option;
name : IL.name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we prefer name : Call_grap_types.function_id?
As in the comments in Call_graph_types.ml: I think we should keep name an abstract type, as in most of the codebase we don't care about what it is exactly

(*****************************************************************************)

let check_fundef (taint_inst : Taint_rule_inst.t) name ctx ?glob_env ?class_name
let check_fundef (taint_inst : Taint_rule_inst.t) (name : IL.name) ctx ?glob_env ?class_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: function_id?

type fun_info = {
fn_id : Shape_and_sig.fn_id;
opt_name : IL.name option;
name : IL.name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not relevant to this commit, but what is the exact purpose of class_name_str and method_properties in this type?

val check_fundef :
Taint_rule_inst.t ->
Shape_and_sig.fn_id (** entity being analyzed *) ->
IL.name (** entity being analyzed *) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: IL.name or func_id?

[@@deriving yojson, eq, show, ord]

(** Function identifier - IL.name contains name string + token with file/position *)
type function_id = IL.name option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you want to move parts of the call graph to lib, I don't think it should depend on anything in IL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we want to have a single type we use to identify functions (at the moment the use of IL.name is quite confusing for me), which serves as the type of keys in the signature db and can be obtained by querying the LSP, we need to keep it in a more profound and specific place. Especially that it is a cross-file thing, so I would stress it strongly in the code that it as a different kind of being that leave outside the AST/IL of the current target. I propose that we add a separate module for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "this" I mean a type that identifies a function within the entire project (not a single target). This module would also be a good place to keep functions that query LSP to convert a given location in the current target to the global location, etc.

type range = { start : position; end_ : position [@key "end"] }
[@@deriving yojson, eq, show, ord]

(** Function identifier - IL.name contains name string + token with file/position *)
Copy link
Contributor

@maciejpirog maciejpirog Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I saw a comment that something file names in tokens are not 100% trustworthy, but I can't find it now. I'm also not sure if it was a global thing, or only at that given place. We need to make sure token is enough to identify a file


(* Function key for the signature database - uses just the function name (last element of fn_id).
This matches the graph vertex type in Function_call_graph.ml. *)
type func_key = IL.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where we would benefit from having a proper type for global function identifiers instead of IL.name

Copy link
Contributor

@maciejpirog maciejpirog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is time to do some major clean up about types, identifiers etc we use in the codebase. I left some comments in particular files, but my major concern is the growing complexity and size of the code without introducing any abstraction over what we're doing. I propose that:

  • We are more precise about how functions are identified. At the moment we have types like Shape_and_sig.fn_id, Shape_and_sig.func_key, Call_grap_types.function_id. and on top of that in most places of the code, we use IL.name. I propose we make a separate module, say src/tainting/Signature_db, in which we:

    • Explicitly define the one and only type to identify functions, as they are used cross-file. If possible, I would like to keep this type abstract,

    • Move the signature db code here,

    • Add code to deal with LSPs here as well,

    • Reason for the two above: I guess these are the only two things that need to know the internal structure of these identifiers.

    • Problems with the above: Does naming want to know the structure of these (also: we cannot keep two separate mechanism for naming and we need to kill off one of them)

  • We don't put Call_graph_types.ml in lib, since it depends on things in src. Maybe we can make it a functor and parameterize with things from src inside src?

(** Call graph node - just a function identifier (IL.name) *)
type node = Function_id.t

(* Helper to extract normalized key for node comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it in Function_id

close_out oc

(* Export graph to JSON format (Graphology format for Cytoscape.js viewer) *)
let graph_to_json (graph : G.t) : Yojson.Basic.t =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to graph_to_graphology_format

open Call_graph

(* Convert path to file:// URI *)
let fpath_to_uri path = "file://" ^ Fpath.to_string path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitize this path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need to sanitize this path: it is only used as a label in the exported json

let opt_name =
let* ent = opt_ent in
AST_to_IL.name_of_entity ent
| Some ent ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Collaborator

@dimitris-m dimitris-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, taking into account what we discussed offline.

once adapted we can merge it.

@maciejpirog maciejpirog merged commit 598aa8d into main Jan 28, 2026
6 checks passed
@maciejpirog maciejpirog deleted the ch/refactoring-graph branch January 28, 2026 11:12
@maciejpirog maciejpirog mentioned this pull request Feb 4, 2026
tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Feb 15, 2026
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [opengrep/opengrep](https://github.com/opengrep/opengrep) | minor | `v1.15.1` → `v1.16.0` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>opengrep/opengrep (opengrep/opengrep)</summary>

### [`v1.16.0`](https://github.com/opengrep/opengrep/releases/tag/v1.16.0): Opengrep 1.16.0

[Compare Source](opengrep/opengrep@v1.15.1...v1.16.0)

#### Improvements

- Dart: Add typed metavariabless by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;551](opengrep/opengrep#551)
- Dart: Use case of identifier to guess call vs new by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;555](opengrep/opengrep#555)
- Go: Enable goroutines in taint tracking by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;559](opengrep/opengrep#559)
- Add taint propagation via "for" comprehensions by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;564](opengrep/opengrep#564)

#### Bug Fixes

- Rust: Missing Rust type alias translation by [@&#8203;smith-xyz](https://github.com/smith-xyz) in [#&#8203;549](opengrep/opengrep#549)
- Fix: Ensure that linux binaries have 8mb stack size (musl) by [@&#8203;dimitris-m](https://github.com/dimitris-m) in [#&#8203;563](opengrep/opengrep#563)
- Fixed a perf regression by removing system calls and improving the reachability graph and the callee lookup by [@&#8203;corneliuhoffman](https://github.com/corneliuhoffman) in [#&#8203;556](opengrep/opengrep#556)
- Fixed intrafile bug introduced by a superfluous fallback by [@&#8203;corneliuhoffman](https://github.com/corneliuhoffman) in [#&#8203;567](opengrep/opengrep#567)
- Ruby: Always translate `or` and `and` to expression by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;562](opengrep/opengrep#562)
- Bash: Allow redirects before command arguments by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;548](opengrep/opengrep#548)

#### Internal Improvements

- Add `show dump-intrafile-graph` and `show dump-taint-signatures` commands by [@&#8203;corneliuhoffman](https://github.com/corneliuhoffman) in [#&#8203;552](opengrep/opengrep#552)
- Improve tainting code by [@&#8203;maciejpirog](https://github.com/maciejpirog) in [#&#8203;546](opengrep/opengrep#546)
- Graph refactoring by [@&#8203;corneliuhoffman](https://github.com/corneliuhoffman) in [#&#8203;553](opengrep/opengrep#553)

#### New Contributors

- [@&#8203;smith-xyz](https://github.com/smith-xyz) made their first contribution in [#&#8203;549](opengrep/opengrep#549)

**Full Changelog**: <opengrep/opengrep@v1.15.1...v1.16.0>

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi45Ni4yIiwidXBkYXRlZEluVmVyIjoiNDIuOTYuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90IiwiYXV0b21hdGlvbjpib3QtYXV0aG9yZWQiLCJkZXBlbmRlbmN5LXR5cGU6Om1pbm9yIl19-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants