Conversation
99a9592 to
9d1318b
Compare
5992bc2 to
1ebf5ab
Compare
f911e04 to
6a7d912
Compare
6ae650c to
0cd3d1c
Compare
src/engine/Match_tainting_mode.ml
Outdated
| List.iteri (fun i fn_id -> | ||
| let name = match Shape_and_sig.get_fn_name fn_id with | ||
| | Some n -> fst n.IL.ident | ||
| | None -> "<no-name>" | ||
| in | ||
| Log.debug (fun m -> m "SUBGRAPH: source_function[%d] = %s" i name)) | ||
| source_functions; | ||
| List.iteri (fun i fn_id -> | ||
| let name = match Shape_and_sig.get_fn_name fn_id with | ||
| | Some n -> fst n.IL.ident | ||
| | None -> "<no-name>" | ||
| in | ||
| Log.debug (fun m -> m "SUBGRAPH: sink_function[%d] = %s" i name)) | ||
| sink_functions; |
There was a problem hiding this comment.
For performance reasons, it's best to only make these calculations inside the fun m -> ... since they are not needed outside of the debug context.
src/engine/Match_tainting_mode.ml
Outdated
| | G.Param { G.ptype = Some ptype; _ } :: _ -> ( | ||
| match ptype.G.t with | ||
| | G.TyFun _ -> true | ||
| | _ -> false) | ||
| | _ -> false |
There was a problem hiding this comment.
Why not just:
| G.Param { G.ptype = Some {t = G.TyFun _; _}; _ } :: _ -> true
| _ -> false
src/engine/Match_tainting_mode.ml
Outdated
| * also extract signature with arity-1 to handle trailing lambda syntax: | ||
| * f(a, b) vs f(a) { b } *) | ||
| let updated_db = | ||
| if Lang.equal lang Lang.Kotlin && arity >= 2 then |
There was a problem hiding this comment.
Why not && arity >= 1 ?
Is it not the case that f(a) where a is a function is also f { a } ?
src/matching/Generic_vs_generic.ml
Outdated
| | G.FuncDef a1, B.FieldDefColon { B.vinit = Some { e = B.Lambda b1; _ }; _ } | ||
| | B.FieldDefColon { B.vinit = Some { e = B.Lambda b1; _ }; _ }, G.FuncDef a1 -> |
There was a problem hiding this comment.
I think this is wrong: we are inverting the order of arguments, left is Pattern (G) and right is Source code (B).
But from the cases we should not assume that they are interchangeable, ie m_expr a b is not necessarily the same as m_expr b a.
We need 2 separate cases or maybe (if possible) rename the variables on the rhs of | ... to have a1 pattern and b1 code in both cases.
src/matching/Generic_vs_generic.ml
Outdated
| | G.FuncDef a1, B.VarDef { B.vinit = Some { e = B.Lambda b1; _ }; _ } | ||
| | B.VarDef { B.vinit = Some { e = B.Lambda b1; _ }; _ }, G.FuncDef a1 -> |
There was a problem hiding this comment.
Same comment as below.
| (* iso: FuncDef pattern can match VarDef with Lambda (arrow function) *) | ||
| | G.FuncDef a1, B.VarDef { B.vinit = Some { e = B.Lambda b1; _ }; _ } | ||
| | B.VarDef { B.vinit = Some { e = B.Lambda a1; _ }; _ }, G.FuncDef b1 -> | ||
| if_config | ||
| (fun x -> x.arrow_is_function) | ||
| ~then_:(m_function_definition a1 b1) | ||
| ~else_:(fail ()) | ||
| (* iso: FuncDef pattern can match FieldDefColon with Lambda (arrow function in object literal) *) | ||
| | G.FuncDef a1, B.FieldDefColon { B.vinit = Some { e = B.Lambda b1; _ }; _ } | ||
| | B.FieldDefColon { B.vinit = Some { e = B.Lambda a1; _ }; _ }, G.FuncDef b1 -> | ||
| if_config | ||
| (fun x -> x.arrow_is_function) | ||
| ~then_:(m_function_definition a1 b1) | ||
| ~else_:(fail ()) |
There was a problem hiding this comment.
I am starting to doubt we need any of that.
What is the reason to have these identifications of different constructs? Won't it return weird results in search mode? Won't it enlarge the set of taint results in perhaps unintentional ways (if one has pattern-inside for example)?
There was a problem hiding this comment.
Leaving a TODO for Part II: ensure this is what we want to do.
Hopefully this only happens when we have already matched the entity and we should be comparing two DefStmt definition.
| | _ -> ()); | ||
| super#visit_expr env expr |
There was a problem hiding this comment.
I think we can do:
| _ -> super#visit_expr env expr);| Structure: OtherExpr("ShortLambda", [Params [...]; S body_stmt]) | ||
| This allows Naming_AST to create proper scope for the params. *) | ||
| let convert_short_lambda (tok : Tok.t) (body_expr : G.expr) : G.expr = | ||
| let max_placeholder = find_max_placeholder body_expr in |
There was a problem hiding this comment.
When you match a pattern such as &(foo(... &3 ...)) how can you depend on the max placeholder?
It will only match short lambdas with exactly 3 parameters.
How about &(foo(...))? No parameters at all, and no match in target code.
I think that if we are to be able to match using such patterns, we need to treat patterns differently.
What I did in my current lang is to add an ellipsis after the max index (extra ellipsis parameter) only when parsing a pattern, so you can simply match any number of parameters with such patterns.
For the empty parameters case (no &n appears at all in the pattern), one should still add that ellipsis and it will match short lambdas of any parameter count. Better than no matches at all.
For Elixir the mechanics are there in languages/elixir/generic/Elixir_to_generic.ml: type env = Program | Pattern.
I recommend to add some tests about this material under tests/patterns/elixir/.
Also, tainting should work for a rule with:
- pattern-inside: |
&($_(... $P ...))
- focus-metavariable: $P
and a sink like sink(...), for example:
&(sink &3)
should match I think.
There was a problem hiding this comment.
TODO for Part II: changes as discussed offline.
| | G.TyN name -> | ||
| (* For anonymous classes, accept the interface name even if not in class_names *) | ||
| Some name |
There was a problem hiding this comment.
In that we could just remove then when from the case above.
There was a problem hiding this comment.
@maciejpirog we have agreed to simplify this, leaving here as todo.
| let find_max_placeholder (e : G.expr) : int = | ||
| let max_found = ref 0 in | ||
| let visitor = | ||
| object | ||
| inherit [_] AST_generic.iter_no_id_info as super | ||
|
|
||
| method! visit_expr env expr = | ||
| (match expr.G.e with | ||
| | G.OtherExpr | ||
| (("PlaceHolder", _), [ G.E { e = G.L (G.Int (Some n, _)); _ } ]) -> | ||
| max_found := max !max_found (Int64.to_int n) | ||
| | _ -> super#visit_expr env expr) | ||
| end | ||
| in | ||
| visitor#visit_expr () e; | ||
| !max_found |
There was a problem hiding this comment.
TODO for Part II:
- make class;
- place max_found in env so the instance can be reused.
| object | ||
| inherit [_] AST_generic.map as super | ||
|
|
||
| method! visit_expr env expr = | ||
| match expr.G.e with | ||
| | G.OtherExpr | ||
| (("PlaceHolder", _), [ G.E { e = G.L (G.Int (Some n, tk)); _ } ]) -> | ||
| let param_name = Printf.sprintf "&%Ld" n in | ||
| let param_id = (param_name, tk) in | ||
| G.N (G.Id (param_id, G.empty_id_info ())) |> G.e | ||
| | _ -> super#visit_expr env expr | ||
| end |
There was a problem hiding this comment.
TODO for Part II: make class and reuse instance.
| val current_class : G.name option ref = ref None | ||
| val parent_path : IL.name option list ref = ref [] | ||
|
|
||
| method! visit_definition f ((ent, def_kind) as def) = | ||
| match def_kind with |
There was a problem hiding this comment.
TODO for Part II:
- make current_class and parent_path part of
env(together withf) - make stateless and reuse 1 instance of the class.
- removed retries - removed the need of fixpoint_TIMEOUT by delegating to the fixpoint graph - improved the graph by adding HOFs to it as well as top level stuff
034d431 to
45767bb
Compare
| match shared_call_graph with | ||
| | Some (graph, _shared_mappings) -> graph | ||
| | None -> | ||
| (* Compute call graph as before *) | ||
| Function_call_graph.build_call_graph ~lang | ||
| ~object_mappings:all_object_mappings ast | ||
| in |
There was a problem hiding this comment.
Can this happen? Where is it stored after it's calculated?
There was a problem hiding this comment.
TODO for Part II: Check if this can really happen, at least debug log if it does.
| (* Fallback: try qualified function name (Module.function for Elixir, etc.) *) | ||
| let qualified_name = | ||
| { | ||
| ident = (class_str, snd method_name.ident); | ||
| ident = (fst obj.ident ^ "." ^ fst method_name.ident, snd method_name.ident); |
There was a problem hiding this comment.
TODO for Part II: Can we use IdQualified here? We might have to extend things a bit.
| | IL.Unnamed ({ e = Fetch lval; _ } as lambda_exp) -> | ||
| (* Single Fetch argument - check if it's a lambda by looking at its shape *) | ||
| (match Lval_env.find_lval env.lval_env lval with | ||
| | Some (S.Cell (_, shape)) -> | ||
| (match shape with | ||
| | S.Fun _fun_sig -> | ||
| (* It's a function/lambda! *) | ||
| Some (e, lambda_exp) | ||
| | _ -> None) | ||
| | None -> None) | ||
| | _ -> None) | ||
| | _ -> None) |
There was a problem hiding this comment.
TODO for Part II: Simplify this with a more refined pattern (else None)
| let timeout = | ||
| if taint_inst.options.taint_intrafile then base_timeout *. 10.0 | ||
| if taint_inst.options.taint_intrafile then base_timeout *. 20.0 |
There was a problem hiding this comment.
TODO for Part II: timeout is not used any more (see Dataflow_core.ml). Remove.
| (* Create assumptions for lambda parameters *) | ||
| (* Create assumptions for lambda parameters using Fold_IL_params *) |
There was a problem hiding this comment.
TODO for Part II: Unify comments.
There was a problem hiding this comment.
TODO: Check for Part II
| (* Give the parameter an Arg shape so it can be used in HOF *) | ||
| let new_env = add_param_to_env il_lval taint_set taint_arg env in |
There was a problem hiding this comment.
TODO for Part II:
See below case of IL.PatternParm.
More cases needed probably.
dimitris-m
left a comment
There was a problem hiding this comment.
LGTM, let's merge this.
We have to make a follow-up, especially now that there is no taint_fixpoint_timeout, we need to remove the rule option and all usage of the parameter.
There are several TODO for part II and we should do these asap.
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [opengrep/opengrep](https://github.com/opengrep/opengrep) | minor | `v1.13.2` → `v1.14.1` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>opengrep/opengrep (opengrep/opengrep)</summary> ### [`v1.14.1`](https://github.com/opengrep/opengrep/releases/tag/v1.14.1): Opengrep 1.14.1 [Compare Source](opengrep/opengrep@v1.14.0...v1.14.1) #### Improvements - Clojure translation part II by [@​dimitris-m](https://github.com/dimitris-m) in [#​517](opengrep/opengrep#517) - C#: Allow implicit variables in properties to be taint sources by [@​maciejpirog](https://github.com/maciejpirog) in [#​516](opengrep/opengrep#516) - Add core flags `dump_rule` and `dump_patterns_of_rule` as options in the show command by [@​maciejpirog](https://github.com/maciejpirog) in [#​519](opengrep/opengrep#519) #### Bug fixes - Fix: pass signature databaseb to lambda analysis, handle method mutation tainting by [@​corneliuhoffman](https://github.com/corneliuhoffman) in [#​520](opengrep/opengrep#520) #### Tech debt - Fix CHANGELOG.md, OPENGREP.md, remove unused files by [@​dimitris-m](https://github.com/dimitris-m) in [#​523](opengrep/opengrep#523) **Full Changelog**: <opengrep/opengrep@v1.14.0...v1.14.1> ### [`v1.14.0`](https://github.com/opengrep/opengrep/releases/tag/v1.14.0): Opengrep 1.14.0 [Compare Source](opengrep/opengrep@v1.13.2...v1.14.0) #### Improvements - Support for higher-order functions in intrafile taint analysis by [@​corneliuhoffman](https://github.com/corneliuhoffman) in [#​469](opengrep/opengrep#469) and [#​513](opengrep/opengrep#513) - Clojure: Improved support for Clojure (incl. tainting) by [@​dimitris-m](https://github.com/dimitris-m) in [#​501](opengrep/opengrep#501) - Dart: Improved support for Dart by [@​maciejpirog](https://github.com/maciejpirog) in [#​508](opengrep/opengrep#508) - C#: Better handing of extension methods and extension blocks by [@​maciejpirog](https://github.com/maciejpirog) in [#​514](opengrep/opengrep#514) #### Fixes - Bump cygwin install action by [@​dimitris-m](https://github.com/dimitris-m) in [#​503](opengrep/opengrep#503) and [#​509](opengrep/opengrep#509) **Full Changelog**: <opengrep/opengrep@v1.13.2...v1.14.0> </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42OS4yIiwidXBkYXRlZEluVmVyIjoiNDIuNjkuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90IiwiYXV0b21hdGlvbjpib3QtYXV0aG9yZWQiLCJkZXBlbmRlbmN5LXR5cGU6Om1pbm9yIl19-->
Higher-Order Function (HOF) Taint Analysis - PR Documentation
Overview
This PR introduces comprehensive support for tracking taint through higher-order functions (HOFs) in opengrep's cross-function taint analysis.
Problem: Before this PR, taint analysis could not track data flow through HOF calls like:
Solution: We now model HOF behavior with
ToSinkInCalleffects that represent "call this callback with tainted data", combined with a call graph that tracks both direct calls and HOF callbacks for proper analysis ordering.Architecture Changes
1. Path-Based Function Identification (
fn_id)File:
src/tainting/Shape_and_sig.ml:695-710Before:
After:
Why: The old structure couldn't represent nested functions. The new path-based representation supports arbitrary nesting:
[None; Some "foo"]- Top-level functionfoo[Some "MyClass"; Some "method"]- Method in a class[None; Some "outer"; Some "inner"]- Nested functioninnerinsideouter[Some "Class"; Some "method"; Some "_tmp"]- Lambda inside a class methodHelper functions added:
2. Builtin Signature Database
New File:
src/tainting/Builtin_models.ml(471 lines)Standard library HOFs (like
Array.map,filter, Python'smap()) don't have user-defined signatures. This module provides built-in models.Key concept:
ToSinkInCalleffect represents "call a callback with tainted data":Three signature patterns:
MethodHOF -
arr.map(callback): Passesthis(the array) to the callbackFunctionHOF -
map(callback, data): Passes one parameter to anotherReturningFunctionHOF - Ruby's
arr.mapreturns a function that takes a blockSupported languages and functions (
get_hof_configs):map,flatMap,filter,forEach,find,findIndex,some,every,reduce,reduceRightmap,filtermap,each,select,filter,flat_map,collect,find,detect(returning-function pattern)map,filter,forEach,flatMapmap,filter,forEach,flatMap,find,any,all(arity 0 and 1)map,filter,forEach,flatMap,compactMap,first,containsmap,filter,foreach,flatMap,find,exists,forallSelect,Where,ForEach,SelectMany,First,Any,Allmap,for_each,filter,flat_map,find,any,allarray_map,array_filter,array_walkmap,foreach,filterfor_each(arity 3),transform(arity 4)Enum.map,Enum.each,Enum.filter,Enum.flat_map,Enum.find3. Labeled Call Graph with HOF Edges
File:
src/tainting/Function_call_graph.ml(major rewrite, +869 lines)Before: Simple graph
FuncGraph = Persistent.Digraph.Concrete(FuncVertex)After: Labeled imperative graph with edge metadata:
Why labeled edges:
Key functions:
identify_callee- Resolves a call expression to itsfn_id:this.method(),obj.method(), qualified namesextract_calls- Extracts all (fn_id, call_site_tok) pairs from a function bodyextract_hof_callbacks- Extracts HOF callback edges:Builtin_models.get_hof_configsto know which methods are HOFsdetect_user_hof<hof_callback:foo>detect_user_hof- Checks if a function calls any of its parameters:4. Graph Reachability Optimization
New File:
src/tainting/Graph_functor.ml(132 lines)For large codebases, analyzing every function is expensive. This module computes the minimal subgraph needed for a specific rule.
Key function:
nearest_common_descendant_subgraphAlgorithm:
R1) and sink functions (R2)R1 ∩ R2)Usage in Match_tainting_mode.ml:
5. Parent Path Visitor
File:
src/analyzing/Visit_function_defs.ml(+167 lines)New visitor:
visitor_with_parent_pathtracks the full path to each function during AST traversal.Why: The old
fold_with_class_contextonly tracked class names. For nested functions, we need the full path to build correctfn_ids.6. ToSinkInCall Effect
File:
src/tainting/Sig_inst.ml:38-48, 876-1107New effect type added to signature instantiation:
Instantiation logic (simplified):
ToSinkInCalleffect during signature instantiation:Key change:
instantiate_function_signaturenow takes optionallookup_siganddepthparameters:7. Signature Lookup via Call Graph
File:
src/tainting/Dataflow_tainting.ml:667-730New function queries the call graph to resolve function calls:
Lookup order:
callee_fn_idfrom the edge label8. Ruby Class Method Fix
File:
languages/ruby/generic/ruby_to_generic.ml:794-815Bug: Ruby class methods (
def self.method_name) were being parsed with empty entity names, causing all class methods to collapse into a single node.Before:
After:
Impact: Call graph now correctly shows
Service.default_integrationandService.closest_group_integrationas separate nodes instead of both beingService..9. Parameter Shape for HOF Detection
File:
src/tainting/Taint_signature_extractor.ml:123-214Parameters now get
Shape.Argshape instead of just taints:Why: When a parameter is used as a callback in a HOF call, we need to know it's an "argument" to create proper
ToSinkInCalleffects. TheArgshape carries the parameter's index.10. Match_tainting_mode Integration
File:
src/engine/Match_tainting_mode.ml(+343 lines of changes)Major refactoring to use the new infrastructure:
Shared call graph: Computed once per file, shared across rules
Relevant subgraph optimization:
Topological ordering for analysis:
Kotlin trailing lambda support: Functions with lambda as last parameter get signatures extracted at arity and arity-1:
Data Flow Example
Consider this TypeScript code:
Analysis flow:
Build call graph:
Compute relevant subgraph: Functions between source and sink
Topological order:
transform,process,<toplevel>(leaves first)Extract signatures:
transform:ToReturn { data_taints = {Arg(x, 0)} }(returns its input)process:mapbuiltin hasToSinkInCall { callee = callback, args_taints = [BThis] }x => transform(x)is analyzed, we look uptransform's signatureToReturn { data_taints = {Arg(items, 0)} }(returns transformed items)Instantiate at call sites:
process(data)wheredatais tainted fromsource()Arg(items, 0)→ taint fromdatasink(result[0])→ FINDING!Test Coverage
New test files (17 languages):
test_hof_comprehensive_*.{js,ts,py,rb,java,kt,scala,swift,cs,rs,lua,php,go,cpp,c,julia,ex}Each test covers:
map,filter,forEach)arr.map(...).filter(...))Configuration
No new configuration options. HOF support is automatically enabled when
taint_intrafileis true.Performance Considerations
taint_MAX_VISITS_PER_NODEBenchmark Script
A benchmark script is provided at
benchmark_taint.shfor comparing OpenGrep vs Semgrep taint analysis performance:Options:
--no-semgrep: Skip Semgrep benchmarks (useful when Semgrep is not installed or for quick OpenGrep-only tests)num_runs: Number of runs to average (default: 3)sha1, sha2, ...: Git commits/branches to compare (will checkout, compile, and benchmark each)Output: Summary table with findings count and timing (avg ± std) for each configuration