Skip to content

feat(heuristics): add whitespace check to detect excessive spacing and invisible characters for malware check#1086

Merged
behnazh-w merged 7 commits into
oracle:mainfrom
AmineRaouane:white-spaces-heuristic
Sep 20, 2025
Merged

feat(heuristics): add whitespace check to detect excessive spacing and invisible characters for malware check#1086
behnazh-w merged 7 commits into
oracle:mainfrom
AmineRaouane:white-spaces-heuristic

Conversation

@AmineRaouane

@AmineRaouane AmineRaouane commented May 19, 2025

Copy link
Copy Markdown
Member

Summary

This PR adds a new heuristic that analyzes code to detect suspicious use of excessive spaces and invisible characters. It checks whether the amount of spacing and invisible Unicode characters exceeds a defined threshold.

Description of changes

  • Implemented the WhiteSpaces heuristic in a new Python module.
  • Registered the new heuristic inside the main heuristics.py file.
  • Created unit tests to verify the behavior of the WhiteSpacesAnalyzer heuristic.
  • Updated detect_malicious_metadata_check.py to integrate and execute the new heuristic logic during analysis.
  • The heuristic scans the codebase for abnormal invisible characters and spaces in the code.
  • This heuristic is combined with ForceSetup to justify high confidence in detection, as the presence of extra spaces alone could be due to poor formatting rather than malicious intent.

Related issues

None

Checklist

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have updated the relevant documentation, if applicable.
  • I have tested my changes and verified they work as expected.

@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label May 19, 2025
@behnazh-w behnazh-w changed the title feat(heuristics): add Whitespace Check to detect excessive spacing and invisible characters. feat(malware-check): add whitespace Check to detect excessive spacing and invisible characters. May 20, 2025
@behnazh-w behnazh-w changed the title feat(malware-check): add whitespace Check to detect excessive spacing and invisible characters. feat(malware-check): add whitespace check to detect excessive spacing and invisible characters. May 20, 2025
@behnazh-w behnazh-w changed the title feat(malware-check): add whitespace check to detect excessive spacing and invisible characters. feat(malware-check): add whitespace check to detect excessive spacing and invisible characters May 20, 2025
@behnazh-w behnazh-w requested a review from art1f1c3R May 26, 2025 05:15
Comment thread src/macaron/malware_analyzer/pypi_heuristics/sourcecode/white_spaces.py Outdated
@AmineRaouane AmineRaouane force-pushed the white-spaces-heuristic branch from db0e35c to 6978bd7 Compare May 26, 2025 10:58
Comment thread src/macaron/config/defaults.ini Outdated
Comment thread src/macaron/slsa_analyzer/checks/detect_malicious_metadata_check.py
Comment thread src/macaron/resources/pypi_malware_rules/obfuscation.yaml Outdated
Comment thread src/macaron/resources/pypi_malware_rules/obfuscation.yaml Outdated
art1f1c3R
art1f1c3R previously approved these changes Aug 8, 2025
@behnazh-w behnazh-w changed the title feat(malware-check): add whitespace check to detect excessive spacing and invisible characters feat(heuristics): add whitespace check to detect excessive spacing and invisible characters for malware check Aug 8, 2025
@art1f1c3R

art1f1c3R commented Aug 8, 2025

Copy link
Copy Markdown

The CI test seems to be failing due to detecting excessive whitespace in django@5.0.6 for django/utils/html.py:149 and tests/gis_tests/distapp/tests.py:384.

@art1f1c3R art1f1c3R dismissed their stale review September 3, 2025 23:07

Dismissing approval until integration test failure is resolved.

@art1f1c3R

art1f1c3R commented Sep 5, 2025

Copy link
Copy Markdown

I have investigated this problem. in django/utils/html.py:149-150 the following text appears in a docstring:

      format_html_join('\n', "<li>{} {}</li>", ((u.first_name, u.last_name)
                                                  for u in users))

In tests/gis_tests/distapp/tests.py:384-386 the following text also appears in a docstring:

================================

                                                | Projected Geometry | Lon/lat Geometry

Both of these examples are triggered by the excessive whitespace Semgrep rule as there are over 50 spaces before some of the indented lines. Both of these examples occur in docstrings, so my proposed solution (which I have tested does not trigger on django@5.0.6) is to alter the rule to this:

- id: obfuscation_excessive-spacing
  metadata:
    description: Detects the use of excessive spacing in code, which may indicate obfuscation or hidden code.
  message: Hidden code after excessive spacing
  languages:
  - python
  severity: WARNING
  pattern-either:
  - pattern-regex: '[\s]{50,}(\S)+' # The 50 here is the threshold for excessive spacing , more than that is considered obfuscation
  - pattern-not-inside: |
        """ ... """

I have used \s for whitespace characters and \S for non-whitespace characters for clarity here. The pattern-not-inside makes the rule avoid docstrings. Please try and see if this resolves the CI test failing issue.

Something we may have to be wary of is benign code blocks that are excessively indented and will cause this to trigger. Many projects will not encounter this, as the indentation level will not reach more than 50 spaces and/or code linters will prevent this from happening, so I don't expect too many false positives with that, but it is a possibility.

art1f1c3R
art1f1c3R previously approved these changes Sep 12, 2025

@art1f1c3R art1f1c3R left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI error has been resolved by ignoring excessive spacing present in docstrings.

behnazh-w
behnazh-w previously approved these changes Sep 12, 2025
@art1f1c3R

art1f1c3R commented Sep 15, 2025

Copy link
Copy Markdown

I have investigated this problem. in django/utils/html.py:149-150 the following text appears in a docstring:

      format_html_join('\n', "<li>{} {}</li>", ((u.first_name, u.last_name)
                                                  for u in users))

In tests/gis_tests/distapp/tests.py:384-386 the following text also appears in a docstring:

================================

                                                | Projected Geometry | Lon/lat Geometry

Both of these examples are triggered by the excessive whitespace Semgrep rule as there are over 50 spaces before some of the indented lines. Both of these examples occur in docstrings, so my proposed solution (which I have tested does not trigger on django@5.0.6) is to alter the rule to this:

- id: obfuscation_excessive-spacing
  metadata:
    description: Detects the use of excessive spacing in code, which may indicate obfuscation or hidden code.
  message: Hidden code after excessive spacing
  languages:
  - python
  severity: WARNING
  pattern-either:
  - pattern-regex: '[\s]{50,}(\S)+' # The 50 here is the threshold for excessive spacing , more than that is considered obfuscation
  - pattern-not-inside: |
        """ ... """

I have used \s for whitespace characters and \S for non-whitespace characters for clarity here. The pattern-not-inside makes the rule avoid docstrings. Please try and see if this resolves the CI test failing issue.

Something we may have to be wary of is benign code blocks that are excessively indented and will cause this to trigger. Many projects will not encounter this, as the indentation level will not reach more than 50 spaces and/or code linters will prevent this from happening, so I don't expect too many false positives with that, but it is a possibility.

The pattern-not-inside was incorrectly formatted, which was fixed in this commit.

@AmineRaouane AmineRaouane dismissed stale reviews from behnazh-w and art1f1c3R via e759fd3 September 15, 2025 22:54
@AmineRaouane AmineRaouane force-pushed the white-spaces-heuristic branch 3 times, most recently from a62851e to c88ee73 Compare September 15, 2025 23:40
Amine added 3 commits September 19, 2025 16:59
…d invisible characters

Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
…d invisible characters

Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Amine and others added 4 commits September 19, 2025 16:59
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Amine <amine.raouane@enim.ac.ma>
…tion threshold

Signed-off-by: Amine <amine.raouane@enim.ac.ma>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
@art1f1c3R art1f1c3R force-pushed the white-spaces-heuristic branch from c88ee73 to 667c4f0 Compare September 19, 2025 07:13
@behnazh-w behnazh-w merged commit 320d644 into oracle:main Sep 20, 2025
15 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants