Skip to content

Conversation

@gfs
Copy link
Contributor

@gfs gfs commented Sep 8, 2025

This pull request introduces significant improvements to XML/XPath boundary detection in the TextContainer class and adds new test helpers for rule creation. The main focus is on enhancing the accuracy and robustness of XML XPath string extraction, especially with respect to namespace handling and precise location mapping, while also making it easier to write and manage rule-based tests.

Enhancements to XML/XPath processing:

  • Replaced the use of XPathDocument and XmlNavigator with XDocument and LINQ to XML for XPath evaluation, enabling better support for line information and more robust namespace handling. (AppInspector.RulesEngine/TextContainer.cs) [1] [2] [3]
  • Added new logic for precise boundary detection using line info from XML objects, including configurable search windows for attributes and element values. This improves the accuracy of mapping XPath results back to their positions in the original XML document. (AppInspector.RulesEngine/TextContainer.cs) [1] [2]
  • Improved handling of XML namespaces: now merges namespaces declared in the document with those provided by the caller, and logs detailed errors if namespace mapping fails. (AppInspector.RulesEngine/TextContainer.cs)

Testing improvements:

  • Added a new RuleTestHelpers class to simplify the creation of RuleProcessor instances and single-rule RuleSets for tests, supporting both string and dictionary-based namespace inputs for XPath patterns. (AppInspector.Tests/RuleProcessor/RuleTestHelpers.cs)

These changes should result in more accurate XML analysis and make it easier to write and maintain tests for rule processing.

gfs added 4 commits September 8, 2025 16:04
Refactors XML XPath extraction in TextContainer to use XDocument and track node positions, ensuring correct sample and boundary values for XPath matches. Adds a regression test for GitHub issue #621 to verify that the extracted sample and boundary are accurate for XPath queries in XML files.
Replaces the previous EvaluateXPath method with a unified approach using XPathEvaluate, improving support for native engine results and namespace handling. Adds robust logic to extract value boundaries for elements, attributes, and strings, including fallbacks for missing position data. Enhances error handling and logging for XPath evaluation failures.
Renamed the test method from 'XPathBugFix621_VersionExtraction' to 'XPathVersionElementSampleBoundary' in XmlAndJsonTests.cs for improved clarity and alignment with test intent.
Introduces a new test class 'XPathComprehensiveTests' with a variety of cases to expand coverage of XPath handling in the rules engine. Tests include complex predicates, axis selectors, attribute formats, namespace conflicts, text nodes, invalid expressions, empty/whitespace nodes, CDATA/comments, large document performance, and boundary edge cases.
@gfs gfs requested a review from Copilot September 8, 2025 23:40

This comment was marked as outdated.

gfs added 4 commits September 8, 2025 16:59
Simplifies XML XPath extraction by removing the custom node-to-position mapping logic and relying on line info and fallback string search to determine boundaries. This reduces complexity and improves maintainability while preserving the ability to map extracted values to their locations in the original document.
Enhanced the TextContainer to better handle XML namespaces by first extracting namespace declarations from the document root and then applying caller-supplied mappings, allowing for more robust and flexible XPath queries. Updated related test to use a more precise XPath and namespace mapping, ensuring correct matching of attributes in XML files.
Enhanced TextContainer to better disambiguate boundaries for duplicate XML element and attribute values, reducing collisions in boundary mapping. Added comprehensive unit tests in XPathBoundaryTests to validate correct boundary index/length mapping, especially for duplicate values and edge cases, and to document current limitations.
Removed documentation and handling of known boundary mapping limitations in XPathBoundaryTests. Now asserts that all boundaries are distinct for different element and attribute positions, reflecting improved or expected boundary disambiguation logic.
@gfs gfs requested a review from Copilot September 9, 2025 19:37

This comment was marked as outdated.

gfs added 3 commits September 9, 2025 12:48
Extracted line-info and fallback XML boundary detection logic into separate methods for improved readability and maintainability. The main processing loop now delegates to these helpers, reducing code duplication and clarifying the boundary detection flow for XML elements and attributes.
Eliminated the unnecessary 'using System.Collections.Generic;' directive from XPathComprehensiveTests.cs to clean up unused imports.
Updated the BuildRule method in XPathBoundaryTests to construct rules using the object model instead of JSON strings. Added support for parsing XPath namespaces from JSON and mapping pattern type strings to the PatternType enum for improved test clarity and maintainability.
@gfs gfs requested a review from Copilot September 9, 2025 19:51

This comment was marked as outdated.

Enhanced attribute value boundary detection to handle namespaced attributes by checking both full and local names. Removed the TryProcessFallbackMethods method and its usage, simplifying the code and relying on improved primary detection logic.
@gfs gfs requested a review from Copilot September 9, 2025 20:10

This comment was marked as outdated.

gfs added 2 commits September 9, 2025 13:22
Introduced RuleTestHelpers to centralize and simplify the creation of RuleProcessor and RuleSet instances in test code. Updated XPathBoundaryTests and XPathComprehensiveTests to use these helpers, reducing duplication and improving maintainability.
Replaces hardcoded numeric values for XML/XPath boundary search windows with named constants. This improves code readability and maintainability, and provides context for tuning performance and match accuracy.
@gfs gfs requested a review from Copilot September 9, 2025 20:27

This comment was marked as outdated.

Enhanced error handling in TextContainer by logging detailed trace messages when adding XML namespace mappings fails. Also, removed outdated comments from XPathBoundaryTests to clean up the test code.
@gfs gfs requested a review from Copilot September 9, 2025 20:34

This comment was marked as outdated.

gfs added 3 commits September 9, 2025 13:37
Introduced a local variable 'targetVal' to store the value as a Span for reuse in the attribute value comparison loop, improving code clarity and efficiency.
Improves robustness by validating line and column numbers before mapping XML line info to content indices. Prevents out-of-bounds errors and ensures invalid or out-of-range line/column values are handled gracefully.
Introduces unit tests to verify that the rule processor handles malformed or out-of-range XML line and column information gracefully, ensuring correct fallback logic and value extraction without exceptions.
@gfs gfs requested a review from Copilot September 9, 2025 20:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request significantly refactors the XML/XPath handling in the TextContainer class to improve boundary calculation accuracy and fix issues with XPath query result mapping. The primary goal is to resolve GitHub issue #621 where sample and excerpt extraction wasn't working properly with XPath.

Key changes:

  • Migrated from XPathDocument to XDocument for better line info support and XML manipulation capabilities
  • Implemented sophisticated boundary calculation logic using line information and fallback strategies
  • Enhanced namespace handling for XPath queries with both document-declared and caller-supplied namespaces

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
AppInspector.RulesEngine/TextContainer.cs Complete rewrite of XPath processing logic with new XDocument-based implementation and precise boundary calculation
AppInspector.Tests/RuleProcessor/XmlAndJsonTests.cs Added comprehensive test for XPath boundary accuracy and updated existing namespace tests
AppInspector.Tests/RuleProcessor/XPathBoundaryTests.cs New dedicated test file for validating boundary calculation edge cases
AppInspector.Tests/RuleProcessor/XPathComprehensiveTests.cs New comprehensive test suite covering various XPath scenarios
AppInspector.Tests/RuleProcessor/XPathMalformedLineInfoTests.cs New test file for defensive handling of malformed XML line info
AppInspector.Tests/RuleProcessor/RuleTestHelpers.cs New helper class for creating test rules and processors

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +33 to +41
// Window/offset constants used for XML/XPath boundary reconstruction.
// These constrain local searches to avoid scanning entire large documents while
// remaining generous enough for typical formatting (attributes spread across a tag line,
// element content near reported line info, and a broader fallback when line info is
// imperfect). Adjust cautiously as they directly impact performance and match accuracy.
private const int XmlAttributeSearchWindow = 400; // Max chars to scan forward for an attribute value on the same line
private const int XmlElementSearchLookBehind = 100; // Chars to look back from reported line position when locating element start
private const int XmlElementSearchLookAhead = 500; // Chars to look forward from reported line position for element content
private const int XmlElementFallbackWindow = 800; // Fallback scan window when precise line-info based match fails
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These magic numbers should be made configurable or moved to a configuration class. Hard-coded constants make it difficult to tune performance for different XML document sizes and structures without recompiling.

Copilot uses AI. Check for mistakes.
var line = li.LineNumber; // 1-based per IXmlLineInfo
var col = li.LinePosition; // 1-based per IXmlLineInfo
// Defensive: ensure line index is within our sentinel-based one-indexed arrays
if (line <= 0 || line >= LineStarts.Count)
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition line >= LineStarts.Count should be line > LineStarts.Count - 1 or use >= with the proper array bounds check. Since LineStarts is 1-indexed with a sentinel at position 0, valid line numbers should be 1 to LineStarts.Count-1.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +29
private static object CreateLineInfoAnnotation(int line, int col)
{
var annotationType = typeof(XObject).Assembly.GetType("System.Xml.Linq.LineInfoAnnotation")
?? throw new InvalidOperationException("LineInfoAnnotation type not found.");
var inst = Activator.CreateInstance(annotationType, line, col)
?? throw new InvalidOperationException("Failed to create LineInfoAnnotation instance.");
return inst;
}
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method uses reflection to access internal types which creates brittle code that may break with .NET updates. Consider using public APIs or mocking frameworks instead of relying on internal implementation details.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +50
var removeMethod = typeof(System.Xml.Linq.XObject)
.GetMethods(BindingFlags.Instance | BindingFlags.Public)
.FirstOrDefault(m => m.Name == "RemoveAnnotations" && m.IsGenericMethod && m.GetParameters().Length == 0);
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using reflection to find and invoke methods by name is fragile and may break with framework updates. Consider using a more stable approach or documenting the .NET version dependency.

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 151
var attrAIndex = xml.IndexOf("a=\"DupVal\"") + "a=\"".Length; // Should be around 44
var attrBIndex = xml.IndexOf("b=\"DupVal\"") + "b=\"".Length; // Should be around 55
var attrCIndex = xml.IndexOf("c=\"DupVal\"") + "c=\"".Length; // Should be around 66
var elemIndex = xml.IndexOf(">DupVal<") + 1; // Should be around 73
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are calculated but never used in the test assertions. Either use these expected indices in the test validation or remove the unused variables to avoid confusion.

Copilot uses AI. Check for mistakes.
Eliminated unused index calculations and an unnecessary assertion in the test. This streamlines the test code and removes dead code that did not contribute to the test's outcome.
@gfs gfs marked this pull request as ready for review September 9, 2025 20:53
@gfs gfs merged commit 8d41b1f into main Sep 10, 2025
14 checks passed
@gfs gfs deleted the gfs/FixXpathBug branch September 10, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants