Conversation
Summary of ChangesHello @mizhm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing data integrity and streamlining data processing within the core API. It introduces crucial deduplication mechanisms for assets, ports, and vulnerabilities, preventing redundant entries and improving database efficiency. Additionally, it refines the handling of job payload results by setting clearer default values and optimizing data transformation, contributing to more predictable and reliable data flows. A minor refactoring in the tools service also contributes to overall code cleanliness. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable improvements. The data deduplication logic added to data-adapter.service.ts is a solid enhancement for data integrity, preventing issues from duplicate entries. The changes in jobs-registry.dto.ts make the data transfer object more robust and flexible. Additionally, the refactoring in tools.service.ts simplifies the code and resolves a bug related to fetching tools. Overall, these are high-quality changes. I have one suggestion to further improve code style.
There was a problem hiding this comment.
Code Review
This pull request introduces data deduplication in data-adapter.service.ts before database insertion for subdomains, ports, and vulnerabilities. This is a good practice to prevent errors and ensure data consistency. The PR also refactors the getToolByNames method in tools.service.ts to fix a bug. While the refactoring is an improvement, I've found a logic issue in the new implementation when isInstalled is true, where it fails to include built-in tools. I've provided a suggestion to correct this behavior.
There was a problem hiding this comment.
Pull request overview
This PR addresses issues in the job registry by refactoring tool lookup logic and adding deduplication to prevent duplicate data insertion.
Changes:
- Refactored
getToolByNamesmethod in tools.service.ts to change how installed tools are queried - Added deduplication logic in data-adapter.service.ts for assets, ports, and vulnerabilities to prevent duplicate entries
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| core-api/src/modules/tools/tools.service.ts | Rewrote the tool lookup logic when isInstalled=true, removing workspace repository queries and built-in tools concatenation in favor of a direct query with type filtering |
| core-api/src/modules/data-adapter/data-adapter.service.ts | Added deduplication using Map for assets (by value), ports (by Set), and vulnerabilities (by fingerprint) before database insertion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Deduplicate based on fingerprint | ||
| const uniqueValues = Array.from( | ||
| new Map(values.map((v) => [v.fingerprint, v])).values(), | ||
| ); |
There was a problem hiding this comment.
The deduplication logic added here is not covered by tests. While the existing tests in data-adapter.service.spec.ts verify the basic vulnerability insertion flow, they don't test the scenario where duplicate vulnerabilities (with the same fingerprint) are provided in the input data.
Consider adding a test case that verifies when the input data contains multiple vulnerabilities with the same fingerprint (derived from the same name, asset ID, and tool ID), only one is inserted.
| // Deduplicate data based on value | ||
| const uniqueData = Array.from( | ||
| new Map(data.map((asset) => [asset.value, asset])).values(), | ||
| ); |
There was a problem hiding this comment.
While this deduplication approach using Map is correct, there's a subtle behavior to note: when duplicate values exist in the input array, the Map will keep the last occurrence (since subsequent set operations with the same key overwrite previous values). This means if there are multiple assets with the same value but different properties (like dnsRecords), only the properties from the last occurrence will be retained.
Consider whether this is the intended behavior. If it is, this should be documented. If not, you may want to implement merge logic or keep the first occurrence instead by checking if the key already exists before adding it to the Map.
| // Deduplicate data based on value | |
| const uniqueData = Array.from( | |
| new Map(data.map((asset) => [asset.value, asset])).values(), | |
| ); | |
| // Deduplicate data based on value, keeping the first occurrence for each value | |
| const assetByValue = new Map<string, Asset>(); | |
| for (const asset of data) { | |
| if (!assetByValue.has(asset.value)) { | |
| assetByValue.set(asset.value, asset); | |
| } | |
| } | |
| const uniqueData = Array.from(assetByValue.values()); |
| // Deduplicate data based on value | ||
| const uniqueData = Array.from( | ||
| new Map(data.map((asset) => [asset.value, asset])).values(), | ||
| ); |
There was a problem hiding this comment.
When the input data array is empty, the deduplication will produce an empty uniqueData array. This could lead to issues downstream:
- On line 64,
uniqueData.find()will returnundefinedwhen the array is empty - This causes
primaryAssets?.dnsRecordson line 73 to beundefined, which is handled safely by the optional chaining - On lines 89-93, mapping over an empty array will produce an empty values array for the insert
While the code handles the empty array case without crashing, the database insert with an empty values array might behave unexpectedly depending on the TypeORM version. Consider adding an early return or explicit check if data.length === 0 or uniqueData.length === 0 to make the intent clearer and avoid unnecessary database operations.
| workspaceTools: { | ||
| tool: { | ||
| name: In(names), | ||
| }, | ||
| }, |
There was a problem hiding this comment.
This query structure is incorrect. The workspaceTools relation on the Tool entity is a collection of WorkspaceTool records, where each WorkspaceTool has a tool property pointing back to the Tool. The nested workspaceTools.tool.name filter doesn't make sense in this context.
Additionally, this query is missing a critical parameter: it needs a workspaceId to determine which workspace to check for installed tools. Without the workspace context, this query will return any tool that has ANY workspace association, not just the specific workspace the caller is interested in.
The original implementation queried the workspaceToolRepository with workspace context and then mapped to get the tools, which was the correct approach. To fix this properly:
- Add a
workspaceIdparameter to the method signature - Query the
workspaceToolRepositoryto find tools installed in that specific workspace - Combine those results with built-in tools (which should be available in all workspaces)
| return await this.toolsRepository.find({ | ||
| where: [ | ||
| { | ||
| name: In(names), | ||
| type: WorkerType.BUILT_IN, | ||
| }, | ||
| { | ||
| workspaceTools: { | ||
| tool: { | ||
| name: In(names), | ||
| }, | ||
| }, | ||
| relations: ['tool'], | ||
| }) | ||
| .then((res) => res.map((r) => r.tool)); | ||
| } | ||
|
|
||
| tools = await this.toolsRepository.find({ | ||
| where: { | ||
| name: In(names), | ||
| }, | ||
| }, | ||
| ], | ||
| }); |
There was a problem hiding this comment.
The removal of the built-in tools concatenation logic is a breaking change. The original code explicitly added tools from the builtInTools array that matched the requested names. This ensured that built-in tools were always available regardless of database state.
While the new code attempts to query for BUILT_IN type tools from the database, there's a fundamental difference: the original builtInTools array may contain tools with additional runtime properties (like parser functions) that aren't persisted to the database. Removing this concatenation could break functionality that depends on these runtime-only properties.
If the intent is to rely solely on database records, ensure that all necessary tool properties are properly persisted during the onModuleInit phase and that no runtime-only properties are required by consumers of this method.
| return await this.toolsRepository.find({ | ||
| where: [ | ||
| { | ||
| name: In(names), | ||
| type: WorkerType.BUILT_IN, | ||
| }, | ||
| { | ||
| workspaceTools: { | ||
| tool: { | ||
| name: In(names), | ||
| }, | ||
| }, | ||
| relations: ['tool'], | ||
| }) | ||
| .then((res) => res.map((r) => r.tool)); | ||
| } | ||
|
|
||
| tools = await this.toolsRepository.find({ | ||
| where: { | ||
| name: In(names), | ||
| }, | ||
| }, | ||
| ], | ||
| }); |
There was a problem hiding this comment.
The query is missing the relations configuration needed to load the workspaceTools relation. In TypeORM, when using nested relation paths in the where clause (like workspaceTools.tool.name), you must also explicitly specify those relations in the query configuration. Without this, TypeORM won't join the related tables and the query will fail or return no results.
Even if the query structure is corrected, you'll need to add something like:
relations: ['workspaceTools']
to the find options.
| // Deduplicate data based on value | ||
| const uniqueData = Array.from( | ||
| new Map(data.map((asset) => [asset.value, asset])).values(), | ||
| ); |
There was a problem hiding this comment.
The deduplication logic added here is not covered by tests. While the existing test in data-adapter.service.spec.ts verifies the basic subdomain insertion flow, it doesn't test the scenario where duplicate assets (with the same value) are provided in the input data.
Consider adding a test case that verifies:
- When the input data contains multiple assets with the same value, only one is inserted
- The deduplicated asset is the correct one (e.g., the last one in case of conflicts)
| // Deduplicate ports | ||
| const uniquePorts = [...new Set(data.filter((port) => !isNaN(port)))]; |
There was a problem hiding this comment.
The deduplication logic added here is not covered by tests. While the existing tests in data-adapter.service.spec.ts verify the basic port scanner flow, they don't test the scenario where duplicate ports are provided in the input data.
Consider adding a test case that verifies when the input data contains duplicate port numbers, only unique ports are inserted and used to create asset services.
No description provided.