Duo Context Exclusion
## Background
As we expand the context and content available to Duo, we need to provide customer controls for excluding sensitive files/content from Duo features and supporting models. Customers may have sensitive files that should not be processed or input to LLM's and embeddings models.
**References**
* [Additional background and architecture proposal](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ai_context_management/)
* [Additional MR discussion](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/merge_requests/837)
* [Related feature request](https://gitlab.com/gitlab-org/editor-extensions/gitlab-jetbrains-plugin/-/issues/839)
## Main goal
Allow customers to enforce their security/privacy policy by controlling the content that is used within Duo. This supports messaging that ensures customers that excluded files/context are not processed by any Duo LLM or supporting model.
This provides a very strong data privacy and data retention position:
* Each customer can preclude content from ever being processed by Duo
* For content that is processed by Duo, we maintain zero-day data retention
## MVC Proposal
### **Functional summary**
* Files are available for AI context by default, unless otherwise specified.
* At the project level, an administrator can:
* Configure paths to exclude from AI context
* This could include a specific file, a directory, a file extension, etc.
* Configure paths to include for AI context
* e.g. Exclude a folder, but include 2 specific files in that folder.
* This could include a specific file, a directory, a file extension, etc.
* All files are excluded when a project has [Duo turned off](https://docs.gitlab.com/ee/user/gitlab_duo/turn_on_off.html).
* The exclusion policy should also be enforced for GitLab Duo with Amazon Q, with the same behavior as GitLab Duo.
* The exclusion policy should be enforced at the customer level (i.e. instance or top-level namespace) rather than the user level.
* As a potential example, content should be uniformly excluded even if a customer has a mix of Duo Pro/Enterprise users and Duo Core users.
* \[Nice to have but not a strict MVC requirement\] We also exclude any paths specified in `gitignore`
* \[Nice to have but not a strict MVC requirement\] Updating files when the exclusion configuration is updated:
* If files are embedded/stored, and Duo is turned off for the project, then we should remove the files from the Duo data store.
* If files are embedded/stored, and those files are added to the exclude configuration, then we should remove the files from the Duo data store.
* The removal doesn't need to be instantaneous but we should aim for no more than 30 minutes to apply the change.
### **Excluded files behavior**
* Content from excluded files is not sent to an LLM or embeddings model.
* Duo Chat is not supported for excluded files.
* Code Suggestions are not supported within excluded files.
* Content in excluded files won't be used to inform code completion suggestions in other files.
* This includes both open tabs context, and imports context.
* Content from excluded files is not embedded and stored.
* Generally, no Duo feature should use content from excluded files.
* Edge case: Duo is enabled but all or most files are excluded - Duo will be ineffective. No specific requirement here but we could consider in-product messaging if this is common.
**Full list of features that should enforce content exclusion policy**
<table>
<tr>
<th>Feature category</th>
<th>Feature</th>
<th>Unit primitive</th>
</tr>
<tr>
<td>Code Suggestions</td>
<td>
[Code generation](https://docs.gitlab.com/user/project/repository/code_suggestions/#code-completion-and-generation)
</td>
<td>generate_code</td>
</tr>
<tr>
<td>Code Suggestions</td>
<td>
[Code completion](https://docs.gitlab.com/user/project/repository/code_suggestions/#code-completion-and-generation)
</td>
<td>complete_code</td>
</tr>
<tr>
<td>Chat</td>
<td>
[/include file](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#ask-about-specific-files-in-the-ide)
</td>
<td>include_file_context</td>
</tr>
<tr>
<td>Chat</td>
<td>/include merge request</td>
<td>include_merge_request_context</td>
</tr>
<tr>
<td>Chat</td>
<td>/include directory</td>
<td>include_directory_context</td>
</tr>
<tr>
<td>Chat</td>
<td>/include repository</td>
<td>include_repository_context</td>
</tr>
<tr>
<td>Chat</td>
<td>
[/fix](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#fix-code-in-the-ide)
</td>
<td>fix_code</td>
</tr>
<tr>
<td>Chat</td>
<td>
[/refactor](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#refactor-code-in-the-ide)
</td>
<td>refactor_code</td>
</tr>
<tr>
<td>Chat</td>
<td>
[/test](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#write-tests-in-the-ide)
</td>
<td>write_tests</td>
</tr>
<tr>
<td>Chat</td>
<td>
[/explain](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#explain-selected-code)
</td>
<td>
explain_code
include_terminal_context
</td>
</tr>
<tr>
<td>Chat</td>
<td>
[Ask about file](https://docs.gitlab.com/user/gitlab_duo_chat/#in-the-gitlab-ui)
</td>
<td>n/a</td>
</tr>
<tr>
<td>Chat</td>
<td>
[Ask about merge request](https://docs.gitlab.com/user/gitlab_duo_chat/examples/#ask-about-a-specific-merge-request)
</td>
<td>ask_merge_request</td>
</tr>
<tr>
<td>Duo Workflow</td>
<td>
[Duo Workflow](https://docs.gitlab.com/user/duo_workflow/)
</td>
<td>duo_workflow_execute_workflow</td>
</tr>
<tr>
<td>Duo Code Review</td>
<td>
[Duo Code Review](https://docs.gitlab.com/user/project/merge_requests/duo_in_merge_requests/#have-gitlab-duo-review-your-code)
</td>
<td>review_merge_request</td>
</tr>
<tr>
<td>Sec. Vulnerability</td>
<td>
[Vulnerability resolution](https://docs.gitlab.com/user/application_security/vulnerabilities/#vulnerability-resolution)
</td>
<td>resolve_vulnerability</td>
</tr>
<tr>
<td>Summarization</td>
<td>
[Generate merge commit message](https://docs.gitlab.com/user/project/merge_requests/duo_in_merge_requests/#generate-a-merge-commit-message)
</td>
<td>generate_commit_message</td>
</tr>
<tr>
<td>Summarization</td>
<td>
[Generate merge request description](https://docs.gitlab.com/user/project/merge_requests/duo_in_merge_requests/#generate-a-description-by-summarizing-code-changes)
</td>
<td>summarize_new_merge_request</td>
</tr>
<tr>
<td>Tools</td>
<td>Embeddings for codebase semantic search</td>
<td>generate_embeddings_codebase</td>
</tr>
<tr>
<td>Tools</td>
<td>Codebase semantic search</td>
<td>codebase_search</td>
</tr>
</table>
**Proposed UX treatments**
* IDE should display the disabled Tanuki icon when the open and active file is excluded.
* If the user submits a Chat prompt for an excluded file, Chat should respond: `Duo does not have access to this file due to an active content exclusion policy.`
* This could include `/fix` `/refactor` `/explain` `/test`
* This could include a prompt such as "summarize this file".
* Excluded files are displayed but disabled within the `/include` selection menu, with an info icon to communicate the file status.
* Info icon hover text: `Duo does not have access to this file due to an active content exclusion policy.`
* These features return an exclusion message within their response when one or more relevant files were excluded. The message should be `Duo could not access these files due to an active content exclusion policy: filename1.ext filename2.extt ...`
* Duo Code Review
* Duo Workflow
* Vulnerability resolution
* Ask about a merge request
* Generate a merge commit message
* Generate merge request description
**Edge case**
* We can't reasonably stop a user from copy/pasting the entire contents of restricted file into chat
* e.g. Open file, copy all code, paste into Chat along with question/task
### Tier availability and deployment options
**Supported Duo add-ons**
* Duo Core :x:
* Duo Pro :white_check_mark:
* Duo Enterprise :white_check_mark:
**Supported deployment options**
* .com :white_check_mark:
* Dedicated :white_check_mark:
* Self Managed :white_check_mark:
* Self-hosted models :white_check_mark:
### **Telemetry**
* We can measure the number of customers using a non-default AI context policy
* We can measure the number of projects using a non-default AI context policy
### **Potential future iterations**
* Automated validation of correct policy configuration
* Manage policy at group level
## For discussion
Proposing that we use a UI-based settings affordance rather than an ignore file stored in each repository. This is more consistent with our current direction for [custom rules management](https://gitlab.com/groups/gitlab-org/-/epics/17685), and I prefer that we are consistent in the interaction patterns when possible. We can discuss this if there are advantages to storing a context policy file in each repository, rather than a UI interaction.
A helpful comparison of file-based vs UI-based pros and cons with respect to rules: https://gitlab.com/groups/gitlab-org/-/epics/17685#note_2493440891
## Metrics
The metrics are focused on adoption, and measuring a shift in projects moving from Duo-disabled to Duo-enabled with some files excluded. We believe that there will be fewer projects with Duo turned completely off and more projects where specific file extensions are disabled. As a prerequisite to roll out, we can baseline the number of projects where Duo is supported but disabled.
**Adoption**
* % of customers using AI context inclusion/exclusion
* % of projects using AI context inclusion/exclusion
**Behavior change**
* Reduced % of Duo-disabled projects
## Appendix
There is a prior [AI Context management proposal](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ai_context_management/#suggested-iterative-implementation-plan) that may be a useful reference to inform the implementation.
epic