Skip to content

feat: semantic search for large repos vector store toolkit#23

Closed
michaelneale wants to merge 27 commits intomainfrom
vector_store
Closed

feat: semantic search for large repos vector store toolkit#23
michaelneale wants to merge 27 commits intomainfrom
vector_store

Conversation

@michaelneale
Copy link
Copy Markdown
Collaborator

@michaelneale michaelneale commented Aug 28, 2024

this is using sentence transformers and embeddings to create a simple vector database to allow semantic search of large codebases to help goose navigate around.

model info:

To test:

uv run goose session start --profile vector

with a ~/.config/goose/profiles.yaml with:

vector:
  provider: openai
  processor: gpt-4o
  accelerator: gpt-4o-mini
  moderator: truncate
  toolkits:
  - name: developer
    requires: {}
  - name: vector
    requires: {}   

Then try some query to ask where to add a feature, or anything which you think needs a semantic match

image

@michaelneale michaelneale changed the title Vector store semantic search for large repos: vector store Aug 28, 2024
@michaelneale michaelneale changed the title semantic search for large repos: vector store semantic search for large repos: vector store toolkit Aug 29, 2024
@michaelneale michaelneale marked this pull request as ready for review August 29, 2024 21:36
@lifeizhou-ap
Copy link
Copy Markdown
Collaborator

lifeizhou-ap commented Sep 2, 2024

I've tried a scenario with the toolkits with vector and without vector.

  • It seems the configuration with vector is more consistent and quicker to find the relevant files (although the first time
    it has to build the vector, the time is ok, not long). 👍

  • I saw a warning message below but I guess it should be fine? (since the vector is created by the code that the user provides)

goose/src/goose/toolkit/vector.py:115: FutureWarning: You are using `torch.load` with 
`weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct 
malicious pickle data which will execute arbitrary code during unpickling (See 
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default 
value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. 
Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via 
`torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have
full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  data = torch.load(db_path)

def vector_toolkit():
return VectorToolkit(notifier=MagicMock())

def test_query_vector_db_creates_db(temp_dir, vector_toolkit):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use tmp_path directly instead of temp_dir.

tmp_path is the built-in fixture in pytest. https://docs.pytest.org/en/latest/how-to/tmp_path.html#tmp-path

from pathlib import Path


GOOSE_GLOBAL_PATH = Path("~/.config/goose").expanduser()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can import GOOSE_GLOBAL_PATH from config.py

@michaelneale
Copy link
Copy Markdown
Collaborator Author

@lifeizhou-ap thanks - yes good catch, it should only load weights so that warning should go away.

Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Collaborator

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the description as it helps me understand how this works IRL

vector_toolkit.create_vector_db(temp_dir.as_posix())
query = 'print("Hello World")'
result = vector_toolkit.query_vector_db(temp_dir.as_posix(), query)
print("Query Result:", result)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excuse python noob.. do we want these prints? I guess they aren't visible by default, so it doesn't matter

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you have to run pytest in another mode to see them

Comment thread tests/toolkit/test_vector.py Outdated
temp_db_path = vector_toolkit.get_db_path(temp_dir.as_posix())
assert os.path.exists(temp_db_path)
assert os.path.getsize(temp_db_path) > 0
assert 'No embeddings available to query against' in result or '\n' in result
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in the future, we could make an integration test with ollama for this one, or possibly an in-memory embeddings lib?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - something scaled down and deterministic ideally

Copy link
Copy Markdown
Collaborator

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick drive by

Comment thread tests/toolkit/test_vector.py Outdated
Comment thread tests/toolkit/test_vector.py Outdated
@michaelneale michaelneale changed the title semantic search for large repos: vector store toolkit feat: semantic search for large repos vector store toolkit Sep 12, 2024
@michaelneale
Copy link
Copy Markdown
Collaborator Author

@baxen according to goose:

image

So that is not small - unfortunately a optional dependency isn't really viable for a CLI?

@michaelneale
Copy link
Copy Markdown
Collaborator Author

going to have a lot at some lightweight options here, and failing that, I will make this an optional and validate that (and likely merge it after that point).

@michaelneale
Copy link
Copy Markdown
Collaborator Author

hey @baxen how does this look with optional deps now?

@ahau-square
Copy link
Copy Markdown
Contributor

A few thoughts

  • Code embedding search seems like a promising direction to pursue
  • We should consider and test different chunking strategies - embedding code snippets e.g., classes/functions rather or in addition to whole code files to get more pinpointed search
  • Probably worth benchmarking and evaluating the embedding models against alternatives e.g., ones specifically for code (https://huggingface.co/Salesforce/codet5p-110m-embedding, https://huggingface.co/bigcode/starencoder)
  • Why limit to models that can be run locally vs. use hosted models like the OpenAI embeddings API or potentially others that Block hosts e.g., through the Databricks model gateway?
  • Is the future idea to eventually have a vector store of code embeddings for each repo and have them be updated on merge? Doing so might lend itself to a better experience of not having to wait for your embeddings to compute.
  • From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

@michaelneale
Copy link
Copy Markdown
Collaborator Author

@ahau-square

From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

That is exactly what this aims to do in a simple way - that is all that is needed (the toolkit isn't for end users to see - but to help goose find where to look which is then used as context).

I think future idea would be for embeddings to change (but they aren't meant to be search - so for relatively stable codebase isn't a huge deal). Could certainly run it with other models and approaches - but the idea of a toolkit is you can use it or not (but also would like to have something that is "batteries included" for goose - if it is this approach or another, as I think as it is it needs help to find code to work on).

@michaelneale
Copy link
Copy Markdown
Collaborator Author

this approach with local model(s) works quite well, but it is a hefty dependency addition to goose. Remote/server based embeddings and search is one option (but very specific to provider and probably more work to maintain across - not sure of exact benefit yet). Another approach is to use tools like rq but with fuzzy searching plus some pre-expansion of a question into related terms: like you search for "intellisense" - then the accelerator model could expand that to "content assist... completion" etc (as per users intent) and then do a more keyword like search for that (porter stemming would be the old way, but with accelerator models I think we can do better). Won't be as good for code specific comprehension though so I still like the idea of a local ephemeral embeddings/vector and indexing system or service.

@michaelneale michaelneale added enhancement New feature or request work-in-progress labels Sep 25, 2024
@michaelneale
Copy link
Copy Markdown
Collaborator Author

@baxen I can't work out how optional deps work with UV (they used to work - but not there now).

@michaelneale
Copy link
Copy Markdown
Collaborator Author

I am going to close this for now - but keep the branch around

@yingjiehe-xyz yingjiehe-xyz deleted the vector_store branch February 5, 2025 21:11
dianed-square added a commit to dianed-square/goose that referenced this pull request Nov 27, 2025
Issue aaif-goose#23: Second recipe still referenced ./output/validation-changes.md
and ./output/update-summary.md when workflow runs from output/ directory.

Changed to:
- ./validation-changes.md (was ./output/validation-changes.md)
- ./update-summary.md (was ./output/update-summary.md)

This matches the fix we made to the first recipe (synthesize-validation-changes.yaml)
in commit c4e6c0f.
jamadeo pushed a commit that referenced this pull request Apr 13, 2026
Update acp-client git dependency from 17e60981 to dbd5bc9c.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jamadeo pushed a commit that referenced this pull request Apr 13, 2026
* feat: migrate from boss-ui to Ghost UI foundation

Swap the design system foundation from boss-ui to Ghost UI, the
open-source successor design language from block/ghost.

Font swap:
- Remove Cash Sans (CDN) and Cash Sans Wide (local) font faces
- Add HK Grotesk (local woff2, 7 weights: 300–900)
- Update --font-sans, --font-display to "HK Grotesk"
- Update --font-mono to "Geist Mono"

CSS tokens (globals.css):
- Adopt Ghost's main.css as the token source of truth
- Add hero-blur-in, fadeToFull, fadeToSubtle keyframes from Ghost
- Add scrollbar hover styles from Ghost
- Preserve all goose2-specific tokens: --color-brand, --density-spacing,
  --text-subtle, --radius-overlay, compat color scales for ai-elements
- Preserve goose2 custom utilities: density spacing, page transitions,
  content-fade-in, prefers-reduced-motion, keyboard-nav focus, app shell lock

UI primitives (47 of 49 files):
- Copy Ghost UI components with import path adjustment
  (@/lib/utils → @/shared/lib/cn) and "use client" removal
- Preserve goose2's button.tsx (leftIcon/rightIcon, forwardRef,
  ghost-light/toolbar variants, xs/icon-lg sizes, compound variants)
- Preserve goose2's tabs.tsx (CVA variants: default, buttons)
- Restore dialog.tsx showCloseButton prop (used by ImageLightbox)
- Fix sonner.tsx useTheme import to @/shared/theme/ThemeProvider

Not included (PR 2):
- ai-elements migration (48 files, separate scope)

* fix: add type="button" to SidebarRail per AGENTS.md

Ghost upstream omits type="button" on the SidebarRail <button>.
Goose2 requires it on all button elements to prevent accidental
form submission (AGENTS.md coding conventions).

PR review feedback from Marge.

* fix: restore 20px radius on popup surfaces (dropdown → overlay parity)

* fix: remove HK Grotesk font — align with upstream Ghost 'no bundled fonts' model

Upstream block/ghost PRs #23 and #24 removed all HK Grotesk @font-face
declarations and switched to a 'consumers bring their own fonts' model.
Align our branch with this change:

- Replace --font-sans and --font-display with system font stack
- Remove all 7 @font-face declarations (300–900 weights)
- Delete 7 HKGrotesk-*.woff2 font files from src/assets/fonts/

The system font stack (system-ui, -apple-system, BlinkMacSystemFont,
Segoe UI, Roboto, sans-serif) matches upstream Ghost's new defaults.

318/318 tests passing. Biome clean.

* fix: restore rounded-overlay class on overlay surfaces

The migration commit incorrectly replaced rounded-overlay with
rounded-dropdown on all popup/overlay components. Both tokens resolve
to 20px currently, but rounded-overlay is the correct semantic token
for overlay surfaces (popovers, hover cards, context menus, selects,
dropdown menus, menubars).

Restores 9 occurrences across 6 shared UI components.

* fix: strip internal references from CSS comments

Remove migration-specific context from globals.css comments that
wouldn't make sense to future consumers of the design system:

- Remove '(upstream Ghost PRs #23/#24)' from font comment
- Replace redundant font narration with '@font-face — add your own here'
- Remove '(from dsgn-playground)' from gray scale comment

Comments should help the next person, not document our migration.

* feat: point components.json at ghost-ui registry + add ghost.config.ts

Step 2: Update components.json to consume from ghost-ui registry:
- style: new-york → ghost
- baseColor: zinc → neutral
- Add registryUrl pointing to block.github.io/ghost/registry.json
- Add iconLibrary: lucide

Future `npx shadcn add <component>` commands will now pull
ghost-styled components from the upstream registry.

Step 3: Add ghost.config.ts for drift detection:
- Points at ghost-ui registry for goose2's shared UI components
- Enables value + structure scanning
- Configures rules: hardcoded-color (error), token-override (warn),
  missing-token (warn), structural-divergence (error)

Run `ghost scan` to detect drift from the parent design system.

* fix: use raw GitHub URL for ghost-ui registry

The GitHub Pages deploy doesn't include registry.json in dist/ —
it only serves the Vite SPA. The raw.githubusercontent.com URL
points directly at the source file and resolves correctly.

Long-term: ghost repo should copy registry.json to public/ so
it's served at block.github.io/ghost/registry.json.

* chore: add TODO to switch registry URL once block/ghost#25 lands

The raw.githubusercontent.com URL is a workaround for registry.json
not being served on GitHub Pages. PR block/ghost#25 fixes the deploy
workflow. Once merged and deployed, both ghost.config.ts and
components.json should switch to:
  https://block.github.io/ghost/registry.json

* fix: use canonical Pages URL for ghost-ui registry

The registry was already being served at /ghost/r/registry.json via
shadcn build output in public/r/. Switch both components.json and
ghost.config.ts from the raw.githubusercontent.com workaround to the
proper GitHub Pages URL.

Removes the TODO tracking block/ghost#25 — that PR is still useful
for serving at the root path, but no longer blocking us.

* chore: revert formatting-only changes (import/export reordering)

Reverts Biome auto-sort import reordering and export alphabetizing
across 39 shared/ui component files. These were formatting-only changes
that added noise to the ghost-ui migration PR without any functional
impact.

9 files fully reverted (pure formatting noise):
- collapsible, hover-card, popover, tooltip, table, label,
  separator, progress, avatar

30 files surgically cleaned (import/export order restored,
real ghost-ui class changes preserved):
- accordion, alert, alert-dialog, badge, breadcrumb, button-group,
  calendar, carousel, chart, checkbox, command, context-menu, dialog,
  dropdown-menu, form, input-group, input-otp, menubar, navigation-menu,
  pagination, radio-group, resizable, scroll-area, select, sheet,
  sidebar, slider, switch, toggle, toggle-group

* chore: revert remaining export reordering (round 2)

Missed 13 more files with alphabetized exports in the first pass.
Restored original export order in: alert-dialog, card, carousel,
command, context-menu, drawer, dropdown-menu, form, input-group,
menubar, navigation-menu, pagination, sheet.

card.tsx and drawer.tsx are now fully clean (zero diff vs main).

* fix: restore rounded-full on InputGroup container

Reverts ghost-ui registry's rounded-md back to rounded-full per design
direction. The InputGroup container uses a fixed h-9 height, so pill
radius renders correctly.

Also restores rounded-[calc(var(--radius)-5px)] on the sm button variant
for proper nested radius calculation.

* chore: simplify ghost.config.ts — remove speculative scan/rules

The scan and rules blocks were added speculatively without explicit
design decisions. Strip down to just the registry pointer until we
decide on lint rules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants