feat: semantic search for large repos vector store toolkit #23

michaelneale · 2024-08-28T08:09:15Z

this is using sentence transformers and embeddings to create a simple vector database to allow semantic search of large codebases to help goose navigate around.

model info:

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
trying to this model instead as may be faster/leaner: https://huggingface.co/sentence-transformers/paraphrase-albert-small-v2

To test:

uv run goose session start --profile vector

with a ~/.config/goose/profiles.yaml with:

vector:
  provider: openai
  processor: gpt-4o
  accelerator: gpt-4o-mini
  moderator: truncate
  toolkits:
  - name: developer
    requires: {}
  - name: vector
    requires: {}

Then try some query to ask where to add a feature, or anything which you think needs a semantic match

lifeizhou-ap · 2024-09-02T04:54:14Z

I've tried a scenario with the toolkits with vector and without vector.

It seems the configuration with vector is more consistent and quicker to find the relevant files (although the first time
it has to build the vector, the time is ok, not long). 👍
I saw a warning message below but I guess it should be fine? (since the vector is created by the code that the user provides)

goose/src/goose/toolkit/vector.py:115: FutureWarning: You are using `torch.load` with 
`weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct 
malicious pickle data which will execute arbitrary code during unpickling (See 
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default 
value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. 
Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via 
`torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have
full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  data = torch.load(db_path)

lifeizhou-ap · 2024-09-02T05:00:26Z

tests/toolkit/test_vector.py

+def vector_toolkit():
+    return VectorToolkit(notifier=MagicMock())
+
+def test_query_vector_db_creates_db(temp_dir, vector_toolkit):


can use tmp_path directly instead of temp_dir.

tmp_path is the built-in fixture in pytest. https://docs.pytest.org/en/latest/how-to/tmp_path.html#tmp-path

lifeizhou-ap · 2024-09-02T05:01:39Z

src/goose/toolkit/vector.py

+from pathlib import Path
+
+
+GOOSE_GLOBAL_PATH = Path("~/.config/goose").expanduser()


can import GOOSE_GLOBAL_PATH from config.py

michaelneale · 2024-09-02T05:40:09Z

@lifeizhou-ap thanks - yes good catch, it should only load weights so that warning should go away.

pyproject.toml

codefromthecrypt

Thanks for the description as it helps me understand how this works IRL

codefromthecrypt · 2024-09-02T23:35:04Z

tests/toolkit/test_vector.py

+    vector_toolkit.create_vector_db(temp_dir.as_posix())
+    query = 'print("Hello World")'
+    result = vector_toolkit.query_vector_db(temp_dir.as_posix(), query)
+    print("Query Result:", result)


excuse python noob.. do we want these prints? I guess they aren't visible by default, so it doesn't matter

yeah you have to run pytest in another mode to see them

codefromthecrypt · 2024-09-02T23:36:40Z

tests/toolkit/test_vector.py

+    temp_db_path = vector_toolkit.get_db_path(temp_dir.as_posix())
+    assert os.path.exists(temp_db_path)
+    assert os.path.getsize(temp_db_path) > 0
+    assert 'No embeddings available to query against' in result or '\n' in result


I suppose in the future, we could make an integration test with ollama for this one, or possibly an in-memory embeddings lib?

yeah - something scaled down and deterministic ideally

michaelneale · 2024-09-03T07:58:30Z

@lifeizhou-ap do you mind giving this a try again and see if it is as good as before for you?

baxen · 2024-09-03T18:44:20Z

Very excited to try this out!

To match the rest of how goose works, I think it makes sense if we delegate the embedding off to the provider. That's a bigger refactor, but then it avoids installing heavy dependencies with goose out of the box(torch, locally downloading a model). It might drive higher performance too, but would need to test that. What do you think?

src/goose/toolkit/vector.py

lifeizhou-ap · 2024-09-03T23:15:04Z

@lifeizhou-ap do you mind giving this a try again and see if it is as good as before for you?

LGTM!

michaelneale · 2024-09-04T00:16:01Z

@baxen do you mean each provider has its own embeddings impl local to it? Would that gain much over having just one (as it is all local, and not provider specific) or do you mean lives in exchange alongside providers? (and they can offer their own if they want?). Just not sure what benefit would be? (I might be missing something) but I am sure is doable. Wouldn't this also still bring over the dependencies as the providers are bundled together (if in exchange?) - ie there is no "lazy loading" of dependencies (I think?)

codefromthecrypt

quick drive by

tests/toolkit/test_vector.py

michaelneale · 2024-09-12T07:54:04Z

@baxen according to goose:

So that is not small - unfortunately a optional dependency isn't really viable for a CLI?

michaelneale · 2024-09-13T01:04:26Z

going to have a lot at some lightweight options here, and failing that, I will make this an optional and validate that (and likely merge it after that point).

michaelneale · 2024-09-16T11:37:08Z

hey @baxen how does this look with optional deps now?

ahau-square · 2024-09-17T13:14:35Z

A few thoughts

Code embedding search seems like a promising direction to pursue
We should consider and test different chunking strategies - embedding code snippets e.g., classes/functions rather or in addition to whole code files to get more pinpointed search
Probably worth benchmarking and evaluating the embedding models against alternatives e.g., ones specifically for code (https://huggingface.co/Salesforce/codet5p-110m-embedding, https://huggingface.co/bigcode/starencoder)
Why limit to models that can be run locally vs. use hosted models like the OpenAI embeddings API or potentially others that Block hosts e.g., through the Databricks model gateway?
Is the future idea to eventually have a vector store of code embeddings for each repo and have them be updated on merge? Doing so might lend itself to a better experience of not having to wait for your embeddings to compute.
From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

michaelneale · 2024-09-17T23:48:35Z

@ahau-square

From a UX perspective - I don't know how useful identifying similar files on their own is - but similar files fed in as context to a ChatGPT/Claude for someone to then ask questions over or generate code based on could be very useful

That is exactly what this aims to do in a simple way - that is all that is needed (the toolkit isn't for end users to see - but to help goose find where to look which is then used as context).

I think future idea would be for embeddings to change (but they aren't meant to be search - so for relatively stable codebase isn't a huge deal). Could certainly run it with other models and approaches - but the idea of a toolkit is you can use it or not (but also would like to have something that is "batteries included" for goose - if it is this approach or another, as I think as it is it needs help to find code to work on).

michaelneale added 6 commits August 28, 2024 15:54

tests passing

5870e33

add plugin

e16e4ee

status for vectors

1439926

progress

1f6742e

working but messy

3c7a5f2

tidier

2b95bd5

michaelneale changed the title ~~Vector store~~ semantic search for large repos: vector store Aug 28, 2024

michaelneale added 4 commits August 28, 2024 18:37

set model

cf86d5c

add more language extension types

d6ddec8

set cleanup tokenization spaces to true

8285b90

small fixes for tensorflow future compatibility

63de6b3

michaelneale changed the title ~~semantic search for large repos: vector store~~ semantic search for large repos: vector store toolkit Aug 29, 2024

michaelneale added 2 commits August 29, 2024 19:46

now is working

0635c5f

shifting to faster model

71cb303

michaelneale marked this pull request as ready for review August 29, 2024 21:36

michaelneale added 3 commits August 30, 2024 09:47

trimming list of files

2418b99

only create vector db once her file heirarchy

eff8ee1

switching models for more semantic similarity

f5f93b9

michaelneale requested review from baxen and lukealvoeiro August 31, 2024 05:43

lifeizhou-ap reviewed Sep 2, 2024

View reviewed changes

lifeizhou-ap approved these changes Sep 2, 2024

View reviewed changes

restoring weights only

bd2d2b6

zakiali reviewed Sep 2, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

codefromthecrypt approved these changes Sep 2, 2024

View reviewed changes

make it smarter with embeddings and paths

46443d5

michaelneale requested a review from lifeizhou-ap September 3, 2024 07:58

baxen reviewed Sep 3, 2024

View reviewed changes

src/goose/toolkit/vector.py Outdated Show resolved Hide resolved

codefromthecrypt reviewed Sep 11, 2024

View reviewed changes

tests/toolkit/test_vector.py Outdated Show resolved Hide resolved

tests/toolkit/test_vector.py Outdated Show resolved Hide resolved

michaelneale added 3 commits September 12, 2024 16:37

use faiss and better matching

2374bdd

Merge remote-tracking branch 'origin/main' into vector_store

8b98472

use faiss and better prompt

0dfd570

michaelneale changed the title ~~semantic search for large repos: vector store toolkit~~ feat: semantic search for large repos vector store toolkit Sep 12, 2024

michaelneale added 2 commits September 12, 2024 17:10

fixing up types

3da04fe

more fixing

12f87be

michaelneale added 3 commits September 16, 2024 21:17

Merge branch 'main' into vector_store

b198eb9

making vector search stuff optional

9bcb9a9

need to include extras in ci

9a488cd

michaelneale requested review from baxen and zakiali September 17, 2024 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: semantic search for large repos vector store toolkit #23

feat: semantic search for large repos vector store toolkit #23

michaelneale commented Aug 28, 2024 •

edited

Loading

lifeizhou-ap commented Sep 2, 2024 •

edited

Loading

lifeizhou-ap Sep 2, 2024

lifeizhou-ap Sep 2, 2024

michaelneale commented Sep 2, 2024

codefromthecrypt left a comment

codefromthecrypt Sep 2, 2024

michaelneale Sep 2, 2024

codefromthecrypt Sep 2, 2024

michaelneale Sep 2, 2024

michaelneale commented Sep 3, 2024

baxen commented Sep 3, 2024

lifeizhou-ap commented Sep 3, 2024

michaelneale commented Sep 4, 2024

codefromthecrypt left a comment

michaelneale commented Sep 12, 2024

michaelneale commented Sep 13, 2024

michaelneale commented Sep 16, 2024

ahau-square commented Sep 17, 2024

michaelneale commented Sep 17, 2024

		from pathlib import Path


		GOOSE_GLOBAL_PATH = Path("~/.config/goose").expanduser()

feat: semantic search for large repos vector store toolkit #23

Are you sure you want to change the base?

feat: semantic search for large repos vector store toolkit #23

Conversation

michaelneale commented Aug 28, 2024 • edited Loading

lifeizhou-ap commented Sep 2, 2024 • edited Loading

lifeizhou-ap Sep 2, 2024

Choose a reason for hiding this comment

lifeizhou-ap Sep 2, 2024

Choose a reason for hiding this comment

michaelneale commented Sep 2, 2024

codefromthecrypt left a comment

Choose a reason for hiding this comment

codefromthecrypt Sep 2, 2024

Choose a reason for hiding this comment

michaelneale Sep 2, 2024

Choose a reason for hiding this comment

codefromthecrypt Sep 2, 2024

Choose a reason for hiding this comment

michaelneale Sep 2, 2024

Choose a reason for hiding this comment

michaelneale commented Sep 3, 2024

baxen commented Sep 3, 2024

lifeizhou-ap commented Sep 3, 2024

michaelneale commented Sep 4, 2024

codefromthecrypt left a comment

Choose a reason for hiding this comment

michaelneale commented Sep 12, 2024

michaelneale commented Sep 13, 2024

michaelneale commented Sep 16, 2024

ahau-square commented Sep 17, 2024

michaelneale commented Sep 17, 2024

michaelneale commented Aug 28, 2024 •

edited

Loading

lifeizhou-ap commented Sep 2, 2024 •

edited

Loading