-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: semantic search for large repos vector store toolkit #23
base: main
Are you sure you want to change the base?
Conversation
I've tried a scenario with the toolkits with
|
def vector_toolkit(): | ||
return VectorToolkit(notifier=MagicMock()) | ||
|
||
def test_query_vector_db_creates_db(temp_dir, vector_toolkit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use tmp_path
directly instead of temp_dir
.
tmp_path
is the built-in fixture in pytest. https://docs.pytest.org/en/latest/how-to/tmp_path.html#tmp-path
from pathlib import Path | ||
|
||
|
||
GOOSE_GLOBAL_PATH = Path("~/.config/goose").expanduser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can import GOOSE_GLOBAL_PATH
from config.py
@lifeizhou-ap thanks - yes good catch, it should only load weights so that warning should go away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the description as it helps me understand how this works IRL
vector_toolkit.create_vector_db(temp_dir.as_posix()) | ||
query = 'print("Hello World")' | ||
result = vector_toolkit.query_vector_db(temp_dir.as_posix(), query) | ||
print("Query Result:", result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excuse python noob.. do we want these prints? I guess they aren't visible by default, so it doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you have to run pytest in another mode to see them
tests/toolkit/test_vector.py
Outdated
temp_db_path = vector_toolkit.get_db_path(temp_dir.as_posix()) | ||
assert os.path.exists(temp_db_path) | ||
assert os.path.getsize(temp_db_path) > 0 | ||
assert 'No embeddings available to query against' in result or '\n' in result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in the future, we could make an integration test with ollama for this one, or possibly an in-memory embeddings lib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - something scaled down and deterministic ideally
@lifeizhou-ap do you mind giving this a try again and see if it is as good as before for you? |
Very excited to try this out! To match the rest of how goose works, I think it makes sense if we delegate the embedding off to the provider. That's a bigger refactor, but then it avoids installing heavy dependencies with goose out of the box(torch, locally downloading a model). It might drive higher performance too, but would need to test that. What do you think? |
LGTM! |
@baxen do you mean each provider has its own embeddings impl local to it? Would that gain much over having just one (as it is all local, and not provider specific) or do you mean lives in exchange alongside providers? (and they can offer their own if they want?). Just not sure what benefit would be? (I might be missing something) but I am sure is doable. Wouldn't this also still bring over the dependencies as the providers are bundled together (if in exchange?) - ie there is no "lazy loading" of dependencies (I think?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick drive by
@baxen according to goose: So that is not small - unfortunately a optional dependency isn't really viable for a CLI? |
going to have a lot at some lightweight options here, and failing that, I will make this an optional and validate that (and likely merge it after that point). |
hey @baxen how does this look with optional deps now? |
A few thoughts
|
That is exactly what this aims to do in a simple way - that is all that is needed (the toolkit isn't for end users to see - but to help goose find where to look which is then used as context). I think future idea would be for embeddings to change (but they aren't meant to be search - so for relatively stable codebase isn't a huge deal). Could certainly run it with other models and approaches - but the idea of a toolkit is you can use it or not (but also would like to have something that is "batteries included" for goose - if it is this approach or another, as I think as it is it needs help to find code to work on). |
this is using sentence transformers and embeddings to create a simple vector database to allow semantic search of large codebases to help goose navigate around.
model info:
To test:
uv run goose session start --profile vector
with a ~/.config/goose/profiles.yaml with:
Then try some query to ask where to add a feature, or anything which you think needs a semantic match