Quadratic complexity bug in SQLite query makes @Codebase unusably slow on large repos #4255
Open
3 tasks done
Labels
area:indexing
Relates to embedding and indexing
ide:vscode
Relates specifically to VS Code extension
kind:bug
Indicates an unexpected problem or unintended behavior
"needs-triage"
priority:high
Indicates high priority
Before submitting your bug report
Relevant environment info
Description
If a large repo, such as
huggingface/transformers
, is indexed, then all@Codebase
queries will be very, very slow. For almost the full duration, there are no LLM calls, no embeddings being created, and no reranking being done. There will only be a single core running at 100% at the client end.This happens as long as any large repo has been indexed in the past from any directory. Even if I open a small project, all
@Codebase
queries on the new project will still be slow.I've identified the problem as a nested loop join in the following SQL query used for FTS, used in
FullTextSearchCodebaseIndex.ts
insideretrieve/buildRetrieveQuery
:The query plan appears as:
This means for every row of
fts
, the database engine will loop over all rows ofchunk_tags
. Both tables contain a row for every chunk across all projects indexed, so the time complexity of the query is quadratic in the number of chunks.I was able to fix the problem by marking the
chunkId
column asUNIQUE
in thechunk_tags
table. However, I'm not sure if it's possible for the samechunkId
to appear more than once in the table. Marking thetag
andchunkId
columns asUNIQUE
when put together would also solve the quadratic time complexity issue, making the@Codebase
queries fast again.To reproduce
You may want to set up a local embedding model, or disable embedding somehow, to avoid incurring a high API fee.
First, clone https://github.com/huggingface/transformers
Then, open it in VSCode and wait for the whole repository to be indexed (this took about 30 minutes for me).
Now, try any query that uses
@Codebase
in the chat mode. It will take about 20 minutes to finally get an answer. This long computation happens every single time@Codebase
is used.Log output
No logs appear relevant to this issue in particular.
The text was updated successfully, but these errors were encountered: