Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic complexity bug in SQLite query makes @Codebase unusably slow on large repos #4255

Open
3 tasks done
JubilantJerry opened this issue Feb 20, 2025 · 0 comments
Open
3 tasks done
Assignees
Labels
area:indexing Relates to embedding and indexing ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior "needs-triage" priority:high Indicates high priority

Comments

@JubilantJerry
Copy link

JubilantJerry commented Feb 20, 2025

Before submitting your bug report

Relevant environment info

- OS: Ubuntu 24.04
- Continue version: 0.9.269
- IDE version: VSCode 1.97.2
- Model: Qwen 2.5 7b Instruct
- config:
  
{
  "models": [
    {
      "title": "Qwen 2.5 72b Instruct",
      "model": "qwen2.5:72b-instruct-q8_0",
      "provider": "ollama",
      "apiBase": "https://api.ollama.companyname.com",
      "contextLength": 16384
    },
    {
      "title": "Qwen 2.5 7b Instruct",
      "model": "qwen2.5:7b-instruct-q8_0",
      "provider": "ollama",
      "apiBase": "https://api.ollama.companyname.com",
      "contextLength": 16384
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base",
    "apiBase": "https://api.ollama.companyname.com"
  },
  "embeddingsProvider": {
    "title": "Nomic Embed Text",
    "provider": "ollama",
    "model": "nomic-embed-text",
    "apiBase": "https://api.ollama.companyname.com"
  },
  "reranker": {
    "name": "huggingface-tei",
    "params": {
      "apiBase": "http://192.168.187.57:11435",
      "truncate": true,
      "truncation_direction": "Right"
    }
  },
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    },
    {
      "name": "diff",
      "params": {}
    },
    {
      "name": "terminal",
      "params": {}
    },
    {
      "name": "problems",
      "params": {}
    },
    {
      "name": "folder",
      "params": {}
    },
    {
      "name": "codebase",
      "params": {}
    }
  ],
  "slashCommands": [
    {
      "name": "share",
      "description": "Export the current chat session to markdown"
    },
    {
      "name": "cmd",
      "description": "Generate a shell command"
    },
    {
      "name": "commit",
      "description": "Generate a git commit message"
    }
  ]
}

Description

If a large repo, such as huggingface/transformers, is indexed, then all @Codebase queries will be very, very slow. For almost the full duration, there are no LLM calls, no embeddings being created, and no reranking being done. There will only be a single core running at 100% at the client end.

This happens as long as any large repo has been indexed in the past from any directory. Even if I open a small project, all @Codebase queries on the new project will still be slow.

I've identified the problem as a nested loop join in the following SQL query used for FTS, used in FullTextSearchCodebaseIndex.ts inside retrieve/buildRetrieveQuery:

SELECT fts_metadata.chunkId, fts_metadata.path, fts.content, rank
FROM fts
JOIN fts_metadata ON fts.rowid = fts_metadata.id
JOIN chunk_tags ON fts_metadata.chunkId = chunk_tags.chunkId
WHERE fts MATCH ?
AND chunk_tags.tag IN (?)
ORDER BY bm25(fts, 10)
LIMIT ?;

The query plan appears as:

SCAN fts VIRTUAL TABLE INDEX 0:M2
SEARCH fts_metadata USING INTEGER PRIMARY KEY (rowid=?)
SCAN chunk_tags
USE TEMP B-TREE FOR ORDER BY

This means for every row of fts, the database engine will loop over all rows of chunk_tags. Both tables contain a row for every chunk across all projects indexed, so the time complexity of the query is quadratic in the number of chunks.

I was able to fix the problem by marking the chunkId column as UNIQUE in the chunk_tags table. However, I'm not sure if it's possible for the same chunkId to appear more than once in the table. Marking the tag and chunkId columns as UNIQUE when put together would also solve the quadratic time complexity issue, making the @Codebase queries fast again.

To reproduce

You may want to set up a local embedding model, or disable embedding somehow, to avoid incurring a high API fee.

First, clone https://github.com/huggingface/transformers
Then, open it in VSCode and wait for the whole repository to be indexed (this took about 30 minutes for me).

Now, try any query that uses @Codebase in the chat mode. It will take about 20 minutes to finally get an answer. This long computation happens every single time @Codebase is used.

Log output

No logs appear relevant to this issue in particular.
@dosubot dosubot bot added area:indexing Relates to embedding and indexing ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior priority:high Indicates high priority labels Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:indexing Relates to embedding and indexing ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior "needs-triage" priority:high Indicates high priority
Projects
None yet
Development

No branches or pull requests

2 participants