Ingest knowledge from GitHub
What gets indexed
For each connected repository:
- README files at any path.
- Markdown docs in
/docs,/documentation, or custom paths. - Issue threads (titles and bodies; optionally comments).
- Discussion threads (with comments).
- Wiki pages (if enabled on the repo).
- PR descriptions (optionally; off by default).
What's not indexed:
- Code files (default; opt-in for code-aware retrieval).
- Binary assets.
- Issue labels (used for filtering, not as content).
- Draft PRs.
Setup walkthrough
About 15 minutes:
Step 1: connect GitHub
- Open Knowledge Hub > Add Source > GitHub.
- Click "Connect GitHub".
- Sign in with GitHub.
- Approve scopes:
repo(read access to repositories).read:discussion(read discussions).
- Pick which repositories to grant access. Per-repo selection is the most precise.
Setup time: about 5 minutes.
Step 2: select content per repo
For each repo:
- Include paths. Default:
/README.md,/docs/**,/documentation/**. Customize with glob patterns. - Include issues. Toggle on/off. If on, include closed issues too?
- Include discussions. Toggle.
- Include PRs. Off by default.
- Branch. Default:
mainormaster. Other branches selectable.
Step 3: configure sync
- Webhook sync (recommended). Updates within 30 seconds of a push.
- Scheduled sync. Daily catch-up sync.
Step 4: trigger initial sync
Click "Sync now". Typical repo (50 docs) indexes in about 5 minutes.
Sync behavior
How GitHub changes flow:
pushwebhook fires on commits to the watched branch.- AskVault detects which files changed vs the last sync.
- Only changed files re-index. Avoids full re-crawl.
- Index updates within 30 seconds of push.
For issue and discussion content:
issues.opened,issues.edited,issues.closedevents trigger re-index.- Same for discussions.
Code-aware retrieval (opt-in)
By default, code files aren't indexed. Enable selectively:
- Knowledge Hub > GitHub Source > Code Indexing.
- Enable.
- Pick file extensions (e.g.,
.py,.js,.go). - Pick paths to index (e.g.,
/src/**/*.py).
When enabled:
- Functions and classes index as chunks.
- Doc comments preserved.
- The bot can answer "where is function X defined?" with file path and line number.
Useful for engineering helpdesk bots. Adds storage cost; budget for about 2 to 3x the size of your code files.
Sample questions the bot can answer
Once indexed:
- "How do I install this package?" → README answer.
- "What's the rate limit on the API?" → /docs/rate-limits.md.
- "Has anyone reported this error?" → matching closed issues.
- "What's the rollback procedure?" → /docs/runbooks/rollback.md.
- (Code-aware) "Where is authenticate_user defined?" → src/auth/handler.py:42.
Each answer cites the source file with line numbers where applicable.
Private repository handling
For private repos:
- OAuth grants AskVault read access scoped to the connected user.
- AskVault never exposes private content to anonymous bot visitors.
- Use audience tags to scope which authenticated users can query private-repo content.
Common pattern: tag private-repo content internal, deploy a Slack bot, require identity verification for engineers.
Multi-repo workspaces
For organizations with many repos:
- Connect each separately under different sources.
- Or connect a whole org (if OAuth scoped that way) and pick repos to index.
- Up to 50 repos per AskVault workspace on Business.
Webhook configuration
AskVault auto-configures webhooks during setup. For debugging:
- GitHub > Repository Settings > Webhooks lists active webhooks.
- Find "askvault" entry.
- Recent deliveries shows event history.
If a webhook fails, AskVault retries 3 times then alerts.
Plan availability
- Free, Starter. No GitHub integration.
- Growth. Up to 5 repos, doc-only indexing. Growth+
- Business. Up to 50 repos, issues + discussions, code-aware retrieval. Business+
- Enterprise. Unlimited repos.
Issue and discussion handling
Issues become a knowledge source:
- Title is the question prompt.
- Body is the context.
- Optionally, comments add detail.
- Closed-resolution status marks as authoritative.
For discussions:
- Threaded conversations index with full context.
- Marked answers ("answered" status) get extra weight in retrieval.
Useful for engineering helpdesk bots that surface "this was discussed before".
Branch and tag handling
- Default branch indexed. Configure under sync settings.
- Other branches. Add as separate sources if needed.
- Tags (releases) not indexed. Releases content indexed via the README on the tagged commit.
For docs-as-code workflows, the bot follows the main branch and surfaces the canonical docs.
Code search vs RAG
For code-aware queries, two patterns:
- RAG-based. Bot semantic-searches indexed chunks; surfaces relevant code with explanation.
- GitHub-native search. Bot calls GitHub's code search API (faster for exact-name lookups).
AskVault uses RAG by default. For "find function name X" patterns, planning to use GitHub-native search.
Privacy and audit
What AskVault reads:
- Repository content within the OAuth scope.
- Issue and discussion metadata.
- No write access ever.
Every retrieval logs:
- Which repo, file, lines.
- Visitor ID.
- Timestamp.
Useful for proving "the bot didn't leak source code to unauthorized visitors".
Planned features (on the roadmap)
Documented for accuracy:
- Pull-request body indexing. Today, off by default. PR descriptions optionally indexable.
- Commit message indexing. Today, no. Planned for engineering knowledge bots.
- Code-aware function-summary generation. Today, code chunks index as-is. Auto-generated function summaries for better retrieval planned.
- GitHub Copilot integration. Today, separate worlds. Planned: side-by-side AskVault plus Copilot context.
Limits
- Repos per workspace. 5 (Growth), 50 (Business), unlimited (Enterprise).
- Files per repo indexed. 5,000 max.
- File size per file. 1 MB max for code; 10 MB for docs.
- Webhook event throughput. About 100 events per minute per repo.
Common pitfalls
Repository not appearing in source list. OAuth didn't grant access to that repo. Re-authorize with broader scope.
Issues content overpowering docs. Too many noisy issues. Filter to specific labels under Source Settings.
Code chunks return without context. Function definitions without surrounding context. Enable "include neighboring chunks" under retrieval settings.
Webhook events not arriving. GitHub webhook misconfigured. Re-create via Knowledge Hub > GitHub Source > Re-create webhook.
FAQ
Does this work for GitHub Enterprise Server (self-hosted)?
Yes for hosted GitHub Enterprise. For on-prem GitHub Enterprise Server, contact support for the right authentication flow.
Can I index code from a private repo for a public-facing bot?
Risky. The bot won't expose private content to anonymous visitors thanks to audience tags, but verify the tagging carefully.
How fresh are bot answers?
Within 30 seconds of push (webhook sync). Without webhooks, within 24 hours (scheduled sync).
Can I index GitHub Wiki pages?
Yes if the repo's wiki is enabled. AskVault detects and includes them.
Does this work for monorepos?
Yes. Use include-path globs to scope which subdirectories index. Avoid indexing the whole tree if it's huge.