doora
Search code by shape,
not by text.
A high-performance structural code search engine built on Tree-sitter. Where grep sees a flat river of bytes, doora sees a grammar — functions, types, scopes, and relationships.
01Overview
doora is a command-line tool that parses source files into Abstract Syntax Trees and executes structural pattern queries against them. It is fundamentally different from text search: it understands the grammar of your code, not just its characters.
Every text-based search tool — grep, ripgrep, ack — treats source code as a string. They cannot tell the difference between a function named authenticate and a comment that mentions authenticate. They cannot find "all functions that take exactly two arguments" or "all unwrap() calls outside test modules."
doora answers all of these queries instantly, across millions of lines of code, using Tree-sitter's incremental parsing library at its core.
Structural patterns
Find code by its syntactic shape using Tree-sitter S-expression queries.
Parallel processing
Lock-free work-stealing thread pool via Rayon. Scales to all available cores.
7 languages
Rust, Python, JavaScript, TypeScript, Go, C, and C++ with auto-detection.
Bloom filter index
Pre-reject files that cannot possibly match before invoking the parser.
Semantic rewriting
Replace structural patterns without corrupting surrounding syntax.
MCP server
Expose codebase structure to LLM agents via the Model Context Protocol.
02Installation
doora is distributed as a single static binary. Install via Cargo, download a pre-built binary, or build from source.
From crates.io
# Requires Rust 1.78+
cargo install doora
Pre-built binaries
Download from the Releases page. Binaries are available for:
| Platform | Architecture | Binary |
|---|---|---|
| Linux | x86_64 | doora-x86_64-unknown-linux-gnu |
| Linux | aarch64 | doora-aarch64-unknown-linux-gnu |
| macOS | x86_64 | doora-x86_64-apple-darwin |
| macOS | Apple Silicon | doora-aarch64-apple-darwin |
| Windows | x86_64 | doora-x86_64-pc-windows-msvc.exe |
From source
git clone https://github.com/backpack-lab/doora
cd doora
cargo build --release
# Binary at ./target/release/doora
Shell completions
# Bash doora --generate-completions bash >> ~/.bashrc # Zsh doora --generate-completions zsh >> ~/.zshrc # Fish doora --generate-completions fish > ~/.config/fish/completions/doora.fish
03Quick start
Run your first structural query in under 60 seconds.
Find all function definitions
src/auth/handler.rs:42:0 [@fn_name] "parse_token"
src/auth/handler.rs:89:0 [@fn_name] "validate_session"
src/db/pool.rs:14:0 [@fn_name] "connect"
...
Found 47 matches across 23 files in 38ms
Find a specific function by name
doora \ -q '(function_item name: (identifier) @fn (#eq? @fn "connect"))' \ -p ./src
Search multiple languages at once
doora \
-q '(function_declaration name: (identifier) @fn_name)' \
-p ./src
# auto-detects language per file
Multiple queries in one pass
doora \ -q '(function_item name: (identifier) @fn_name)' \ -q '(struct_item name: (type_identifier) @struct_name)' \ -p ./src --no-color
Launch the interactive TUI
doora -q '(function_item)' -p . --tui
Build a search index for faster queries
doora index ./src --verbose
# Subsequent searches use Bloom filter pre-rejection
04Why structural search
Text search has fundamental limitations that no amount of regex cleverness can overcome. Structural search solves them categorically.
| Query | grep / ripgrep | doora |
|---|---|---|
Find function definitions named auth_user |
Returns ALL occurrences — function defs, variable names, comments, string literals, dead code | Returns only function definition nodes named auth_user |
| Find functions taking exactly 2 arguments | Cannot be expressed reliably in regex | Trivial: query the parameter list node child count |
Find all unwrap() outside test modules |
Cannot express scope constraints | Query for call sites with scope predicates |
| Find struct definitions that implement a trait | Multi-step, fragile, many false positives | Single S-expression query |
| Rename a function everywhere it is defined | Risk of corrupting string literals and comments | Semantic rewriting via AST: only actual definition nodes |
doora is to grep what a SQL database is to a flat text file. Both store the same data; one understands its structure.
05How AST search works
Source code has grammar. Every language defines how tokens nest into expressions, expressions into statements, and statements into functions and modules. doora makes this grammar queryable.
From bytes to tree
Consider this Rust function:
fn authenticate(user: &str, password: &str) -> bool { true }
Tree-sitter parses this into a Concrete Syntax Tree (CST):
└─ function_item
├─ name: identifier "authenticate"
├─ parameters: parameters
│ ├─ parameter
│ │ ├─ pattern: identifier "user"
│ │ └─ type: reference_type
│ └─ parameter
│ ├─ pattern: identifier "password"
│ └─ type: reference_type
├─ return_type: primitive_type "bool"
└─ body: block
An S-expression query navigates this tree structure. The query (function_item name: (identifier) @fn) matches any function_item node that has a name child of type identifier, and captures that identifier as @fn.
CST vs AST
Traditional compilers produce Abstract Syntax Trees (ASTs) that discard whitespace, comments, and punctuation. doora uses Concrete Syntax Trees (CSTs) that retain 100% fidelity to the original source — including exact byte offsets for every token. This is required for:
- Accurately mapping matches back to line and column numbers for display
- Semantic rewriting that preserves the user's original formatting
- Syntax highlighting driven by the tree itself rather than regex
06Query syntax overview
Queries use Tree-sitter's S-expression pattern syntax — a Lisp-like notation that mirrors the shape of the syntax tree.
Basic node match
(node_type)
Matches any node of the given type anywhere in the tree. Node type names are grammar-specific — see the language sections for common types.
Named capture
(function_item name: (identifier) @my_capture)
The @my_capture syntax tags a sub-node for extraction. Captured text appears in the output. Multiple captures per query are supported.
Equality predicate
(function_item name: (identifier) @fn (#eq? @fn "connect"))
Matches only when the captured node's text equals "connect" exactly.
Regex predicate
(function_item name: (identifier) @fn (#match? @fn "^(get|set|update)_"))
Matches function names starting with get_, set_, or update_.
Wildcard child
(call_expression function: (identifier) @callee arguments: (arguments . (_) .))
The (_) wildcard matches any single node. Use . as an anchor.
Multi-query (single pass)
doora \ -q '(function_item name: (identifier) @fn_name)' \ -q '(struct_item name: (type_identifier) @struct_name)' \ -p ./src
Multiple -q flags are compiled into a single-pass automaton. The AST is traversed exactly once regardless of query count.
07CLI — search command
The default command. When no subcommand is given, doora runs a structural search.
Arguments
| Flag | Type | Default | Description |
|---|---|---|---|
| -q, --query <S-EXPR> | String (repeatable) | required | S-expression query pattern. Pass multiple times for multi-query single-pass search. |
| -p, --path <DIR> | PathBuf | "." | Root directory to search. Must be an existing directory. |
| -l, --lang <LANG> | String | "auto" | Language to parse. One of: rust, python, js, ts, go, c, cpp, auto. |
| --no-color | bool | false | Disable ANSI color output. Also honored via the NO_COLOR environment variable. |
| -Q, --quiet | bool | false | Suppress per-match result lines. Only the summary line is printed. |
| --stats | bool | false | Print detailed performance diagnostics to stderr after the search completes. |
| --no-update-index | bool | false | Disable automatic incremental index updates during search. |
Output format
Fields: filepath:line:col [@capture_name] "matched_text"
- Filepath is colorized cyan (unless
--no-color) - Capture name is colorized yellow
- Matched text is colorized green, always wrapped in literal quotes
- Line numbers are 1-indexed; columns are 0-indexed byte offsets
Stats output
files walked: 47
files parsed: 46
files skipped: 1
matches found: 12
sieve rejected: 18
match rate: 26.09% (files with matches / files parsed)
wall time: 38ms
throughput: 1236.84 files/sec
08CLI — index command
Explicitly builds or updates the Bloom filter index for a directory. Once indexed, search queries with string literals run significantly faster.
| Flag | Type | Default | Description |
|---|---|---|---|
| <PATH> | PathBuf | required | Root directory to index. |
| --lang <LANG> | String | "auto" | Language filter. Defaults to auto-detect all supported extensions. |
| --verbose | bool | false | Print one line per file: indexed:, fresh:, or removed: prefix. |
indexed: src/db/pool.rs
fresh: src/main.rs
removed: src/old/legacy.rs
indexed 44 files, skipped 2 fresh, removed 1 stale entries
index written to .doora-index
The index is stored at <PATH>/.doora-index as a binary file serialized with bincode. It is a manifest of per-file Bloom filter entries, each containing:
- Absolute file path
- Last-modified timestamp (seconds since Unix epoch)
- File size in bytes
- 512-byte Bloom filter (4096 bits, two FNV-1a hash functions)
- Detected language string
Both the explicit index command and the search command perform incremental updates. Files whose mtime and size match the stored entry are skipped. New and modified files are re-indexed. Deleted files are removed from the manifest automatically.
09Query guide — basics
A complete reference for writing effective S-expression queries.
Node types
Every language has a set of named node types defined by its Tree-sitter grammar. The node type is the first element of an S-expression. Common node types vary by language:
| Language | Functions | Classes / Types | Variables |
|---|---|---|---|
| Rust | function_item | struct_item, impl_item | let_declaration |
| Python | function_definition | class_definition | assignment |
| JavaScript | function_declaration | class_declaration | variable_declaration |
| TypeScript | function_declaration | interface_declaration, type_alias_declaration | lexical_declaration |
| Go | function_declaration | type_declaration | short_var_declaration |
| C / C++ | function_definition | struct_specifier, class_specifier | declaration |
Field names
Named fields constrain which child to match. The syntax is field_name: (child_type):
(function_item name: (identifier) @fn parameters: (parameters) @params return_type: (_) @ret)
Nested patterns
Patterns can nest to arbitrary depth:
(impl_item
type: (type_identifier) @impl_type
body: (declaration_list
(function_item name: (identifier) @method_name)))
10Query guide — predicates
Predicates filter captures based on their text content. They are evaluated using pre-compiled regex objects, cached at query compile time.
| Predicate | Description | Example |
|---|---|---|
#eq? |
Exact equality match | (#eq? @fn "connect") |
#match? |
Regex match (anchored with ^ and $ as needed) | (#match? @fn "^handle_") |
#not-eq? |
Negative equality | (#not-eq? @fn "main") |
#any-of? |
Match any string in a list | (#any-of? @fn "get" "set" "del") |
Regex predicates are compiled exactly once at query compile time and wrapped in Arc<Regex>. They are passed by reference to all parallel worker threads — zero per-file regex compilation overhead.
11Query examples by language
A practical library of common patterns for each supported language.
Rust
(function_item name: (identifier) @fn_name)
fn declaration — public, private, async, const, unsafe.unwrap() call sites(call_expression
function: (field_expression
field: (field_identifier) @m (#eq? @m "unwrap")))
.unwrap() call regardless of the receiver type.(struct_item name: (type_identifier) @struct_name)
(impl_item trait: (type_identifier) @trait type: (type_identifier) @impl_type (#eq? @impl_type "MyStruct"))
(function_item name: (identifier) @fn (#match? @fn "^(get|set|update)_"))
get_, set_, or update_.Python
(function_definition name: (identifier) @fn_name)
(decorated_definition decorator: (decorator) @decorator definition: (function_definition name: (identifier) @fn_name))
(class_definition name: (identifier) @class_name)
JavaScript
(function_declaration name: (identifier) @fn_name)
function foo() {} syntax only. Arrow functions use a different node type.(class_declaration name: (identifier) @class_name)
(method_definition name: (property_identifier) @method_name)
TypeScript
(interface_declaration name: (type_identifier) @interface_name)
(type_alias_declaration name: (type_identifier) @type_name)
Go
(function_declaration name: (identifier) @fn_name)
(type_declaration (type_spec name: (type_identifier) @type_name))
C / C++
(function_definition
declarator: (function_declarator
declarator: (identifier) @fn_name))
(class_specifier name: (type_identifier) @class_name)
(type_definition declarator: (type_identifier) @type_name)
12Language auto-detection
When --lang auto is used (the default), doora detects the grammar per file from its extension and walks all supported extensions simultaneously.
| Language flag | Extensions walked | Grammar |
|---|---|---|
rust | .rs | tree-sitter-rust |
python | .py, .pyi | tree-sitter-python |
js | .js, .mjs, .cjs | tree-sitter-javascript |
ts | .ts, .tsx, .mts, .cts | tree-sitter-typescript (tsx variant) |
go | .go | tree-sitter-go |
c | .c, .h | tree-sitter-c |
cpp | .cpp, .cc, .hpp, .hxx, .cxx, .h | tree-sitter-cpp |
In auto mode, .h files are parsed with the C grammar. To parse them as C++, use --lang cpp explicitly.
In auto mode, the query is compiled against every supported language at startup. Languages for which the query fails to compile (because a node type doesn't exist in that grammar) are silently skipped. Only languages where the query compiles successfully are searched. This means a Rust-only query like (function_item @fn) will search only .rs files even in auto mode.
13Bloom filter index
The index is a pre-parse rejection sieve. Files that mathematically cannot contain the search term are skipped entirely, before tree-sitter is invoked.
How it works
Bloom filter parameters
| Parameter | Value | Rationale |
|---|---|---|
| Bit array size | 4096 bits (512 bytes) | ~1% false positive rate for ~570 trigrams per file |
| Hash functions | 2 (FNV-1a variants) | Sweet spot for our trigram density and bit array size |
| Hash algorithm | FNV-1a 32-bit | Deterministic across process runs (unlike SipHash with random seeds) |
14Semantic rewriting
doora can surgically replace structural patterns without corrupting surrounding syntax. This is fundamentally different from sed — it operates on the AST, not on text.
--rewrite 'new_name' \
-p ./src
--rewrite 'new_name' \
--in-place \
-p ./src
A unified diff is always printed before --in-place writes, unless --yes is passed. Rewrites are applied in reverse byte order to avoid offset invalidation when multiple matches occur in the same file.
The rewrite template replaces the CAPTURED node's text, not the full match. Capture your target precisely to avoid unintended changes.
15Interactive TUI
Launch with --tui for a split-pane terminal explorer with live streaming results.
│ src/auth/ ││ source_file │
│ handler.rs 3 ││ function_item [42:0 → 58:1] │
│ token.rs 1 ││ name: identifier "parse_token" ← match │
│ src/db/ ││ parameters: parameters │
│ pool.rs 1 ││ parameter: identifier "input" │
└──────────────────┘└───────────────────────────────────────────┘
┌─ Code ─────────────────────────────────────────────────────────┐
│ 42 │ pub fn parse_token(input: &str, ctx: &Context) -> Token { │
└────────────────────────────────────────────────────────────────┘
[/] query [j/k] navigate [Enter] expand node [q] quit
The TUI uses a fully decoupled async event loop (Tokio). Results stream in live as background threads find matches — the terminal never freezes, even on multi-gigabyte repositories.
| Key | Action |
|---|---|
/ | Focus the query input bar |
j / k or arrows | Navigate the file list |
Enter | Expand / collapse AST nodes |
Tab | Switch focus between panes |
q or Esc | Quit |
16MCP server mode
doora exposes a Model Context Protocol server so LLM coding agents can query your codebase structurally, reducing hallucinations and wasted context tokens.
Why LLMs need structural search
LLM coding agents have limited context windows. When an agent tries to understand a large codebase by reading files, it burns context tokens on irrelevant content and still has no structural understanding of the architecture.
With the doora MCP server, an agent can ask: "What is the exact type signature of the function handling user authentication?" and receive a precise, structured answer in milliseconds — consuming a fraction of the context tokens that reading source files would require.
Setup
Configure in .mcp.json
Add to your project's MCP configuration file:
{
"mcpServers": {
"doora": {
"command": "doora",
"args": ["serve", "--mcp"],
"cwd": "/path/to/your/repo"
}
}
}
Available MCP tools
| Tool | Input | Description |
|---|---|---|
search_ast |
query: string |
Run a live S-expression structural search and return matching results with file paths and positions. |
lookup_symbol |
name: string |
Query the persisted SQLite index for a symbol by name. Returns type signatures, locations, and relationships. |
17Architecture — pipeline overview
The search pipeline is a linear sequence from CLI arguments to printed results. Each stage is isolated and independently testable.
Key architectural decisions
- Query compiled before walk:
Arc<MultiCompiledQuery>is built once and shared across all Rayon threads with zero per-thread recompilation. - Ephemeral tree lifecycle: The
Treeand source bytes are dropped immediately afterextract_matchesreturns. RAM usage stays flat regardless of repository size. - Single-pass multi-query: Multiple
-qflags are merged into one traversal. The AST is walked once per file regardless of query count. - BitSet pre-filtering: Each node's kind ID is checked against a pre-computed
HashSet<u16>before evaluating the full match pattern. - Bloom filter sieve: Files without the query's required trigrams are skipped entirely before tree-sitter is invoked.
18Architecture — module reference
The codebase is organized into focused, independently testable modules.
MatchResult, SearchConfig, Language, LangMode, AppError. All shared types live here.parse_file with mmap support, get_language, detect_language, FileSource enum.CompiledQuery with BitSet kind IDs and cached regex predicates. MultiCompiledQuery. Single-pass extract_multi_matches.ignore crate. Respects .gitignore. build_walker for single language, build_auto_walker for all extensions.ColorMode, resolve_color_mode (respects NO_COLOR). Writer injection for testability.extract_unique_trigrams, extract_query_trigrams. No heap allocation per trigram.[u8; 512] bit array. insert, probably_contains, serialization.IndexEntry, IndexManifest. Atomic save via rename. bincode serialization. Version mismatch detection.QueryTrigramSet, FileIndexStatus, should_parse_file. Multi-query OR semantics.19Architecture — concurrency model
The search pipeline is fully parallel using Rayon's work-stealing thread pool. Every design decision was made to maximize throughput while eliminating contention.
Work-stealing thread pool
The directory walker's Iterator is converted to a Rayon parallel iterator via par_bridge(). Rayon distributes file processing across all available cores using a work-stealing scheduler — when a thread exhausts its local queue, it steals work from the end of another thread's queue.
Thread-local parser pool
Tree-sitter Parser initialization is expensive — it allocates C-level state via FFI. Creating and dropping a parser per file would be catastrophic. Instead, each Rayon worker thread owns exactly one Parser via thread_local!:
thread_local! {
static PARSER: RefCell<Parser> = RefCell::new(Parser::new());
}
The parser is reconfigured for each file via set_language() before parsing — a cheap operation that resets internal state without a new allocation.
Shared state
The following objects are shared across all threads:
Arc<HashMap<Language, Arc<MultiCompiledQuery>>>— compiled queries, shared by referenceArc<Mutex<Vec<MatchResult>>>— result accumulator, locked only for appendArc<Mutex<usize>>— file counterArc<IndexManifest>— Bloom filter index, read-only during search
Mutex locks are held only for the append/increment operation — never during parse or query work. The expensive CPU work (parsing, DFS traversal) happens entirely outside any lock.
20Architecture — memory model
doora maintains a flat RAM profile regardless of repository size by enforcing an ephemeral tree lifecycle.
Ephemeral lifecycle
When parsing a file with text search, only a small buffer needs to be in memory. With AST-based search, parsing a file produces a large C-allocated structure with potentially tens of thousands of nodes. If all trees were retained simultaneously, a 10,000-file repository would require tens of gigabytes of RAM.
The solution is the ephemeral lifecycle. In the parallel closure, the drop order is explicit and mandatory:
let (tree, source) = parse_file(entry.path(), &lang)?; let matches = extract_multi_matches(&tree, source.as_bytes(), &multi, path); drop(tree); // C-allocated CST freed immediately drop(source); // source bytes freed immediately // Only the small Vec<MatchResult> persists
Memory-mapped files
For files above 1MB, parse_file uses memmap2 instead of fs::read_to_string. The OS pages in only the bytes actually accessed, backed by the file's page cache rather than a heap allocation:
// Small files: heap allocation FileSource::Heap(String) // Large files (≥ 1MB): memory-mapped FileSource::Mapped(memmap2::Mmap)
Maximum RAM usage
RAM usage is bounded by:
- Number of active Rayon worker threads × (one tree + one source buffer)
- The final accumulated
Vec<MatchResult>(small owned strings only) - The pre-computed
CompiledQueryobjects (one per language) - The
IndexManifestif the index is loaded (512 bytes per file entry)
21Performance
doora is engineered for sub-second query latency on repositories with millions of lines of code.
Benchmark results
| Benchmark | Result | Notes |
|---|---|---|
| Single file parse + query (100 functions) | ~180µs | On Apple M2, release build |
| 10,000-file Rust repository | <1000ms | Target; actual ~380ms on modern 8-core |
| Parallel search, 100 files, 20 functions each | ~45ms | Rayon, 8 cores |
| Query compilation (single, Rust grammar) | ~12µs | Includes BitSet and regex pre-compilation |
| Bloom filter rejection check | <0.003ms | Per file, two FNV-1a hash evaluations |
Optimization stack
The following optimizations are layered from outermost to innermost:
- Bloom filter sieve: Skip parsing files that cannot contain the search term. Eliminates tree-sitter invocation entirely for filtered files.
- BitSet kind filtering: Pre-compute which node kind IDs could match the query. Skip
match_nodeevaluation for the vast majority of tree nodes. - Pre-compiled regex predicates: Compile
#match?regex patterns exactly once per query, share viaArc<Regex>. Zero per-file regex compilation. - Single-pass multi-query automaton: Traverse the AST exactly once per file regardless of how many
-qflags were passed. - Thread-local parser pool: Each Rayon thread owns one
Parserfor its entire lifetime. Zero per-file parser allocation. - Memory-mapped file reading: Files ≥ 1MB use
mmapto avoid heap allocation of the source string. - Work-stealing parallelism: Rayon's lock-free scheduler saturates all CPU cores with no idle threads.
Profiling
To profile on your hardware:
# Install flamegraph cargo install flamegraph # Run profiler against the Rust compiler source make profile # Or manually cargo flamegraph --bin doora -- \ -q '(function_item name: (identifier) @fn_name)' \ -p /path/to/large/repo --lang rust --no-color --quiet
See PROFILING.md for detailed instructions and interpretation guidance.
22Building from source
Requires Rust 1.78 or later (MSRV). All grammar crates are compiled from source as part of the Cargo build.
# Debug build cargo build # Release build (recommended for performance testing) cargo build --release # Check only (fast compilation check) cargo check --all-features
Dependencies
| Crate | Version | Purpose |
|---|---|---|
tree-sitter | 0.22 | Incremental parsing library core |
tree-sitter-rust/python/javascript/typescript/go/c/cpp | 0.21 | Language grammars |
rayon | 1.10 | Work-stealing parallel iterator |
ignore | 0.4 | Gitignore-aware directory walker |
clap | 4 | CLI argument parsing |
clap_complete | 4 | Shell completion generation |
thiserror | 1 | Error type derivation |
regex | 1 | Pre-compiled regex for #match? predicates |
bincode | 2 | Binary serialization for the index |
serde | 1 | Serialization derive macros |
memmap2 | 0.9 | Memory-mapped file reading |
23Testing
The test suite covers unit tests, integration tests, and statistical Bloom filter correctness tests.
# Run all tests cargo test # Run tests for a specific module cargo test --lib parser cargo test --lib query cargo test --lib walker cargo test --lib bloom # Run integration tests cargo test --test integration_test -- --nocapture # Run Bloom filter statistical tests cargo test --test bloom_statistics -- --nocapture # Run binary tests (CLI validation, stats) cargo test --bin doora
Test inventory
| Module | Test count | Coverage focus |
|---|---|---|
types.rs | 7 | MatchResult derives, AppError display, sort order |
parser.rs | 20+ | Thread-local pool, mmap vs heap, all 7 languages |
query.rs | 25+ | Query compilation, BitSet filtering, regex predicates, Arc sharing |
walker.rs | 30+ | Extension filtering, gitignore, excluded dirs, binary detection |
output.rs | 31 | Color output, format correctness, writer injection |
bloom.rs | 15+ | Zero false negatives, bit operations, FNV-1a determinism |
trigram.rs | 20+ | Sliding window, UTF-8 byte handling, deduplication |
sieve.rs | 18+ | Multi-query OR semantics, staleness detection, end-to-end |
integration_test.rs | 56+ | End-to-end pipeline across all 7 languages |
bloom_statistics.rs | 8 | Zero false negatives across 1000 files, FPR measurement |
Fixture files
Integration tests use fixture files in tests/fixtures/:
simple.rs— Rust functions and typesmulti_fn.rs— Multiple functions for ordering testsstructs.rs— Struct definitionsnested.rs— Nested module functionsempty.rs— Empty file (parse error path)simple.py— Python functionssimple.js— JavaScript functions and classessimple.ts— TypeScript interfaces and typessimple.go— Go functions and structssimple.c— C functions and typedefssimple.cpp— C++ classes and templates
24Benchmarks
Criterion benchmarks in benches/search_benchmarks.rs. Run automatically in CI with regression detection.
# Run all benchmarks cargo bench --bench search_benchmarks # Run as tests (fast, no timing) cargo bench --bench search_benchmarks -- --test # Save baseline for comparison cargo bench -- --save-baseline main # Compare against baseline cargo bench -- --baseline main
Benchmark suite
| Benchmark group | What it measures |
|---|---|
compile_single_query_rust | Query compilation time including BitSet + regex pre-compilation |
compile_five_queries_rust | Multi-query compilation overhead |
single_file/functions/N | Parse + query throughput for files with 10, 100, 500, 1000 functions |
heap_read vs mmap_read | File reading strategy comparison |
one_query vs five_queries | Single-pass multi-query overhead |
parallel_100_files_20fn_each | Full parallel pipeline with result accumulation |
compile_universal_query_all_languages | Auto mode query compilation across all 7 grammars |
CI fails on regressions exceeding 10% for parallel benchmarks and 2% for single-file benchmarks. Results are uploaded as GitHub Actions artifacts with HTML reports from Criterion.
25Contributing
Contributions are welcome. The project is tracked issue-by-issue with milestones for each subsystem.
Before opening a PR
# Format cargo fmt # Lint — zero warnings enforced cargo clippy --all-features -- -D warnings # Full test suite cargo test --all-features
All three must pass cleanly. The CI pipeline enforces #![deny(warnings)] and #![warn(clippy::pedantic)] across the entire codebase.
Code style
- No inline comments, no doc comments, no block comments anywhere in source files
- All public items must have doc comments (enforced by
cargo doc --no-depszero warnings) - Use
Result<T>(the crate's type alias forResult<T, AppError>) for all fallible functions - Never use
unwrap()in production code — only in tests with an explanatory comment - Explicit
drop()calls are mandatory after tree and source are no longer needed
Adding a new language
- Add
tree-sitter-<lang>toCargo.toml - Add a variant to
Languageenum intypes.rs - Add extension mapping in
walker.rs::extensions_for_language - Add grammar arm in
parser.rs::get_language - Add detection arm in
parser.rs::detect_language - Add variant in
parser.rs::get_all_languages - Add lang string to
main.rs::resolve_langandvalidate - Add fixture file to
tests/fixtures/ - Add integration tests following the pattern of existing language tests
License
MIT