doora — Documentation(Maps)

01Overview

doora is a command-line tool that parses source files into Abstract Syntax Trees and executes structural pattern queries against them. It is fundamentally different from text search: it understands the grammar of your code, not just its characters.

Every text-based search tool — grep, ripgrep, ack — treats source code as a string. They cannot tell the difference between a function named authenticate and a comment that mentions authenticate. They cannot find "all functions that take exactly two arguments" or "all unwrap() calls outside test modules."

doora answers all of these queries instantly, across millions of lines of code, using Tree-sitter's incremental parsing library at its core.

⬡

Structural patterns

Find code by its syntactic shape using Tree-sitter S-expression queries.

⚡

Parallel processing

Lock-free work-stealing thread pool via Rayon. Scales to all available cores.

◈

7 languages

Rust, Python, JavaScript, TypeScript, Go, C, and C++ with auto-detection.

◎

Bloom filter index

Pre-reject files that cannot possibly match before invoking the parser.

↺

Semantic rewriting

Replace structural patterns without corrupting surrounding syntax.

⬖

MCP server

Expose codebase structure to LLM agents via the Model Context Protocol.

02Installation

doora is distributed as a single static binary. Install via Cargo, download a pre-built binary, or build from source.

From crates.io

# Requires Rust 1.78+
cargo install doora

Pre-built binaries

Download from the Releases page. Binaries are available for:

Platform	Architecture	Binary
Linux	x86_64	`doora-x86_64-unknown-linux-gnu`
Linux	aarch64	`doora-aarch64-unknown-linux-gnu`
macOS	x86_64	`doora-x86_64-apple-darwin`
macOS	Apple Silicon	`doora-aarch64-apple-darwin`
Windows	x86_64	`doora-x86_64-pc-windows-msvc.exe`

From source

git clone https://github.com/backpack-lab/doora
cd doora
cargo build --release
# Binary at ./target/release/doora

Shell completions

# Bash
doora --generate-completions bash >> ~/.bashrc

# Zsh
doora --generate-completions zsh >> ~/.zshrc

# Fish
doora --generate-completions fish > ~/.config/fish/completions/doora.fish

03Quick start

Run your first structural query in under 60 seconds.

Find all function definitions

terminal

doora -q '(function_item name: (identifier) @fn_name)' -p ./src

src/auth/handler.rs:42:0 [@fn_name] "parse_token"
src/auth/handler.rs:89:0 [@fn_name] "validate_session"
src/db/pool.rs:14:0 [@fn_name] "connect"
...

Found 47 matches across 23 files in 38ms

Find a specific function by name

doora \
  -q '(function_item name: (identifier) @fn (#eq? @fn "connect"))' \
  -p ./src

Search multiple languages at once

doora \
  -q '(function_declaration name: (identifier) @fn_name)' \
  -p ./src
# auto-detects language per file

Multiple queries in one pass

doora \
  -q '(function_item name: (identifier) @fn_name)' \
  -q '(struct_item name: (type_identifier) @struct_name)' \
  -p ./src --no-color

Launch the interactive TUI

doora -q '(function_item)' -p . --tui

Build a search index for faster queries

doora index ./src --verbose
# Subsequent searches use Bloom filter pre-rejection

04Why structural search

Text search has fundamental limitations that no amount of regex cleverness can overcome. Structural search solves them categorically.

Query	grep / ripgrep	doora
Find function definitions named `auth_user`	Returns ALL occurrences — function defs, variable names, comments, string literals, dead code	Returns only function definition nodes named `auth_user`
Find functions taking exactly 2 arguments	Cannot be expressed reliably in regex	Trivial: query the parameter list node child count
Find all `unwrap()` outside test modules	Cannot express scope constraints	Query for call sites with scope predicates
Find struct definitions that implement a trait	Multi-step, fragile, many false positives	Single S-expression query
Rename a function everywhere it is defined	Risk of corrupting string literals and comments	Semantic rewriting via AST: only actual definition nodes

The key insight

doora is to grep what a SQL database is to a flat text file. Both store the same data; one understands its structure.

05How AST search works

Source code has grammar. Every language defines how tokens nest into expressions, expressions into statements, and statements into functions and modules. doora makes this grammar queryable.

From bytes to tree

Consider this Rust function:

fn authenticate(user: &str, password: &str) -> bool {
    true
}

Tree-sitter parses this into a Concrete Syntax Tree (CST):

Syntax tree structure

source_file
  └─ function_item
      ├─ name: identifier "authenticate"
      ├─ parameters: parameters
      │      ├─ parameter
      │      │      ├─ pattern: identifier "user"
      │      │      └─ type: reference_type
      │      └─ parameter
      │            ├─ pattern: identifier "password"
      │            └─ type: reference_type
      ├─ return_type: primitive_type "bool"
      └─ body: block

An S-expression query navigates this tree structure. The query (function_item name: (identifier) @fn) matches any function_item node that has a name child of type identifier, and captures that identifier as @fn.

CST vs AST

Traditional compilers produce Abstract Syntax Trees (ASTs) that discard whitespace, comments, and punctuation. doora uses Concrete Syntax Trees (CSTs) that retain 100% fidelity to the original source — including exact byte offsets for every token. This is required for:

Accurately mapping matches back to line and column numbers for display
Semantic rewriting that preserves the user's original formatting
Syntax highlighting driven by the tree itself rather than regex

06Query syntax overview

Queries use Tree-sitter's S-expression pattern syntax — a Lisp-like notation that mirrors the shape of the syntax tree.

Basic node match

(node_type)

Matches any node of the given type anywhere in the tree. Node type names are grammar-specific — see the language sections for common types.

Named capture

(function_item name: (identifier) @my_capture)

The @my_capture syntax tags a sub-node for extraction. Captured text appears in the output. Multiple captures per query are supported.

Equality predicate

(function_item
  name: (identifier) @fn
  (#eq? @fn "connect"))

Matches only when the captured node's text equals "connect" exactly.

Regex predicate

(function_item
  name: (identifier) @fn
  (#match? @fn "^(get|set|update)_"))

Matches function names starting with get_, set_, or update_.

Wildcard child

(call_expression
  function: (identifier) @callee
  arguments: (arguments . (_) .))

The (_) wildcard matches any single node. Use . as an anchor.

Multi-query (single pass)

doora \
  -q '(function_item name: (identifier) @fn_name)' \
  -q '(struct_item name: (type_identifier) @struct_name)' \
  -p ./src

Multiple -q flags are compiled into a single-pass automaton. The AST is traversed exactly once regardless of query count.

07CLI — search command

The default command. When no subcommand is given, doora runs a structural search.

usage

doora [search] [OPTIONS] --query <S-EXPR>

Arguments

Flag	Type	Default	Description
-q, --query <S-EXPR>	String (repeatable)	required	S-expression query pattern. Pass multiple times for multi-query single-pass search.
-p, --path <DIR>	PathBuf	"."	Root directory to search. Must be an existing directory.
-l, --lang <LANG>	String	"auto"	Language to parse. One of: `rust`, `python`, `js`, `ts`, `go`, `c`, `cpp`, `auto`.
--no-color	bool	false	Disable ANSI color output. Also honored via the `NO_COLOR` environment variable.
-Q, --quiet	bool	false	Suppress per-match result lines. Only the summary line is printed.
--stats	bool	false	Print detailed performance diagnostics to stderr after the search completes.
--no-update-index	bool	false	Disable automatic incremental index updates during search.

Output format

stdout — one line per match

src/auth/handler.rs:42:0 [@fn_name] "parse_token"

Fields: filepath:line:col [@capture_name] "matched_text"

Filepath is colorized cyan (unless --no-color)
Capture name is colorized yellow
Matched text is colorized green, always wrapped in literal quotes
Line numbers are 1-indexed; columns are 0-indexed byte offsets

stderr — summary line

Found 47 matches across 23 files in 38ms

Stats output

stderr — with --stats flag

--- search statistics ---
files walked: 47
files parsed: 46
files skipped: 1
matches found: 12
sieve rejected: 18
match rate: 26.09% (files with matches / files parsed)
wall time: 38ms
throughput: 1236.84 files/sec

08CLI — index command

Explicitly builds or updates the Bloom filter index for a directory. Once indexed, search queries with string literals run significantly faster.

usage

doora index <PATH> [OPTIONS]

Flag	Type	Default	Description
<PATH>	PathBuf	required	Root directory to index.
--lang <LANG>	String	"auto"	Language filter. Defaults to auto-detect all supported extensions.
--verbose	bool	false	Print one line per file: `indexed:`, `fresh:`, or `removed:` prefix.

example output (verbose)

indexed: src/auth/handler.rs
indexed: src/db/pool.rs
fresh: src/main.rs
removed: src/old/legacy.rs

indexed 44 files, skipped 2 fresh, removed 1 stale entries
index written to .doora-index

The index is stored at <PATH>/.doora-index as a binary file serialized with bincode. It is a manifest of per-file Bloom filter entries, each containing:

Absolute file path
Last-modified timestamp (seconds since Unix epoch)
File size in bytes
512-byte Bloom filter (4096 bits, two FNV-1a hash functions)
Detected language string

Incremental updates

Both the explicit index command and the search command perform incremental updates. Files whose mtime and size match the stored entry are skipped. New and modified files are re-indexed. Deleted files are removed from the manifest automatically.

09Query guide — basics

A complete reference for writing effective S-expression queries.

Node types

Every language has a set of named node types defined by its Tree-sitter grammar. The node type is the first element of an S-expression. Common node types vary by language:

Language	Functions	Classes / Types	Variables
Rust	`function_item`	`struct_item`, `impl_item`	`let_declaration`
Python	`function_definition`	`class_definition`	`assignment`
JavaScript	`function_declaration`	`class_declaration`	`variable_declaration`
TypeScript	`function_declaration`	`interface_declaration`, `type_alias_declaration`	`lexical_declaration`
Go	`function_declaration`	`type_declaration`	`short_var_declaration`
C / C++	`function_definition`	`struct_specifier`, `class_specifier`	`declaration`

Field names

Named fields constrain which child to match. The syntax is field_name: (child_type):

(function_item
  name: (identifier) @fn
  parameters: (parameters) @params
  return_type: (_) @ret)

Nested patterns

Patterns can nest to arbitrary depth:

(impl_item
  type: (type_identifier) @impl_type
  body: (declaration_list
    (function_item name: (identifier) @method_name)))

10Query guide — predicates

Predicates filter captures based on their text content. They are evaluated using pre-compiled regex objects, cached at query compile time.

Predicate	Description	Example
`#eq?`	Exact equality match	`(#eq? @fn "connect")`
`#match?`	Regex match (anchored with ^ and $ as needed)	`(#match? @fn "^handle_")`
`#not-eq?`	Negative equality	`(#not-eq? @fn "main")`
`#any-of?`	Match any string in a list	`(#any-of? @fn "get" "set" "del")`

Performance note

Regex predicates are compiled exactly once at query compile time and wrapped in Arc<Regex>. They are passed by reference to all parallel worker threads — zero per-file regex compilation overhead.

11Query examples by language

A practical library of common patterns for each supported language.

Rust

rustAll function definitions

(function_item name: (identifier) @fn_name)

Matches every fn declaration — public, private, async, const, unsafe.

rustAll unwrap() call sites

(call_expression
  function: (field_expression
    field: (field_identifier) @m (#eq? @m "unwrap")))

Finds every .unwrap() call regardless of the receiver type.

rustStruct definitions

(struct_item name: (type_identifier) @struct_name)

rustTrait implementations for a specific type

(impl_item
  trait: (type_identifier) @trait
  type: (type_identifier) @impl_type
  (#eq? @impl_type "MyStruct"))

rustFunctions matching a naming pattern

(function_item
  name: (identifier) @fn
  (#match? @fn "^(get|set|update)_"))

Finds all functions whose names start with get_, set_, or update_.

Python

pythonFunction definitions

(function_definition name: (identifier) @fn_name)

pythonDecorated functions

(decorated_definition
  decorator: (decorator) @decorator
  definition: (function_definition name: (identifier) @fn_name))

Captures both the decorator and function name.

pythonClass definitions

(class_definition name: (identifier) @class_name)

JavaScript

jsFunction declarations

(function_declaration name: (identifier) @fn_name)

Matches function foo() {} syntax only. Arrow functions use a different node type.

jsClass declarations

(class_declaration name: (identifier) @class_name)

jsMethod definitions

(method_definition name: (property_identifier) @method_name)

TypeScript

tsInterface declarations

(interface_declaration name: (type_identifier) @interface_name)

tsType aliases

(type_alias_declaration name: (type_identifier) @type_name)

Go

goFunction declarations (not methods)

(function_declaration name: (identifier) @fn_name)

goStruct type declarations

(type_declaration
  (type_spec name: (type_identifier) @type_name))

C / C++

cFunction definitions

(function_definition
  declarator: (function_declarator
    declarator: (identifier) @fn_name))

C and C++ function names are nested inside a declarator chain.

cppClass declarations

(class_specifier name: (type_identifier) @class_name)

cTypedef names

(type_definition declarator: (type_identifier) @type_name)

12Language auto-detection

When --lang auto is used (the default), doora detects the grammar per file from its extension and walks all supported extensions simultaneously.

Language flag	Extensions walked	Grammar
`rust`	`.rs`	tree-sitter-rust
`python`	`.py`, `.pyi`	tree-sitter-python
`js`	`.js`, `.mjs`, `.cjs`	tree-sitter-javascript
`ts`	`.ts`, `.tsx`, `.mts`, `.cts`	tree-sitter-typescript (tsx variant)
`go`	`.go`	tree-sitter-go
`c`	`.c`, `.h`	tree-sitter-c
`cpp`	`.cpp`, `.cc`, `.hpp`, `.hxx`, `.cxx`, `.h`	tree-sitter-cpp

.h file disambiguation

In auto mode, .h files are parsed with the C grammar. To parse them as C++, use --lang cpp explicitly.

In auto mode, the query is compiled against every supported language at startup. Languages for which the query fails to compile (because a node type doesn't exist in that grammar) are silently skipped. Only languages where the query compiles successfully are searched. This means a Rust-only query like (function_item @fn) will search only .rs files even in auto mode.

13Bloom filter index

The index is a pre-parse rejection sieve. Files that mathematically cannot contain the search term are skipped entirely, before tree-sitter is invoked.

How it works

01

Trigram extraction

Every file's source bytes are broken into all consecutive 3-byte windows. "hello" → [hel, ell, llo]. Unique trigrams are inserted into a per-file Bloom filter.

02

Bloom filter construction

A 4096-bit (512-byte) bit array with two FNV-1a hash functions. Each trigram sets 2 bits. The filter is serialized to the index manifest.

03

Query trigram extraction

At search time, string literals in predicates (#eq?, #match?) are decomposed into trigrams. "authenticate" → [aut, uth, the, hent, enti, ntic, tica, icat, cate].

04

Pre-parse rejection

Before invoking tree-sitter on a file, check if all query trigrams are present in the file's filter. If any is absent, skip the file entirely. Under 0.003ms per rejection.

05

Zero false negatives guaranteed

A file containing the search term ALWAYS passes the filter. False positives (files that pass but don't match) result in unnecessary parsing — not missed results.

Bloom filter parameters

Parameter	Value	Rationale
Bit array size	4096 bits (512 bytes)	~1% false positive rate for ~570 trigrams per file
Hash functions	2 (FNV-1a variants)	Sweet spot for our trigram density and bit array size
Hash algorithm	FNV-1a 32-bit	Deterministic across process runs (unlike SipHash with random seeds)

14Semantic rewriting

doora can surgically replace structural patterns without corrupting surrounding syntax. This is fundamentally different from sed — it operates on the AST, not on text.

preview a rename (dry run)

doora -q '(function_item name: (identifier) @fn (#eq? @fn "old_name"))' \
--rewrite 'new_name' \
-p ./src

apply rewrites in place

doora -q '(function_item name: (identifier) @fn (#eq? @fn "old_name"))' \
  --rewrite 'new_name' \
  --in-place \
  -p ./src

A unified diff is always printed before --in-place writes, unless --yes is passed. Rewrites are applied in reverse byte order to avoid offset invalidation when multiple matches occur in the same file.

Important

The rewrite template replaces the CAPTURED node's text, not the full match. Capture your target precisely to avoid unintended changes.

15Interactive TUI

Launch with --tui for a split-pane terminal explorer with live streaming results.

launch the TUI

doora -q '(function_item)' -p . --tui

TUI layout

┌─ Files ──────────┐┌─ AST ─────────────────────────────────────┐
│ src/auth/ ││ source_file │
│ handler.rs 3 ││ function_item [42:0 → 58:1] │
│ token.rs 1 ││ name: identifier "parse_token" ← match │
│ src/db/ ││ parameters: parameters │
│ pool.rs 1 ││ parameter: identifier "input" │
└──────────────────┘└───────────────────────────────────────────┘
┌─ Code ─────────────────────────────────────────────────────────┐
│ 42 │ pub fn parse_token(input: &str, ctx: &Context) -> Token { │
└────────────────────────────────────────────────────────────────┘
[/] query [j/k] navigate [Enter] expand node [q] quit

The TUI uses a fully decoupled async event loop (Tokio). Results stream in live as background threads find matches — the terminal never freezes, even on multi-gigabyte repositories.

Key	Action
`/`	Focus the query input bar
`j` / `k` or arrows	Navigate the file list
`Enter`	Expand / collapse AST nodes
`Tab`	Switch focus between panes
`q` or `Esc`	Quit

16MCP server mode

doora exposes a Model Context Protocol server so LLM coding agents can query your codebase structurally, reducing hallucinations and wasted context tokens.

Why LLMs need structural search

LLM coding agents have limited context windows. When an agent tries to understand a large codebase by reading files, it burns context tokens on irrelevant content and still has no structural understanding of the architecture.

With the doora MCP server, an agent can ask: "What is the exact type signature of the function handling user authentication?" and receive a precise, structured answer in milliseconds — consuming a fraction of the context tokens that reading source files would require.

Setup

build the index first

doora index --persist -p .

start the MCP server

doora serve --mcp

Configure in .mcp.json

Add to your project's MCP configuration file:

{
  "mcpServers": {
    "doora": {
      "command": "doora",
      "args": ["serve", "--mcp"],
      "cwd": "/path/to/your/repo"
    }
  }
}

Available MCP tools

Tool	Input	Description
`search_ast`	`query: string`	Run a live S-expression structural search and return matching results with file paths and positions.
`lookup_symbol`	`name: string`	Query the persisted SQLite index for a symbol by name. Returns type signatures, locations, and relationships.

17Architecture — pipeline overview

The search pipeline is a linear sequence from CLI arguments to printed results. Each stage is isolated and independently testable.

CLI parseCli struct

→

Validatepath, lang, query

→

Compile queriesArc<MultiCompiledQuery>

→

Walk filesWalkBuilder + ignore

→

par_bridgeRayon parallel

→

SieveBloom filter check

→

parse_fileTree-sitter CST

→

extract_matchesDFS traversal

→

sort + dedupMatchResult[]

→

printstdout

Key architectural decisions

Query compiled before walk: Arc<MultiCompiledQuery> is built once and shared across all Rayon threads with zero per-thread recompilation.
Ephemeral tree lifecycle: The Tree and source bytes are dropped immediately after extract_matches returns. RAM usage stays flat regardless of repository size.
Single-pass multi-query: Multiple -q flags are merged into one traversal. The AST is walked once per file regardless of query count.
BitSet pre-filtering: Each node's kind ID is checked against a pre-computed HashSet<u16> before evaluating the full match pattern.
Bloom filter sieve: Files without the query's required trigrams are skipped entirely before tree-sitter is invoked.

18Architecture — module reference

The codebase is organized into focused, independently testable modules.

types

src/types.rs

Core data types: MatchResult, SearchConfig, Language, LangMode, AppError. All shared types live here.

parser

src/parser.rs

Thread-local parser pool, parse_file with mmap support, get_language, detect_language, FileSource enum.

query

src/query.rs

CompiledQuery with BitSet kind IDs and cached regex predicates. MultiCompiledQuery. Single-pass extract_multi_matches.

walker

src/walker.rs

Directory traversal via the ignore crate. Respects .gitignore. build_walker for single language, build_auto_walker for all extensions.

output

src/output.rs

ANSI color output, ColorMode, resolve_color_mode (respects NO_COLOR). Writer injection for testability.

trigram

src/trigram.rs

Byte-level trigram extraction. extract_unique_trigrams, extract_query_trigrams. No heap allocation per trigram.

bloom

src/bloom.rs

4096-bit Bloom filter with two FNV-1a hash functions. Fixed-size [u8; 512] bit array. insert, probably_contains, serialization.

index

src/index.rs

IndexEntry, IndexManifest. Atomic save via rename. bincode serialization. Version mismatch detection.

indexer

src/indexer.rs

Parallel index build/update pipeline. Staleness detection via mtime + size. Incremental re-indexing of modified files only.

sieve

src/sieve.rs

Bloom filter pre-rejection sieve. QueryTrigramSet, FileIndexStatus, should_parse_file. Multi-query OR semantics.

19Architecture — concurrency model

The search pipeline is fully parallel using Rayon's work-stealing thread pool. Every design decision was made to maximize throughput while eliminating contention.

Work-stealing thread pool

The directory walker's Iterator is converted to a Rayon parallel iterator via par_bridge(). Rayon distributes file processing across all available cores using a work-stealing scheduler — when a thread exhausts its local queue, it steals work from the end of another thread's queue.

Thread-local parser pool

Tree-sitter Parser initialization is expensive — it allocates C-level state via FFI. Creating and dropping a parser per file would be catastrophic. Instead, each Rayon worker thread owns exactly one Parser via thread_local!:

thread_local! {
    static PARSER: RefCell<Parser> = RefCell::new(Parser::new());
}

The parser is reconfigured for each file via set_language() before parsing — a cheap operation that resets internal state without a new allocation.

Shared state

The following objects are shared across all threads:

Arc<HashMap<Language, Arc<MultiCompiledQuery>>> — compiled queries, shared by reference
Arc<Mutex<Vec<MatchResult>>> — result accumulator, locked only for append
Arc<Mutex<usize>> — file counter
Arc<IndexManifest> — Bloom filter index, read-only during search

Lock design

Mutex locks are held only for the append/increment operation — never during parse or query work. The expensive CPU work (parsing, DFS traversal) happens entirely outside any lock.

20Architecture — memory model

doora maintains a flat RAM profile regardless of repository size by enforcing an ephemeral tree lifecycle.

Ephemeral lifecycle

When parsing a file with text search, only a small buffer needs to be in memory. With AST-based search, parsing a file produces a large C-allocated structure with potentially tens of thousands of nodes. If all trees were retained simultaneously, a 10,000-file repository would require tens of gigabytes of RAM.

The solution is the ephemeral lifecycle. In the parallel closure, the drop order is explicit and mandatory:

let (tree, source) = parse_file(entry.path(), &lang)?;
let matches = extract_multi_matches(&tree, source.as_bytes(), &multi, path);
drop(tree);    // C-allocated CST freed immediately
drop(source);  // source bytes freed immediately
// Only the small Vec<MatchResult> persists

Memory-mapped files

For files above 1MB, parse_file uses memmap2 instead of fs::read_to_string. The OS pages in only the bytes actually accessed, backed by the file's page cache rather than a heap allocation:

// Small files: heap allocation
FileSource::Heap(String)

// Large files (≥ 1MB): memory-mapped
FileSource::Mapped(memmap2::Mmap)

Maximum RAM usage

RAM usage is bounded by:

Number of active Rayon worker threads × (one tree + one source buffer)
The final accumulated Vec<MatchResult> (small owned strings only)
The pre-computed CompiledQuery objects (one per language)
The IndexManifest if the index is loaded (512 bytes per file entry)

21Performance

doora is engineered for sub-second query latency on repositories with millions of lines of code.

Benchmark results

Benchmark	Result	Notes
Single file parse + query (100 functions)	~180µs	On Apple M2, release build
10,000-file Rust repository	<1000ms	Target; actual ~380ms on modern 8-core
Parallel search, 100 files, 20 functions each	~45ms	Rayon, 8 cores
Query compilation (single, Rust grammar)	~12µs	Includes BitSet and regex pre-compilation
Bloom filter rejection check	<0.003ms	Per file, two FNV-1a hash evaluations

Optimization stack

The following optimizations are layered from outermost to innermost:

Bloom filter sieve: Skip parsing files that cannot contain the search term. Eliminates tree-sitter invocation entirely for filtered files.
BitSet kind filtering: Pre-compute which node kind IDs could match the query. Skip match_node evaluation for the vast majority of tree nodes.
Pre-compiled regex predicates: Compile #match? regex patterns exactly once per query, share via Arc<Regex>. Zero per-file regex compilation.
Single-pass multi-query automaton: Traverse the AST exactly once per file regardless of how many -q flags were passed.
Thread-local parser pool: Each Rayon thread owns one Parser for its entire lifetime. Zero per-file parser allocation.
Memory-mapped file reading: Files ≥ 1MB use mmap to avoid heap allocation of the source string.
Work-stealing parallelism: Rayon's lock-free scheduler saturates all CPU cores with no idle threads.

Profiling

To profile on your hardware:

# Install flamegraph
cargo install flamegraph

# Run profiler against the Rust compiler source
make profile

# Or manually
cargo flamegraph --bin doora -- \
  -q '(function_item name: (identifier) @fn_name)' \
  -p /path/to/large/repo --lang rust --no-color --quiet

See PROFILING.md for detailed instructions and interpretation guidance.

22Building from source

Requires Rust 1.78 or later (MSRV). All grammar crates are compiled from source as part of the Cargo build.

# Debug build
cargo build

# Release build (recommended for performance testing)
cargo build --release

# Check only (fast compilation check)
cargo check --all-features

Dependencies

Crate	Version	Purpose
`tree-sitter`	0.22	Incremental parsing library core
`tree-sitter-rust/python/javascript/typescript/go/c/cpp`	0.21	Language grammars
`rayon`	1.10	Work-stealing parallel iterator
`ignore`	0.4	Gitignore-aware directory walker
`clap`	4	CLI argument parsing
`clap_complete`	4	Shell completion generation
`thiserror`	1	Error type derivation
`regex`	1	Pre-compiled regex for #match? predicates
`bincode`	2	Binary serialization for the index
`serde`	1	Serialization derive macros
`memmap2`	0.9	Memory-mapped file reading

23Testing

The test suite covers unit tests, integration tests, and statistical Bloom filter correctness tests.

# Run all tests
cargo test

# Run tests for a specific module
cargo test --lib parser
cargo test --lib query
cargo test --lib walker
cargo test --lib bloom

# Run integration tests
cargo test --test integration_test -- --nocapture

# Run Bloom filter statistical tests
cargo test --test bloom_statistics -- --nocapture

# Run binary tests (CLI validation, stats)
cargo test --bin doora

Test inventory

Module	Test count	Coverage focus
`types.rs`	7	MatchResult derives, AppError display, sort order
`parser.rs`	20+	Thread-local pool, mmap vs heap, all 7 languages
`query.rs`	25+	Query compilation, BitSet filtering, regex predicates, Arc sharing
`walker.rs`	30+	Extension filtering, gitignore, excluded dirs, binary detection
`output.rs`	31	Color output, format correctness, writer injection
`bloom.rs`	15+	Zero false negatives, bit operations, FNV-1a determinism
`trigram.rs`	20+	Sliding window, UTF-8 byte handling, deduplication
`sieve.rs`	18+	Multi-query OR semantics, staleness detection, end-to-end
`integration_test.rs`	56+	End-to-end pipeline across all 7 languages
`bloom_statistics.rs`	8	Zero false negatives across 1000 files, FPR measurement

Fixture files

Integration tests use fixture files in tests/fixtures/:

simple.rs — Rust functions and types
multi_fn.rs — Multiple functions for ordering tests
structs.rs — Struct definitions
nested.rs — Nested module functions
empty.rs — Empty file (parse error path)
simple.py — Python functions
simple.js — JavaScript functions and classes
simple.ts — TypeScript interfaces and types
simple.go — Go functions and structs
simple.c — C functions and typedefs
simple.cpp — C++ classes and templates

24Benchmarks

Criterion benchmarks in benches/search_benchmarks.rs. Run automatically in CI with regression detection.

# Run all benchmarks
cargo bench --bench search_benchmarks

# Run as tests (fast, no timing)
cargo bench --bench search_benchmarks -- --test

# Save baseline for comparison
cargo bench -- --save-baseline main

# Compare against baseline
cargo bench -- --baseline main

Benchmark suite

Benchmark group	What it measures
`compile_single_query_rust`	Query compilation time including BitSet + regex pre-compilation
`compile_five_queries_rust`	Multi-query compilation overhead
`single_file/functions/N`	Parse + query throughput for files with 10, 100, 500, 1000 functions
`heap_read` vs `mmap_read`	File reading strategy comparison
`one_query` vs `five_queries`	Single-pass multi-query overhead
`parallel_100_files_20fn_each`	Full parallel pipeline with result accumulation
`compile_universal_query_all_languages`	Auto mode query compilation across all 7 grammars

CI fails on regressions exceeding 10% for parallel benchmarks and 2% for single-file benchmarks. Results are uploaded as GitHub Actions artifacts with HTML reports from Criterion.

25Contributing

Contributions are welcome. The project is tracked issue-by-issue with milestones for each subsystem.

Before opening a PR

# Format
cargo fmt

# Lint — zero warnings enforced
cargo clippy --all-features -- -D warnings

# Full test suite
cargo test --all-features

All three must pass cleanly. The CI pipeline enforces #![deny(warnings)] and #![warn(clippy::pedantic)] across the entire codebase.

Code style

No inline comments, no doc comments, no block comments anywhere in source files
All public items must have doc comments (enforced by cargo doc --no-deps zero warnings)
Use Result<T> (the crate's type alias for Result<T, AppError>) for all fallible functions
Never use unwrap() in production code — only in tests with an explanatory comment
Explicit drop() calls are mandatory after tree and source are no longer needed

Adding a new language

Add tree-sitter-<lang> to Cargo.toml
Add a variant to Language enum in types.rs
Add extension mapping in walker.rs::extensions_for_language
Add grammar arm in parser.rs::get_language
Add detection arm in parser.rs::detect_language
Add variant in parser.rs::get_all_languages
Add lang string to main.rs::resolve_lang and validate
Add fixture file to tests/fixtures/
Add integration tests following the pattern of existing language tests

License

MIT

dooraSearch code by shape,not by text.

01Overview

Structural patterns

Parallel processing

7 languages

Bloom filter index

Semantic rewriting

MCP server

02Installation

From crates.io

Pre-built binaries

From source

Shell completions

03Quick start

Find all function definitions

Find a specific function by name

Search multiple languages at once

Multiple queries in one pass

Launch the interactive TUI

Build a search index for faster queries

04Why structural search

05How AST search works

From bytes to tree

CST vs AST

06Query syntax overview

Basic node match

Named capture

Equality predicate

Regex predicate

Wildcard child

Multi-query (single pass)

07CLI — search command

Arguments

Output format

Stats output

08CLI — index command

09Query guide — basics

Node types

Field names

Nested patterns

10Query guide — predicates

11Query examples by language

Rust

Python

JavaScript

TypeScript

Go

C / C++

12Language auto-detection

13Bloom filter index

How it works

Bloom filter parameters

14Semantic rewriting

15Interactive TUI

16MCP server mode

Why LLMs need structural search

Setup

Configure in .mcp.json

Available MCP tools

17Architecture — pipeline overview

Key architectural decisions

18Architecture — module reference

19Architecture — concurrency model

Work-stealing thread pool

Thread-local parser pool

Shared state

20Architecture — memory model

Ephemeral lifecycle

Memory-mapped files

Maximum RAM usage

21Performance

Benchmark results

Optimization stack

Profiling

22Building from source

Dependencies

23Testing

Test inventory

Fixture files

24Benchmarks

doora
Search code by shape,
not by text.