Companion to
standalone_matcher_design.md. That doc describes what to build. This doc describes how and in what order, including prep work, branch structure, and review milestones.
Date: 2026-05-23. Status: approved, not yet started.
The May 2026 performance investigation (see TODO.md “Performance” section)
shipped two optimisations to data_redactor’s C engine: a strstr literal
pre-filter (option B) and chunked input (option G). Net effect: 25-30%
faster across all sizes and linear scaling restored, but the C extension is
still 3-5× slower than pure-Ruby gsub at every size we measured.
The remaining gap is structural — glibc’s POSIX regexec lacks the
Boyer-Moore literal pre-filter that Ruby’s Onigmo engine has built in, and
allocates O(input-length) state per call. No amount of optimisation around
the engine’s existing API can close that gap.
The only options that can actually beat Ruby are options I (streaming position-by-position match) and E (a combined automaton walking all patterns simultaneously). I is the algorithmic skeleton of E — same outer loop, different inner step. This plan treats them as one project in two phases.
Confirmed from public sources (search 2026-05-23):
grep paper describes exactly this technique. The CS is 57 years old.RegexSet — battle-tested, well-documented API.We’re re-implementing a well-understood category. The risk is execution, not novelty.
Phase 1: streaming-match skeleton. Build the outer loop — walk input
once, at each position try patterns until one matches, emit [REDACTED],
advance past the match. Inner step initially uses 88 sequential regexec
calls (one per pattern). This is provably correct but slow.
redact/scan, overlap
resolution, scan offset translation, custom-pattern injection, the
235 specs still pass byte-identical to today.Phase 2: combined automaton. Replace the inner step with a single
NFA/DFA constructed from all 88 patterns merged via Thompson’s construction
(see standalone_matcher_design.md Part “Algorithm”). One state transition
per input byte instead of 88 regexec probes.
The phasing isolates correctness (Phase 1) from performance (Phase 2). Bugs in the combined automaton are caught by the spec suite that the Phase 1 skeleton already passes.
Per the 2026-05-23 decision: implementation lives in ext/data_redactor/
throughout both phases. Library extraction to a standalone repo (the
“spinoff” idea in standalone_matcher_design.md) is a separate decision,
deferred until after Phase 2 ships and the design is proven.
Rationale: less upfront friction, faster iteration, can validate the design on real specs without setting up cross-repo tooling. Extraction is harder than it looks but not impossible; if the Elixir port ever happens, that’s the natural forcing function.
Each prep PR is independent, small, and lands incremental value. They can land in any order (no inter-dependencies), but the recommended order below maximises information for the matcher design.
Branch: feat/audit-pattern-subset. Estimated size: ~1-2 hours.
Walk every entry in ext/data_redactor/patterns.c and verify it uses only
the regex features the combined matcher will support (per
standalone_matcher_design.md “Scope”). Catches surprises early — if any
pattern uses something we plan to exclude (backreferences, lookaround,
non-greedy), we either expand scope or rewrite the pattern.
Deliverables:
docs/pattern_subset_audit.md listing each
pattern, the features it uses, and pass/fail against the planned subset.Acceptance: all 88 patterns pass; 235 specs green.
Branch: feat/overlap-policy-specs. Estimated size: ~half day.
Today the gem’s overlap behaviour is implicit — patterns run in specific→generic order on a working buffer that’s been modified by earlier patterns, so by the time a generic pattern runs, the specific ones have already replaced their text. The combined matcher sees all matches simultaneously, so we need an explicit policy.
The design doc lists candidate policies:
Deliverables:
AKIA
(20-char alphanum prefix) vs the generic 20-char alphanum pattern;
email (contains a .com that the IPv4 dot-pattern almost matches);
credit_card (digits that overlap national-ID digit patterns).docs/standalone_matcher_design.md declaring
which policy reproduces that behaviour, with a worked example.Acceptance: new overlap specs pass against today’s engine and will serve as the oracle for the combined matcher’s overlap-resolution code.
Branch: feat/isolated-matcher-bench. Estimated size: ~1-2 hours.
benchmark/vs_pure_ruby.rb measures the whole redact pipeline. During
matcher development we want to measure just the matcher: same patterns,
identical inputs, no surrounding placeholder substitution or buffer
management.
Deliverables:
benchmark/matcher_isolated.rb — runs both the current 88-regexec loop
and (once it exists) the combined matcher on a corpus of strings;
reports per-input time. Initially compares against pure-Ruby gsub so
the file is useful immediately.benchmark/README.md explaining what it isolates.Acceptance: script runs cleanly and produces stable numbers across runs (variance < 10%).
Branch: feat/matcher-api-survey. Estimated size: ~half day.
Before designing our own C API in detail, read the public APIs and post-mortems of two production implementations. This is research, not code — but it should produce a written artifact so the survey isn’t re-done from scratch later.
Deliverables:
docs/matcher_api_survey.md — for each of RE2::Set and Rust’s
RegexSet: their compile API, match API, what they report (one match
vs all matches vs which patterns matched), how they handle overlaps,
what their known issues are.Acceptance: the survey doc exists; standalone_matcher_design.md is
updated where the survey changes any design decision.
Branch: feat/combined-matcher-phase1 (off main, after prep PRs).
Approximate scope: medium. Most of the work is refactoring redact.c
and scan.c around a new outer loop while preserving the 235-spec
behaviour and adding the new overlap specs from Prep 2.
Deliverables:
streaming_match.c / .h with the position-by-position loop.redact.c and scan.c refactored to use it (or replaced — TBD during
detailed design).regexec’s sequentially for now. Encapsulated behind a clear interface
that Phase 2 will swap out.benchmark/matcher_isolated.rb numbers committed in the PR description
(likely similar to today’s, possibly slightly worse — that’s fine).Not shipped to users. Phase 1 lands on main only because Phase 2 will
build on it; users see no behavioural change. If Phase 1 turns out to be
the wrong abstraction, we revert it before Phase 2 lands.
Branch: feat/combined-matcher-phase2 (off feat/combined-matcher-phase1
after it merges).
Approximate scope: large. Multi-week. Real C compiler/state-machine work.
Deliverables (per standalone_matcher_design.md):
benchmark/vs_pure_ruby.rb numbers in the PR description showing the
C extension finally beating pure-Ruby gsub.Shipped as a major version release (1.0.0 per semver — the engine
is replaced even though the public Ruby API is unchanged). CHANGELOG
documents the engine swap and the new performance characteristics.
main
├── feat/audit-pattern-subset ──┐
├── feat/overlap-policy-specs ──┤ Independent prep PRs.
├── feat/isolated-matcher-bench ──┤ Any order.
└── feat/matcher-api-survey ──┘
(after all 4 prep PRs merge)
main
└── feat/combined-matcher-phase1
└── feat/combined-matcher-phase2
└── (1.0.0 release)
| Risk | Mitigation |
|---|---|
| DFA state explosion at our pattern count | Stay on NFA simulation initially; build lazy DFA only if NFA proves too slow. RE2 hits 2GB at ~30 patterns, but our 88 patterns are short and prefix-heavy — likely much smaller. Validate during Phase 2 design with a state-count probe before committing to full DFA. |
| Overlap policy mismatch with today’s behaviour | Prep 2 produces an explicit spec suite locking in today’s behaviour before the matcher is built; matcher must pass it. |
| Multi-week effort stalls | Phase 1 is shippable scaffolding on its own — even if Phase 2 stalls, Phase 1’s refactor is reusable and the spec suite is better. |
| Bug in custom regex engine = security bug | (a) Spec-equivalence harness against the existing engine; (b) property tests against PCRE/RE2 as oracles; (c) AFL fuzzing on compile and match; (d) ASan/UBSan on every CI build. All three layers, none optional. |
| Performance fails to beat Ruby anyway | Concrete kill criterion before Phase 2 starts: a quick prototype with 10 patterns must beat the equivalent 10-pattern Ruby gsub by ≥3× on a 1 MB log. If it doesn’t, the architecture is wrong and we either pivot to option H (Onigmo) or stop. |
| Onigmo would have been fine | Document the trade-off in the Phase 2 PR description: Onigmo gets us to parity for free but is MRI-only. If H later turns out to be sufficient for the gem’s actual users, we can ship that as a stopgap during Phase 2 without invalidating Phase 2’s existence. |
These get answered by the prep PRs and inform the Phase 1/2 design:
TODO.md as a fallback if Phases 1+2 fail.redact_stream API is a separate PR after
Phase 2.