Status: design, not yet implemented. The data_redactor Ruby gem’s C engine
currently uses glibc POSIX regex.h, which is O(N²) for the gem’s access
pattern (many patterns × one input, called in a loop). This document specs the
standalone C library that would replace it.
Date: 2026-05-23. Author: working through it with Claude during a performance investigation.
The combined-matcher core is gem-agnostic — it compiles N regex patterns
into one merged automaton, walks input once, emits (pattern-id, start, end)
match events. Nothing about that is Ruby-specific.
Reasons to spin it off rather than nest it in data_redactor/ext/:
data_redactor (see TODO
“Possible Erlang/Elixir port”) would need the same matcher. Sharing a
single C core via a NIF avoids two implementations drifting.Given:
N regular expressions (typically dozens to low hundreds).Find: every match of any of the N patterns, with the pattern id of which pattern matched, in O(input_length) total time and O(automaton_size) space.
The existing landscape:
So: a portable, zero-dep, multi-pattern, regex-subset C library is a gap.
Drawn from the 88 patterns in data_redactor’s ext/data_redactor/patterns.c.
The library is a regex subset, not a full PCRE replacement.
Required:
[abc], [a-z], [A-Z0-9], [^abc] (negated)[[:alpha:]], [[:digit:]], [[:alnum:]], [[:space:]]a|b|cab(ab|cd)*, +, ?, {n}, {n,}, {n,m} (bounded — important for
the linear-time guarantee; see “Time complexity” below)^, $ (line / string)\., \\, \(, \[, etc.Explicitly NOT supported (and rejected at compile time with a clear error):
\1) — not regular, exponential worst case.(?=...), (?!...)) — not regular.*?, +?) — possible to support but rarely needed
for tokenized data; defer until asked.\p{L}) — defer; UTF-8 byte-level matching covers
the cases we care about (the gem is byte-oriented today).This is roughly the POSIX ERE subset minus a few features, which is what
data_redactor’s patterns already use.
Opaque handle, two-phase usage (compile once, match many):
typedef struct mm_matcher mm_matcher;
typedef struct mm_compile_error {
int pattern_index; /* which input pattern failed, -1 for general */
size_t offset; /* byte offset into that pattern's source */
char message[128]; /* human-readable error */
} mm_compile_error;
/*
* Compile a set of patterns into one matcher. Patterns are POSIX-ERE-subset
* source strings (see "Scope" in design doc). pattern_ids[i] is the integer
* tag the matcher will report when patterns[i] matches; the caller picks
* the namespace (commonly 0..n-1 indexed).
*
* Returns NULL on compile error; *err is filled in.
*/
mm_matcher *mm_compile(const char *const *patterns,
const int *pattern_ids,
size_t n_patterns,
mm_compile_error *err);
void mm_free(mm_matcher *m);
/*
* Match event delivered to the caller's callback. Offsets are byte positions
* in the input passed to mm_scan; length is byte length of the match.
*/
typedef struct mm_match {
int pattern_id;
size_t start;
size_t length;
} mm_match;
/*
* Walk `input` once, calling `cb` for each match. `userdata` is passed through.
* Returns the number of matches emitted, or (size_t)-1 on error.
*
* Match ordering: emitted in input-position order (sorted by start, ties broken
* by longer match first — see "Overlap resolution" below).
*
* Streaming variant (separate fn): mm_scan_chunk lets callers feed chunks; the
* matcher holds state across calls to handle matches straddling boundaries.
*/
typedef void (*mm_match_cb)(const mm_match *m, void *userdata);
size_t mm_scan(const mm_matcher *m,
const char *input, size_t input_len,
mm_match_cb cb, void *userdata);
The matcher is thread-safe for reading (mm_scan on the same compiled
matcher from multiple threads); compile and free are not concurrent with scans.
pattern_id.DFA size is bounded by 2^(total NFA states) in the worst case, but for the
character-class-heavy patterns in data_redactor (each pattern is small;
common prefixes share states) realistic size is in the low thousands of
states, fitting comfortably in L2 cache.
Single loop over input bytes:
state = dfa.start
for i in 0..input_len:
state = dfa.transition[state][input[i]]
if state is accept:
for pid in state.pattern_ids:
emit_potential_match(pid, start_of_current_run, i+1)
Each byte does one table lookup → O(input_length) total, period. No backtracking, no per-call setup, no allocation in the hot loop.
The interesting question is which match to emit when multiple are possible — see overlap resolution below.
Linear in input length, provided the regex doesn’t contain unbounded ambiguous repetition that requires NFA simulation (which is why the scope above bans backreferences and lookaround — they break the regular property).
Memory is O(DFA_states × alphabet_size) for the transition table. For 256-byte alphabet (raw bytes, UTF-8-safe by byte) and a few thousand states, that’s a few MB — acceptable for a one-time cost held in the compiled matcher.
Policy: longest match wins, tied lengths broken by pattern-id (lower index
wins). Locked in via Prep 2 (see
pattern_subset_audit.md sibling and
combined_matcher_plan.md). Spec coverage is in
spec/data_redactor_spec.rb under the “overlap resolution” describe blocks
— both today’s pattern-id-priority behaviour AND the post-1.0 longest-match
behaviour are written as specs, with the latter marked pending until the
matcher ships.
The previous draft of this doc recommended pattern-id priority (today’s implicit behaviour). Prep 2’s empirical investigation revealed a class of cases where pattern-id priority leaks secrets:
Input:
AKIAIOSFODNN7EXAMPLEAAAAAAAAAAAAAAAAAAAA(40 alphanum bytes).
- Pattern-id priority (today):
aws_access_key_id(idx 14) wins, redacts the leading 20 chars, leaves the trailing 20 unredacted.- Longest-match:
aws_secret_access_key(idx 15, 40 chars) wins, the whole span is redacted.
For a redaction gem the correct failure mode when uncertain is “redact more, not less.” The pattern-id semantic is also an accidental outcome of sequential pattern execution — no one designed it; it fell out of the implementation. Adopting longest-match aligns with Onigmo, PCRE, RE2, and Hyperscan’s standard semantics.
Input: AKIAIOSFODNN7EXAMPLEAAAAAAAAAAAAAAAAAAAA (40 chars).
| Match candidate | Start | Length | Pattern id |
|---|---|---|---|
AKIAIOSFODNN7EXAMPLE |
0 | 20 | 14 (aws_access_key_id) |
AKIAIOSFODNN7EXAMPLEAAAAAAAAAAAAAAAAAAAA |
0 | 40 | 15 (aws_secret_access_key) |
Longest match (40) wins. redact emits one [REDACTED] covering the
whole input.
Input: id 85121612345 end (11-digit span at offset 3).
| Match candidate | Start | Length | Pattern id |
|---|---|---|---|
85121612345 |
3 | 11 | 81 (polish_pesel) |
85121612345 |
3 | 11 | 82 (belgian_national_number) |
85121612345 |
3 | 11 | 83 (norwegian_fodselsnummer) |
85121612345 |
3 | 11 | 87 (polish_pesel_2) |
All four candidates are the same length. Tiebreak by pattern-id: index 81
wins, so polish_pesel is reported. Matches today’s behaviour.
This is a behaviour change from today. Concrete consequences:
1.0.0 (already
planned per combined_matcher_plan.md). The overlap-policy change is
called out prominently in the CHANGELOG.scan API unchanged. Still returns matches: [...]. The set of
matches reported may differ from today (one longer match instead of
multiple shorter overlapping ones).For scan-style introspection use cases, the underlying matcher API can
optionally report all overlapping matches in input-position order, with
the consumer (in our case the redact/scan Ruby layer) choosing what
to do with them. Today’s scan would just keep emitting the
longest-wins-with-id-tiebreak selection; a future scan_all could expose
the full set. Out of scope for the 1.0 matcher; design space left open.
Beyond the asymptotic “O(N) one pass instead of O(N×P) for P patterns,” combining patterns into one automaton shares work across patterns with common prefixes (or any common subgraph after DFA minimisation).
Example with three Github token patterns:
github_pat_fine_grained: github_pat_[0-9a-zA-Z_]{82}
github_classic_pat: ghp_[0-9a-zA-Z]{36}
github_oauth_token: gho_[0-9a-zA-Z]{36}
Three separate regexec calls each read g then h independently — 3
byte comparisons of g, 3 of h, total 6. The combined NFA goes:
start ──g──> Sx ──h──> Sy ──┬── i ──> github_pat_fine_grained branch
├── p ──> github_classic_pat branch
└── o ──> github_oauth_token branch
The g and h bytes are inspected once for all three patterns at this
position. With 88 patterns the savings compound:
[A-Z][A-Z]
recognition is shared across 18 patterns.This isn’t a separate optimisation we’d add — it falls out of NFA-merging + subset construction + Hopcroft minimisation, all standard Thompson-construction steps. The “we don’t check the same byte twice” win is the structural payoff for building the combined automaton in the first place, not an extra trick.
The streaming API (mm_scan_chunk) carries DFA state across calls. A match
that starts in chunk A and ends in chunk B is still detected correctly because
the DFA state at the end of A is the partial-match state going into B.
This solves the same problem as TODO item #8 (Streaming API) and as the chunking approach (option G) in TODO’s Performance section — they all need the same “partial-match state carries across boundaries” property, which is exactly what a DFA gives you for free.
What it cannot do: detect a match starting inside a region the caller already
flushed. Callers must hold the bytes of any in-progress match until it
completes — the API should report partial_match_active or similar so the
caller knows when it’s safe to discard buffered input. Simple ring-buffer
pattern.
data_redactor wraps generic patterns with (^|[^0-9A-Za-z])(...)([^0-9A-Za-z]|$)
to avoid matching inside other tokens. The new library should:
replace_all_matches today.A regex engine is a security-critical component. Testing must be aggressive:
data_redactor’s 231-spec suite with the
gem swapped to use the new matcher; output must be byte-identical to the
regex.h version.multi-matcher/ (or similar — naming TBD)
├── include/
│ └── multi_matcher.h public API
├── src/
│ ├── parser.c regex source → AST
│ ├── nfa.c AST → NFA (Thompson's)
│ ├── dfa.c NFA → DFA (subset construction)
│ ├── minimize.c DFA minimization (Hopcroft)
│ ├── scan.c match loop
│ └── streaming.c chunk-boundary state
├── tests/
│ ├── unit/ per-stage unit tests
│ ├── conformance/ PCRE/RE2-oracle comparison
│ └── fuzz/ AFL/libFuzzer harnesses
├── bench/
│ ├── compile.c compile-time benchmarks
│ └── scan.c scan-time benchmarks
├── examples/
│ └── data_redactor.c show integration
├── Makefile
├── README.md
└── LICENSE MIT
Pure C99, no external dependencies, builds with any modern gcc/clang.
Provide both static and shared library targets.
data_redactorOnce the library exists:
ext/multi_matcher/ and update extconf.rb
to compile both. (Or vendor as a tarball — submodules are friction for gem
builders.)redact.c’s per-pattern loop with one mm_scan call per redact
invocation. Build the compiled matcher once at Init_data_redactor time
from the 88 pattern sources (already exposed via BUILTIN_PATTERN_SOURCES,
added in this same investigation 2026-05-23).scan.c becomes trivial: mm_scan already reports (pattern_id, start,
length) events — exactly what scan needs to return.add_pattern/remove_pattern
is called. This is a small perf hit on registration but eliminates the
custom-pattern loop entirely from the hot path.Bumps a major version (1.0.0) because the engine is replaced, but the public
Ruby API stays identical.
data_redactor model)
or upgrade to codepoint-aware character classes? Byte-oriented is simpler
and matches today’s behaviour exactly.multi_matcher, multimatch, regset, something else?