Performance benchmarks for data_redactor. Repo-only — this directory is
not packaged into the published gem (the gemspec’s files list does not include
benchmark/).
From the repo root, after compiling the C extension:
bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
bundle exec rake compile # build the .so for the current Ruby
bundle exec ruby benchmark/throughput.rb
Use bundle exec ruby — a bare ruby may resolve to a different Ruby than the
one the .so was compiled for and fail with incompatible library version.
| Script | Measures |
|---|---|
throughput.rb |
MB/s of redact on a log line, JSON blob, and 1MB/10MB log files; plus redact_deep and scan. The headline “how fast is it” numbers. |
vs_pure_ruby.rb |
The C extension vs a pure-Ruby gsub loop running the same 88 patterns. Prints the speedup factor — the C-extension value proposition. |
scaling.rb |
Runtime vs input size (1 KB → 50 MB). MB/s should stay roughly flat, confirming linear scaling. |
per_pattern.rb |
Per-pattern scan cost over a 1 MB payload, sorted slowest-first. Surfaces expensive patterns to target in optimization work. |
matcher_isolated.rb |
Matcher-only cost (just “find all matches”, no placeholder substitution): DataRedactor.scan vs the pure-Ruby per-pattern scan. Used during combined-matcher development to isolate matcher cost from surrounding pipeline work. |
vs_pure_ruby.rb and per_pattern.rb do not hard-code pattern strings. They
read DataRedactor::BUILTIN_PATTERN_SOURCES / BUILTIN_PATTERN_BOUNDARY live
from the compiled extension, so the pure-Ruby baseline runs exactly the same
patterns — boundary wrapper included — as the C engine. No drift from
ext/data_redactor/patterns.c.
benchmark-memory counts Ruby object allocations only. The C path allocates
its working buffers in C, invisible to that counter — it understates the C
path’s allocation advantage.