data_redactor

Benchmarks

Performance benchmarks for data_redactor. Repo-only — this directory is not packaged into the published gem (the gemspec’s files list does not include benchmark/).

Running

From the repo root, after compiling the C extension:

bundle install           # pulls benchmark-ips, benchmark-memory (dev deps)
bundle exec rake compile # build the .so for the current Ruby
bundle exec ruby benchmark/throughput.rb

Use bundle exec ruby — a bare ruby may resolve to a different Ruby than the one the .so was compiled for and fail with incompatible library version.

Scripts

Script Measures
throughput.rb MB/s of redact on a log line, JSON blob, and 1MB/10MB log files; plus redact_deep and scan. The headline “how fast is it” numbers.
vs_pure_ruby.rb The C extension vs a pure-Ruby gsub loop running the same 88 patterns. Prints the speedup factor — the C-extension value proposition.
scaling.rb Runtime vs input size (1 KB → 50 MB). MB/s should stay roughly flat, confirming linear scaling.
per_pattern.rb Per-pattern scan cost over a 1 MB payload, sorted slowest-first. Surfaces expensive patterns to target in optimization work.
matcher_isolated.rb Matcher-only cost (just “find all matches”, no placeholder substitution): DataRedactor.scan vs the pure-Ruby per-pattern scan. Used during combined-matcher development to isolate matcher cost from surrounding pipeline work.

How the comparison stays honest

vs_pure_ruby.rb and per_pattern.rb do not hard-code pattern strings. They read DataRedactor::BUILTIN_PATTERN_SOURCES / BUILTIN_PATTERN_BOUNDARY live from the compiled extension, so the pure-Ruby baseline runs exactly the same patterns — boundary wrapper included — as the C engine. No drift from ext/data_redactor/patterns.c.

Notes