Evidence

Engineering Evidence

We test wavebird the way we would want our infrastructure tested. Every claim on this page is backed by reproducible evidence from controlled benchmark runs and a comprehensive pre-pilot validation campaign with fault injection.

Last updated: 2026-04-07

What we measured

Speed

28.76ms end-to-end

Measured runtime path from request entry to a sponsoring decision being ready, excluding the AI model’s own wait time. p99 means 99% of requests were this fast or faster.

Reliability

0missing proofs

Over 8 hours (263,534 total slots processed) with fault injection active, we observed 0 missing proofs across the 260,321 terminal slots expected to produce proof.

Resilience

7SSP failure scenarios

We simulated seven exchange failure modes (plus three PostgreSQL failures). Result: 0 crashes, 0 unrecoverable states, correct circuit breaker activation and recovery.

Terms used on this page

Slot: One sponsoring opportunity in the runtime (one decision attempt).
Proof: A signed evidence record produced for filled slots and used for audit and settlement.
Beacon: A post-render signal from the wrapper/app confirming a creative was rendered.
Mock-SSP: A simulated ad exchange response used to measure the internal ad path without public network noise.
Fault injection: Deliberate, randomized failures introduced during the campaign (latency jitter, slow responses, HTTP errors, malformed responses, drops, no-bid spikes).

How we tested

In March and April 2026, we ran our pre-pilot validation campaign: a set of automated tests designed to find problems before the first real partner connects. The final run completed on April 7, 2026. We did not test under ideal conditions. We deliberately broke things.

Our Mock-SSP chaos mode randomly injected network delays, server errors, malformed responses, dropped connections, and traffic spikes into the test runs. The goal is simple: prove correct behavior under failure before we connect a live partner.

Mock-SSP

Mock-SSP simulates an ad exchange response inside the benchmark harness and inside the pre-pilot chaos campaign so we can measure the internal ad path without public network noise.

Proof integrity

We processed 10,000 sponsoring slots at 100 concurrent connections with fault injection active. Result: 0 missing proofs, 0 invalid signatures, 0 orphaned beacons.

Settlement accuracy

We ran 5,000 slots through 6 billing scenarios — including micro-unit price boundaries, duplicate detection, and multi-SSP fallback attribution. Result: exact reconciliation in every scenario (0 billing errors).

Resilience

We tested 7 SSP failure scenarios plus 3 PostgreSQL failure scenarios. Result: 0 crashes and correct circuit breaker activation and recovery in all scenarios.

Found and fixed during the campaign

Settlement attribution bug in multi-SSP fallback: slots were incorrectly attributed to the timed-out primary SSP.

Full campaign details

The final pre-pilot campaign run completed on April 7, 2026 with chaos fault injection active. It covered release verification, resilience, proof integrity, settlement accuracy, concurrency profiling, and the full 8-hour sustained load test.

Final pre-pilot campaign

April 7, 2026 final run: 9 validation steps, all passed.
Duration: 9 hours 1 minute.
Includes: release verification, SSP resilience (7 scenarios), PG resilience (3 scenarios), Redis resilience, proof integrity (10,000 slots at c25 and c100 with chaos), settlement accuracy (5,000 slots, 6 scenarios with chaos), concurrency profiling (c10-c200 with chaos), and 8-hour sustained load with chaos.
The sustained load step passed with memory stability verification active.

Proof Chain Integrity

10,000 slots processed at concurrency 100 with chaos faults active.
294 latency jitter faults, 42 slow responses, 18 HTTP errors, 7 malformed responses, and 7 connection drops injected.
Result: 0 missing proofs, 0 invalid signatures, 0 orphaned beacons.
Every filled slot has a correctly signed proof pack.

Settlement Accuracy

5,000 slots across 6 test scenarios.
Standard mixed-outcome run, micro-unit price boundaries, multi-SSP fallback attribution, duplicate detection, CS profile breakdown, and 30-minute duration stability.
Result: exact reconciliation in all scenarios, 0 billing errors.

Found and fixed: settlement attribution bug in multi-SSP fallback. Slots were incorrectly attributed to the timed-out primary SSP.

Resilience Under Failure

7 SSP failure scenarios tested: connection refused, timeout, HTTP 500, HTTP 429, partial failure with fallback, flapping, and slow response.
3 PostgreSQL failure scenarios: mid-runtime drop, never available, and slow queries.
Redis fail-policy: explicitly changed from implicit fail-open to configurable fail-closed (`CSL_RATE_LIMIT_REDIS_FAIL_POLICY`).
Result: 0 crashes, 0 unrecoverable states, correct circuit breaker activation and recovery in all scenarios.

Sustained Load (8 Hours)

263,534 slots processed over 8 hours with chaos faults active.
0 missing proofs across 260,321 proofable terminal slots.
All 8 hourly quality gates passed.
0 handle leaks, metric cardinality stable (67 -> 81).
Chaos faults injected: 242 latency jitter, 38 slow responses, 12 HTTP errors, 8 malformed responses, 8 connection drops, 83 no-bid spikes.

Memory management: slot eviction, ledger compaction, streaming settlement, projection pruning, rate-limiter sweeps, and automatic settlement snapshots are all active. The final 8-hour run completed with stable memory under all pass criteria.

Under load

We pushed the system from 10 to 200 concurrent connections to find where it starts to struggle. The answer: it never crashes. It gets slower, but it keeps working.

“c100” means 100 concurrent connections.

Concurrent connections	Response time (p99)	Throughput
10	64 ms	333 ops/s
25	293 ms	126 ops/s
50	695 ms	92 ops/s
75	1,203 ms	73 ops/s
100	1,764 ms	64 ops/s
150	3,267 ms	52 ops/s
200	3,590 ms	33 ops/s

At 200 concurrent connections, p99 response time increases to 3.6 seconds but every response is still valid (2xx). Under that extreme load we see decision poll timeouts; when load drops back to 25 connections, the system recovers within 30 seconds.

How to read this table

The “Errors” column is HTTP-level errors. In these runs, every response was 2xx at every concurrency level. Under extreme load we do observe decision poll timeouts (2 at c100, 130 at c150, and 1,871 at c200). The system degrades gracefully rather than failing hard. Spike recovery from c200 to c25 completes within 30 seconds.

Sustained operation (8 hours)

We ran the system continuously for 8 hours with fault injection active. All 9 validation steps passed, including the full 8-hour sustained load test with memory stability verification.

Memory stability

The runtime now includes slot eviction, ledger compaction, streaming settlement, projection pruning, rate-limiter sweeps, and automatic settlement snapshots. The 8-hour soak test passed all memory stability criteria with these optimizations active.

Detailed methodology

The benchmark suite and the pre-pilot campaign were both run under controlled conditions. The goal was to measure the wavebird runtime itself, not the public internet or live model providers.

Evidence date: 2026-03-23 (benchmarks), 2026-04-07 (final pre-pilot campaign)
Execution mode: Local host benchmark harness
Exchange substitute: Mock-SSP
Runs: 7
Warmup requests: 1000
Measured requests per run after warmup: Not yet published in the current sanitized evidence bundle
Selection method: Median per benchmark
Pre-pilot campaign: Chaos fault injection via configurable Mock-SSP chaos mode with latency jitter (30%), slow responses (3%), HTTP errors (2%), malformed responses (1%), connection drops (0.5%), and periodic no-bid spikes.
Final campaign run: April 7, 2026, with slot eviction, ledger compaction, streaming settlement, and V8 memory optimization active. Node.js started with --expose-gc --max-old-space-size=4096.
Runtime memory management: Automatic slot eviction after proof fulfillment, incremental ledger compaction, streaming settlement exports (no large temporary arrays), projection pruning, rate-limiter bucket sweeps, periodic settlement snapshots, and V8 garbage collection hints after settlement cycles.

Per-run variation exists internally and will be published once the sanitized artifact bundle is ready. The original benchmark methodology remains unchanged and the March 23 results remain valid.

Full benchmark metrics

Benchmarks

March 23, 2026

Firewall p99 latency

0.22ms

Filtering step before any ad request leaves the runtime.

Mock-SSP round-trip p99 latency

15.28ms

Internal ad path against a controlled exchange substitute.

End-to-end p99 latency

28.76ms

Measured runtime path with external model wait time excluded.

Settlement max runtime

887.58ms

Longest measured settlement run in the current evidence pack.

Mock-SSP request throughput

1364.82ops/s

Controlled request throughput inside the benchmark harness.

Pre-pilot campaign

April 7, 2026

Proof integrity

10,000slots

Processed at c100 with 0 missing proofs.

Settlement accuracy

5,000slots

6 scenarios with exact reconciliation.

SSP resilience

7failure modes

0 crashes across SSP failure scenarios.

Concurrency tested

c10-c200

Graceful degradation under spike load.

Sustained load

263,534slots

Processed over 8 hours with all memory stability criteria passed.

What this does not claim

We are transparent about what this evidence does and does not prove:

These are internal measurements, not third-party audits.
Latency was measured locally, not across the public internet or live model providers.
The exchange partner was simulated (Mock-SSP), not a live partner.
These numbers are not a production SLA.
The first live partner integration is the next milestone.

What is still open

Two things are still open in the original benchmark suite: beacon processing slows down above 50 concurrent connections, and jobs/sec remains below target. The sustained 8-hour memory stability finding from the earlier run is resolved in the final campaign.

Beacon p99 at concurrency 50 and 100 remains above target in the original benchmark suite.
Jobs/sec remains below target in the original benchmark suite.

Artifacts

Downloadable artifacts will be published once the sanitized bundle is ready for public release. Pre-pilot campaign reports are available internally as machine-readable JSON artifacts.

Related material

How wavebird works Safety SDK Resources

Next step

See how it integrates

If the runtime evidence is what you needed, the next step is the integration path.

See the SDK Talk to the team