forge-docs

Performance Testing Summary (ECS Native + k6)

Executive Summary

We executed a structured performance test plan against Forge deployed on AWS ECS (GraalVM native) using k6 load generation from inside the same VPC. The key result is that Forge demonstrated repeatable throughput and predictable tail latency at scale, with clear saturation signals and clean recovery via horizontal scaling.

Across the baseline and tuned scaling envelopes:

Throughout these steps, aggregate HTTP medians remained in the single-digit millisecond range. This is the strongest evidence that the internal service tier is not the bottleneck for this workload at these concurrency levels and that the services recover predictably under horizontal scaling.

What we can confidently say:

For the detailed results and math, see the phase write-ups in this directory, especially Phase 4:

Phase 4: key graphs

These were captured during the Phase 4 runs and provide quick “shape of system” evidence alongside the envelope doc.

ECS CPUUtil (Phase 4, 200 VUs dataset)

RDS PostgreSQL metrics (Phase 4, 200 VUs dataset)


Context (test environment and topology)

This testing was intentionally “small footprint” to validate that Forge scales predictably from minimal infrastructure.

This is not a “final production architecture”; it is a controlled baseline to validate repeatability and scaling behaviour.


Prerequisites (what you need to reproduce)


Tuning and feedback (what we changed and why it mattered)

The overall story of tuning in these phases is that we addressed connection lifecycle and pool sizing so the system would remain stable under higher concurrency.


Overview (test plan and where to find results)

The test plan is the canonical reference for phases, success criteria, and methodology:

Phase write-ups (recommended reading order):

Raw artifacts (k6 outputs, CSV extracts, screenshots) are stored adjacent to their phase envelopes inside each phase folder.


Phase 4 focus (200 VUs, Runs 3–5)

Phase 4 at 200 VUs is the first materially production-like scaling dataset, in the sense that it uses true concurrent multi-user behaviour and repeatable steady-state runs after tuning and horizontal scaling.

From the Phase 4 envelope:

The key scaling moment is the step from 1 task → 2 tasks per service:

Interpretation:


External dependencies (what limits the observed tails)

This test’s “baseline mix” scenario is intentionally auth-heavy and exercises external identity operations. As a result:

This is why “system-wide p95/p99” in the mixed scenario should be interpreted as end-to-end including auth, not purely internal service compute.


Future work and why we stopped at 200 VUs

Phase 4’s objective was not simply to maximize VU count. It was to identify whether saturation signals appeared and whether the platform recovered predictably under horizontal scaling. The 200 VU runs achieved both outcomes.

We stopped at 200 VUs because we had already demonstrated the properties we needed for this stage:

The next steps depend on the audience and intended production shape:


Glossary (reading the metrics)