Performance Testing and Tuning Guidelines¶

Philosophy: Measure First, Optimize Never (Until Necessary)¶

"Premature optimization is the root of all evil." — Donald Knuth, The Art of Computer Programming

Rob Pike's rules of performance (Notes on Programming in C, 1989):

You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places — do not guess; measure.
Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
Fancy algorithms are slow when n is small, and n is usually small. Until n is frequently large, prefer simple algorithms.
Fancy algorithms are buggier and harder to implement. Simple algorithms with simple data structures are safer and often fast enough.
Data dominates. Right data structures make algorithms self-evident. Design data structures first.

Consequence: write correct, readable code first. Add performance work only when a measured bottleneck justifies it, with a concrete target (e.g., "p99 latency < 200 ms under 500 RPS").

Performance Test Types¶

Type	Purpose	Mandatory?
Benchmark	Throughput/latency of a function or component in isolation	Mandatory for libraries and hot-path service code (see environment note below)
Load test	Validates behavior under expected peak load	Mandatory for services
Stress test	Finds the breaking point beyond expected load	Mandatory before first production release
Soak / Endurance	Detects memory leaks and resource exhaustion under sustained load	Mandatory for long-running services
Spike test	Validates behavior under sudden, sharp load increases	Recommended for auto-scaling services
Profiling	Identifies where time and memory are actually spent	On-demand: run when a benchmark reveals a problem

CLI tools: time ./mytool is usually sufficient unless the tool processes large data volumes.

What to Measure¶

Services¶

Collect at a minimum: - Latency: p50, p95, p99 — never averages; they hide tail latency. - Throughput: requests per second at target latency. - Error rate: must remain 0 % (or SLO-defined threshold) under load. - Resource utilization: CPU, memory, open file descriptors, connection pool exhaustion.

Tools: k6, Gatling, wrk.

Libraries¶

Use language-native micro-benchmark frameworks:

Language	Framework
Go	`go test -bench` + `benchstat` for statistical comparison
Java	JMH (Java Microbenchmark Harness)
C++	Google Benchmark

Benchmark Environments¶

CI runners (especially cloud-hosted) have variable CPU frequency, noisy neighbors, and non-reproducible hardware — absolute numbers are meaningless across runs.

Goal	Environment
Regression detection	CI is sufficient: compare relative change against baseline on the same runner type using statistical tools (`benchstat`, JMH comparison mode). Fail the build on regressions above a threshold (e.g., 10 %).
Absolute targets ("p99 < 200 ms")	Dedicated environment: self-hosted runner or bare-metal machine with pinned CPU frequency (disable turbo boost and frequency scaling), no other load, reproducible across runs.
Release sign-off / profiling	Dedicated environment only.

A self-hosted CI runner with pinned hardware is a pragmatic middle ground: regression detection and meaningful absolute numbers in one place.

CLI Tools¶

time ./mytool input.dat is sufficient for most cases. For batch-processing tools, include a benchmark with realistic input sizes in the test suite.

Profiling vs. Performance Testing¶

	Performance Testing	Profiling
Question	"Does the system meet its targets?"	"Where does the time/memory go?"
Scope	End-to-end, external view	Internal, code-level view
When	Continuously in CI (benchmarks); before releases (load tests)	On-demand after a test reveals a problem
Output	Latency, throughput, error rates	Flame graphs, allocation profiles, call trees

Tools: pprof (Go), async-profiler / JFR (Java), perf / Valgrind / Heaptrack (C++).

Performance Observability in Production¶

Prefer instrumentation over logging for performance data:

Distributed tracing (preferred): instrument with OpenTelemetry; export to Jaeger or a compatible backend. Traces reveal latency at span level without ad-hoc timing code.
Spring Boot: spring-boot-starter-actuator + Micrometer Tracing.
Go: go.opentelemetry.io/otel.
Metrics: expose RED metrics (Rate, Errors, Duration) via Prometheus. Instrument at the framework/middleware level — not inline in business logic.
Performance logging as last resort: log duration of individual operations only when tracing is unavailable and a specific bottleneck is under investigation. Remove or gate behind a debug flag afterward.

Do not scatter System.currentTimeMillis() / time.Now() throughout business logic. Use interceptors, middleware, or AOP.

Common Performance Bottlenecks¶

Address these before writing custom optimizations:

N+1 queries: fetch related data in bulk; use JOIN or batch loading.
Missing database indexes: profile slow queries (EXPLAIN ANALYZE); index frequent filter and sort columns.
Serialization overhead: benchmark JSON vs. binary formats (Protobuf, Avro) when throughput is critical.
Chatty APIs: replace multiple fine-grained calls with one coarse-grained or batch call.
Lock contention: prefer immutability, lock-free structures, or message passing over shared mutable state.
Synchronous blocking in async contexts: never block an event-loop or virtual-thread carrier thread.
Memory allocation / GC pressure: minimize short-lived allocations in hot paths; pool expensive objects.
Thread/connection pool exhaustion: size pools using Little's Law; monitor queue depth, not just pool size.
Missing caching: cache expensive, read-heavy, low-churn results at the right layer (in-process → distributed).

Directives¶

Measure before optimizing; never guess bottlenecks
Use OpenTelemetry and Prometheus for production performance data — do not scatter timing code in business logic
Collect p50/p95/p99 latency — never rely on averages
Load tests mandatory for services; benchmarks mandatory for libraries and hot-path service code
Address common bottlenecks first: N+1 queries, missing DB indexes, chatty APIs, lock contention, GC pressure
Use dedicated environments for absolute latency targets; CI is sufficient for regression detection