Performance Testing and Tuning Guidelines¶
Philosophy: Measure First, Optimize Never (Until Necessary)¶
"Premature optimization is the root of all evil." — Donald Knuth, The Art of Computer Programming
Rob Pike's rules of performance (Notes on Programming in C, 1989):
- You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places — do not guess; measure.
- Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
- Fancy algorithms are slow when n is small, and n is usually small. Until n is frequently large, prefer simple algorithms.
- Fancy algorithms are buggier and harder to implement. Simple algorithms with simple data structures are safer and often fast enough.
- Data dominates. Right data structures make algorithms self-evident. Design data structures first.
Consequence: write correct, readable code first. Add performance work only when a measured bottleneck justifies it, with a concrete target (e.g., "p99 latency < 200 ms under 500 RPS").
Performance Test Types¶
| Type | Purpose | Mandatory? |
|---|---|---|
| Benchmark | Throughput/latency of a function or component in isolation | Mandatory for libraries and hot-path service code (see environment note below) |
| Load test | Validates behavior under expected peak load | Mandatory for services |
| Stress test | Finds the breaking point beyond expected load | Mandatory before first production release |
| Soak / Endurance | Detects memory leaks and resource exhaustion under sustained load | Mandatory for long-running services |
| Spike test | Validates behavior under sudden, sharp load increases | Recommended for auto-scaling services |
| Profiling | Identifies where time and memory are actually spent | On-demand: run when a benchmark reveals a problem |
CLI tools:
time ./mytoolis usually sufficient unless the tool processes large data volumes.
What to Measure¶
Services¶
Collect at a minimum: - Latency: p50, p95, p99 — never averages; they hide tail latency. - Throughput: requests per second at target latency. - Error rate: must remain 0 % (or SLO-defined threshold) under load. - Resource utilization: CPU, memory, open file descriptors, connection pool exhaustion.
Libraries¶
Use language-native micro-benchmark frameworks:
| Language | Framework |
|---|---|
| Go | go test -bench + benchstat for statistical comparison |
| Java | JMH (Java Microbenchmark Harness) |
| C++ | Google Benchmark |
Benchmark Environments¶
CI runners (especially cloud-hosted) have variable CPU frequency, noisy neighbors, and non-reproducible hardware — absolute numbers are meaningless across runs.
| Goal | Environment |
|---|---|
| Regression detection | CI is sufficient: compare relative change against baseline on the same runner type using statistical tools (benchstat, JMH comparison mode). Fail the build on regressions above a threshold (e.g., 10 %). |
| Absolute targets ("p99 < 200 ms") | Dedicated environment: self-hosted runner or bare-metal machine with pinned CPU frequency (disable turbo boost and frequency scaling), no other load, reproducible across runs. |
| Release sign-off / profiling | Dedicated environment only. |
A self-hosted CI runner with pinned hardware is a pragmatic middle ground: regression detection and meaningful absolute numbers in one place.
CLI Tools¶
time ./mytool input.dat is sufficient for most cases. For batch-processing tools, include
a benchmark with realistic input sizes in the test suite.
Profiling vs. Performance Testing¶
| Performance Testing | Profiling | |
|---|---|---|
| Question | "Does the system meet its targets?" | "Where does the time/memory go?" |
| Scope | End-to-end, external view | Internal, code-level view |
| When | Continuously in CI (benchmarks); before releases (load tests) | On-demand after a test reveals a problem |
| Output | Latency, throughput, error rates | Flame graphs, allocation profiles, call trees |
Tools: pprof (Go), async-profiler / JFR (Java), perf / Valgrind / Heaptrack (C++).
Performance Observability in Production¶
Prefer instrumentation over logging for performance data:
- Distributed tracing (preferred): instrument with OpenTelemetry; export to Jaeger or a compatible backend. Traces reveal latency at span level without ad-hoc timing code.
- Spring Boot:
spring-boot-starter-actuator+ Micrometer Tracing. - Go:
go.opentelemetry.io/otel. - Metrics: expose RED metrics (Rate, Errors, Duration) via Prometheus. Instrument at the framework/middleware level — not inline in business logic.
- Performance logging as last resort: log duration of individual operations only when tracing is unavailable and a specific bottleneck is under investigation. Remove or gate behind a debug flag afterward.
Do not scatter System.currentTimeMillis() / time.Now() throughout business logic.
Use interceptors, middleware, or AOP.
Common Performance Bottlenecks¶
Address these before writing custom optimizations:
- N+1 queries: fetch related data in bulk; use
JOINor batch loading. - Missing database indexes: profile slow queries (
EXPLAIN ANALYZE); index frequent filter and sort columns. - Serialization overhead: benchmark JSON vs. binary formats (Protobuf, Avro) when throughput is critical.
- Chatty APIs: replace multiple fine-grained calls with one coarse-grained or batch call.
- Lock contention: prefer immutability, lock-free structures, or message passing over shared mutable state.
- Synchronous blocking in async contexts: never block an event-loop or virtual-thread carrier thread.
- Memory allocation / GC pressure: minimize short-lived allocations in hot paths; pool expensive objects.
- Thread/connection pool exhaustion: size pools using Little's Law; monitor queue depth, not just pool size.
- Missing caching: cache expensive, read-heavy, low-churn results at the right layer (in-process → distributed).
Directives¶
- Measure before optimizing; never guess bottlenecks
- Use OpenTelemetry and Prometheus for production performance data — do not scatter timing code in business logic
- Collect p50/p95/p99 latency — never rely on averages
- Load tests mandatory for services; benchmarks mandatory for libraries and hot-path service code
- Address common bottlenecks first: N+1 queries, missing DB indexes, chatty APIs, lock contention, GC pressure
- Use dedicated environments for absolute latency targets; CI is sufficient for regression detection