Skip to content

Performance Testing and Tuning Guidelines

Philosophy: Measure First, Optimize Never (Until Necessary)

"Premature optimization is the root of all evil." — Donald Knuth, The Art of Computer Programming

Rob Pike's rules of performance (Notes on Programming in C, 1989):

  1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places — do not guess; measure.
  2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
  3. Fancy algorithms are slow when n is small, and n is usually small. Until n is frequently large, prefer simple algorithms.
  4. Fancy algorithms are buggier and harder to implement. Simple algorithms with simple data structures are safer and often fast enough.
  5. Data dominates. Right data structures make algorithms self-evident. Design data structures first.

Consequence: write correct, readable code first. Add performance work only when a measured bottleneck justifies it, with a concrete target (e.g., "p99 latency < 200 ms under 500 RPS").

Performance Test Types

Type Purpose Mandatory?
Benchmark Throughput/latency of a function or component in isolation Mandatory for libraries and hot-path service code (see environment note below)
Load test Validates behavior under expected peak load Mandatory for services
Stress test Finds the breaking point beyond expected load Mandatory before first production release
Soak / Endurance Detects memory leaks and resource exhaustion under sustained load Mandatory for long-running services
Spike test Validates behavior under sudden, sharp load increases Recommended for auto-scaling services
Profiling Identifies where time and memory are actually spent On-demand: run when a benchmark reveals a problem

CLI tools: time ./mytool is usually sufficient unless the tool processes large data volumes.

What to Measure

Services

Collect at a minimum: - Latency: p50, p95, p99 — never averages; they hide tail latency. - Throughput: requests per second at target latency. - Error rate: must remain 0 % (or SLO-defined threshold) under load. - Resource utilization: CPU, memory, open file descriptors, connection pool exhaustion.

Tools: k6, Gatling, wrk.

Libraries

Use language-native micro-benchmark frameworks:

Language Framework
Go go test -bench + benchstat for statistical comparison
Java JMH (Java Microbenchmark Harness)
C++ Google Benchmark

Benchmark Environments

CI runners (especially cloud-hosted) have variable CPU frequency, noisy neighbors, and non-reproducible hardware — absolute numbers are meaningless across runs.

Goal Environment
Regression detection CI is sufficient: compare relative change against baseline on the same runner type using statistical tools (benchstat, JMH comparison mode). Fail the build on regressions above a threshold (e.g., 10 %).
Absolute targets ("p99 < 200 ms") Dedicated environment: self-hosted runner or bare-metal machine with pinned CPU frequency (disable turbo boost and frequency scaling), no other load, reproducible across runs.
Release sign-off / profiling Dedicated environment only.

A self-hosted CI runner with pinned hardware is a pragmatic middle ground: regression detection and meaningful absolute numbers in one place.

CLI Tools

time ./mytool input.dat is sufficient for most cases. For batch-processing tools, include a benchmark with realistic input sizes in the test suite.

Profiling vs. Performance Testing

Performance Testing Profiling
Question "Does the system meet its targets?" "Where does the time/memory go?"
Scope End-to-end, external view Internal, code-level view
When Continuously in CI (benchmarks); before releases (load tests) On-demand after a test reveals a problem
Output Latency, throughput, error rates Flame graphs, allocation profiles, call trees

Tools: pprof (Go), async-profiler / JFR (Java), perf / Valgrind / Heaptrack (C++).

Performance Observability in Production

Prefer instrumentation over logging for performance data:

  1. Distributed tracing (preferred): instrument with OpenTelemetry; export to Jaeger or a compatible backend. Traces reveal latency at span level without ad-hoc timing code.
  2. Spring Boot: spring-boot-starter-actuator + Micrometer Tracing.
  3. Go: go.opentelemetry.io/otel.
  4. Metrics: expose RED metrics (Rate, Errors, Duration) via Prometheus. Instrument at the framework/middleware level — not inline in business logic.
  5. Performance logging as last resort: log duration of individual operations only when tracing is unavailable and a specific bottleneck is under investigation. Remove or gate behind a debug flag afterward.

Do not scatter System.currentTimeMillis() / time.Now() throughout business logic. Use interceptors, middleware, or AOP.

Common Performance Bottlenecks

Address these before writing custom optimizations:

  • N+1 queries: fetch related data in bulk; use JOIN or batch loading.
  • Missing database indexes: profile slow queries (EXPLAIN ANALYZE); index frequent filter and sort columns.
  • Serialization overhead: benchmark JSON vs. binary formats (Protobuf, Avro) when throughput is critical.
  • Chatty APIs: replace multiple fine-grained calls with one coarse-grained or batch call.
  • Lock contention: prefer immutability, lock-free structures, or message passing over shared mutable state.
  • Synchronous blocking in async contexts: never block an event-loop or virtual-thread carrier thread.
  • Memory allocation / GC pressure: minimize short-lived allocations in hot paths; pool expensive objects.
  • Thread/connection pool exhaustion: size pools using Little's Law; monitor queue depth, not just pool size.
  • Missing caching: cache expensive, read-heavy, low-churn results at the right layer (in-process → distributed).

Directives

  • Measure before optimizing; never guess bottlenecks
  • Use OpenTelemetry and Prometheus for production performance data — do not scatter timing code in business logic
  • Collect p50/p95/p99 latency — never rely on averages
  • Load tests mandatory for services; benchmarks mandatory for libraries and hot-path service code
  • Address common bottlenecks first: N+1 queries, missing DB indexes, chatty APIs, lock contention, GC pressure
  • Use dedicated environments for absolute latency targets; CI is sufficient for regression detection