WordPress Observability with OpenTelemetry: Distributed Tracing, Structured Logging, and Metrics for Production Sites

WordPress powers a staggering percentage of the web, yet most WordPress deployments operate with minimal visibility into what actually happens during request processing. Developers rely on error logs, crude timing measurements, and guesswork when diagnosing production issues. Meanwhile, the broader software engineering world has embraced observability as a core operational practice, built on three pillars: distributed tracing, structured logging, and metrics collection.

This article walks through a complete observability setup for WordPress using OpenTelemetry, Monolog, and Grafana’s open-source stack. Every code sample is production-tested. By the end, you will have instrumented your WordPress application with distributed traces that follow requests from the initial HTTP hit through template resolution, database queries, and hook execution, paired with structured JSON logs and custom metrics that feed into dashboards and alerting rules.

Why WordPress Needs Observability

A typical WordPress request touches dozens of subsystems. The request enters through index.php, loads the WordPress bootstrap, fires initialization hooks, resolves rewrite rules, runs the main query, selects a template, executes template hooks, renders output through a chain of nested template parts, and finally sends the response. Along the way, plugins inject behavior at nearly every stage. A single page load might execute 200+ database queries, fire 500+ action and filter hooks, and call external APIs for analytics, CDN purging, or payment processing.

When something goes wrong or slows down, the default WordPress debugging tools offer almost nothing. WP_DEBUG_LOG produces an unstructured text file with no context about which request generated each entry. Query Monitor is excellent for development but cannot run in production without significant overhead. New Relic and similar APM tools provide some visibility but treat WordPress as a black box, missing the hook-driven architecture that actually determines performance characteristics.

OpenTelemetry changes this equation. It provides a vendor-neutral instrumentation framework that can capture exactly the data you need, at exactly the granularity you choose, with configurable sampling to control overhead. Combined with structured logging through PSR-3 and Monolog, you get a complete picture of every request your WordPress site processes.

Instrumenting WordPress with the OpenTelemetry PHP SDK

OpenTelemetry provides both automatic and manual instrumentation for PHP applications. Automatic instrumentation hooks into common libraries and frameworks without code changes. Manual instrumentation gives you precise control over what gets traced. For WordPress, you will use both approaches.

Installing the SDK

Start by requiring the OpenTelemetry packages through Composer. You need the API, SDK, and at least one exporter. The OTLP exporter sends data to any OpenTelemetry-compatible backend.

composer require \
  open-telemetry/sdk \
  open-telemetry/exporter-otlp \
  open-telemetry/transport-grpc \
  php-http/guzzle7-adapter \
  google/protobuf

The protobuf extension dramatically improves serialization performance. If you can install PHP extensions in your environment, add the C extension instead of the pure PHP polyfill:

pecl install protobuf
echo "extension=protobuf.so" >> /usr/local/etc/php/conf.d/protobuf.ini

For automatic instrumentation of PDO, cURL, and other PHP internals, install the OpenTelemetry PHP extension:

pecl install opentelemetry
echo "extension=opentelemetry.so" >> /usr/local/etc/php/conf.d/opentelemetry.ini

Bootstrapping the Tracer

Create a dedicated file for OpenTelemetry initialization. This file must load before WordPress processes any request, so place it in your must-use plugins directory.

<?php
/**
 * Plugin Name: WP OpenTelemetry Instrumentation
 * Description: Distributed tracing and metrics for WordPress
 * Version: 1.0.0
 */

use OpenTelemetry\API\Globals;
use OpenTelemetry\API\Trace\Propagation\TraceContextPropagator;
use OpenTelemetry\SDK\Trace\TracerProviderBuilder;
use OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor;
use OpenTelemetry\SDK\Trace\Sampler\ParentBased;
use OpenTelemetry\SDK\Trace\Sampler\TraceIdRatioBasedSampler;
use OpenTelemetry\SDK\Resource\ResourceInfoFactory;
use OpenTelemetry\SDK\Resource\ResourceInfo;
use OpenTelemetry\SDK\Common\Attribute\Attributes;
use OpenTelemetry\SemConv\ResourceAttributes;
use OpenTelemetry\Contrib\Otlp\SpanExporter;
use OpenTelemetry\Contrib\Otlp\OtlpUtil;
use OpenTelemetry\API\Signals;

// Only initialize once
if (defined('WP_OTEL_INITIALIZED')) {
    return;
}
define('WP_OTEL_INITIALIZED', true);

// Load Composer autoloader
require_once ABSPATH . 'vendor/autoload.php';

// Define the service resource
$resource = ResourceInfoFactory::defaultResource()->merge(
    ResourceInfo::create(Attributes::create([
        ResourceAttributes::SERVICE_NAME => 'wordpress-production',
        ResourceAttributes::SERVICE_VERSION => get_bloginfo('version'),
        ResourceAttributes::DEPLOYMENT_ENVIRONMENT => wp_get_environment_type(),
        'wordpress.site_url' => get_site_url(),
    ]))
);

// Configure the OTLP exporter
$transport = (new \OpenTelemetry\Contrib\Otlp\OtlpHttpTransportFactory())
    ->create(
        getenv('OTEL_EXPORTER_OTLP_ENDPOINT') ?: 'http://otel-collector:4318',
        'application/x-protobuf',
        [],
        OtlpUtil::method(Signals::TRACE)
    );

$exporter = new SpanExporter($transport);

// Use a parent-based sampler: 10% of new traces, always follow parent decision
$sampler = new ParentBased(
    new TraceIdRatioBasedSampler(
        (float)(getenv('OTEL_TRACES_SAMPLER_ARG') ?: 0.1)
    )
);

// Build the tracer provider with batch processing
$tracerProvider = (new TracerProviderBuilder())
    ->setResource($resource)
    ->addSpanProcessor(
        new BatchSpanProcessor(
            $exporter,
            2048,   // max queue size
            5000,   // schedule delay ms
            30000,  // export timeout ms
            512     // max batch size
        )
    )
    ->setSampler($sampler)
    ->build();

// Register globally
Globals::registerInitializer(function() use ($tracerProvider) {
    return $tracerProvider;
});

// Ensure spans flush on shutdown
register_shutdown_function(function() use ($tracerProvider) {
    $tracerProvider->shutdown();
});

Notice the BatchSpanProcessor configuration. Unlike the SimpleSpanProcessor that exports each span immediately, the batch processor queues spans and exports them in bulk. This is critical for production use because it decouples the trace export from the request lifecycle, preventing the export latency from affecting response times.

The sampler deserves attention too. A 10% sampling rate means only one in ten new traces gets recorded. The ParentBased wrapper ensures that if an incoming request already carries a trace context header (from a load balancer or upstream service), the sampling decision from that parent trace is respected. This prevents partial traces where the frontend is sampled but the backend is not.

Creating Manual Spans

With the tracer provider registered, you can create spans anywhere in your WordPress code. Each span represents a unit of work with a start time, end time, attributes, and optional events.

use OpenTelemetry\API\Globals;
use OpenTelemetry\API\Trace\StatusCode;

function wp_otel_trace_function(string $name, callable $fn, array $attributes = []) {
    $tracer = Globals::tracerProvider()->getTracer('wordpress-app');
    $span = $tracer->spanBuilder($name)
        ->setAttributes($attributes)
        ->startSpan();

    $scope = $span->activate();

    try {
        $result = $fn();
        $span->setStatus(StatusCode::STATUS_OK);
        return $result;
    } catch (\Throwable $e) {
        $span->setStatus(StatusCode::STATUS_ERROR, $e->getMessage());
        $span->recordException($e);
        throw $e;
    } finally {
        $scope->detach();
        $span->end();
    }
}

This utility function wraps any callable in a traced span. Exceptions are automatically recorded with full stack traces. The activate() call sets the span as the current active span, so any child spans created within the callable will automatically become children in the trace hierarchy.

Tracing the WordPress Request Lifecycle

WordPress processes requests through a well-defined sequence of hooks. By attaching trace instrumentation at each major phase, you build a complete picture of the request lifecycle. The following code creates spans for each phase, nested under a root span that represents the entire request.

<?php
/**
 * WordPress Request Lifecycle Tracing
 *
 * Attaches spans to major WordPress lifecycle hooks to produce
 * a hierarchical trace of each request.
 */

use OpenTelemetry\API\Globals;
use OpenTelemetry\API\Trace\SpanKind;
use OpenTelemetry\API\Trace\StatusCode;

class WP_Request_Tracer {

    private $tracer;
    private $rootSpan;
    private $rootScope;
    private $currentSpan;
    private $currentScope;
    private $requestStart;

    public function __construct() {
        $this->tracer = Globals::tracerProvider()->getTracer(
            'wordpress-request-lifecycle',
            '1.0.0'
        );
        $this->requestStart = $_SERVER['REQUEST_TIME_FLOAT'] ?? microtime(true);
    }

    public function register(): void {
        // Create the root span for the entire request
        $this->rootSpan = $this->tracer->spanBuilder('http.request')
            ->setSpanKind(SpanKind::KIND_SERVER)
            ->setStartTimestamp((int)($this->requestStart * 1_000_000_000))
            ->setAttributes([
                'http.method' => $_SERVER['REQUEST_METHOD'] ?? 'GET',
                'http.url' => home_url($_SERVER['REQUEST_URI'] ?? '/'),
                'http.user_agent' => $_SERVER['HTTP_USER_AGENT'] ?? '',
                'http.client_ip' => $_SERVER['REMOTE_ADDR'] ?? '',
            ])
            ->startSpan();

        $this->rootScope = $this->rootSpan->activate();

        // Hook into lifecycle phases (priority 1 to run early)
        add_action('plugins_loaded', [$this, 'onPluginsLoaded'], 1);
        add_action('init', [$this, 'onInit'], 1);
        add_action('parse_request', [$this, 'onParseRequest'], 1);
        add_action('template_redirect', [$this, 'onTemplateRedirect'], 1);
        add_action('wp_head', [$this, 'onWpHead'], 1);
        add_action('wp_footer', [$this, 'onWpFooter'], 1);
        add_action('shutdown', [$this, 'onShutdown'], 9999);
    }

    public function onPluginsLoaded(): void {
        $this->startPhaseSpan('wordpress.plugins_loaded');
    }

    public function onInit(): void {
        $this->endPhaseSpan();
        $this->startPhaseSpan('wordpress.init');
    }

    public function onParseRequest(): void {
        $this->endPhaseSpan();
        $this->startPhaseSpan('wordpress.parse_request');
    }

    public function onTemplateRedirect(): void {
        $this->endPhaseSpan();

        // Record which template was selected
        $template = get_page_template_slug() ?: 'default';
        $this->rootSpan->setAttribute('wordpress.template', $template);
        $this->rootSpan->setAttribute('wordpress.query_type', $this->getQueryType());

        $this->startPhaseSpan('wordpress.template_render', [
            'wordpress.template' => $template,
        ]);
    }

    public function onWpHead(): void {
        $this->endPhaseSpan();
        $this->startPhaseSpan('wordpress.wp_head');
    }

    public function onWpFooter(): void {
        $this->endPhaseSpan();
        $this->startPhaseSpan('wordpress.wp_footer');
    }

    public function onShutdown(): void {
        $this->endPhaseSpan();

        $statusCode = http_response_code() ?: 200;
        $this->rootSpan->setAttribute('http.status_code', $statusCode);

        if ($statusCode >= 500) {
            $this->rootSpan->setStatus(StatusCode::STATUS_ERROR);
        } else {
            $this->rootSpan->setStatus(StatusCode::STATUS_OK);
        }

        // Record total memory usage
        $this->rootSpan->setAttribute(
            'wordpress.peak_memory_mb',
            round(memory_get_peak_usage(true) / 1048576, 2)
        );

        $this->rootScope->detach();
        $this->rootSpan->end();
    }

    private function startPhaseSpan(string $name, array $attributes = []): void {
        $this->currentSpan = $this->tracer->spanBuilder($name)
            ->setAttributes($attributes)
            ->startSpan();
        $this->currentScope = $this->currentSpan->activate();
    }

    private function endPhaseSpan(): void {
        if ($this->currentScope) {
            $this->currentScope->detach();
            $this->currentSpan->end();
            $this->currentScope = null;
            $this->currentSpan = null;
        }
    }

    private function getQueryType(): string {
        if (is_singular()) return 'singular';
        if (is_archive()) return 'archive';
        if (is_search()) return 'search';
        if (is_front_page()) return 'front_page';
        if (is_404()) return '404';
        return 'unknown';
    }
}

// Initialize the request tracer
$requestTracer = new WP_Request_Tracer();
$requestTracer->register();

When you view the resulting trace in Grafana Tempo or Jaeger, you see a waterfall visualization showing exactly how time was distributed across the request. The root http.request span contains child spans for each lifecycle phase. If the wordpress.template_render span takes 400ms while wordpress.init takes only 20ms, you immediately know where to focus optimization efforts.

Tracing Database Queries

Database queries are typically the primary source of latency in WordPress. The wpdb class provides hooks for query interception that work well with tracing.

class WP_Database_Tracer {

    private $tracer;
    private $activeSpans = [];
    private $queryCount = 0;

    public function __construct() {
        $this->tracer = Globals::tracerProvider()->getTracer('wordpress-database');
    }

    public function register(): void {
        // Requires SAVEQUERIES or a custom wpdb wrapper
        add_filter('query', [$this, 'onQueryStart'], 1);
        add_filter('log_query_custom_data', [$this, 'onQueryEnd'], 1, 5);
    }

    public function onQueryStart(string $query): string {
        $this->queryCount++;

        $span = $this->tracer->spanBuilder('db.query')
            ->setAttributes([
                'db.system' => 'mysql',
                'db.statement' => $this->sanitizeQuery($query),
                'db.operation' => $this->extractOperation($query),
                'wordpress.query_number' => $this->queryCount,
            ])
            ->startSpan();

        $this->activeSpans[$this->queryCount] = [
            'span' => $span,
            'scope' => $span->activate(),
        ];

        return $query;
    }

    public function onQueryEnd($data, $query, $elapsed, $caller, $start) {
        $active = array_pop($this->activeSpans);
        if ($active) {
            $active['span']->setAttributes([
                'db.execution_time_ms' => round($elapsed * 1000, 2),
                'db.caller' => $caller,
            ]);

            // Flag slow queries (over 50ms)
            if ($elapsed > 0.05) {
                $active['span']->addEvent('slow_query', [
                    'db.execution_time_ms' => round($elapsed * 1000, 2),
                    'db.threshold_ms' => 50,
                ]);
            }

            $active['scope']->detach();
            $active['span']->end();
        }
        return $data;
    }

    private function sanitizeQuery(string $query): string {
        // Remove literal values to prevent sensitive data in traces
        $sanitized = preg_replace("/= '[^']*'/", "= '?'", $query);
        $sanitized = preg_replace("/= \"[^\"]*\"/", '= "?"', $sanitized);
        $sanitized = preg_replace('/= \d+/', '= ?', $sanitized);
        return $sanitized;
    }

    private function extractOperation(string $query): string {
        $query = ltrim($query);
        if (stripos($query, 'SELECT') === 0) return 'SELECT';
        if (stripos($query, 'INSERT') === 0) return 'INSERT';
        if (stripos($query, 'UPDATE') === 0) return 'UPDATE';
        if (stripos($query, 'DELETE') === 0) return 'DELETE';
        if (stripos($query, 'SHOW') === 0) return 'SHOW';
        return 'OTHER';
    }
}

$dbTracer = new WP_Database_Tracer();
$dbTracer->register();

The query sanitization step is important. You must strip literal values from SQL statements before recording them in traces, because those values may contain user data, passwords, or other sensitive information. The sanitization replaces string and numeric literals with placeholder markers while preserving the query structure so you can still identify which query pattern caused a slowdown.

Note the use of log_query_custom_data, which requires SAVEQUERIES to be enabled. In production, running with SAVEQUERIES adds overhead because WordPress stores every query in memory. An alternative approach wraps the $wpdb instance with a custom class that intercepts the query() method directly. This avoids the memory overhead of storing all queries while still capturing timing data.

Custom wpdb Wrapper for Production

class Traced_wpdb extends wpdb {

    private $tracer;

    public function __construct($dbuser, $dbpassword, $dbname, $dbhost) {
        parent::__construct($dbuser, $dbpassword, $dbname, $dbhost);
        $this->tracer = Globals::tracerProvider()->getTracer('wordpress-database');
    }

    public function query($query) {
        $span = $this->tracer->spanBuilder('db.query')
            ->setAttributes([
                'db.system' => 'mysql',
                'db.statement' => $this->sanitize_query_for_trace($query),
                'db.operation' => $this->extract_operation($query),
            ])
            ->startSpan();

        $scope = $span->activate();
        $start = microtime(true);

        try {
            $result = parent::query($query);
            $elapsed = microtime(true) - $start;

            $span->setAttribute('db.execution_time_ms', round($elapsed * 1000, 2));
            $span->setAttribute('db.rows_affected', $this->rows_affected);

            if ($elapsed > 0.05) {
                $span->addEvent('slow_query', [
                    'threshold_ms' => 50,
                    'actual_ms' => round($elapsed * 1000, 2),
                ]);
            }

            return $result;
        } catch (\Throwable $e) {
            $span->setStatus(StatusCode::STATUS_ERROR, $e->getMessage());
            $span->recordException($e);
            throw $e;
        } finally {
            $scope->detach();
            $span->end();
        }
    }

    private function sanitize_query_for_trace(string $query): string {
        $sanitized = preg_replace("/= '[^']*'/", "= '?'", $query);
        $sanitized = preg_replace("/= \"[^\"]*\"/", '= "?"', $sanitized);
        return preg_replace('/= \d+/', '= ?', $sanitized);
    }

    private function extract_operation(string $query): string {
        $trimmed = ltrim($query);
        $first_word = strtoupper(strtok($trimmed, " \t\n\r"));
        return in_array($first_word, ['SELECT','INSERT','UPDATE','DELETE','SHOW','ALTER','CREATE','DROP'])
            ? $first_word
            : 'OTHER';
    }
}

To use this wrapper, replace the global $wpdb object early in the WordPress bootstrap. A must-use plugin loaded before other plugins is the right place:

// In mu-plugins/traced-db.php
$GLOBALS['wpdb'] = new Traced_wpdb(DB_USER, DB_PASSWORD, DB_NAME, DB_HOST);

PSR-3 Structured Logging with Monolog and Wonolog

WordPress’s built-in logging is limited to error_log() calls and the debug.log file. This produces unstructured text with no machine-parseable format, no severity levels beyond what PHP itself provides, and no contextual metadata. Structured logging replaces this with JSON-formatted log entries that carry rich context, making logs searchable, filterable, and correlatable with traces.

Setting Up Monolog

composer require monolog/monolog inpsyde/wonolog:^2.0

Wonolog bridges WordPress and Monolog, automatically capturing WordPress errors, deprecated function notices, and doing_it_wrong calls as structured log entries. Configure it in a must-use plugin:

<?php
/**
 * Plugin Name: WP Structured Logging
 */

use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Handler\RotatingFileHandler;
use Monolog\Formatter\JsonFormatter;
use Monolog\Processor\IntrospectionProcessor;
use Monolog\Processor\MemoryUsageProcessor;
use Monolog\Processor\WebProcessor;
use Inpsyde\Wonolog;

// Create a JSON-formatted handler for machine-readable logs
$jsonHandler = new RotatingFileHandler(
    WP_CONTENT_DIR . '/logs/wordpress.json.log',
    14,  // keep 14 days of logs
    Logger::DEBUG
);
$jsonHandler->setFormatter(new JsonFormatter());

// Create a human-readable handler for direct inspection
$readableHandler = new RotatingFileHandler(
    WP_CONTENT_DIR . '/logs/wordpress.log',
    7,
    Logger::WARNING
);

// Build the logger
$logger = new Logger('wordpress');
$logger->pushHandler($jsonHandler);
$logger->pushHandler($readableHandler);

// Add processors for automatic context enrichment
$logger->pushProcessor(new WebProcessor());
$logger->pushProcessor(new MemoryUsageProcessor());
$logger->pushProcessor(new IntrospectionProcessor(Logger::WARNING));

// Add a custom processor that injects the current trace ID
$logger->pushProcessor(function (array $record) {
    $currentSpan = \OpenTelemetry\API\Trace\Span::getCurrent();
    $spanContext = $currentSpan->getContext();

    $record['extra']['trace_id'] = $spanContext->getTraceId();
    $record['extra']['span_id'] = $spanContext->getSpanId();
    $record['extra']['trace_flags'] = $spanContext->getTraceFlags();

    return $record;
});

// Initialize Wonolog with our configured logger
Wonolog\bootstrap($logger);

The custom processor that injects trace_id and span_id into every log entry is where logging and tracing converge. When you find a suspicious log entry in Grafana Loki, you can copy the trace_id and jump directly to the corresponding trace in Tempo. This correlation between logs and traces is one of the most powerful features of a unified observability stack.

Custom Log Context for WordPress Operations

Beyond the automatic WordPress error capture that Wonolog provides, add targeted logging for operations you care about. Create a thin wrapper that makes the Monolog logger accessible throughout your WordPress code:

class WP_Structured_Logger {

    private static ?Logger $instance = null;

    public static function get(): Logger {
        if (self::$instance === null) {
            throw new \RuntimeException('Logger not initialized');
        }
        return self::$instance;
    }

    public static function set(Logger $logger): void {
        self::$instance = $logger;
    }

    public static function info(string $message, array $context = []): void {
        self::get()->info($message, self::enrichContext($context));
    }

    public static function warning(string $message, array $context = []): void {
        self::get()->warning($message, self::enrichContext($context));
    }

    public static function error(string $message, array $context = []): void {
        self::get()->error($message, self::enrichContext($context));
    }

    private static function enrichContext(array $context): array {
        // Add WordPress-specific context automatically
        $context['wp_user_id'] = get_current_user_id();
        $context['wp_request_uri'] = $_SERVER['REQUEST_URI'] ?? '';
        $context['wp_is_admin'] = is_admin();
        $context['wp_doing_ajax'] = wp_doing_ajax();
        $context['wp_doing_cron'] = wp_doing_cron();
        return $context;
    }
}

// Set the instance after logger creation
WP_Structured_Logger::set($logger);

Now you can add structured logging throughout your WordPress application:

// Log slow WP_Query executions
add_filter('posts_results', function($posts, $query) {
    if (isset($query->query_vars['_query_start'])) {
        $elapsed = microtime(true) - $query->query_vars['_query_start'];
        if ($elapsed > 0.5) {
            WP_Structured_Logger::warning('Slow WP_Query detected', [
                'query_vars' => $query->query_vars,
                'execution_time_s' => round($elapsed, 4),
                'post_count' => count($posts),
                'sql' => $query->request,
            ]);
        }
    }
    return $posts;
}, 10, 2);

add_action('pre_get_posts', function($query) {
    $query->set('_query_start', microtime(true));
});

JSON Log Output Format

With the JsonFormatter, each log entry produces a single JSON line that looks like this:

{
  "message": "Slow WP_Query detected",
  "context": {
    "query_vars": {"post_type": "product", "posts_per_page": 100, "meta_key": "price"},
    "execution_time_s": 1.2341,
    "post_count": 87,
    "sql": "SELECT wp_posts.* FROM wp_posts INNER JOIN wp_postmeta ...",
    "wp_user_id": 0,
    "wp_request_uri": "/shop/",
    "wp_is_admin": false,
    "wp_doing_ajax": false,
    "wp_doing_cron": false
  },
  "level": 300,
  "level_name": "WARNING",
  "channel": "wordpress",
  "datetime": "2022-04-18T14:23:17.445612+00:00",
  "extra": {
    "url": "/shop/",
    "ip": "192.168.1.45",
    "http_method": "GET",
    "server": "example.com",
    "referrer": "https://example.com/",
    "memory_usage": "48 MB",
    "trace_id": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
    "span_id": "1a2b3c4d5e6f7a8b",
    "trace_flags": 1
  }
}

This format is directly ingestible by Loki, Elasticsearch, Datadog, and virtually any modern log aggregation system. The trace_id field links this log entry to the distributed trace, enabling you to jump from a log line to the full request waterfall.

Exporting Traces and Logs to Grafana and Datadog

The OpenTelemetry Collector acts as a central hub that receives telemetry data and routes it to one or more backends. This decouples your application from the specific observability vendor, allowing you to switch backends or send data to multiple systems simultaneously.

OpenTelemetry Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s
    send_batch_max_size: 2048

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: service.team
        value: wordpress
        action: upsert

  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.url
            value: "/wp-cron.php"

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        deployment.environment: "environment"
      attributes:
        level: ""
        wordpress.query_type: ""

  datadog:
    api:
      key: ${DD_API_KEY}
      site: datadoghq.com

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, attributes, filter, batch]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

The filter processor excludes traces for wp-cron.php requests, which are typically high-volume and low-value for debugging purposes. The memory_limiter processor prevents the collector from consuming excessive memory during traffic spikes. The batch processor groups spans for efficient network transmission.

If you use Datadog instead of Grafana, uncomment the Datadog exporter and add it to the pipeline. The OpenTelemetry Collector supports sending the same data to multiple backends simultaneously, which is useful during migration periods.

Sending Logs to Loki via Promtail

While the OpenTelemetry Collector can forward logs to Loki, many teams prefer using Promtail to tail JSON log files directly. This approach works well with the RotatingFileHandler we configured earlier.

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: wordpress
    static_configs:
      - targets:
          - localhost
        labels:
          job: wordpress
          environment: production
          __path__: /var/www/html/wp-content/logs/wordpress.json.log*

    pipeline_stages:
      - json:
          expressions:
            level: level_name
            message: message
            trace_id: extra.trace_id
            channel: channel

      - labels:
          level:
          channel:

      - template:
          source: trace_id
          template: '{{ .Value }}'

      - structured_metadata:
          trace_id:

The pipeline stages extract the log level and trace ID from the JSON structure, adding them as labels and structured metadata. In Grafana, you can filter logs by level="ERROR" and then click through to the associated trace.

Custom Metrics: WP_Query, Hook Duration, and Cache Hit Ratios

Metrics provide aggregate views of system behavior over time. While traces show you individual requests, metrics reveal trends, patterns, and anomalies across thousands of requests. OpenTelemetry supports three metric types: counters, gauges, and histograms.

Setting Up the Meter Provider

use OpenTelemetry\SDK\Metrics\MeterProviderBuilder;
use OpenTelemetry\SDK\Metrics\MetricReader\ExportingReader;
use OpenTelemetry\Contrib\Otlp\MetricExporter;

$metricTransport = (new \OpenTelemetry\Contrib\Otlp\OtlpHttpTransportFactory())
    ->create(
        getenv('OTEL_EXPORTER_OTLP_ENDPOINT') ?: 'http://otel-collector:4318',
        'application/x-protobuf',
        [],
        OtlpUtil::method(Signals::METRICS)
    );

$metricExporter = new MetricExporter($metricTransport);

$meterProvider = (new MeterProviderBuilder())
    ->setResource($resource)  // reuse the resource from tracer setup
    ->addReader(new ExportingReader($metricExporter))
    ->build();

// Register for global access
$meter = $meterProvider->getMeter('wordpress-metrics', '1.0.0');

WP_Query Execution Time Histogram

// Create a histogram for query execution times
$queryHistogram = $meter->createHistogram(
    'wordpress.wp_query.duration',
    'ms',
    'Time spent executing WP_Query instances'
);

add_action('pre_get_posts', function($query) {
    $query->set('_otel_start', microtime(true));
});

add_filter('posts_results', function($posts, $query) use ($queryHistogram) {
    $start = $query->get('_otel_start');
    if ($start) {
        $duration = (microtime(true) - $start) * 1000;

        $queryHistogram->record($duration, [
            'post_type' => $query->get('post_type') ?: 'post',
            'is_main_query' => $query->is_main_query() ? 'true' : 'false',
            'posts_per_page' => (string)$query->get('posts_per_page'),
            'has_meta_query' => $query->get('meta_query') ? 'true' : 'false',
        ]);
    }
    return $posts;
}, 10, 2);

This histogram records the distribution of WP_Query execution times, bucketed by post type and whether the query included meta queries (which are often the slowest). In Grafana, you can visualize the p50, p95, and p99 latencies and set alerts when they exceed thresholds.

Hook Duration Tracking

WordPress hooks are the primary extension mechanism, and poorly written hook callbacks are a common source of performance problems. Tracking the total time spent in each hook reveals which plugins or themes are consuming the most processing time.

class WP_Hook_Metrics {

    private $hookHistogram;
    private $hookCounter;
    private $hookTimings = [];

    public function __construct($meter) {
        $this->hookHistogram = $meter->createHistogram(
            'wordpress.hook.duration',
            'ms',
            'Time spent executing WordPress hooks'
        );

        $this->hookCounter = $meter->createCounter(
            'wordpress.hook.calls',
            'calls',
            'Number of times each hook fires'
        );
    }

    public function register(): void {
        // Track specific hooks known to cause performance issues
        $trackedHooks = [
            'the_content',
            'wp_head',
            'wp_footer',
            'pre_get_posts',
            'save_post',
            'woocommerce_before_single_product',
            'woocommerce_checkout_process',
            'template_redirect',
            'init',
            'widgets_init',
            'wp_loaded',
        ];

        foreach ($trackedHooks as $hook) {
            // Add a high-priority (early) start marker
            add_filter($hook, function($value) use ($hook) {
                $this->hookTimings[$hook] = microtime(true);
                return $value;
            }, -9999);

            // Add a low-priority (late) end marker
            add_filter($hook, function($value) use ($hook) {
                if (isset($this->hookTimings[$hook])) {
                    $duration = (microtime(true) - $this->hookTimings[$hook]) * 1000;

                    $this->hookHistogram->record($duration, [
                        'hook_name' => $hook,
                    ]);

                    $this->hookCounter->add(1, [
                        'hook_name' => $hook,
                    ]);

                    unset($this->hookTimings[$hook]);
                }
                return $value;
            }, 9999);
        }
    }
}

$hookMetrics = new WP_Hook_Metrics($meter);
$hookMetrics->register();

Object Cache Hit Ratio

For sites using a persistent object cache like Redis or Memcached, the cache hit ratio is a key performance indicator. A dropping hit ratio often indicates that the cache is being evicted too frequently, that a plugin is bypassing the cache, or that the cache store is running out of memory.

$cacheHits = $meter->createCounter(
    'wordpress.object_cache.hits',
    'hits',
    'Object cache hit count'
);

$cacheMisses = $meter->createCounter(
    'wordpress.object_cache.misses',
    'misses',
    'Object cache miss count'
);

// Record cache stats at the end of each request
add_action('shutdown', function() use ($cacheHits, $cacheMisses) {
    global $wp_object_cache;

    if (method_exists($wp_object_cache, 'getStats')) {
        // Redis Object Cache plugin
        $stats = $wp_object_cache->getStats();
        $cacheHits->add($stats['hits'] ?? 0);
        $cacheMisses->add($stats['misses'] ?? 0);
    } elseif (isset($wp_object_cache->cache_hits, $wp_object_cache->cache_misses)) {
        // Default WordPress object cache
        $cacheHits->add($wp_object_cache->cache_hits);
        $cacheMisses->add($wp_object_cache->cache_misses);
    }
}, 9998);

// HTTP request counter
$httpRequests = $meter->createCounter(
    'wordpress.http.requests',
    'requests',
    'Total HTTP requests processed'
);

add_action('shutdown', function() use ($httpRequests) {
    $httpRequests->add(1, [
        'method' => $_SERVER['REQUEST_METHOD'] ?? 'UNKNOWN',
        'status_code' => (string)http_response_code(),
        'is_admin' => is_admin() ? 'true' : 'false',
    ]);
}, 9999);

Correlating Slow Queries with Plugin Hooks

One of the most powerful applications of distributed tracing in WordPress is connecting slow database queries to the specific plugin or theme hook that generated them. The trace hierarchy makes this straightforward: a database query span is a child of whatever span was active when the query executed.

The Plugin Callback Tracer

class WP_Plugin_Callback_Tracer {

    private $tracer;
    private $trackedCallbacks = [];

    public function __construct() {
        $this->tracer = Globals::tracerProvider()->getTracer('wordpress-plugins');
    }

    /**
     * Wrap all registered callbacks for a given hook with tracing spans.
     * Call this after all plugins have loaded.
     */
    public function traceHook(string $hookName): void {
        global $wp_filter;

        if (!isset($wp_filter[$hookName])) {
            return;
        }

        $hook = $wp_filter[$hookName];

        foreach ($hook->callbacks as $priority => &$callbacks) {
            foreach ($callbacks as $key => &$callback) {
                $originalFn = $callback['function'];
                $callbackName = $this->getCallbackName($originalFn);
                $pluginFile = $this->resolvePluginFile($originalFn);

                $callback['function'] = function() use ($originalFn, $hookName, $callbackName, $pluginFile, $priority) {
                    $span = $this->tracer->spanBuilder("hook.callback")
                        ->setAttributes([
                            'wordpress.hook' => $hookName,
                            'wordpress.callback' => $callbackName,
                            'wordpress.priority' => $priority,
                            'wordpress.plugin' => $pluginFile,
                        ])
                        ->startSpan();

                    $scope = $span->activate();

                    try {
                        $result = call_user_func_array($originalFn, func_get_args());
                        return $result;
                    } catch (\Throwable $e) {
                        $span->recordException($e);
                        $span->setStatus(StatusCode::STATUS_ERROR, $e->getMessage());
                        throw $e;
                    } finally {
                        $scope->detach();
                        $span->end();
                    }
                };
            }
        }
    }

    private function getCallbackName($callback): string {
        if (is_string($callback)) {
            return $callback;
        }
        if (is_array($callback)) {
            $class = is_object($callback[0]) ? get_class($callback[0]) : $callback[0];
            return $class . '::' . $callback[1];
        }
        if ($callback instanceof \Closure) {
            $ref = new \ReflectionFunction($callback);
            return 'Closure@' . basename($ref->getFileName()) . ':' . $ref->getStartLine();
        }
        return 'unknown';
    }

    private function resolvePluginFile($callback): string {
        if (is_array($callback) && is_object($callback[0])) {
            $ref = new \ReflectionObject($callback[0]);
            $file = $ref->getFileName();
        } elseif ($callback instanceof \Closure) {
            $ref = new \ReflectionFunction($callback);
            $file = $ref->getFileName();
        } elseif (is_string($callback) && function_exists($callback)) {
            $ref = new \ReflectionFunction($callback);
            $file = $ref->getFileName();
        } else {
            return 'unknown';
        }

        // Extract plugin directory name
        if (preg_match('#/plugins/([^/]+)/#', $file, $m)) {
            return $m[1];
        }
        if (preg_match('#/themes/([^/]+)/#', $file, $m)) {
            return 'theme:' . $m[1];
        }
        if (strpos($file, ABSPATH) === 0) {
            return 'wordpress-core';
        }
        return basename(dirname($file));
    }
}

// Apply tracing to performance-sensitive hooks after all plugins load
add_action('plugins_loaded', function() {
    $pluginTracer = new WP_Plugin_Callback_Tracer();
    $pluginTracer->traceHook('the_content');
    $pluginTracer->traceHook('save_post');
    $pluginTracer->traceHook('pre_get_posts');
    $pluginTracer->traceHook('template_redirect');
    $pluginTracer->traceHook('wp_head');
    $pluginTracer->traceHook('woocommerce_checkout_process');
}, 9999);

With this setup, when you open a trace for a slow request, you see the hook callback span containing the database query span. The attributes tell you exactly which plugin registered the callback and which hook triggered it. If a WooCommerce extension registers a the_content filter that runs five extra database queries, you see that relationship directly in the trace waterfall.

A real-world example: a site was experiencing 3-second page load times on product pages. The trace showed that the the_content hook contained 14 callback spans. One callback from a “related products” plugin was taking 1.8 seconds and generating 47 database queries, none of which used the object cache. The fix was straightforward: wrap the plugin’s output in a transient cache. Without tracing, identifying this would have required hours of manual profiling.

Analyzing Hook Chains

Beyond individual callbacks, you can analyze the cumulative effect of hook chains. Create a report that shows total execution time per plugin across all hooks:

class WP_Plugin_Performance_Aggregator {

    private static $timings = [];

    public static function record(string $plugin, string $hook, float $duration): void {
        $key = $plugin . '|' . $hook;
        if (!isset(self::$timings[$key])) {
            self::$timings[$key] = ['total' => 0, 'count' => 0, 'max' => 0];
        }
        self::$timings[$key]['total'] += $duration;
        self::$timings[$key]['count']++;
        self::$timings[$key]['max'] = max(self::$timings[$key]['max'], $duration);
    }

    public static function getReport(): array {
        $byPlugin = [];
        foreach (self::$timings as $key => $data) {
            [$plugin, $hook] = explode('|', $key);
            if (!isset($byPlugin[$plugin])) {
                $byPlugin[$plugin] = ['total_ms' => 0, 'hooks' => []];
            }
            $byPlugin[$plugin]['total_ms'] += $data['total'];
            $byPlugin[$plugin]['hooks'][$hook] = [
                'total_ms' => round($data['total'], 2),
                'calls' => $data['count'],
                'max_ms' => round($data['max'], 2),
                'avg_ms' => round($data['total'] / $data['count'], 2),
            ];
        }

        // Sort by total time descending
        uasort($byPlugin, fn($a, $b) => $b['total_ms'] <=> $a['total_ms']);

        return $byPlugin;
    }
}

// Dump performance report at request end (for development)
add_action('shutdown', function() {
    if (defined('WP_DEBUG') && WP_DEBUG) {
        $report = WP_Plugin_Performance_Aggregator::getReport();
        WP_Structured_Logger::info('Plugin performance report', [
            'plugins' => $report,
        ]);
    }
}, 9997);

Setting Up Grafana Dashboards for WordPress Health

Grafana ties together traces from Tempo, logs from Loki, and metrics from Prometheus (or the OTLP metrics receiver) into unified dashboards. Build a WordPress-specific dashboard that answers the questions your operations team asks most frequently.

Essential Dashboard Panels

Start with a dashboard that covers the four golden signals for WordPress: latency, traffic, errors, and saturation.

Request Latency (p50/p95/p99): Use the wordpress.http.request.duration histogram to plot percentile latencies over time. Split by is_admin to separate admin panel requests from frontend requests.

# P95 request latency for frontend requests
histogram_quantile(0.95,
  sum(rate(wordpress_http_request_duration_bucket{is_admin="false"}[5m])) by (le)
)

# P99 request latency split by query type
histogram_quantile(0.99,
  sum(rate(wordpress_http_request_duration_bucket{is_admin="false"}[5m])) by (le, query_type)
)

Request Rate: Track the number of requests per second, split by response status code.

# Requests per second by status code
sum(rate(wordpress_http_requests_total[5m])) by (status_code)

# Error rate percentage
sum(rate(wordpress_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(wordpress_http_requests_total[5m]))
* 100

Database Query Latency: Plot the distribution of database query times, with separate panels for read vs. write operations.

# P95 database query latency by operation type
histogram_quantile(0.95,
  sum(rate(wordpress_wp_query_duration_bucket[5m])) by (le, post_type)
)

# Queries per request (gauge showing average)
sum(rate(wordpress_db_queries_total[5m]))
/
sum(rate(wordpress_http_requests_total[5m]))

Cache Hit Ratio: Display the object cache effectiveness.

# Cache hit ratio percentage
sum(rate(wordpress_object_cache_hits_total[5m]))
/
(
  sum(rate(wordpress_object_cache_hits_total[5m]))
  +
  sum(rate(wordpress_object_cache_misses_total[5m]))
) * 100

Grafana Dashboard JSON Model

Provision the dashboard automatically using Grafana’s JSON model. Here is a condensed version that sets up the core panels:

{
  "dashboard": {
    "title": "WordPress Production Health",
    "uid": "wp-health-001",
    "timezone": "utc",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Latency (p95)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(wordpress_http_request_duration_bucket{is_admin=\"false\"}[5m])) by (le))",
            "legendFormat": "p95 Frontend"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(wordpress_http_request_duration_bucket{is_admin=\"false\"}[5m])) by (le))",
            "legendFormat": "p50 Frontend"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "ms",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 500},
                {"color": "red", "value": 2000}
              ]
            }
          }
        }
      },
      {
        "title": "Request Rate & Errors",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(wordpress_http_requests_total[5m])) by (status_code)",
            "legendFormat": "{{status_code}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "reqps"}
        }
      },
      {
        "title": "Cache Hit Ratio",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(wordpress_object_cache_hits_total[5m])) / (sum(rate(wordpress_object_cache_hits_total[5m])) + sum(rate(wordpress_object_cache_misses_total[5m]))) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 80},
                {"color": "green", "value": 95}
              ]
            }
          }
        }
      },
      {
        "title": "Slow Queries (>50ms)",
        "type": "stat",
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(wordpress_db_slow_queries_total[5m])) * 300"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 10},
                {"color": "red", "value": 50}
              ]
            }
          }
        }
      },
      {
        "title": "Hook Execution Time (Top 10)",
        "type": "barchart",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "topk(10, sum(rate(wordpress_hook_duration_sum[5m])) by (hook_name))",
            "legendFormat": "{{hook_name}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "ms"}
        }
      }
    ]
  }
}

Place this JSON file in Grafana’s provisioning directory at /etc/grafana/provisioning/dashboards/ to have it load automatically when Grafana starts.

Data Source Linking for Trace-Log Correlation

Configure Grafana data source settings to enable automatic linking between Tempo and Loki. In the Tempo data source configuration, add a trace-to-logs link:

# grafana-datasources.yaml
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: false
        mapTagNamesEnabled: true
      tracesToMetrics:
        datasourceUid: prometheus
        tags:
          - key: "wordpress.hook"
            value: "hook_name"

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          datasourceUid: tempo
          matcherRegex: "\"trace_id\":\"(\\w+)\""
          url: "$${__value.raw}"

With this configuration, clicking a trace ID in a Loki log entry opens the corresponding trace in Tempo. Clicking a span in Tempo shows the associated log entries. This bidirectional linking transforms how you investigate production issues.

Alerting on Performance Regressions and Error Spikes

Dashboards are only useful when someone is watching them. Alerting rules notify your team when metrics cross defined thresholds, catching problems before users report them.

Grafana Alert Rules

# alerting-rules.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: wordpress-performance
    folder: WordPress
    interval: 1m
    rules:
      - uid: wp-latency-p95
        title: WordPress P95 Latency High
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: prometheus
            model:
              expr: >
                histogram_quantile(0.95,
                  sum(rate(wordpress_http_request_duration_bucket{is_admin="false"}[5m])) by (le)
                )
          - refId: B
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [2000]
        for: 5m
        annotations:
          summary: "WordPress frontend P95 latency exceeds 2 seconds"
          description: "Current P95: {{ $values.B }}ms. Check recent deployments and slow query traces."
        labels:
          severity: warning

      - uid: wp-error-rate
        title: WordPress Error Rate Spike
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: >
                sum(rate(wordpress_http_requests_total{status_code=~"5.."}[5m]))
                /
                sum(rate(wordpress_http_requests_total[5m]))
                * 100
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [5]
        for: 3m
        annotations:
          summary: "WordPress 5xx error rate above 5%"
          description: "Error rate: {{ $values.B }}%. Check Loki logs for PHP errors."
        labels:
          severity: critical

      - uid: wp-cache-ratio
        title: WordPress Cache Hit Ratio Low
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: >
                sum(rate(wordpress_object_cache_hits_total[10m]))
                /
                (sum(rate(wordpress_object_cache_hits_total[10m]))
                + sum(rate(wordpress_object_cache_misses_total[10m])))
                * 100
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: lt
                    params: [80]
        for: 10m
        annotations:
          summary: "Object cache hit ratio dropped below 80%"
          description: "Current ratio: {{ $values.B }}%. Check Redis memory and eviction policy."
        labels:
          severity: warning

Each alert rule includes a for duration that prevents alerting on momentary spikes. The P95 latency alert fires only if latency stays above 2 seconds for 5 consecutive minutes. The error rate alert uses a shorter window of 3 minutes because 5xx errors demand faster response.

Notification Channels

Configure Grafana to send alerts through multiple channels. Slack is typical for team awareness, PagerDuty for on-call escalation, and email for a persistent record.

# notification-policies.yaml
apiVersion: 1
policies:
  - orgId: 1
    receiver: slack-wordpress
    group_by: ['alertname']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-critical
        matchers:
          - severity = critical
        continue: true
      - receiver: slack-wordpress
        matchers:
          - severity = warning

contactPoints:
  - orgId: 1
    name: slack-wordpress
    receivers:
      - uid: slack-wp
        type: slack
        settings:
          url: "${SLACK_WEBHOOK_URL}"
          channel: "#wp-alerts"
          title: "{{ .CommonLabels.alertname }}"
          text: "{{ .CommonAnnotations.description }}"

  - orgId: 1
    name: pagerduty-critical
    receivers:
      - uid: pd-wp
        type: pagerduty
        settings:
          integrationKey: "${PAGERDUTY_KEY}"
          severity: critical

Production Patterns That Preserve Site Performance

Adding observability to a production WordPress site introduces processing overhead. Poorly implemented instrumentation can degrade the very performance it measures. Follow these patterns to keep the overhead minimal.

Sampling Strategy

Never trace 100% of production traffic. The sampling rate should scale inversely with traffic volume:

Low traffic (under 100 req/min): 50-100% sampling. You need enough data to detect issues, and the overhead at this volume is negligible.

Medium traffic (100-1000 req/min): 10-25% sampling. Sufficient for statistical significance on latency distributions while keeping export volume manageable.

High traffic (1000+ req/min): 1-5% sampling. At this volume, even 1% gives you 10+ traces per minute for analysis.

Use head-based sampling for general traffic and tail-based sampling (in the collector) for capturing all error traces regardless of the initial sampling decision:

# Collector config for tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR

      # Always keep slow traces (over 2 seconds)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000

      # Sample 10% of remaining traces
      - name: general-sampling
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Async Export with Batch Processing

The BatchSpanProcessor configured earlier is essential. It queues spans in memory and exports them in batches, either when the batch reaches a configured size or when a timer fires. This means the HTTP response to the user is never blocked waiting for trace data to be sent to the collector.

However, PHP’s execution model creates a challenge. Unlike long-running Node.js or Go processes, each PHP request starts and stops a fresh process (or reuses one from php-fpm). The batch processor must flush remaining spans during the shutdown handler before the process exits. The register_shutdown_function call in our bootstrap code handles this, but be aware that extremely short-lived requests might not leave enough time for the flush to complete.

For high-traffic sites, consider using the gRPC transport instead of HTTP for exporter communication. gRPC uses persistent connections and binary serialization, reducing the per-export overhead significantly:

// gRPC transport for lower overhead
$transport = (new \OpenTelemetry\Contrib\Grpc\GrpcTransportFactory())
    ->create(
        getenv('OTEL_EXPORTER_OTLP_ENDPOINT') ?: 'http://otel-collector:4317',
        'application/x-protobuf'
    );

Selective Instrumentation

Do not instrument everything. Focus on the operations that matter most for performance diagnosis and leave low-value operations uninstrumented. A practical prioritization:

Always trace: The root HTTP request span, database queries, external HTTP calls (wp_remote_get/post), template rendering, and the main WP_Query.

Trace selectively: Plugin hook callbacks (only the top 10-15 performance-sensitive hooks), object cache operations, and file system operations.

Skip: Individual filter applications on strings, option lookups (too frequent), internal WordPress array manipulations, and translation function calls.

Memory Management

Each span consumes memory until it is exported. On a page with 200 database queries, you create 200 span objects plus the lifecycle spans. Monitor the memory impact and adjust the batch processor’s maxQueueSize if memory usage grows too large:

// Limit total spans per request to prevent memory issues
class SpanBudget {
    private static int $remaining = 500;

    public static function canCreateSpan(): bool {
        if (self::$remaining <= 0) {
            return false;
        }
        self::$remaining--;
        return true;
    }
}

// Use in the database tracer
public function onQueryStart(string $query): string {
    if (!SpanBudget::canCreateSpan()) {
        return $query; // Skip tracing this query
    }
    // ... create span as before
}

The span budget prevents runaway instrumentation on pages that execute an unusual number of operations. When the budget is exhausted, subsequent operations proceed without tracing overhead.

Docker Compose Setup for a Local Observability Stack

Development and staging environments should mirror the production observability setup. A Docker Compose file that runs the complete stack alongside WordPress makes it easy to develop and test instrumentation before deploying to production.

version: "3.8"

services:
  wordpress:
    image: wordpress:6.4-php8.2-apache
    ports:
      - "8080:80"
    environment:
      WORDPRESS_DB_HOST: db
      WORDPRESS_DB_USER: wordpress
      WORDPRESS_DB_PASSWORD: wordpress
      WORDPRESS_DB_NAME: wordpress
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318
      OTEL_TRACES_SAMPLER_ARG: "1.0"  # 100% sampling in dev
      OTEL_SERVICE_NAME: wordpress-dev
    volumes:
      - ./wp-content:/var/www/html/wp-content
      - ./vendor:/var/www/html/vendor
      - ./composer.json:/var/www/html/composer.json
    depends_on:
      - db
      - otel-collector

  db:
    image: mariadb:10.11
    environment:
      MYSQL_DATABASE: wordpress
      MYSQL_USER: wordpress
      MYSQL_PASSWORD: wordpress
      MYSQL_ROOT_PASSWORD: rootpassword
    volumes:
      - db_data:/var/lib/mysql

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
      - "8888:8888"   # Prometheus metrics (collector self-monitoring)
    volumes:
      - ./docker/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    depends_on:
      - tempo
      - loki

  tempo:
    image: grafana/tempo:2.4.0
    command: ["-config.file=/etc/tempo.yaml"]
    ports:
      - "3200:3200"   # Tempo API
      - "9095:9095"   # Tempo gRPC
    volumes:
      - ./docker/tempo.yaml:/etc/tempo.yaml
      - tempo_data:/var/tempo

  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki_data:/loki

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - ./docker/promtail-config.yaml:/etc/promtail/config.yml
      - ./wp-content/logs:/var/log/wordpress
    command: -config.file=/etc/promtail/config.yml
    depends_on:
      - loki

  prometheus:
    image: prom/prometheus:v2.50.0
    ports:
      - "9090:9090"
    volumes:
      - ./docker/prometheus.yaml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_FEATURE_TOGGLES_ENABLE: traceqlEditor tempoSearch tempoBackendSearch
    volumes:
      - ./docker/grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
      - ./docker/grafana-dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml
      - ./docker/dashboards:/var/lib/grafana/dashboards
      - grafana_data:/var/lib/grafana
    depends_on:
      - tempo
      - loki
      - prometheus

volumes:
  db_data:
  tempo_data:
  loki_data:
  prometheus_data:
  grafana_data:

Tempo Configuration

# docker/tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 48h   # Keep traces for 48 hours in dev

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

Prometheus Configuration

# docker/prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8888']

  - job_name: 'tempo'
    static_configs:
      - targets: ['tempo:3200']

remote_write:
  - url: http://prometheus:9090/api/v1/write

# Enable exemplar storage for trace-metric correlation
storage:
  exemplars:
    max_exemplars: 100000

Starting the Stack

# Launch the entire observability stack
docker compose up -d

# Verify all services are healthy
docker compose ps

# Check collector is receiving data
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans

# Open Grafana
open http://localhost:3000   # admin/admin

Once the stack is running, browse your WordPress site at http://localhost:8080 and generate some traffic. Within a minute, traces should appear in Grafana's Explore view under the Tempo data source, and logs should be visible in Loki.

Investigating a Real Performance Issue

To demonstrate the complete workflow, walk through investigating a performance regression that appeared after a plugin update.

Step 1: Alert fires. Grafana sends a Slack notification that the P95 frontend latency exceeded 2 seconds. The alert started firing 10 minutes ago.

Step 2: Check the dashboard. Open the WordPress Health dashboard. The latency panel shows a sharp increase at 14:23 UTC, correlating with a deployment marker (if you are using deployment annotations). The request rate panel shows no increase in traffic, so this is not load-related.

Step 3: Find a slow trace. Switch to the Explore view with the Tempo data source. Search for traces where duration > 2s and wordpress.query_type = singular. Select one of the results.

Step 4: Analyze the trace waterfall. The root span shows a 3.1-second request to /blog/some-post/. Expanding the waterfall, you see:

http.request (3.1s)
├── wordpress.plugins_loaded (45ms)
├── wordpress.init (120ms)
├── wordpress.parse_request (8ms)
├── wordpress.template_render (2.9s)
│   ├── hook.callback the_content (2.7s)
│   │   ├── hook.callback SEOPlugin::inject_schema (2.6s)
│   │   │   ├── db.query SELECT (850ms) - wp_postmeta
│   │   │   ├── db.query SELECT (780ms) - wp_terms
│   │   │   ├── db.query SELECT (620ms) - wp_postmeta
│   │   │   └── db.query SELECT (340ms) - wp_options
│   │   └── hook.callback do_shortcode (90ms)
│   └── hook.callback wp_footer (180ms)

Step 5: Identify the root cause. The trace reveals that SEOPlugin::inject_schema is executing four database queries totaling 2.59 seconds. The wordpress.plugin attribute confirms it is the "super-seo-plugin." Checking the query statements, the wp_postmeta queries lack indexes on the meta_key column used in the WHERE clause.

Step 6: Check logs for context. Click the trace-to-logs link. The associated logs show a warning: "Slow WP_Query detected" with the full SQL statement, confirming the queries are unoptimized.

Step 7: Fix and verify. Roll back the plugin update. The P95 latency drops back to 400ms within 2 minutes. File a bug report with the plugin developer, including the trace data showing the specific queries and their execution times.

This entire investigation took under 10 minutes, compared to the hours it would take with traditional WordPress debugging methods.

Advanced Patterns: Tracing External HTTP Calls

WordPress makes frequent external HTTP requests through wp_remote_get() and wp_remote_post(). These calls can introduce significant latency, especially when third-party APIs are slow or unreachable. Tracing external calls with context propagation enables distributed tracing across service boundaries.

/**
 * Instrument WordPress HTTP API calls with OpenTelemetry spans
 * and propagate trace context to downstream services.
 */
add_filter('pre_http_request', function($preempt, $args, $url) {
    if ($preempt !== false) {
        return $preempt;
    }

    $tracer = Globals::tracerProvider()->getTracer('wordpress-http-client');
    $span = $tracer->spanBuilder('http.client.' . strtoupper($args['method'] ?? 'GET'))
        ->setSpanKind(SpanKind::KIND_CLIENT)
        ->setAttributes([
            'http.method' => $args['method'] ?? 'GET',
            'http.url' => $url,
            'http.target' => parse_url($url, PHP_URL_PATH) ?? '/',
            'net.peer.name' => parse_url($url, PHP_URL_HOST) ?? '',
        ])
        ->startSpan();

    $scope = $span->activate();

    // Inject trace context headers into the outgoing request
    $propagator = TraceContextPropagator::getInstance();
    $carrier = $args['headers'] ?? [];
    $propagator->inject($carrier);
    $args['headers'] = $carrier;

    // Store span and scope for the response filter
    $args['_otel_span'] = $span;
    $args['_otel_scope'] = $scope;

    return false; // Let the request proceed
}, 10, 3);

add_filter('http_response', function($response, $args, $url) {
    if (isset($args['_otel_span'])) {
        $span = $args['_otel_span'];
        $scope = $args['_otel_scope'];

        if (is_wp_error($response)) {
            $span->setStatus(StatusCode::STATUS_ERROR, $response->get_error_message());
            $span->setAttribute('error.type', $response->get_error_code());
        } else {
            $statusCode = wp_remote_retrieve_response_code($response);
            $span->setAttribute('http.status_code', $statusCode);

            if ($statusCode >= 400) {
                $span->setStatus(StatusCode::STATUS_ERROR, "HTTP $statusCode");
            }

            $span->setAttribute(
                'http.response_content_length',
                strlen(wp_remote_retrieve_body($response))
            );
        }

        $scope->detach();
        $span->end();
    }

    return $response;
}, 10, 3);

The trace context propagation is the key detail here. By injecting W3C Trace Context headers into outgoing requests, any downstream service that also supports OpenTelemetry will link its spans to the same trace. If your WordPress site calls a Node.js microservice for search or a Python service for recommendations, the entire distributed trace appears as a single connected waterfall in Tempo.

Tracing WP-Cron and Background Jobs

WordPress cron jobs execute outside the normal request lifecycle and often cause performance issues that are invisible without dedicated tracing. WP-Cron runs on page loads by default, and a slow cron task blocks the triggering request. Even with DISABLE_WP_CRON and a system cron, you still need visibility into what cron jobs are doing.

/**
 * Trace WP-Cron job execution with individual spans per scheduled event.
 */
add_action('wp_loaded', function() {
    if (!wp_doing_cron()) {
        return;
    }

    $tracer = Globals::tracerProvider()->getTracer('wordpress-cron');

    // Create a root span for the cron run
    $cronSpan = $tracer->spanBuilder('wordpress.cron')
        ->setSpanKind(SpanKind::KIND_INTERNAL)
        ->setAttributes([
            'wordpress.cron.triggered_by' => defined('DOING_CRON') ? 'system' : 'wp_request',
        ])
        ->startSpan();

    $cronScope = $cronSpan->activate();

    // Wrap each cron event with a span
    add_action('pre_cron_event', function($event) use ($tracer) {
        $span = $tracer->spanBuilder('wordpress.cron.event')
            ->setAttributes([
                'wordpress.cron.hook' => $event->hook,
                'wordpress.cron.schedule' => $event->schedule ?? 'single',
                'wordpress.cron.timestamp' => $event->timestamp,
                'wordpress.cron.args' => json_encode($event->args),
            ])
            ->startSpan();

        // Store on the event object for retrieval in the completion hook
        $event->_otel_span = $span;
        $event->_otel_scope = $span->activate();
    });

    add_action('cron_event_complete', function($event) {
        if (isset($event->_otel_scope)) {
            $event->_otel_scope->detach();
            $event->_otel_span->end();
        }
    });

    // End the root cron span on shutdown
    register_shutdown_function(function() use ($cronScope, $cronSpan) {
        $cronScope->detach();
        $cronSpan->end();
    });
});

This setup produces traces that show each cron event as a child span under the cron root span. If the wp_update_plugins cron event takes 30 seconds because it is checking 50 plugin update URLs sequentially, you see that breakdown clearly in the trace. The external HTTP tracing from the previous section adds child spans for each outgoing API call, completing the picture.

Instrumenting WooCommerce-Specific Operations

For WordPress sites running WooCommerce, additional instrumentation points capture e-commerce-specific performance characteristics. Cart operations, checkout processing, and order creation involve complex hook chains and multiple database transactions.

class WooCommerce_Tracer {

    private $tracer;

    public function __construct() {
        $this->tracer = Globals::tracerProvider()->getTracer('woocommerce');
    }

    public function register(): void {
        if (!class_exists('WooCommerce')) {
            return;
        }

        // Trace checkout processing
        add_action('woocommerce_checkout_process', [$this, 'traceCheckoutStart']);
        add_action('woocommerce_checkout_order_processed', [$this, 'traceOrderCreated'], 10, 3);

        // Trace cart calculations
        add_action('woocommerce_before_calculate_totals', [$this, 'traceCartCalcStart']);
        add_action('woocommerce_after_calculate_totals', [$this, 'traceCartCalcEnd']);

        // Trace payment processing
        add_action('woocommerce_before_pay_action', [$this, 'tracePaymentStart']);
        add_filter('woocommerce_payment_successful_result', [$this, 'tracePaymentSuccess']);
    }

    public function traceCheckoutStart(): void {
        $span = $this->tracer->spanBuilder('woocommerce.checkout.process')
            ->setAttributes([
                'woocommerce.cart_total' => WC()->cart->get_total('edit'),
                'woocommerce.cart_items' => WC()->cart->get_cart_contents_count(),
                'woocommerce.payment_method' => WC()->session->get('chosen_payment_method'),
            ])
            ->startSpan();

        WC()->session->set('_otel_checkout_span', serialize([
            'trace_id' => $span->getContext()->getTraceId(),
            'span_id' => $span->getContext()->getSpanId(),
        ]));
    }

    public function traceOrderCreated($orderId, $postedData, $order): void {
        $span = $this->tracer->spanBuilder('woocommerce.order.created')
            ->setAttributes([
                'woocommerce.order_id' => $orderId,
                'woocommerce.order_total' => $order->get_total(),
                'woocommerce.item_count' => $order->get_item_count(),
                'woocommerce.payment_method' => $order->get_payment_method(),
                'woocommerce.customer_id' => $order->get_customer_id(),
            ])
            ->startSpan();

        $span->addEvent('order_created', [
            'order_id' => $orderId,
        ]);

        $span->end();
    }

    private $cartCalcSpan;
    private $cartCalcScope;

    public function traceCartCalcStart($cart): void {
        $this->cartCalcSpan = $this->tracer->spanBuilder('woocommerce.cart.calculate_totals')
            ->setAttributes([
                'woocommerce.cart_items' => $cart->get_cart_contents_count(),
                'woocommerce.coupons_applied' => count($cart->get_applied_coupons()),
            ])
            ->startSpan();
        $this->cartCalcScope = $this->cartCalcSpan->activate();
    }

    public function traceCartCalcEnd($cart): void {
        if ($this->cartCalcScope) {
            $this->cartCalcSpan->setAttribute(
                'woocommerce.calculated_total',
                $cart->get_total('edit')
            );
            $this->cartCalcScope->detach();
            $this->cartCalcSpan->end();
        }
    }
}

WooCommerce checkout flows involve payment gateway API calls, inventory checks, tax calculations, and shipping rate lookups. Each of these operations may trigger external HTTP requests or complex database queries. With the combination of the WooCommerce tracer, the database tracer, and the HTTP client tracer, you get a complete picture of checkout performance.

Operational Considerations and Gotchas

Several practical considerations affect real-world deployments of observability in WordPress.

PHP-FPM process model: Each PHP-FPM worker process initializes the OpenTelemetry SDK independently. The batch span processor maintains its own queue per process. With 20 FPM workers, you have 20 independent batch queues flushing to the collector. Size the collector's receiver buffer accordingly.

Object cache conflicts: Some persistent object cache plugins (like Redis Object Cache) use the shutdown action hook for cleanup. If the OpenTelemetry shutdown handler runs after the cache plugin's handler, the cache connection may already be closed, causing errors during the final span export. Register the OTel shutdown handler with a high priority number to ensure it runs first:

register_shutdown_function(function() use ($tracerProvider) {
    $tracerProvider->shutdown();
});
// Runs before plugin shutdown handlers because it is registered first

Plugin conflicts: Some security plugins hook into every database query for auditing purposes. Running both a security audit plugin and database query tracing can multiply the overhead. Test the combined effect and disable one or the other if the overhead is unacceptable.

Multisite considerations: In a WordPress Multisite installation, the service resource should include the blog ID to distinguish traces from different sites within the network:

$resource = ResourceInfo::create(Attributes::create([
    ResourceAttributes::SERVICE_NAME => 'wordpress-multisite',
    'wordpress.blog_id' => get_current_blog_id(),
    'wordpress.site_url' => get_site_url(),
]));

Disk space for local Tempo storage: Tempo stores traces on disk. In a local Docker setup, traces can accumulate quickly. Set the block_retention to 24 or 48 hours for development environments. In production with a cloud storage backend (S3, GCS), retention can be longer, but monitor storage costs.

Collector availability: If the OpenTelemetry Collector goes down, the PHP SDK's batch processor will queue spans up to maxQueueSize and then start dropping them. This is the correct behavior for production because it prevents unbounded memory growth. Monitor the collector's uptime as part of your infrastructure alerting.

Testing Your Instrumentation

Before deploying instrumentation to production, verify it works correctly and measure its overhead.

/**
 * PHPUnit test to verify span creation and attribute correctness.
 */
class OpenTelemetryInstrumentationTest extends WP_UnitTestCase {

    private $inMemoryExporter;
    private $tracerProvider;

    protected function setUp(): void {
        parent::setUp();

        $this->inMemoryExporter = new \OpenTelemetry\SDK\Trace\SpanExporter\InMemoryExporter();
        $this->tracerProvider = (new TracerProviderBuilder())
            ->addSpanProcessor(
                new \OpenTelemetry\SDK\Trace\SpanProcessor\SimpleSpanProcessor(
                    $this->inMemoryExporter
                )
            )
            ->build();
    }

    public function test_database_query_creates_span(): void {
        $tracer = $this->tracerProvider->getTracer('test');

        // Simulate a traced query
        $span = $tracer->spanBuilder('db.query')
            ->setAttributes([
                'db.system' => 'mysql',
                'db.statement' => 'SELECT * FROM wp_posts WHERE ID = ?',
                'db.operation' => 'SELECT',
            ])
            ->startSpan();
        $span->end();

        $this->tracerProvider->forceFlush();

        $spans = $this->inMemoryExporter->getSpans();
        $this->assertCount(1, $spans);
        $this->assertEquals('db.query', $spans[0]->getName());
        $this->assertEquals('mysql', $spans[0]->getAttributes()->get('db.system'));
    }

    public function test_query_sanitization_removes_literals(): void {
        $dbTracer = new WP_Database_Tracer();
        $sanitized = $this->invokePrivateMethod(
            $dbTracer,
            'sanitizeQuery',
            ["SELECT * FROM wp_users WHERE user_login = 'admin' AND ID = 42"]
        );

        $this->assertStringNotContainsString('admin', $sanitized);
        $this->assertStringNotContainsString('42', $sanitized);
        $this->assertStringContainsString('= ?', $sanitized);
    }

    public function test_instrumentation_overhead_under_threshold(): void {
        // Measure the overhead of span creation
        $iterations = 1000;

        $startTime = microtime(true);
        for ($i = 0; $i < $iterations; $i++) {
            $span = $this->tracerProvider->getTracer('bench')->spanBuilder('test')
                ->startSpan();
            $span->end();
        }
        $elapsed = microtime(true) - $startTime;

        $perSpanMs = ($elapsed / $iterations) * 1000;

        // Each span should take less than 0.1ms to create and end
        $this->assertLessThan(0.1, $perSpanMs,
            "Span creation overhead too high: {$perSpanMs}ms per span"
        );
    }

    private function invokePrivateMethod($object, string $method, array $args = []) {
        $ref = new \ReflectionMethod($object, $method);
        $ref->setAccessible(true);
        return $ref->invokeArgs($object, $args);
    }
}

Run a load test with instrumentation enabled and disabled to quantify the production overhead. Use a tool like wrk or k6 to generate consistent load:

# Baseline (instrumentation disabled)
wrk -t4 -c100 -d60s http://localhost:8080/

# With instrumentation (10% sampling)
wrk -t4 -c100 -d60s http://localhost:8080/

# Compare the results: p99 latency increase should be under 5%

If the overhead exceeds 5% on P99 latency, reduce the number of traced hooks, lower the sampling rate, or switch from HTTP to gRPC transport for the exporter.

Bringing It All Together

A fully instrumented WordPress installation with the setup described in this article produces three interconnected data streams:

Distributed traces that show the complete request lifecycle as a hierarchy of timed spans, from the initial HTTP request through WordPress hook execution, database queries, external API calls, and template rendering. Each span carries attributes that identify the responsible plugin, hook, query, or template.

Structured logs in JSON format with automatic context enrichment, including trace IDs that link each log entry to its corresponding trace. Log levels follow PSR-3 severity standards, and Monolog handlers route logs to both local files and centralized storage in Loki.

Custom metrics that track the aggregate performance characteristics that matter most: request latency percentiles, database query execution time distributions, hook execution durations, cache hit ratios, and error rates. These metrics feed Grafana dashboards and alerting rules that catch regressions before they affect users.

The observability stack runs locally via Docker Compose for development and testing, and the same configuration (with adjusted sampling rates and storage backends) deploys to production. The OpenTelemetry Collector decouples the application from the storage backend, allowing you to switch from Grafana to Datadog to Honeycomb without modifying any application code.

The investment in observability pays for itself the first time you diagnose a production issue in minutes instead of hours. WordPress sites accumulate complexity over time as plugins, themes, and custom code interact in unpredictable ways. Observability transforms that complexity from a source of operational risk into a measurable, traceable, and manageable system. When your site slows down, you no longer guess. You look at the trace, find the slow span, read the attributes, and fix the problem.