Observability at TinyFish

Logs · Traces · Metrics — all in one place

We had zero distributed tracing and fragmented logging — each service managing its own ES client. Now we have a unified pipeline with centralized enrichment, PII redaction, and full trace correlation.

~5M spans/day

7 services

3 signal types

1 Grafana

Service Observability Docs (Notion)

Session Agenda

0110 min

Logs

Structured logging, Vector, VictoriaLogs

0215 min

Traces

OpenTelemetry SDK, Tempo, trace correlation

035 min

Metrics

Host metrics, app metrics, span metrics

045 min

Full Picture

How all three signals connect

0520 min

Live Demos

Grafana Explore, querying real data

065 min

Q&A

Questions, discussion, next steps

Logs

Structured Logging

Log Explorer (Grafana) LogsQL docs

Structured logging is optional

Vector Agent picks up all stdout/stderr from your containers automatically — plain text, JSON, anything. You get logs in VictoriaLogs with zero code changes. Structured JSON logging just makes them much easier to query: you can filter by service, level, trace_id, and any custom field instead of grep-ing through raw text.

Before

Sonar → Direct ES

Each service manages ES client
Dual Python/TS SDK (Sonar)
No centralized pipeline
No PII redaction

After

structlog/Pino → stdout → Vector

Zero ES deps in services
Centralized enrichment & redaction
Automatic trace correlation
S3 archival for free

Setup

Python (tinyfish-common-structlog)

from tinyfish.common.structlog import configure_structlog
configure_structlog(service_name="my-service")

import structlog
log = structlog.get_logger()
log.info("query resolved", engine="google", latency_ms=42)

TypeScript (Pino)

import pino from "pino";
const logger = pino({
  timestamp: pino.stdTimeFunctions.isoTime,
  base: { service: "my-service" },
});

const reqLogger = logger.child({ request_id: id, trace_id: traceId });
reqLogger.info({ engine: "google" }, "query resolved");

Annotated Log Output

Structured log entry

{
  "@timestamp": "2026-03-15T10:30:00.000Z",
  "level": "info",
  "event": "query resolved",
  "service": "agentql",            // <-- auto-injected by configure_structlog
  "trace_id": "abc123def456...",   // <-- auto-injected by OTel context
  "span_id": "789xyz...",          // <-- auto-injected by OTel context
  "request_id": "req-42",
  "engine": "google",
  "latency_ms": 42
}

LogsQL Cheatsheet

service:agentql AND level:error AND _time:1hErrors from agentql in last hour

trace_id:"abc123def456"All logs for a specific trace

service:eva AND request_id:"req-789"Logs for a specific request

"timeout" AND "browser" AND _time:30mFull-text search across all services

Traces

Distributed Tracing

Trace Explorer (Grafana) TraceQL docs

What is a trace?

A trace tracks a single request across services. It contains spans — timed operations that nest parent-child, all sharing one trace_id.

frontendtask-workereva← same trace_id

Trace Waterfall

trace_id: a1b2c3d4e5f6...

frontend

50ms

└SQS enqueue

5ms

task-worker

3.2s

└queue.wait

120ms

└eva.run

2.8s

└browser.create

400ms

└automation.execute

2.2s

└result.extract

200ms

└result.publish

80ms

0s1s2s3s

Powered by OpenTelemetry (open-source)

Our tracing & metrics pipeline uses the OpenTelemetry open standard — vendor-neutral, CNCF-hosted, supported by every major cloud and APM vendor. The same SDK works with Datadog, New Relic, Jaeger, or any OTLP-compatible backend.

Python JavaScript Go Java .NET Rust

Setup

Python (tinyfish-common-telemetry)

from tinyfish.common.telemetry import init_telemetry, shutdown_telemetry
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

# Startup
init_telemetry("my-service")
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()

# Shutdown
shutdown_telemetry()

JS/TS — with Sentry v10+

sentry.server.config.ts (Sentry owns TracerProvider)

// sentry.server.config.ts
import * as Sentry from "@sentry/nextjs";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampleRate: 0.15,
  openTelemetrySpanProcessors: [
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
      })
    ),
  ],
});

JS/TS — without Sentry

instrumentation.ts (standalone OTel)

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.OTEL_SERVICE_NAME,
});
sdk.start();

TraceQL Cheatsheet

{ resource.service.name = "agentql" }All traces from agentql

{ status = error }Failed traces

{ resource.service.name = "eva" && duration > 5s }Slow eva traces

{ name = "browser.create" && duration > 10s }Slow browser creation spans

Metrics

Metrics Collection

Metrics Explorer (Grafana) PromQL/MetricsQL docs

Automatic (Host)

Vector Agent collects CPU, memory, disk, network. Zero setup required.

Application (OTel SDK)

Counters, histograms, gauges via OTel SDK. Same OTLP endpoint as traces.

Span Metrics (auto)

Auto-generated from traces by Tempo. RED dashboards for free.

Application Metrics (Python)

Python (OpenTelemetry)

from opentelemetry import metrics
meter = metrics.get_meter("my-service")

counter = meter.create_counter("myservice.requests.total")
histogram = meter.create_histogram("myservice.request.duration", unit="ms")

counter.add(1, {"endpoint": "/api/query", "status": "success"})
histogram.record(elapsed_ms, {"endpoint": "/api/query"})

PromQL Cheatsheet

rate(myservice_requests_total[5m])Request rate over 5 min

histogram_quantile(0.95, rate(myservice_request_duration_bucket[5m]))p95 latency

eva_active_sessionsCurrent active sessions gauge

host_memory_total_bytes - host_memory_available_bytesMemory usage

Full Picture

Putting It All Together

Three signals, one pipeline, one destination. Here's how logs, traces, and metrics flow from your service to Grafana.

Key Insights

Zero code changes for logs — Vector Agent picks up stdout automatically

Traces require OTel SDK — one-time setup per service

Pipeline failure never impacts your app — all fire-and-forget

Live Demos

Hands-on Walkthrough

Production Sandbox

Find logs for a service

service:agentql AND _time:1h

Open VictoriaLogs in Grafana Explore, paste the LogsQL query.

Find error logs

service:eva AND level:error AND _time:24h

Filter by level to see only errors from a specific service.

Trace a request

copy trace_id from log -> paste in Tempo

Click a log entry, copy trace_id, switch to Tempo datasource, paste it.

Cross-service trace

frontend -> task-worker -> eva

Full request lifecycle across multiple services in one waterfall.

Log <-> Trace correlation

click from trace to logs and back

Use the 'Logs for this span' link in Tempo, or the trace link in logs.

Query metrics

eva_active_sessions, host CPU/memory

Switch to VictoriaMetrics datasource and run PromQL queries.

Quick Reference

Cheatsheet

Grafana URLs

Productiongrafana.production.tinyfish.io

Sandboxgrafana.sandbox.tinyfish.io

Key Dashboards

Pipeline Health— All observability component status Log Explorer— VictoriaLogs query interface Service Traces— Tempo explore view

ECS Environment Variables (new services)

ECS task definition env vars

OTEL_EXPORTER_OTLP_ENDPOINT = https://otel-collector.${env}.tinyfish.io:4317
OTEL_SERVICE_NAME = my-service
OTEL_TRACES_SAMPLE_RATE = 1.0  # sandbox; 0.15 for production

AI Agent Access (Claude Code / Cursor)

Copy-paste this into your agent prompt for observability API access (VPN required):

Agent prompt block

You have access to these observability APIs (VPN required):

Grafana:
  Production: https://grafana.production.tinyfish.io
  Sandbox:    https://grafana.sandbox.tinyfish.io

VictoriaLogs (LogsQL):
  POST https://victorialogs.{env}.tinyfish.io/select/logsql/query
  Body: { "query": "service:myapp AND _time:1h" }

VictoriaMetrics (PromQL):
  GET https://victoriametrics.{env}.tinyfish.io/api/v1/query
  Params: query=rate(myapp_requests_total[5m])

Tempo (TraceQL):
  GET https://tempo.{env}.tinyfish.io/api/search
  Params: q={ resource.service.name = "myapp" }

Useful Links

Service Observability (Notion)LogsQL docs TraceQL docs PromQL docs