Observability at TinyFish
Logs · Traces · Metrics — all in one place
We had zero distributed tracing and fragmented logging — each service managing its own ES client. Now we have a unified pipeline with centralized enrichment, PII redaction, and full trace correlation.
Session Agenda
Logs
Structured logging, Vector, VictoriaLogs
Traces
OpenTelemetry SDK, Tempo, trace correlation
Metrics
Host metrics, app metrics, span metrics
Full Picture
How all three signals connect
Live Demos
Grafana Explore, querying real data
Q&A
Questions, discussion, next steps
Structured Logging
Structured logging is optional
Vector Agent picks up all stdout/stderr from your containers automatically — plain text, JSON, anything. You get logs in VictoriaLogs with zero code changes. Structured JSON logging just makes them much easier to query: you can filter by service, level, trace_id, and any custom field instead of grep-ing through raw text.
Before
Sonar → Direct ES
- Each service manages ES client
- Dual Python/TS SDK (Sonar)
- No centralized pipeline
- No PII redaction
After
structlog/Pino → stdout → Vector
- Zero ES deps in services
- Centralized enrichment & redaction
- Automatic trace correlation
- S3 archival for free
Setup
from tinyfish.common.structlog import configure_structlog
configure_structlog(service_name="my-service")
import structlog
log = structlog.get_logger()
log.info("query resolved", engine="google", latency_ms=42)import pino from "pino";
const logger = pino({
timestamp: pino.stdTimeFunctions.isoTime,
base: { service: "my-service" },
});
const reqLogger = logger.child({ request_id: id, trace_id: traceId });
reqLogger.info({ engine: "google" }, "query resolved");Annotated Log Output
{
"@timestamp": "2026-03-15T10:30:00.000Z",
"level": "info",
"event": "query resolved",
"service": "agentql", // <-- auto-injected by configure_structlog
"trace_id": "abc123def456...", // <-- auto-injected by OTel context
"span_id": "789xyz...", // <-- auto-injected by OTel context
"request_id": "req-42",
"engine": "google",
"latency_ms": 42
}LogsQL Cheatsheet
service:agentql AND level:error AND _time:1hErrors from agentql in last hourtrace_id:"abc123def456"All logs for a specific traceservice:eva AND request_id:"req-789"Logs for a specific request"timeout" AND "browser" AND _time:30mFull-text search across all servicesDistributed Tracing
What is a trace?
A trace tracks a single request across services. It contains spans — timed operations that nest parent-child, all sharing one trace_id.
Trace Waterfall
trace_id: a1b2c3d4e5f6...
Powered by OpenTelemetry (open-source)
Our tracing & metrics pipeline uses the OpenTelemetry open standard — vendor-neutral, CNCF-hosted, supported by every major cloud and APM vendor. The same SDK works with Datadog, New Relic, Jaeger, or any OTLP-compatible backend.
Setup
from tinyfish.common.telemetry import init_telemetry, shutdown_telemetry
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
# Startup
init_telemetry("my-service")
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
# Shutdown
shutdown_telemetry()JS/TS — with Sentry v10+
// sentry.server.config.ts
import * as Sentry from "@sentry/nextjs";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampleRate: 0.15,
openTelemetrySpanProcessors: [
new BatchSpanProcessor(
new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
})
),
],
});JS/TS — without Sentry
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: process.env.OTEL_SERVICE_NAME,
});
sdk.start();TraceQL Cheatsheet
{ resource.service.name = "agentql" }All traces from agentql{ status = error }Failed traces{ resource.service.name = "eva" && duration > 5s }Slow eva traces{ name = "browser.create" && duration > 10s }Slow browser creation spansMetrics Collection
Automatic (Host)
Vector Agent collects CPU, memory, disk, network. Zero setup required.
Application (OTel SDK)
Counters, histograms, gauges via OTel SDK. Same OTLP endpoint as traces.
Span Metrics (auto)
Auto-generated from traces by Tempo. RED dashboards for free.
Application Metrics (Python)
from opentelemetry import metrics
meter = metrics.get_meter("my-service")
counter = meter.create_counter("myservice.requests.total")
histogram = meter.create_histogram("myservice.request.duration", unit="ms")
counter.add(1, {"endpoint": "/api/query", "status": "success"})
histogram.record(elapsed_ms, {"endpoint": "/api/query"})PromQL Cheatsheet
rate(myservice_requests_total[5m])Request rate over 5 minhistogram_quantile(0.95, rate(myservice_request_duration_bucket[5m]))p95 latencyeva_active_sessionsCurrent active sessions gaugehost_memory_total_bytes - host_memory_available_bytesMemory usagePutting It All Together
Three signals, one pipeline, one destination. Here's how logs, traces, and metrics flow from your service to Grafana.
Key Insights
Zero code changes for logs — Vector Agent picks up stdout automatically
Traces require OTel SDK — one-time setup per service
Pipeline failure never impacts your app — all fire-and-forget
Find logs for a service
service:agentql AND _time:1hOpen VictoriaLogs in Grafana Explore, paste the LogsQL query.
Find error logs
service:eva AND level:error AND _time:24hFilter by level to see only errors from a specific service.
Trace a request
copy trace_id from log -> paste in TempoClick a log entry, copy trace_id, switch to Tempo datasource, paste it.
Cross-service trace
frontend -> task-worker -> evaFull request lifecycle across multiple services in one waterfall.
Log <-> Trace correlation
click from trace to logs and backUse the 'Logs for this span' link in Tempo, or the trace link in logs.
Query metrics
eva_active_sessions, host CPU/memorySwitch to VictoriaMetrics datasource and run PromQL queries.
Cheatsheet
Grafana URLs
ECS Environment Variables (new services)
OTEL_EXPORTER_OTLP_ENDPOINT = https://otel-collector.${env}.tinyfish.io:4317
OTEL_SERVICE_NAME = my-service
OTEL_TRACES_SAMPLE_RATE = 1.0 # sandbox; 0.15 for productionAI Agent Access (Claude Code / Cursor)
Copy-paste this into your agent prompt for observability API access (VPN required):
You have access to these observability APIs (VPN required):
Grafana:
Production: https://grafana.production.tinyfish.io
Sandbox: https://grafana.sandbox.tinyfish.io
VictoriaLogs (LogsQL):
POST https://victorialogs.{env}.tinyfish.io/select/logsql/query
Body: { "query": "service:myapp AND _time:1h" }
VictoriaMetrics (PromQL):
GET https://victoriametrics.{env}.tinyfish.io/api/v1/query
Params: query=rate(myapp_requests_total[5m])
Tempo (TraceQL):
GET https://tempo.{env}.tinyfish.io/api/search
Params: q={ resource.service.name = "myapp" }