Observability at TinyFish

Logs · Traces · Metrics — all in one place

We had zero distributed tracing and fragmented logging — each service managing its own ES client. Now we have a unified pipeline with centralized enrichment, PII redaction, and full trace correlation.

~5M spans/day
7 services
3 signal types
1 Grafana

Session Agenda

0110 min

Logs

Structured logging, Vector, VictoriaLogs

0215 min

Traces

OpenTelemetry SDK, Tempo, trace correlation

035 min

Metrics

Host metrics, app metrics, span metrics

045 min

Full Picture

How all three signals connect

0520 min

Live Demos

Grafana Explore, querying real data

065 min

Q&A

Questions, discussion, next steps

Logs

Structured Logging

EC2 HostECS Taskstdout (JSON)Vector AgentVector Aggregatorenrich · redact · routeVictoriaLogsS3 archiveGrafanaLogsQL queriesdaemon · 1/hostPII redaction · enrichment

Structured logging is optional

Vector Agent picks up all stdout/stderr from your containers automatically — plain text, JSON, anything. You get logs in VictoriaLogs with zero code changes. Structured JSON logging just makes them much easier to query: you can filter by service, level, trace_id, and any custom field instead of grep-ing through raw text.

Before

Sonar → Direct ES

  • Each service manages ES client
  • Dual Python/TS SDK (Sonar)
  • No centralized pipeline
  • No PII redaction

After

structlog/Pino → stdout → Vector

  • Zero ES deps in services
  • Centralized enrichment & redaction
  • Automatic trace correlation
  • S3 archival for free

Setup

Python (tinyfish-common-structlog)
from tinyfish.common.structlog import configure_structlog
configure_structlog(service_name="my-service")

import structlog
log = structlog.get_logger()
log.info("query resolved", engine="google", latency_ms=42)
TypeScript (Pino)
import pino from "pino";
const logger = pino({
  timestamp: pino.stdTimeFunctions.isoTime,
  base: { service: "my-service" },
});

const reqLogger = logger.child({ request_id: id, trace_id: traceId });
reqLogger.info({ engine: "google" }, "query resolved");

Annotated Log Output

Structured log entry
{
  "@timestamp": "2026-03-15T10:30:00.000Z",
  "level": "info",
  "event": "query resolved",
  "service": "agentql",            // <-- auto-injected by configure_structlog
  "trace_id": "abc123def456...",   // <-- auto-injected by OTel context
  "span_id": "789xyz...",          // <-- auto-injected by OTel context
  "request_id": "req-42",
  "engine": "google",
  "latency_ms": 42
}

LogsQL Cheatsheet

service:agentql AND level:error AND _time:1hErrors from agentql in last hour
trace_id:"abc123def456"All logs for a specific trace
service:eva AND request_id:"req-789"Logs for a specific request
"timeout" AND "browser" AND _time:30mFull-text search across all services
Traces

Distributed Tracing

ECS Task (your app)OTel SDKOTel Collectorbatch · redactOTLP gRPCTempoS3-backed · 14dGrafanaTraceQL queriesauto: FastAPI, httpx, DB

What is a trace?

A trace tracks a single request across services. It contains spans — timed operations that nest parent-child, all sharing one trace_id.

frontendtask-workereva← same trace_id

Trace Waterfall

trace_id: a1b2c3d4e5f6...
frontend
50ms
SQS enqueue
5ms
task-worker
3.2s
queue.wait
120ms
eva.run
2.8s
browser.create
400ms
automation.execute
2.2s
result.extract
200ms
result.publish
80ms
0s1s2s3s
OpenTelemetry

Powered by OpenTelemetry (open-source)

Our tracing & metrics pipeline uses the OpenTelemetry open standard — vendor-neutral, CNCF-hosted, supported by every major cloud and APM vendor. The same SDK works with Datadog, New Relic, Jaeger, or any OTLP-compatible backend.

Setup

Python (tinyfish-common-telemetry)
from tinyfish.common.telemetry import init_telemetry, shutdown_telemetry
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

# Startup
init_telemetry("my-service")
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()

# Shutdown
shutdown_telemetry()

JS/TS — with Sentry v10+

sentry.server.config.ts (Sentry owns TracerProvider)
// sentry.server.config.ts
import * as Sentry from "@sentry/nextjs";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampleRate: 0.15,
  openTelemetrySpanProcessors: [
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
      })
    ),
  ],
});

JS/TS — without Sentry

instrumentation.ts (standalone OTel)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.OTEL_SERVICE_NAME,
});
sdk.start();

TraceQL Cheatsheet

{ resource.service.name = "agentql" }All traces from agentql
{ status = error }Failed traces
{ resource.service.name = "eva" && duration > 5s }Slow eva traces
{ name = "browser.create" && duration > 10s }Slow browser creation spans
Metrics

Metrics Collection

ECS TaskOTel SDKOTel CollectorprometheusremotewriteEC2 HostVector AgentVector AggregatorVictoriaMetricsPromQL · 90d+ span metrics from TempoGrafanaPromQL queriescounters · histograms · gaugesCPU · memory · disk · network

Automatic (Host)

Vector Agent collects CPU, memory, disk, network. Zero setup required.

Application (OTel SDK)

Counters, histograms, gauges via OTel SDK. Same OTLP endpoint as traces.

Span Metrics (auto)

Auto-generated from traces by Tempo. RED dashboards for free.

Application Metrics (Python)

Python (OpenTelemetry)
from opentelemetry import metrics
meter = metrics.get_meter("my-service")

counter = meter.create_counter("myservice.requests.total")
histogram = meter.create_histogram("myservice.request.duration", unit="ms")

counter.add(1, {"endpoint": "/api/query", "status": "success"})
histogram.record(elapsed_ms, {"endpoint": "/api/query"})

PromQL Cheatsheet

rate(myservice_requests_total[5m])Request rate over 5 min
histogram_quantile(0.95, rate(myservice_request_duration_bucket[5m]))p95 latency
eva_active_sessionsCurrent active sessions gauge
host_memory_total_bytes - host_memory_available_bytesMemory usage
Full Picture

Putting It All Together

Three signals, one pipeline, one destination. Here's how logs, traces, and metrics flow from your service to Grafana.

EC2 HostECS Task (your app)OTel SDKstdoutVector AgentVector AggregatorOTel CollectorVictoriaLogsS3TempoVictoriaMetricsGrafanaunified viewLogsTracesMetrics

Key Insights

Zero code changes for logs — Vector Agent picks up stdout automatically

Traces require OTel SDK — one-time setup per service

Pipeline failure never impacts your app — all fire-and-forget

Live Demos

Hands-on Walkthrough

1

Find logs for a service

service:agentql AND _time:1h

Open VictoriaLogs in Grafana Explore, paste the LogsQL query.

2

Find error logs

service:eva AND level:error AND _time:24h

Filter by level to see only errors from a specific service.

3

Trace a request

copy trace_id from log -> paste in Tempo

Click a log entry, copy trace_id, switch to Tempo datasource, paste it.

4

Cross-service trace

frontend -> task-worker -> eva

Full request lifecycle across multiple services in one waterfall.

5

Log <-> Trace correlation

click from trace to logs and back

Use the 'Logs for this span' link in Tempo, or the trace link in logs.

6

Query metrics

eva_active_sessions, host CPU/memory

Switch to VictoriaMetrics datasource and run PromQL queries.

Quick Reference

Cheatsheet

ECS Environment Variables (new services)

ECS task definition env vars
OTEL_EXPORTER_OTLP_ENDPOINT = https://otel-collector.${env}.tinyfish.io:4317
OTEL_SERVICE_NAME = my-service
OTEL_TRACES_SAMPLE_RATE = 1.0  # sandbox; 0.15 for production

AI Agent Access (Claude Code / Cursor)

Copy-paste this into your agent prompt for observability API access (VPN required):

Agent prompt block
You have access to these observability APIs (VPN required):

Grafana:
  Production: https://grafana.production.tinyfish.io
  Sandbox:    https://grafana.sandbox.tinyfish.io

VictoriaLogs (LogsQL):
  POST https://victorialogs.{env}.tinyfish.io/select/logsql/query
  Body: { "query": "service:myapp AND _time:1h" }

VictoriaMetrics (PromQL):
  GET https://victoriametrics.{env}.tinyfish.io/api/v1/query
  Params: query=rate(myapp_requests_total[5m])

Tempo (TraceQL):
  GET https://tempo.{env}.tinyfish.io/api/search
  Params: q={ resource.service.name = "myapp" }