Deploying & Operating
This guide covers how to deploy and operate a conforming OpenTabletop server in production. It assumes you have already built a server following the Implementer’s Guide and are ready to run it.
The guide is practical – every section includes concrete examples you can adapt. For the rationale behind these patterns, see the Infrastructure & Implementation Guidance ADRs.
Local Development Stack
Start with a local stack that mirrors production. This docker-compose.yml runs PostgreSQL with schema auto-loaded on first startup:
# docker-compose.yml
services:
postgres:
image: postgres:17-alpine
environment:
POSTGRES_DB: opentabletop
POSTGRES_USER: ot
POSTGRES_PASSWORD: localdev
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
- ./data/schema.sql:/docker-entrypoint-initdb.d/01-schema.sql
- ./data/seed.sql:/docker-entrypoint-initdb.d/02-seed.sql
volumes:
pgdata:
For production observability, add an OpenTelemetry collector (ADR-0023):
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # gRPC OTLP receiver
- "4318:4318" # HTTP OTLP receiver
- "8889:8889" # Prometheus exporter
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
Start it and load sample data:
docker compose up -d
node scripts/load-samples.js --connection "postgresql://ot:localdev@localhost/opentabletop"
Verify:
curl http://localhost:8080/v1/games/spirit-island
curl http://localhost:8080/healthz
Container Image
Build your server as a multi-stage container image. The pattern is language-agnostic – replace the build stage with your stack:
# Stage 1: Build
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
RUN npm run build
# Stage 2: Runtime (distroless -- see ADR-0021)
FROM gcr.io/distroless/nodejs22-debian12
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
EXPOSE 8080
CMD ["dist/server.js"]
Replace the Node.js images above with the latest stable version for your language (rust:latest, golang:latest, python:3-slim, etc.). Pin to a major version tag rather than a specific patch to keep builds reproducible without going stale.
Key requirements:
- Distroless base (ADR-0021) – No shell, no package manager, minimal attack surface. Target < 50MB final image.
- Health endpoints – Expose
/healthz(liveness: “process is alive”) and/readyz(readiness: “can serve traffic, database connected”). - Tag by git SHA, never
latest(ADR-0024):
docker build -t ghcr.io/your-org/opentabletop:$(git rev-parse --short HEAD) .
docker push ghcr.io/your-org/opentabletop:$(git rev-parse --short HEAD)
Configuration (12-Factor)
All configuration comes from environment variables (ADR-0020). Never bake secrets into images.
| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL | Yes | – | PostgreSQL connection string |
PORT | No | 8080 | HTTP listen port |
REDIS_URL | No | – | Redis connection string (cache layer) |
LOG_LEVEL | No | info | debug, info, warn, error |
LOG_FORMAT | No | json | json for production, pretty for local dev |
OTEL_EXPORTER_OTLP_ENDPOINT | No | – | OpenTelemetry collector endpoint |
OTEL_SERVICE_NAME | No | opentabletop | Service name in traces/metrics |
API_KEY_SALT | Yes | – | HMAC salt for API key hashing |
RATE_LIMIT_ANONYMOUS | No | 60 | Requests/minute for unauthenticated clients |
RATE_LIMIT_AUTHENTICATED | No | 600 | Requests/minute for API key holders |
CORS_ORIGINS | No | * | Comma-separated allowed origins |
For Kubernetes, use a ConfigMap for non-sensitive values and Secrets for credentials:
apiVersion: v1
kind: ConfigMap
metadata:
name: opentabletop-config
data:
PORT: "8080"
LOG_LEVEL: "info"
LOG_FORMAT: "json"
OTEL_SERVICE_NAME: "opentabletop"
RATE_LIMIT_ANONYMOUS: "60"
RATE_LIMIT_AUTHENTICATED: "600"
Kubernetes Deployment
Complete manifests for a production deployment. Adjust resource requests and replica counts for your scale.
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: opentabletop
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: opentabletop
template:
metadata:
labels:
app: opentabletop
spec:
terminationGracePeriodSeconds: 30
containers:
- name: server
image: ghcr.io/your-org/opentabletop:abc1234
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: opentabletop-config
- secretRef:
name: opentabletop-secrets
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Service, Ingress, HPA
apiVersion: v1
kind: Service
metadata:
name: opentabletop
spec:
selector:
app: opentabletop
ports:
- port: 80
targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: opentabletop
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts: [api.yourdomain.com]
secretName: opentabletop-tls
rules:
- host: api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: opentabletop
port:
number: 80
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: opentabletop
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: opentabletop
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: opentabletop
spec:
minAvailable: 1
selector:
matchLabels:
app: opentabletop
Graceful Shutdown
Your server must handle SIGTERM by:
- Stop accepting new connections
- Finish in-flight requests (within
terminationGracePeriodSeconds) - Close database connections
- Exit cleanly
This ensures zero-downtime rolling updates. The readinessProbe failure during shutdown prevents new traffic from routing to a terminating pod.
Database Operations
PostgreSQL Setup
Recommended version: PostgreSQL 16+ (improved JSONB performance, parallel query improvements, and MERGE support).
The recommended schema creates all tables, indexes, and the full-text search tsvector column. For production, also consider:
- Connection pooling – Use PgBouncer or your framework’s built-in pool. Target 10-20 connections per pod, sized to your PostgreSQL
max_connections. - Read replicas – Route
/export/gamesand trend endpoints to replicas. These are read-heavy, long-running queries that should not compete with interactive API requests.
Migrations
Follow the ADR-0029 naming convention:
migrations/
0001_initial_schema.sql
0002_add_experience_playtime.sql
0003_add_game_snapshots.sql
Use any SQL-native migration runner (golang-migrate, Flyway, dbmate, or plain psql). Never use ORM-generated migrations – the schema is the spec’s recommended design and should be maintained as explicit SQL.
Index Tuning
The compound filtering workload (Pillar 2) drives these critical indexes:
-- Mechanic filtering (AND/OR/NOT queries join through this table)
CREATE INDEX idx_game_mechanics_mechanic ON game_mechanics(mechanic_id);
-- Compound sort (most common: rating desc with minimum vote count)
CREATE INDEX idx_games_rating ON games(rating DESC, rating_votes DESC);
-- Full-text search (weighted: name > short desc > full desc)
CREATE INDEX idx_games_search ON games USING GIN(search_vector);
-- Year-based trend queries
CREATE INDEX idx_games_year ON games(year_published);
Monitor slow queries with pg_stat_statements and run EXPLAIN ANALYZE on your compound filter queries to verify index usage.
Backup
- Automated daily backups with point-in-time recovery (PITR) using WAL archiving
- Test restores monthly – a backup you have not restored is not a backup
- Managed services (AWS RDS, Cloud SQL, Supabase) handle this automatically
Materialization Jobs
The Game entity’s aggregate fields (rating, rating_confidence, rank_overall, etc.) are materialized from raw input data – not computed on every API request. See Materialization for the full architectural rationale. In production, a scheduled job performs this materialization.
Execution Order
The materialization must run in dependency order:
- Per-game aggregates first –
rating,rating_votes,rating_distribution,rating_stddev,weight,weight_votes,community_playtime_*,community_suggested_age,owner_count,wishlist_count,total_plays, experience multipliers - Global parameters second – Compute the global mean rating from the freshly-updated per-game averages
- Rating confidence third –
rating_confidenceis the spec-level trust signal (rating-model.md Layer 3), computed from sample size, distribution shape, and deviation from global mean - Rankings fourth –
rank_overallcomputed using an implementation-chosen ranking method (Layer 4 recommends Dirichlet-prior Bayesian) - Derived fields fifth –
top_player_counts,recommended_player_countsfrom per-count ratings - Snapshots last – Write
GameSnapshotrows as a side effect of materialization (ADR-0036)
Steps 1-3 can be parallelized per-game. Step 4 is a single global sort.
Idempotency
The materialization job must be idempotent – safe to re-run at any time without producing incorrect results. Each run reads the current Tier 1 data and overwrites Tier 2 fields. Running the job twice with no new votes produces identical output. This means you can safely retry on failure, run manually for debugging, or trigger an extra run after a bulk data import.
Kubernetes CronJob
Schedule the materialization as a Kubernetes CronJob that runs daily at a low-traffic hour:
apiVersion: batch/v1
kind: CronJob
metadata:
name: opentabletop-materialize
spec:
schedule: "0 4 * * *" # Daily at 04:00 UTC
concurrencyPolicy: Forbid # Never run two at once
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 3600 # Kill if stuck for 1 hour
template:
spec:
restartPolicy: OnFailure
containers:
- name: materialize
image: ghcr.io/your-org/opentabletop:abc1234
command: ["node", "dist/jobs/materialize.js"]
envFrom:
- configMapRef:
name: opentabletop-config
- secretRef:
name: opentabletop-secrets
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 1Gi
Key settings:
concurrencyPolicy: Forbidprevents overlapping runs if a previous job is still in progress.activeDeadlineSecondskills a stuck job before the next scheduled run.backoffLimit: 2retries on transient failures (safe because the job is idempotent).
Replace the command with your implementation’s materialization entrypoint (e.g., cargo run --bin materialize for Rust, python -m opentabletop.materialize for Python).
Manual Trigger
The demo API exposes POST /v1/admin/materialize for manual triggering – useful after bulk data imports or during development. This endpoint runs the same logic as the cron job. In production, protect it with admin-only authentication.
Observability
Instrument your server with OpenTelemetry (ADR-0023). The three signals:
Metrics (Prometheus)
Expose a /metrics endpoint with at minimum:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total requests by method, path, status |
http_request_duration_seconds | Histogram | Request latency by endpoint |
db_query_duration_seconds | Histogram | Database query latency |
db_connections_active | Gauge | Current active DB connections |
cache_hits_total / cache_misses_total | Counter | Redis cache effectiveness |
export_requests_active | Gauge | In-flight bulk export streams |
Logs (Structured JSON)
Every log line should include:
{"level":"info","msg":"request completed","method":"GET","path":"/v1/games","status":200,"duration_ms":42,"trace_id":"abc123"}
The trace_id correlates logs to distributed traces. Use a log aggregator (Loki, CloudWatch, Datadog) to search and alert.
Traces (OTLP)
Configure the OTEL SDK to export traces to your collector. Instrument at minimum:
- HTTP handler spans (automatic with most frameworks)
- Database query spans (manual or via instrumented client)
- Cache lookup spans
For production, sample at 10-25% to control costs while maintaining visibility into tail latency.
Suggested Alerts
| Alert | Condition | Severity |
|---|---|---|
| High error rate | 5xx rate > 1% for 5 minutes | Critical |
| Slow responses | p95 latency > 2s for 5 minutes | Warning |
| DB connections saturated | Active > 80% of max for 5 minutes | Warning |
| Export queue backing up | Active exports > 5 for 10 minutes | Warning |
| Pod restarts | > 3 restarts in 15 minutes | Critical |
Sizing & Capacity Planning
Reference sizing for a dataset of ~100,000 games with full taxonomy, polls, and snapshots:
| Tier | Pods | CPU/Pod | Memory/Pod | PostgreSQL | Redis | Expected QPS |
|---|---|---|---|---|---|---|
| Small (hobby) | 1 | 0.5 vCPU | 512 MB | Micro (2 vCPU, 1 GB) | None | < 10 |
| Medium (community) | 2-3 | 1 vCPU | 1 GB | Small (2 vCPU, 2 GB) | Micro | 10-100 |
| Large (platform) | 3-10 | 2 vCPU | 2 GB | Large (2 vCPU, 16 GB) + replica | Large | 100-1000 |
Storage Growth
| Data Type | Per Game | 100k Games | Growth Rate |
|---|---|---|---|
| Game records | ~2 KB | ~200 MB | Slow (new publications) |
| Player count ratings | ~200 B | ~20 MB | Moderate (new votes) |
| Game snapshots (monthly) | ~500 B x 12/yr | ~600 MB/year | Linear |
| Full-text index | ~1 KB | ~100 MB | Tracks game records |
| Total (year 1) | ~1 GB |
PostgreSQL handles this comfortably. You will hit CPU limits on compound filtering queries long before you hit storage limits.
Security Checklist
- TLS everywhere – Terminate at ingress or load balancer. No plaintext HTTP in production.
- API keys hashed – Store HMAC hashes, not plaintext keys (ADR-0016).
- Rate limiting enforced – 60/min anonymous, 600/min authenticated.
-
CORS restricted – Set
CORS_ORIGINSto your known frontends. Do not leave*in production. - Database not publicly accessible – Private subnets, security groups, or network policies.
-
Secrets in Secret store –
DATABASE_URLandAPI_KEY_SALTin Kubernetes Secrets (or Vault/SOPS), never in ConfigMap or image. - Container image scanned – Run Trivy or Grype in CI. Block deployment on critical CVEs.
- No shell in runtime image – Distroless images have no shell to exploit.
Production Checklist
Before going live:
- Schema migrations applied and verified
- Sample data or real data loaded and queryable
-
Health endpoints (
/healthz,/readyz) responding -
Metrics endpoint (
/metrics) scraped by Prometheus - Structured JSON logs flowing to aggregator
- Traces sampled and visible in backend
- HPA configured and tested under load
-
PodDisruptionBudget set (
minAvailable: 1) - Database backups configured and restore tested
- TLS certificate provisioned and auto-renewing
- Rate limiting verified (test both anonymous and authenticated tiers)
- Rolling update tested (deploy a new image, verify zero downtime)
- Rollback tested (revert to previous SHA, verify recovery)
- Alert rules configured and notification channel verified