Skip to content

Multi-source data โ€‹

๐Ÿšง Roadmap. The federation + sync experience lands in DB-M1 (Postgres), with MySQL and MongoDB following in DB-M2 / DB-M3. The conceptual page at Multi-source data explains why the design looks like this; this page is the operator's hands-on view of what to type.

This page is the operational counterpart to the conceptual Multi-source data. Read that one first for the why; come here for the curl commands, the SQL syntax, and the status surface.

The model in one sentence: every external source registers as a DataFusion catalog (federation), or replays into kyma extents via CDC (sync), or both โ€” the same connection, the same schema introspection, the same status row, on the same connector framework as Prometheus.

Register a source โ€‹

bash
curl -sS -X POST http://localhost:8080/v1/connectors \
  -H "Content-Type: application/json" \
  --data-binary @- <<'JSON'
{
  "name": "pg_prod",
  "type": "postgres",
  "mode": "both",
  "connection": {
    "url":         "postgres://app@prod-rds.example.com:5432/app",
    "secret_ref":  "$env:PG_PROD_PASSWORD",
    "tls":         "required",
    "pool_size":   10
  },
  "scope": {
    "include_schemas": ["public", "billing"]
  },
  "sync": {
    "tables": ["public.users", "billing.invoices"]
  }
}
JSON

Response: {"id": "<uuid>"}. After this:

  • The pg_prod.public.* and pg_prod.billing.* namespaces are valid in any KQL or SQL query โ€” federated reads run live against the source.
  • public.users and billing.invoices have started snapshotting into kyma extents. Once the initial snapshot is done the connector advances to phase=streaming and stays there.

Per-engine config and the type-mapping tables live on each engine page: Postgres, MySQL, MongoDB.

The admin API runs validate_secrets_resolvable before persisting, so a typo'd secret_ref fails at create time, not silently in the runner.

Query it โ€‹

For sources registered with mode: "federation", references resolve to the live source. For mode: "sync", they resolve to the kyma extents the CDC pipeline lands rows in. For mode: "both", bare references resolve to the synced extent (predictable, fast); live(table) opts into the federated path:

sql
-- Synced read โ€” sub-second, no source load.
SELECT * FROM pg_prod.public.users WHERE id = 42;

-- Federated read โ€” fresh-as-of-now, costs the source a query.
SELECT * FROM live(pg_prod.public.users) WHERE id = 42;

-- Cross-source join โ€” federated small side, kyma big side.
SELECT u.email, COUNT(*) AS errors
  FROM pg_prod.public.users u
  JOIN otel_logs l ON l.user_id = u.id
 WHERE l.severity_text = 'ERROR'
   AND l._timestamp > now() - INTERVAL '1 hour'
 GROUP BY u.email
 ORDER BY errors DESC
 LIMIT 5;

Use live(...) when you need read-after-write freshness โ€” e.g., right after a transaction the agent triggered. Default to the synced path for everything else; sync lag is bounded (typically seconds), and you avoid hitting the source for routine queries.

The pushdown_summary โ€‹

Every response that touched a federated source carries a pushdown_summary array โ€” one entry per FederatedScan:

json
{
  "source": "pg_prod",
  "table": "public.users",
  "filters_pushed": ["region = $1"],
  "filters_residual": [],
  "projection_pushed": true,
  "limit_pushed": 50,
  "sort_pushed": null,
  "agg_pushed": null,
  "agg_residual_reason": "cross-source group-by",
  "join_pushed": false,
  "scan_duration_ms": 14,
  "rows_returned": 50,
  "bytes_received": 4128
}

This is the trust mechanism for federation. If a query is slow, the summary tells you exactly which filter went residual and why, so you can rewrite the query, change the source schema (e.g., a MySQL collation โ€” see MySQL), or register a different mode.

The CI suite asserts pushdown_summary is non-degrading on a curated set of queries, so accidental planner regressions get caught at PR time, not in production.

What gets pushed โ€‹

The exact list lives on each engine page; the shared rules are:

  • Always pushed: projection, =/!=/</<=/>/>=/IN/IS NULL/LIKE, AND/OR/NOT trees, LIMIT, ORDER BY, single-source aggregations.
  • Pushed when safe: same-source same-connection joins; verified scalar functions (LOWER, UPPER, COALESCE, date_trunc).
  • Never pushed: kyma UDFs (cosine_distance, dynamic accessors), cross-source joins, window functions, anything where source semantics diverge from DataFusion's (most notably MySQL case-insensitive collations).

If the planner is uncertain, it leaves the expression in the residual. Slow-but-right beats fast-and-wrong โ€” and every pushdown rule has a property test that runs the same query both ways and asserts identical Arrow output.

Status, health, and observability โ€‹

GET /v1/connectors/:id/status returns a structured doc per source โ€” federation pool stats, sync phase, lag in seconds, last error, schema drift if any:

json
{
  "id": "...",
  "type": "postgres",
  "mode": "both",
  "source":     { "reachable": true, "version": "PG 15.4", ... },
  "federation": { "status": "healthy", "pool_in_use": 2, "pool_max": 10,
                  "p50_query_ms": 14, "p99_query_ms": 230, ... },
  "sync":       { "status": "streaming", "lag_seconds": 4,
                  "events_per_sec": 1200, "rows_synced": 5240000, ... }
}

The same data is exposed as a queryable kyma table:

sql
SELECT connector_name, mode, phase, lag_seconds, events_per_sec
  FROM kyma_connector_health
 WHERE lag_seconds > 60;

Schema:

ColumnTypeDescription
connector_idstring
connector_namestring
typestringpostgres / mysql / mongo / prometheus / โ€ฆ
modestringfederation / sync / both
phasestringper source-table
lag_secondsrealsync-mode lag
events_per_secrealsync throughput
pool_in_useintfederation pool snapshot
pool_maxint
last_errorstring
last_event_attimestamp
timestamptimestamprow emit time (matches kyma's standard column name)

Agents query this with KQL/SQL; dashboards chart it; alerts fire on it. This is what closes the "kyma observing kyma observing your databases" loop without dedicated alerting config.

GET /v1/connectors/:id/events returns the last 100 state transitions โ€” phase changes, error onsets, recoveries โ€” for postmortem.

System columns on synced tables โ€‹

Synced rows always have four columns kyma adds automatically: _kyma_pk, _kyma_op, _kyma_lsn, _kyma_event_at. Deletes are tombstone rows with _kyma_op = 'delete'; default reads at the federation/agent layer hide them via predicate. See Multi-source data (concepts) for the row-semantics rules and Retention and compaction for how tombstones get garbage-collected.

Mode-isolated pause โ€‹

mode: "both" connectors can have federation healthy while sync is errored, and vice versa. Pause one without re-snapshotting the other:

bash
curl -X POST 'http://localhost:8080/v1/connectors/<id>/pause?scope=sync'
curl -X POST 'http://localhost:8080/v1/connectors/<id>/resume?scope=sync'

Default scope is all. Useful when an upstream source is doing maintenance and you want to keep federation answering while sync catches up later.

Where to go next โ€‹