Jatinder Aujla

What is Debezium? Architecture, Terminology, and Connectors

Jatinder — Sun, 17 Aug 2025 12:51:21 GMT

In the first article of this CDC series, we learned how Change Data Capture (CDC) works and why it matters.

Now let’s dive deeper into Debezium — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data it produces.

What is Debezium?

Debezium is an open-source CDC platform built on top of Kafka Connect. It continuously monitors database transaction logs and streams every change (insert, update, delete) into Kafka topics.

Supported databases include:

Relational: MySQL, PostgreSQL, SQL Server, Oracle, Db2
NoSQL: MongoDB, Cassandra
Others: Vitess, Spanner, etc.

Instead of batch jobs, Debezium ensures real-time, event-driven pipelines — a perfect fit for analytics, search, and microservices.

How Debezium Works?

At a high level:

Debezium connects to a database’s transaction log (e.g., MySQL binlog, Postgres WAL).
It captures changes row by row.
It converts these into structured change events.
Events are published to Kafka topics.
Other systems (apps, warehouses, sinks) consume these events.

Debezium Architecture?

Database → Debezium Connector (Kafka Connect) → Kafka Topics → Consumers/Sink Connectors

Source Database → e.g., MySQL, PostgreSQL
Debezium Connector → reads changes from transaction logs (MySQL binlog, PostgreSQL WAL).
Kafka Cluster → stores events in topics
Consumers/Sinks → JDBC sink connector for relational database, Elasticsearch, Data Warehouse, or microservices

This design makes Debezium scalable and fault-tolerant.

Terminology

Connector → plugin that knows how to read changes from a specific DB
Source Connector → captures changes (Debezium provides these)
Sink Connector → delivers changes to targets (from Kafka Connect ecosystem)
Change Event → structured JSON/Avro message containing before/after values
Offsets → checkpoints for connector progress in logs
Snapshotting → initial dump of existing data before streaming begins
Schema History Topic → Kafka topic where Debezium records schema/DDL changes
Tombstone Event → message indicating a delete (so topics stay compacted)

Benefits of Using Debezium

Near real-time CDC with low latency
Works with many databases
No changes needed in application code
Reliable (offset tracking, schema history, exactly-once with Kafka)
Scales easily with Kafka

Features of Debezium

Debezium provides a rich set of features that make it one of the most widely used CDC platforms.

Captures All Data Changes
- Inserts, updates, and deletes are all captured reliably
Low Latency, High Efficiency
- Produces change events with very low delay while avoiding heavy CPU usage (no expensive polling).
No Data Model Changes Required
- Works by reading the database’s transaction log, so you don’t need to modify existing tables or schemas.
Captures Deletes
- Supports “tombstone” events to reflect deleted records downstream.
Captures Old State + Metadata
- Provides both before and after row states.
- Can include extra metadata such as transaction IDs, user queries, and timestamps (depending on DB).
Advanced Filtering and Transformations
- Built-in Single Message Transformations (SMTs) allow:
  - Filtering certain records
  - Masking sensitive fields (e.g., PII)
  - Routing records to different Kafka topics
  - Custom message transformations
Fault-Tolerance & Recovery
- Uses Kafka offsets to resume from the exact point of failure, ensuring no events are lost or duplicated.

Why Log-Based CDC is Better

There are different CDC techniques (polling, triggers, log-based), but log-based CDC — which Debezium uses — is considered the best approach:

Low Overhead → No extra load on the database since it reads from the transaction log instead of querying live tables.
Complete Change History → Captures all changes, including deletes and before/after values.
Reliable & Consistent → Transaction ordering is preserved exactly as it happened in the database.
Non-Intrusive → No need to modify application code or database schema.

By comparison:

Polling adds query overhead, can miss changes, and introduces latency.
Triggers increase write latency and are harder to maintain at scale.

That’s why Debezium’s log-based CDC approach is widely used for real-time data pipelines.

Debezium Topics

When Debezium runs, it creates multiple Kafka topics:

Table-specific topics
- Each database table gets its own separate topic.
- Example:
  - dbserver1.inventory.orders → streams changes from orders table
  - dbserver1.inventory.customers → streams changes from customers table
Schema/History topic
- Example: schema-changes.inventory
- Stores schema (DDL) changes like ALTER TABLE.
- Ensures consumers can interpret events even if the table structure evolves.

Sample Change Events (Records)

Let’s say we have a MySQL orders table.

Insert record sample data

{
  "before": null,
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200
  },
  "source": {
    "db": "ecommerce",
    "table": "orders",
    "ts_ms": 1692956540000
  },
  "op": "c",   // c = create
  "ts_ms": 1692956540500
}

Update record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1200 // old price
  },
  "after": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100 //new price
  },
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "u",   // u = update
  "ts_ms": 1692956540700
}

Delete record sample data

{
  "before": {
    "id": 101,
    "product": "Laptop",
    "amount": 1100
  },
  "after": null,
  "source": {
    "db": "ecommerce",
    "table": "orders"
  },
  "op": "d",   // d = delete
  "ts_ms": 1692956540900
}

Notice how each event has:

before → row state before change
after → row state after change
op → operation type (c, u, d)
source → metadata (db, table, timestamp)

This is the heart of how Debezium streams data.

Connectors in Debezium

Source Connectors (provided by Debezium):
- MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra
- Reads transaction logs
Sink Connectors (provided by Kafka Connect ecosystem):
- JDBC Sink → push to another DB
- Elasticsearch Sink → for search indexing
- S3 Sink → for archiving raw events
- Others → Snowflake, BigQuery, etc.

Together, these connectors make Debezium a complete pipeline for database sync and streaming.

Real-World Use Cases

E-commerce → sync orders from MySQL to analytics DB
Search → stream product updates into Elasticsearch
Microservices → event-driven communication
Data migration → low-downtime DB migration
Audit Trails → stream all data changes with before and after payload into data lakes or history tables.

Limitations & Considerations

Requires a Kafka cluster (extra infra)
Snapshotting large tables can be expensive
Schema evolution needs planning
Sensitive data may need masking/transformations
Kafka topic retention must be tuned

Conclusion & Next Steps

Debezium makes CDC practical, reliable, and production-ready.
It turns every row change into a real-time event that downstream systems can consume.

In the next article, we’ll walk through a hands-on example: syncing changes from MySQL → Kafka → PostgreSQL using Debezium and Kafka Connect.

What is CDC (Change Data Capture)?

Jatinder — Sun, 10 Aug 2025 16:52:00 GMT

Understanding Change Data Capture (CDC)

If you’ve ever tried to keep two different systems in sync — say, a database and an analytics dashboard — you know it’s tricky. Data changes fast, and if you’re constantly copying entire tables just to keep things updated, it gets slow, expensive, and messy.

That’s where Change Data Capture (CDC) comes in.

What is CDC?

CDC is a way to detect and capture only the changes that happen in your database — new records, updates, and deletions — and send them somewhere else in real time (or near real time).

Let’s understand it with an example

Imagine your e-commerce database as a giant store. Every minute or second, someone:

Places a new order
Updates their shipping address
Cancels an order
Adds stock to a product

If you were to find out these data changes in almost real-time, efficiently, without wasting time and resources, only changes without copying the entire order, address, and product data.

Real-time use of CDC in an e-commerce platform

Every new order triggers an update in real-time sales dashboards.
Inventory changes are sent directly to the search and product pages.
Customer profile edits update CRM and marketing tools instantly.

Why Use CDC?

Think of CDC like a live news feed for your database:

Faster — You’re not pulling the entire dataset each time.
Lighter — Less network and storage overhead.
Timely — Downstream systems get updates almost in real-time.
Reliable — Fewer chances to miss changes between syncs.

How CDC Works — The Big Picture

Most CDC implementations follow the same pattern:

Watch for Changes — This can be done by reading database logs, tracking timestamps, or using built-in database features.
Capture the Event — Record details about what changed (insert, update, delete, and the affected rows/columns).
Deliver the Change — Send this change to another system (a message queue, data warehouse, or microservice).

Common Ways to Implement CDC

There are several techniques to detect changes:

Log-based CDC — Reads the database’s transaction log (very efficient, minimal performance impact).
Trigger-based CDC — Uses database triggers to log changes into another table (easier to set up, but can slow down writes).
Timestamp-based CDC — Compares timestamps to find new or updated rows (simpler, but can miss deletes or backdated changes).

CDC Tools in the market

Debezium
Airbyte
Hevo, etc

Where Debezium fits in?

Debezium is an open-source change data capture tool. It connects to your source database logs (MySQL binlog, Postgres WAL, etc.), captures all the row-level changes from the logs, and publishes those events to Apache Kafka so that other systems can listen to those changes and react to them.

Benefits of using Debezium (DBZ)

Log-based CDC connects to the transaction log of the database.
Capture real-time data changes
Supports multiple data sources such as MySQL, Postgres, Oracle, MariaDB, SQL Server, etc
Ready to use in production.
Fault tolerance — It maintains the position in the transaction log, allowing it to resume from where it left off in case of failure.
Capture all the changes, including deleted records.

How Debezium works — High-level

Read transaction logs from the database.
Parse logs and Capture data changes
Serialize the changes to a specific schema, like JSON, Avro, etc
Publish an event to a message queue like Apache Kafka

When You Should Consider CDC?

CDC is a good fit when you need:

Real-time reporting or analytics
Search index updates (e.g., Elasticsearch)
Sync between microservices
Data migration with minimal downtime

But it might be overkill for:

Very small datasets
One-off migrations
Systems where “eventual consistency” isn’t critical

Summary

In short, CDC is like putting your database on “live broadcast mode”. Instead of asking “What’s changed?” over and over, you get updates the moment something happens — ready to power analytics, sync systems, or trigger workflows.

With tools like Debezium, implementing CDC is easier than ever. The result? Data that’s always fresh, and systems that are always in sync.

What is Java?

Jatinder — Thu, 16 Jan 2025 18:07:13 GMT

Java is a static-type object-oriented programming language.