<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Jatinder Aujla]]></title><description><![CDATA[Jatinder Aujla]]></description><link>https://jatinderaujla.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 10:18:02 GMT</lastBuildDate><atom:link href="https://jatinderaujla.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[What is Debezium? Architecture, Terminology, and Connectors]]></title><description><![CDATA[In the first article of this CDC series, we learned how Change Data Capture (CDC) works and why it matters.
Now let’s dive deeper into Debezium — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data i...]]></description><link>https://jatinderaujla.com/debezium-cdc-architecture-and-connectors</link><guid isPermaLink="true">https://jatinderaujla.com/debezium-cdc-architecture-and-connectors</guid><category><![CDATA[cdc]]></category><category><![CDATA[debezium]]></category><category><![CDATA[kafka]]></category><category><![CDATA[real-time data]]></category><category><![CDATA[event streaming]]></category><dc:creator><![CDATA[Jatinder]]></dc:creator><pubDate>Sun, 17 Aug 2025 12:51:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755434973260/3e33a6eb-1114-4463-8e5b-628265915383.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the <a target="_blank" href="https://jatinderaujla.com/what-is-cdc-change-data-capture">first article of this CDC series</a>, we learned how <strong>Change Data Capture (CDC)</strong> works and why it matters.</p>
<p>Now let’s dive deeper into <strong>Debezium</strong> — one of the most popular open-source CDC tools — to see what it is, how it works, and what kind of data it produces.</p>
<h3 id="heading-what-is-debezium"><strong>What is Debezium?</strong></h3>
<p>Debezium is an <strong>open-source CDC platform</strong> built on top of <strong>Kafka Connect</strong>. It continuously monitors database transaction logs and streams every change (insert, update, delete) into Kafka topics.</p>
<p>Supported databases include:</p>
<ul>
<li><p><strong>Relational</strong>: MySQL, PostgreSQL, SQL Server, Oracle, Db2</p>
</li>
<li><p><strong>NoSQL</strong>: MongoDB, Cassandra</p>
</li>
<li><p><strong>Others</strong>: Vitess, Spanner, etc.</p>
</li>
</ul>
<p>Instead of batch jobs, Debezium ensures <strong>real-time, event-driven pipelines</strong> — a perfect fit for analytics, search, and microservices.</p>
<h3 id="heading-how-debezium-works"><strong>How Debezium Works?</strong></h3>
<p>At a high level:</p>
<ol>
<li><p>Debezium connects to a <strong>database’s transaction log</strong> (e.g., MySQL binlog, Postgres WAL).</p>
</li>
<li><p>It captures changes row by row.</p>
</li>
<li><p>It converts these into <strong>structured change events</strong>.</p>
</li>
<li><p>Events are published to <strong>Kafka topics</strong>.</p>
</li>
<li><p>Other systems (apps, warehouses, sinks) consume these events.</p>
</li>
</ol>
<h3 id="heading-debezium-architecture"><strong>Debezium Architecture?</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755430284988/05434f45-9c58-4498-bef7-eae6148d9415.png" alt="A diagram illustrating data flow from MySQL and PostgreSQL databases through Debezium source connectors to Apache Kafka. The data is then processed by JDBC and Elasticsearch sink connectors to be stored in a database and sent to an analytics viewer." class="image--center mx-auto" /></p>
<p><strong>Database → Debezium Connector (Kafka Connect) → Kafka Topics → Consumers/Sink Connectors</strong></p>
<ul>
<li><p><strong>Source Database</strong> → e.g., MySQL, PostgreSQL</p>
</li>
<li><p><strong>Debezium Connector</strong> → reads changes from transaction logs (MySQL binlog, PostgreSQL WAL).</p>
</li>
<li><p><strong>Kafka Cluster</strong> → stores events in topics</p>
</li>
<li><p><strong>Consumers/Sinks</strong> → JDBC sink connector for relational database, Elasticsearch, Data Warehouse, or microservices</p>
</li>
</ul>
<p>This design makes Debezium <strong>scalable and fault-tolerant</strong>.</p>
<h3 id="heading-terminology"><strong>Terminology</strong></h3>
<ul>
<li><p><strong>Connector</strong> → plugin that knows how to read changes from a specific DB</p>
</li>
<li><p><strong>Source Connector</strong> → captures changes (Debezium provides these)</p>
</li>
<li><p><strong>Sink Connector</strong> → delivers changes to targets (from Kafka Connect ecosystem)</p>
</li>
<li><p><strong>Change Event</strong> → structured JSON/Avro message containing before/after values</p>
</li>
<li><p><strong>Offsets</strong> → checkpoints for connector progress in logs</p>
</li>
<li><p><strong>Snapshotting</strong> → initial dump of existing data before streaming begins</p>
</li>
<li><p><strong>Schema History Topic</strong> → Kafka topic where Debezium records schema/DDL changes</p>
</li>
<li><p><strong>Tombstone Event</strong> → message indicating a delete (so topics stay compacted)</p>
</li>
</ul>
<h3 id="heading-benefits-of-using-debezium"><strong>Benefits of Using Debezium</strong></h3>
<ul>
<li><p>Near <strong>real-time CDC</strong> with low latency</p>
</li>
<li><p>Works with many databases</p>
</li>
<li><p>No changes needed in application code</p>
</li>
<li><p>Reliable (offset tracking, schema history, exactly-once with Kafka)</p>
</li>
<li><p>Scales easily with Kafka</p>
</li>
</ul>
<h3 id="heading-features-of-debezium"><strong>Features of Debezium</strong></h3>
<p>Debezium provides a rich set of features that make it one of the most widely used CDC platforms.</p>
<ol>
<li><p><strong>Captures All Data Changes</strong></p>
<ul>
<li>Inserts, updates, and deletes are all captured reliably</li>
</ul>
</li>
<li><p><strong>Low Latency, High Efficiency</strong></p>
<ul>
<li>Produces change events with <strong>very low delay</strong> while avoiding heavy CPU usage (no expensive polling).</li>
</ul>
</li>
<li><p><strong>No Data Model Changes Required</strong></p>
<ul>
<li>Works by reading the database’s <strong>transaction log</strong>, so you don’t need to modify existing tables or schemas.</li>
</ul>
</li>
<li><p><strong>Captures Deletes</strong></p>
<ul>
<li>Supports “tombstone” events to reflect deleted records downstream.</li>
</ul>
</li>
<li><p><strong>Captures Old State + Metadata</strong></p>
<ul>
<li><p>Provides both <strong>before</strong> and <strong>after</strong> row states.</p>
</li>
<li><p>Can include extra metadata such as <strong>transaction IDs, user queries, and timestamps</strong> (depending on DB).</p>
</li>
</ul>
</li>
<li><p><strong>Advanced Filtering and Transformations</strong></p>
<ul>
<li><p>Built-in <strong>Single Message Transformations (SMTs)</strong> allow:</p>
<ul>
<li><p>Filtering certain records</p>
</li>
<li><p>Masking sensitive fields (e.g., PII)</p>
</li>
<li><p>Routing records to different Kafka topics</p>
</li>
<li><p>Custom message transformations</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Fault-Tolerance &amp; Recovery</strong></p>
<ul>
<li>Uses <strong>Kafka offsets</strong> to resume from the exact point of failure, ensuring no events are lost or duplicated.</li>
</ul>
</li>
</ol>
<h3 id="heading-why-log-based-cdc-is-better"><strong>Why Log-Based CDC is Better</strong></h3>
<p>There are different CDC techniques (polling, triggers, log-based), but <strong>log-based CDC</strong> — which Debezium uses — is considered the best approach:</p>
<ul>
<li><p><strong>Low Overhead →</strong> No extra load on the database since it reads from the <strong>transaction log</strong> instead of querying live tables.</p>
</li>
<li><p><strong>Complete Change History →</strong> Captures all changes, including deletes and before/after values.</p>
</li>
<li><p><strong>Reliable &amp; Consistent →</strong> Transaction ordering is preserved exactly as it happened in the database.</p>
</li>
<li><p><strong>Non-Intrusive →</strong> No need to modify application code or database schema.</p>
</li>
</ul>
<p>By comparison:</p>
<ul>
<li><p><strong>Polling</strong> adds query overhead, can miss changes, and introduces latency.</p>
</li>
<li><p><strong>Triggers</strong> increase write latency and are harder to maintain at scale.</p>
</li>
</ul>
<p>That’s why <strong>Debezium’s log-based CDC approach</strong> is widely used for real-time data pipelines.</p>
<h3 id="heading-debezium-topics"><strong>Debezium Topics</strong></h3>
<p>When Debezium runs, it creates <strong>multiple Kafka topics</strong>:</p>
<ol>
<li><p><strong>Table-specific topics</strong></p>
<ul>
<li><p>Each database table gets its own separate topic.</p>
</li>
<li><p>Example:</p>
<ul>
<li><p><code>dbserver1.inventory.orders</code> → streams changes from <code>orders</code> table</p>
</li>
<li><p><code>dbserver1.inventory.customers</code> → streams changes from <code>customers</code> table</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Schema/History topic</strong></p>
<ul>
<li><p>Example: <code>schema-changes.inventory</code></p>
</li>
<li><p>Stores schema (DDL) changes like <code>ALTER TABLE</code>.</p>
</li>
<li><p>Ensures consumers can interpret events even if the table structure evolves.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-sample-change-events-records"><strong>Sample Change Events (Records)</strong></h3>
<p>Let’s say we have a MySQL <code>orders</code> table.</p>
<p><strong>Insert record sample data</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"before"</span>: <span class="hljs-literal">null</span>,
  <span class="hljs-attr">"after"</span>: {
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">101</span>,
    <span class="hljs-attr">"product"</span>: <span class="hljs-string">"Laptop"</span>,
    <span class="hljs-attr">"amount"</span>: <span class="hljs-number">1200</span>
  },
  <span class="hljs-attr">"source"</span>: {
    <span class="hljs-attr">"db"</span>: <span class="hljs-string">"ecommerce"</span>,
    <span class="hljs-attr">"table"</span>: <span class="hljs-string">"orders"</span>,
    <span class="hljs-attr">"ts_ms"</span>: <span class="hljs-number">1692956540000</span>
  },
  <span class="hljs-attr">"op"</span>: <span class="hljs-string">"c"</span>,   <span class="hljs-comment">// c = create</span>
  <span class="hljs-attr">"ts_ms"</span>: <span class="hljs-number">1692956540500</span>
}
</code></pre>
<p><strong>Update record sample data</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"before"</span>: {
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">101</span>,
    <span class="hljs-attr">"product"</span>: <span class="hljs-string">"Laptop"</span>,
    <span class="hljs-attr">"amount"</span>: <span class="hljs-number">1200</span> <span class="hljs-comment">// old price</span>
  },
  <span class="hljs-attr">"after"</span>: {
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">101</span>,
    <span class="hljs-attr">"product"</span>: <span class="hljs-string">"Laptop"</span>,
    <span class="hljs-attr">"amount"</span>: <span class="hljs-number">1100</span> <span class="hljs-comment">//new price</span>
  },
  <span class="hljs-attr">"source"</span>: {
    <span class="hljs-attr">"db"</span>: <span class="hljs-string">"ecommerce"</span>,
    <span class="hljs-attr">"table"</span>: <span class="hljs-string">"orders"</span>
  },
  <span class="hljs-attr">"op"</span>: <span class="hljs-string">"u"</span>,   <span class="hljs-comment">// u = update</span>
  <span class="hljs-attr">"ts_ms"</span>: <span class="hljs-number">1692956540700</span>
}
</code></pre>
<p><strong>Delete record sample data</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"before"</span>: {
    <span class="hljs-attr">"id"</span>: <span class="hljs-number">101</span>,
    <span class="hljs-attr">"product"</span>: <span class="hljs-string">"Laptop"</span>,
    <span class="hljs-attr">"amount"</span>: <span class="hljs-number">1100</span>
  },
  <span class="hljs-attr">"after"</span>: <span class="hljs-literal">null</span>,
  <span class="hljs-attr">"source"</span>: {
    <span class="hljs-attr">"db"</span>: <span class="hljs-string">"ecommerce"</span>,
    <span class="hljs-attr">"table"</span>: <span class="hljs-string">"orders"</span>
  },
  <span class="hljs-attr">"op"</span>: <span class="hljs-string">"d"</span>,   <span class="hljs-comment">// d = delete</span>
  <span class="hljs-attr">"ts_ms"</span>: <span class="hljs-number">1692956540900</span>
}
</code></pre>
<p>Notice how each event has:</p>
<ul>
<li><p><strong>before</strong> → row state before change</p>
</li>
<li><p><strong>after</strong> → row state after change</p>
</li>
<li><p><strong>op</strong> → operation type (<code>c</code>, <code>u</code>, <code>d</code>)</p>
</li>
<li><p><strong>source</strong> → metadata (db, table, timestamp)</p>
</li>
</ul>
<p>This is the heart of how Debezium streams data.</p>
<h3 id="heading-connectors-in-debezium"><strong>Connectors in Debezium</strong></h3>
<ul>
<li><p><strong>Source Connectors</strong> (provided by Debezium):</p>
<ul>
<li><p>MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra</p>
</li>
<li><p>Reads transaction logs</p>
</li>
</ul>
</li>
<li><p><strong>Sink Connectors</strong> (provided by Kafka Connect ecosystem):</p>
<ul>
<li><p>JDBC Sink → push to another DB</p>
</li>
<li><p>Elasticsearch Sink → for search indexing</p>
</li>
<li><p>S3 Sink → for archiving raw events</p>
</li>
<li><p>Others → Snowflake, BigQuery, etc.</p>
</li>
</ul>
</li>
</ul>
<p>Together, these connectors make Debezium a <strong>complete pipeline for database sync and streaming</strong>.</p>
<h3 id="heading-real-world-use-cases"><strong>Real-World Use Cases</strong></h3>
<ul>
<li><p><strong>E-commerce</strong> → sync orders from MySQL to analytics DB</p>
</li>
<li><p><strong>Search</strong> → stream product updates into Elasticsearch</p>
</li>
<li><p><strong>Microservices</strong> → event-driven communication</p>
</li>
<li><p><strong>Data migration</strong> → low-downtime DB migration</p>
</li>
<li><p><strong>Audit Trails</strong> → stream all data changes with before and after payload into <strong>data lakes</strong> or <strong>history tables</strong>.</p>
</li>
</ul>
<h3 id="heading-limitations-amp-considerations"><strong>Limitations &amp; Considerations</strong></h3>
<ul>
<li><p>Requires a Kafka cluster (extra infra)</p>
</li>
<li><p>Snapshotting large tables can be expensive</p>
</li>
<li><p>Schema evolution needs planning</p>
</li>
<li><p>Sensitive data may need masking/transformations</p>
</li>
<li><p>Kafka topic retention must be tuned</p>
</li>
</ul>
<h3 id="heading-conclusion-amp-next-steps"><strong>Conclusion &amp; Next Steps</strong></h3>
<p>Debezium makes CDC <strong>practical, reliable, and production-ready</strong>.<br />It turns every row change into a <strong>real-time event</strong> that downstream systems can consume.</p>
<p>In the <strong>next article</strong>, we’ll walk through a <strong>hands-on example</strong>: syncing changes from <strong>MySQL → Kafka → PostgreSQL</strong> using Debezium and Kafka Connect.</p>
]]></content:encoded></item><item><title><![CDATA[What is CDC (Change Data Capture)?]]></title><description><![CDATA[Understanding Change Data Capture (CDC)
If you’ve ever tried to keep two different systems in sync — say, a database and an analytics dashboard — you know it’s tricky. Data changes fast, and if you’re constantly copying entire tables just to keep thi...]]></description><link>https://jatinderaujla.com/what-is-cdc-change-data-capture</link><guid isPermaLink="true">https://jatinderaujla.com/what-is-cdc-change-data-capture</guid><category><![CDATA[cdc]]></category><category><![CDATA[debezium]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Jatinder]]></dc:creator><pubDate>Sun, 10 Aug 2025 16:52:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754844406114/d8c91e66-5274-4091-abe1-8fde2c96f7eb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-understanding-change-data-capture-cdc"><strong>Understanding Change Data Capture (CDC)</strong></h2>
<p>If you’ve ever tried to keep two different systems in sync — say, a database and an analytics dashboard — you know it’s tricky. Data changes fast, and if you’re constantly copying entire tables just to keep things updated, it gets slow, expensive, and messy.</p>
<p>That’s where <strong>Change Data Capture (CDC)</strong> comes in.</p>
<h3 id="heading-what-is-cdc"><strong>What is CDC?</strong></h3>
<p><strong>CDC is a way to detect and capture only the changes that happen in your database</strong> — new records, updates, and deletions — and send them somewhere else in real time (or near real time).</p>
<p><strong>Let’s understand it with an example</strong></p>
<p>Imagine your e-commerce database as a giant store. Every minute or second, someone:</p>
<ul>
<li><p>Places a new order</p>
</li>
<li><p>Updates their shipping address</p>
</li>
<li><p>Cancels an order</p>
</li>
<li><p>Adds stock to a product</p>
</li>
</ul>
<p>If you were to find out these data changes in almost real-time, efficiently, without wasting time and resources, only changes without copying the entire order, address, and product data.</p>
<p><strong>Real-time use of CDC in an e-commerce platform</strong></p>
<ul>
<li><p>Every new order triggers an update in <strong>real-time sales dashboards</strong>.</p>
</li>
<li><p>Inventory changes are sent directly to the <strong>search and product pages</strong>.</p>
</li>
<li><p>Customer profile edits update <strong>CRM and marketing tools</strong> instantly.</p>
</li>
</ul>
<h3 id="heading-why-use-cdc"><strong>Why Use CDC?</strong></h3>
<p>Think of CDC like a live news feed for your database:</p>
<ul>
<li><p><strong>Faster</strong> — You’re not pulling the entire dataset each time.</p>
</li>
<li><p><strong>Lighter</strong> — Less network and storage overhead.</p>
</li>
<li><p><strong>Timely</strong> — Downstream systems get updates almost in real-time.</p>
</li>
<li><p><strong>Reliable</strong> — Fewer chances to miss changes between syncs.</p>
</li>
</ul>
<h3 id="heading-how-cdc-works-the-big-picture"><strong>How CDC Works — The Big Picture</strong></h3>
<p>Most CDC implementations follow the same pattern:</p>
<ol>
<li><p><strong>Watch for Changes</strong> — This can be done by reading database logs, tracking timestamps, or using built-in database features.</p>
</li>
<li><p><strong>Capture the Event</strong> — Record details about what changed (insert, update, delete, and the affected rows/columns).</p>
</li>
<li><p><strong>Deliver the Change</strong> — Send this change to another system (a message queue, data warehouse, or microservice).</p>
</li>
</ol>
<h3 id="heading-common-ways-to-implement-cdc"><strong>Common Ways to Implement CDC</strong></h3>
<p>There are several techniques to detect changes:</p>
<ul>
<li><p><strong>Log-based CDC</strong> — Reads the database’s transaction log (very efficient, minimal performance impact).</p>
</li>
<li><p><strong>Trigger-based CDC</strong> — Uses database triggers to log changes into another table (easier to set up, but can slow down writes).</p>
</li>
<li><p><strong>Timestamp-based CDC</strong> — Compares timestamps to find new or updated rows (simpler, but can miss deletes or backdated changes).</p>
</li>
</ul>
<h3 id="heading-cdc-tools-in-the-market"><strong>CDC Tools in the market</strong></h3>
<ul>
<li><p>Debezium</p>
</li>
<li><p>Airbyte</p>
</li>
<li><p>Hevo, etc</p>
</li>
</ul>
<h3 id="heading-where-debezium-fits-in"><strong>Where Debezium fits in?</strong></h3>
<p>Debezium is an <strong>open-source</strong> change data capture tool. It connects to your source database logs (MySQL binlog, Postgres WAL, etc.), captures all the row-level changes from the logs, and publishes those events to <strong>Apache Kafka</strong> so that other systems can listen to those changes and react to them.</p>
<p><strong>Benefits of using Debezium (DBZ)</strong></p>
<ul>
<li><p>Log-based CDC connects to the <strong>transaction log</strong> of the database.</p>
</li>
<li><p>Capture real-time data changes</p>
</li>
<li><p>Supports multiple data sources such as <strong>MySQL, Postgres, Oracle, MariaDB, SQL Server</strong>, etc</p>
</li>
<li><p>Ready to use in production.</p>
</li>
<li><p>Fault tolerance — It maintains the position in the transaction log, allowing it to resume from where it left off in case of failure.</p>
</li>
<li><p>Capture all the changes, including deleted records.</p>
</li>
</ul>
<h3 id="heading-how-debezium-works-high-level"><strong>How Debezium works — High-level</strong></h3>
<ul>
<li><p>Read transaction logs from the database.</p>
</li>
<li><p>Parse logs and Capture data changes</p>
</li>
<li><p>Serialize the changes to a specific schema, like JSON, Avro, etc</p>
</li>
<li><p>Publish an event to a message queue like Apache Kafka</p>
</li>
</ul>
<h3 id="heading-when-you-should-consider-cdc"><strong>When You Should Consider CDC?</strong></h3>
<p>CDC is a good fit when you need:</p>
<ul>
<li><p><strong>Real-time reporting or analytics</strong></p>
</li>
<li><p><strong>Search index updates</strong> (e.g., Elasticsearch)</p>
</li>
<li><p><strong>Sync between microservices</strong></p>
</li>
<li><p><strong>Data migration with minimal downtime</strong></p>
</li>
</ul>
<p>But it might be overkill for:</p>
<ul>
<li><p>Very small datasets</p>
</li>
<li><p>One-off migrations</p>
</li>
<li><p>Systems where “eventual consistency” isn’t critical</p>
</li>
</ul>
<h3 id="heading-summary"><strong>Summary</strong></h3>
<p>In short, CDC is like putting your database on <strong>“live broadcast mode”.</strong> Instead of asking <strong>“What’s changed?”</strong> over and over, you get updates the moment something happens — ready to power analytics, sync systems, or trigger workflows.</p>
<p>With tools like Debezium, implementing CDC is easier than ever. The result? Data that’s always fresh, and systems that are always in sync.</p>
]]></content:encoded></item><item><title><![CDATA[What is Java?]]></title><description><![CDATA[Java is a static-type object-oriented programming language.]]></description><link>https://jatinderaujla.com/what-is-java</link><guid isPermaLink="true">https://jatinderaujla.com/what-is-java</guid><category><![CDATA[Java]]></category><category><![CDATA[coding]]></category><category><![CDATA[Programming Blogs]]></category><dc:creator><![CDATA[Jatinder]]></dc:creator><pubDate>Thu, 16 Jan 2025 18:07:13 GMT</pubDate><content:encoded><![CDATA[<p>Java is a static-type object-oriented programming language.</p>
]]></content:encoded></item></channel></rss>