CDC (Change Data Capture) Adapter

Introduction

CDC (Change Data Capture) is a process of identifying and tracking data change events in databases. CDC is an efficient mechanism to achieve reliable and scalable data replication across different systems.

In the 6.7 release of Diffusion™, we have introduced a brand-new CDC adapter, which enables users to replicate data from databases into a Diffusion server (or server cluster). This adapter uses the Debezium engine to connect to databases. Debezium uses a log-based change data capture mechanism where it reads transaction logs from the database to identify the row level change events. These will be processed by the adapter to publish to a specific Diffusion topic. Debezium provides connectors for many different databases. The Diffusion CDC adapter initially supports fully-tested MySQL and PostgreSQL connectors out of the box. Connection to other databases are not yet formally supported; but, they may work by adding a specific connector .jar in the classpath when running the adapter but are not supported yet.

 

Each row-level data change event (insert/update/delete) captured by Debezium will be processed and published to a JSON Diffusion topic. More details about mapping these events to Diffusion topic can be found below. Every event consists of details about data changes as well as its schema. The schema of the data (or a table) can be optionally published to a separate Diffusion topic specific to that table of the database. This is configurable and is disabled by default.

With the usage of Debezium at its core, the adapter supports setting any configuration options supported by Debezium. Users can use these Debezium configurations (e.g. For Mysql database) according to their requirement in the adapter. The configuration parameters in Debezium provide an option to exclude/include a list of databases/tables/columns to track. So, users will be able to configure a single adapter to track different tables in a database with different sets of configurations. Similarly, any restrictions and requirements for using Debezium also apply to this adapter.

Mapping to Diffusion Topics

Each row-specific event captured by Debezium is published to a Diffusion topic. Each row of the table is identified by its primary key. Hence, if a table does not have a primary key defined, the updates for this table will be ignored. The adapter supports four different ways to map these events to a Diffusion topic. These are the configuration options to be used in the configuration of the adapter.

  1. Object: A table is mapped to a JSON topic, with each row being a JSON object keyed by the table’s primary key.
  2. Array: A table is mapped to a JSON topic, with all rows as entries in an array. For a ‘Create’ event, new entries are added at the end of the JSON topic’s value. For a ‘Delete’ event, the corresponding item in the value of the JSON Diffusion topic is updated to null.
  3. Row: Each table row is mapped to an individual JSON topic. The topic is created using the database name, table name, and primary key combination. E.g: database/table/pk1,pk2.
  4. None: Same as Row, but the topic contents are exactly as received from Debezium. This will include schema information, as well as the table row data before and after the change.

NB: If a table has a composite primary key, values of those keys will be escaped and concatenated together with ‘,’ to formulate a complete primary key combination, which will be used in Object and Row topic mapping, as defined above.

There are several other configuration options to configure the adapter. All of these can be viewed here.

Details about how to run the adapter can be found here.

Details about managing and monitoring CDC adapter via Diffusion console can be found here.

Quick start

  1. Ensure that the Diffusion server is running locally.
  2. Start MySQL database server. For a quick start, we will be using the MySQL docker image provided by Debezium.  This image contains a pre-populated database called ‘inventory’, which will be used for this example.
    1. Run the MySQL docker image:

      docker run -it --rm --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:1.5

    2. Startup MySQL client:

      docker run -it --rm --name mysqlterm --link mysql --rm mysql:5.7 sh -c 'exec mysql -h"$MYSQL_PORT_3306_TCP_ADDR" -P"$MYSQL_PORT_3306_TCP_PORT" -uroot -p"$MYSQL_ENV_MYSQL_ROOT_PASSWORD"'

    3. Grant all privileges to the ‘inventory’ database:

      GRANT ALL PRIVILEGES ON inventory.* TO 'mysqluser'@'%';

  3. Setup necessary configuration to pass to the adapter. A sample configuration file and schema for the configuration can be found together with the CDC adapter jar in Diffusion installation path under adapters/cdc directory.
  4. Run the adapter jar:

    java -jar cdc-adapter-6.7.jar ./configuration.json

    If you want to run the adapter with only bootstrap configurations, you can pass them as system properties.
    java -jar -Dgateway.client.id=testCdcAdapter -Ddiffusion.gateway.server.url=ws://localhost:8080 -Ddiffusion.gateway.principal=admin -Ddiffusion.gateway.password=password cdc-adapter6.7.jar

 

Once the adapter is up and running, navigate to the Diffusion console’s ‘Topic browser’ view to see that the data contained in the database is replicated to the JSON Diffusion topic, according to the provided configuration.

NB. This will be visible only if snapshotting (fetching a snapshot of data from the database) is enabled in the configuration, which is false by default.

Data can be inserted/updated in the database which will be reflected in the Diffusion topics.


Further reading

The Diffusion Data logo

BLOG

Benchmarking and scaling subscribers

March 15, 2024

Read More about Benchmarking and scaling subscribers/span>

BLOG

Unlocking the Value of ISO 27001 Certification: A Journey of Security and Continuous Improvement

March 25, 2024

Read More about Unlocking the Value of ISO 27001 Certification: A Journey of Security and Continuous Improvement/span>

The Diffusion Data logo

BLOG

100 million updates per second - Landmark Diffusion cluster performance

July 02, 2024

Read More about 100 million updates per second - Landmark Diffusion cluster performance/span>