Technical and business leaders are increasingly looking for real-time insights to inform strategy and decision-making. And the ability to capture changes from operational systems is essential to enabling analytics on data in motion.
As previously discussed, Equalum’s CDC technology makes use of the transactional logs of relational databases – the most powerful solution for capturing changes in real-time, providing a low-overhead way to capture changes asynchronously, without any change to the application code.
But for leaders evaluating how CDC technology can help their organizations, it’s worthwhile to dig a layer deeper to explore five critical technical challenges on the path to realizing the economic value of a CDC solution. These range from upfront integration to maintenance and performance.
How does Equalum navigate the challenge of CDC implementation (including correctness, restart ability, low-latency, high-throughput, and more)? This post highlights five critical challenges along the path to a high-performing CDC technology.
The path to the economic value of a CDC solution
Challenge #1: Parsing the transaction log
Why It Matters: The first step to providing CDC (Change Data Capture) from a relational database is to parse its transaction log. Luckily, most databases actually provide some low-level API for parsing their own transaction logs, which can be used as a starting point for a CDC solution. Using a database (SQL) API also means that capturing changes can be done remotely, without installing a local agent on the production database server.
Why It’s Difficult: A database transaction log is typically stored in a vastly complex and ever-changing internal format, which is prohibitively hard to reverse-engineer and maintain over time. And most databases’ APIs have their own limitations, quirks, and bugs that need to be tamed before a reliable solution can be built.
The Equalum Solution: For each API, creative solutions need to be found to guarantee both correctness and performance. Let’s take, for example, the PostgreSQL CDC feature called Logical Decoding. It generally involves a database-side object called a replication slot, which is a pointer to a specific position in the transaction logs, and an API to parse the transaction log from a specific slot position, which optionally moves the slot forward. While it generally works, this mechanism tends to send duplicate data under load (due to a known bug), does not expose schema changes events (like ALTER TABLE … ADD COLUMN), and also does not provide any way to move the replication slot (pointer) backward, making any application-level recovery mechanism challenging. Equalum has engineered solutions that address these kinds of limitations and idiosyncrasies for PostgresSQL, along with many other APIs – ensuring accurate and performant transaction log processing.
Challenge #2: Synchronizing the capture starting point
Why It Matters: Accurate synchronization ensures that no changes are missed or duplicated when a new CDC configuration is set up.
Why It’s Difficult: Setting up a new CDC configuration involves two separate steps: 1) capturing the current values in a given table by running a regular SQL query, for example by running “SELECT * FROM table” and 2) capturing changes for this table moving forward. The challenge is that those two independent steps must actually be synchronized, even though each may start at a slightly different time. If they are not fully synchronized, we could miss a few changes or duplicate a few changes that happened as we start our capture.
The Equalum Solution: Equalum automatically synchronizes these startup steps using database-specific protocols, which generally involve operating on the internal database “clock” (typically called SCN – System Change Number) by either setting it at the query level or filtering out changes based on it.
Challenge #3: Handling tables schema
Why It Matters: Any CDC solution’s success depends on its ability to accurately interpret the data stored in a database table’s schema (a set of typed columns) and adapt to changes in table schema over time.
Why It’s Difficult: Each major database vendor has its own type system, with many subtle differences between them. These can range from non-standard names to non-standard behaviors. For example, Oracle’s “date ”type is actually a timestamp with a second-level precision, MySQL optionally allows storage of invalid dates and timestamps (like ‘0000-00-00’), etc. In addition, the schemas of the source tables may evolve over time as columns are added or removed.
The Equalum Solution: Equalum normalizes all its input to a standard type system, so processing flows can safely correlate data from multiple sources, and easily map the output of the flow into different types of targets. Also, in many cases, the type systems of the initial data capture (JDBC types) is not entirely consistent with the type system used in the database CDC API, so Equalum makes sure that data is extracted consistently. Additionally, Equalum automatically captures schema changes and responds based on a user-defined policy.
Challenge #4: Recovery and exactly-once
Why It Matters: A CDC solution must be able to reliably function even with network or database issues in order to ensure that end-users have uninterrupted access to accurate data.
Why It’s Difficult: Even in a well-maintained production environment, things will occasionally go wrong. This could range from network issues to the source databases going down on maintenance or unexpectedly to the host of the CDC solution failing or being killed.
The Equalum Solution: Equalum Connect agent is designed to handle all these cases correctly, and to continue streaming changes once the source system is reachable without ever losing an event or duplicating an event. How does it work? The are multiple aspects to it, but generally, it first involves tagging each change with its database-specific tracking metadata and periodically checkpointing a safe starting point (one that would not lose messages). Then, on recovery (after failure), it involves making sure the database source restarts streaming changes at or before the checkpoint position and filtering out new events that have already been fully processed, to avoid duplicate messages.
Challenge #5: Performance and tuning
Why It Matters: A viable CDC solution – especially for a large enterprise – must perform well on multiple dimensions including latency, throughput, and overhead on the source system.
Why It’s Difficult: Creating a high-performing CDC solution is a complex subject that demands the ability to knowledgeably navigate the specific APIs that the sources expose. Some sources, like Oracle databases, exposes multiple APIs with multiple options to choose from. In addition, there are considerations like how to reduce the number of roundtrips to the database, while still keeping low latency. Another consideration is whether it is possible to offload some part of the processing to the database itself or not.
The Equalum Solution: Equalum Connect makes most of the tuning decisions itself, but in some cases let the user provide input on his streaming requirements. For example, in some sources, the user can define its tolerated lag between when the databases changes happen and the time it will take Equalum to detect them, and Equalum will automatically use it to decide the proper tradeoff between latency and throughput.
Achieving a high-performing CDC solution involves taming the complexity of multiple systems, accounting for database changes and recovery scenarios, and ensuring robust performance and low overhead within a multi-flow organization. Equalum’s experience engineering a best-in-class CDC technology can help highlight critical challenges on the path to implementing a solution for data in motion.