BIK Information Services, Inc.

Oracle's Change Data Capture
By Boris Knizhnik (borisk@bikinfo.com)

To the guys who dared to grant me DBA_ROLE -
Dave LaPoint, Jim Kirsch, and Ran Pan


This article in slightly different form appeared in Pinnacle - in May issue of 2003 and in SELECT journal - in second quarter, 2004.


We appreciate any links to this site as they show your appreciation and work as free ads.
If you decide to include a link to our site on your site, please drop us a note.

TABLE OF CONTENTS

Setting up Subscriptions
Setting up a Subscriber
Creating and activating a subscription
Extending the window
Extend window and create a view
Processing change records
Source_colmap$
Dropping views and purging the window
Review
Clean-up
Conclusion
References

Introduction

Oracle Change Data Capture, or CDC, is one of the latest features of Oracle 9i to support Data Warehousing.
To set the record straight - CDC is NOT a Data Warehousing solution - it is just a tool to help manage data changes. The most typical need for Data Warehousing is the need to know what data has been changed since the last load.
There are other uses of this feature outside of Data Warehousing - daily activity reports against the transactional database, for instance.
Unless the online database of record contains timestamps, it is virtually impossible to find out which data has changed since last time you ran the refresh job. The usual alternatives are to either load the entire database every time you need to refresh, or use scripts using various business rules to determine which data has changed. Ad-hoc methods have their shortfalls and can't compete with a more direct way of just knowing what has changed. CDC was designed to solve this problem. Unfortunately, Oracle documentation is inadequate in explaining the overall process. Individual parts are explained, but it is hard to piece the entire process together. So, in this document we are trying to explain how various pieces of CDC fit together. Installation problems aren't addressed here.

Basic Concept

The basic concept behind CDC is simple. For each table that we wish to track changes, we create an INSERT, DELETE and UPDATE trigger that stores the updated information into a specified table. The key here is that CDC is transparent and automatic.
Using CDC, the source system DBA doesn't have to do much to set it up and it works on a known technology of triggers.

In CDC, for EACH table involved we have to create a table to keep track of the changes to the original table. This results in a group of tables with changes only. The developer or DBA then can decide what to do with these changes. This is very critical to understand. CDC doesn't provide developers with any tools to handle changes, but it provides them with a clean cut (point in time, committed transactions) set of tables containing changes to each corresponding source table. What to do with these changes - is up to developers. Third part tools maybe used here to go against the change tables or custom scripts written to go through changes and make adjustments to the datawarehouse fact and dimension tables.

General comments

Even though strictly speaking CDC process may all be done within one instance, it is easier to explain it assuming that the data source is in one instance and the "subscriber" is in a different instance. CDC is using publish - subscribe pattern.

Preparations

There are a few tedious processes involved in setting up the CDC process. They include the creation of change tables that most likely will be very similar to the original ones. If you din't want to spend most of your time working on typos resulting in infamous Oracle message "invalid column", we recommend using the file etl_util.sql - it contains a little package we have designed to help automate this process. There are two functions in this package that we need at the moment: get_cols_definition and get_col_names. The first one (get_cols_definition) creates a string that is usually used to define a table:
ABC 	VARCHAR2(10),BCD  NUMBER(20,2), CDE DATE	. . .
It creates this list based on the table passed to the function call as a source. By default, it creates the entire list of columns but there is a parameter that allows you to skip several columns in the table definition. The other function (get_col_names) creates a list of column names, again, based on the source table:
ABC, BCD, CDE,	. . .
There is also a possibility to skip a few columns.
Both functions list columns in alphabetical order.
Compile this package as 'system' and give it a public synonym - to make it callable from any schema.
Armed with these two functions, let's see how we can create one coherent example of setting up CDC:

Actors

In my opinion, part of the confusion about CDC is that no one makes clear who does which part of the setup. Let's settle it now. We will be working with the notorious tables emp and dept owned by SCOTT. We need three more actors in this setup - dba (SYSTEM), a publisher (BORIS_PUBLISHER) and a subscriber (BORIS_SUBSCRIBER). For each step, we will try to specify exactly who is running it.

Setting up publications

Setting up a publisher

In addition to the regular set of privileges and grants, BORIS_PUBLISHER has to be granted EXECUTE_CATALOG_ROLE and SELECT_CATALOG_ROLE privileges.
So our dba SYSTEM connects to the source instance and grants these privileges to BORIS_PUBLISHER:
connect system/manager@whatever
GRANT EXECUTE_CATALOG_ROLE to boris_publisher;
GRANT SELECT_CATALOG_ROLE to boris_publisher;
Now, a publisher has to have read access to all the tables. So, now as SCOTT (the table owner) we have to grant the publisher permissions to the source tables:
connect scott/tiger@whatever
GRANT SELECT on emp to boris_publisher;
GRANT SELECT on DEPT to boris_publisher;


Creating Change Tables

To publish the tables a publisher basically has to choose which columns of the source tables he/she wants to publish and create a table that would contain just these columns.
I am not sure why simply listing the column names isn't enough, but Oracle requires specifying the column names along with data type and length. I guess one can run into a situation where the original table would contain the column VARCHAR2(10) while in the change table one would want to see it as VARCHAR2(12), but if you just need the exact replica of each column you can use our little helper package.
Consider the following script (you can also find it at etl_publish.sql)
Create Change Tables
connect boris_publisher/boris_publisher@whatever
set serveroutput on size 1000000
set linesize 120
/* Login as a publisher to run this */
BEGIN
DECLARE work_sql VARCHAR2(32767);
   v_publisher_id VARCHAR2(20)  := 'boris_publisher';
   v_source_schema VARCHAR2(20) := 'scott';
   v_source_table VARCHAR2(100);
   v_cdc_table VARCHAR2(100);
   CURSOR c_tables IS
        SELECT 'emp' table_name FROM dual
	UNION
        SELECT 'dept' table_name FROM dual
        ;
BEGIN
   FOR tables IN c_tables LOOP
	v_source_table := tables.table_name;
	v_cdc_table := 'CDC_'||v_source_table;
	work_sql := etl_util.get_cols_definition(
			v_source_schema,v_source_table, '','');
	DBMS_LOGMNR_CDC_PUBLISH.CREATE_CHANGE_TABLE (
		OWNER => v_publisher_id, SOURCE_SCHEMA => v_source_schema
		,CHANGE_SET_NAME => 'SYNC_SET', CAPTURE_VALUES => 'both'
		,RS_ID => 'y', ROW_ID => 'n', USER_ID => 'y'
		, TIMESTAMP => 'y'
		,OBJECT_ID => 'n' -- leave it as 'N' or you will have 
					-- "table has no columns" error
		,SOURCE_COLMAP => 'y', TARGET_COLMAP => 'y'
		,OPTIONS_STRING => null
		,SOURCE_TABLE => v_source_table
		,CHANGE_TABLE_NAME => v_cdc_table
		,COLUMN_TYPE_LIST => work_sql);
	DBMS_output.put_line('Change table '||v_cdc_table 
			||' was created successfully');
  END LOOP;
  EXCEPTION
	WHEN OTHERS THEN
	DBMS_output.put_line(
		'Error start ********************************************');
	DBMS_output.put_line('Error during change table ' || v_cdc_table 
		|| ' creation:');
	DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm);
	DBMS_output.put_line(
		'Error end   ********************************************');
END;
END;
/
Let's talk about this publishing script. It consists of a loop that creates a change table for each of the two tables we are setting up this feature for.
Of course, instead of
CURSOR c_tables IS
	SELECT 'emp' table_name FROM dual
	UNION
	SELECT 'dept' table_name FROM dual
you could have defined a view that contains all the tables for which you need to set up your CDC, or as the other extreme manually cut and paste similar code for each table.
We tried to set it up to minimize manual labor.
Now, let's look at the line:
work_sql := etl_util.get_cols_definition(v_source_schema,v_source_table, '','');
What we get in variable work_sql is something like this:
comm number(7,2), deptno number(2), empno number(4), ename varchar2(10)
, hiredate date, job varchar2(9), mgr number(4), sal number(7,2)
The package creates this string for you to save you some aggravation.
Now, if you didn't want to include columns hiredate and job in a list of columns to publish you would call the function like this:
work_sql := etl_util.get_cols_definition(v_source_schema,v_source_table, 'hiredate,job','');
Here we chose to name change tables as CDC_{original_table_name}. You are free to choose your own naming convention.
Parameter CAPTURE_VALUES => 'both' means that you want to capture both values - before and after change (you will have two records in your change tables).
Parameters SOURCE_COLMAP => 'y', TARGET_COLMAP => 'y' mean that in your change tables you want to include special columns that would allow you to figure out which of the columns have actually been changed.

If you look at the change table definitions, you will see that in addition to the columns requested, our tables have several additional columns.
describe cdc_emp describe cdc_dept
Name               Type 
----------------- ------------ 
OPERATION$        CHAR(2) 
CSCN$             NUMBER 
COMMIT_TIMESTAMP$ DATE 
RSID$             NUMBER 
USERNAME$         VARCHAR2(30)
TIMESTAMP$        DATE 
SOURCE_COLMAP$    RAW(128) 
TARGET_COLMAP$    RAW(128) 
COMM              NUMBER(7,2)
DEPTNO            NUMBER(2) 
EMPNO             NUMBER(4) 
ENAME             VARCHAR2(10)
HIREDATE          DATE 
JOB               VARCHAR2(9)
MGR               NUMBER(4) 
SAL               NUMBER(7,2)
Name              Type 
----------------- ------------ 
OPERATION$        CHAR(2) 
CSCN$             NUMBER 
COMMIT_TIMESTAMP$ DATE 
RSID$             NUMBER 
USERNAME$         VARCHAR2(30) 
TIMESTAMP$        DATE 
SOURCE_COLMAP$    RAW(128) 
TARGET_COLMAP$    RAW(128) 
DEPTNO            NUMBER(2) 
DNAME             VARCHAR2(14) 
LOC               VARCHAR2(13)
What exactly will these change table track? Whenever the source table is changed one or two records are inserted in its corresponding change table.
For Delete operation operation$ column will contain 'D' and the columns published by a publisher will contain the values of the deleted row.
For Insert operation$ column will contain 'I' and published columns will contain new values.
Updated rows will generate a record with operation$='UU' (if you chose to capture both - old and new values) and all the old values, and another record with operation$='UN' or 'UL' with new values.
Note, that these records for update will be generated even if a publisher didn't choose to publish columns that were changed (for instance, if publisher didn't publish column job, and the program updated only job column the record would still be inserted in the change table). So it would appear that old and new records are the same.
Technically the target_colmap and source_colmap should help in figuring out what exactly has been changed, but there is no official documentation on how these columns work.
Also, another shortcoming in CDC logic is that even if the program changes values of columns to the same ones it still considers them "changes" and creates a record in change tables.
Our guess is that internal processing logic relies on a list of columns in UPDATE statements rather than really on change.

Basically, we are done with what a publisher does to set everything up. There are other more complicated situations, which I will not get into now - for instance, a publisher may want to set up several publishing scenarios for the same table - publish all the columns in one change table and only few columns in another change tables for the same source table. This may be a part of elaborate security implementation. Again, once the basics are understood one can easily find the details in documentation - see References.

Let's look what we have accomplished: We as publishers have just set up the change tables for all (well, all two) our tables - promising to publish changes to the original tables.
The change tables belong to a publisher's schema (which doesn't have to be the same as the owner's schema) and reside within the same instance as the source. Assuming that the subscribers belong to a different instance this setup isolates change tables from connectivity problems. Whatever happens to the database links the changes keep being collected. It also gives publisher's better control over data.

NOTE: As soon as the change table is created it starts collecting the changes.

Setting up Subscriptions

Setting up a subscriber

The subscriber BORIS_SUBSCRIBER has to be granted select permissions to source and change tables:
CONNECT scott/tiger@whatever
GRANT SELECT ON emp TO boris_subscriber;
GRANT SELECT ON dept TO boris_subscriber;
CONNECT boris_publisher/boris_publisher
GRANT SELECT ON cdc_emp TO boris_subscriber;
GRANT SELECT ON cdc_dept TO boris_subscriber;
Note, that one would think that subscriber doesn't need permissions to the source tables, but Oracle security is such that unless the subscriber is allowed access to the source tables he/she won't be able to access the change tables. Note also, that the subscriber is the user on the publisher's machine. If the real subscriber is on a different instance of Oracle boris_subscriber would be the one used for a dblink.

Creating and activating a subscription

The fact that a publisher published a few tables doesn't automatically mean that the subscriber will need them all. Moreover, there could be several subscribers potentially subscribing to different sets of tables from the published set. And to add more complexity to the situation even when the subscriber chooses a specific table he/she may not want to see all the columns that a publisher defined. So, the subscriber's goal is to indicate which tables and which columns in them he/she is interested in. Since potentially there could be many tables involved and various lists of columns Oracle decided to give the subscriber a way to collectively manage all the table/column combinations he/she is working with. This is done through something called HANDLE. The idea is to request and receive a handle to your subscription first, and then tie all your table/column combinations to this handle so you have something tangible to work with. When you create a handle, you can specify a description. We find description to be much more useful for purposes of working with CDC than using a handle simply because it is a text rather then a number and when you need to drop subscription and start over the handle will change, but the description won't (you are the one assigning it a value). We will demonstrate this a bit later. But let's see what we need to do to subscribe to all the columns of both tables: EMP and DEPT. Consider the script below (you can also find it online at etl_subscribe.sql) Please note that now you have to be a subscriber to run this.
Subscribe all the tables, activate subscription, extend CDC window
connect boris_subscriber/boris_subscriber@whatever
set serveroutput on size 1000000
set linesize 120

/* Login as a subscriber to run this */
BEGIN	
	DECLARE vSubhandle NUMBER;		
		v_subscription_description VARCHAR2(40):='scott -> Datawarehouse';
		v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER';
		v_source_schema VARCHAR2(20) := 'SCOTT';
		v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER';
		v_source_table VARCHAR2(100);
		v_cdc_table VARCHAR2(100);
		col_names VARCHAR2(32767);
		v_err VARCHAR2(10):='OK';
		CURSOR c_tables IS
			SELECT 'emp' table_name FROM dual
			UNION
			SELECT 'dept' table_name FROM dual
		;
BEGIN
-- find/create a subscription
	BEGIN	  
		vSubhandle :=0;	
		-- get the handle
		SELECT handle INTO vSubhandle FROM all_subscriptions 
		WHERE description = v_subscription_description;	
	EXCEPTION		
		WHEN OTHERS THEN
			vSubhandle := 0;
			dbms_output.put_line('Subscription '
				||v_subscription_description
 				|| ' does not exist in the database');
	END;	
	BEGIN
		IF (vSubhandle = 0) THEN
			-- create handle
			DBMS_CDC_SUBSCRIBE.GET_SUBSCRIPTION_HANDLE(
				CHANGE_SET => 'SYNC_SET',
 				DESCRIPTION => v_subscription_description,
				SUBSCRIPTION_HANDLE => vSubhandle
			); 			
			dbms_output.put_line('Obtained handle '
				||TO_CHAR(vSubhandle)
 				|| ' for subscription '||v_subscription_description
 			);
		END IF;	  
	EXCEPTION	
		WHEN OTHERS THEN
			DBMS_output.put_line(
				'Error start ***********************************');
			DBMS_output.put_line(
				'Error getting a handle '	
				|| ' for a CDC subscription ' 
				|| v_subscription_description|| ':');
			DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm
				|| '  Handle = ' ||TO_CHAR(vSubhandle));
			DBMS_output.put_line(
				'Error end   ***********************************');
			v_err :='ERROR';
	END;
	-- Subscribe for all the tables	
	FOR tables IN c_tables LOOP
		v_source_table := UPPER(TRIM(tables.table_name));
		BEGIN	     
			IF (v_err = 'ERROR') THEN RETURN; END IF;
		        v_cdc_table := 'CDC_'||v_source_table;
	 	        col_names := etl_util.get_col_names(
				v_source_schema,v_source_table, '','');
			DBMS_LOGMNR_CDC_SUBSCRIBE.SUBSCRIBE (vSubhandle
				, v_source_schema,
                       		v_source_table, col_names);
			DBMS_output.put_line('Added table '|| v_source_schema ||'.'
				|| v_source_table
 				|| ' to the CDC subscription ''' 
				|| v_subscription_description 
				|| ''' successfully. Handle = '||TO_CHAR(vSubhandle));
	   	EXCEPTION
 		   WHEN OTHERS THEN
			DBMS_output.put_line(
			  'Error start ***********************************');
			DBMS_output.put_line('Error adding table '
			  || v_source_schema||'.'||v_source_table
			  || ' to the CDC subscription ''' 
			  || v_subscription_description
	   		  || ''':');
			DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm 
			  || '  Handle = ' ||TO_CHAR(vSubhandle));
			DBMS_output.put_line(
			  'Error end   ***********************************');
			v_err :='ERROR';
		END;
	END LOOP;
	BEGIN 	
-- Activate the subscription
		IF (v_err = 'ERROR') THEN RETURN; END IF;
	        DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(vSubhandle);
	   	DBMS_output.put_line('CDC subscription '''
			|| v_subscription_description
			||''' was successfully activated. Handle = '
			||TO_CHAR(vSubhandle));
	EXCEPTION
		WHEN OTHERS THEN
			DBMS_output.put_line(
			   'Error start ***********************************');
			DBMS_output.put_line(
			   'Error activating the CDC subscription '''
			   || v_subscription_description || ''':');
			DBMS_output.put_line(to_char(sqlcode)|| ' ' 
			   || sqlerrm || '  Handle = '
			   ||TO_CHAR(vSubhandle));
			DBMS_output.put_line(
			   'Error end   ***********************************');
			v_err :='ERROR';
	END;
-- Extend the window immediately
	BEGIN	
-- Extend the window for subscription
		IF (v_err = 'ERROR') THEN RETURN; END IF;
	        DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(SUBSCRIPTION_HANDLE=>vSubhandle);
	        DBMS_output.put_line('Window to the CDC subscription '''
			|| v_subscription_description
			|| ''' was successfully extended');
    	EXCEPTION
	WHEN OTHERS THEN
		DBMS_output.put_line(
			'Error start ***********************************');
		DBMS_output.put_line(
			'Error during window extention to CDC subscription '''
			||v_subscription_description || ''':');
		DBMS_output.put_line('Handle # '||TO_CHAR(vSubhandle));
		DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm);
		DBMS_output.put_line(
			'Error end   ***********************************');
		v_err :='ERROR';
   	END;
END;
END;
/
Here is the output:
Added table SCOTT.DEPT to the CDC subscription 'scott -> Datawarehouse' successfully. 
Handle = 87
Added table SCOTT.EMP to the CDC subscription 'scott -> Datawarehouse' successfully. 
Handle = 87
CDC subscription 'scott -> Datawarehouse' was successfully activated. 
Handle = 87
Window to the CDC subscription 'scott -> Datawarehouse' was successfully extended
Again, let's talk about the script above. The first step gets a new subscription handle and associates the description with it (more on that later). The next two steps just add tables to the subscription. The fact that we added tables doesn't automatically mean that everything starts working. You need to activate the subscription. This is done in the third step here. You could do this at a later time, but then you would need to use the same handle you created for the subscription. In the next piece of code we will show you the way we propose to do that, though it is certainly not the only one. In our demonstration case we just don't see any reason not to activate the subscription right away.

If you no longer want to process data you need to drop the subscription.

Now, there is a little logistical problem here - let's say we processed data in a change table. It may take some time. In the meantime new records could have been stored in this change table. You have to make sure not to see any new records that may come in the middle of your processing. Plus, there is another side to it - let's say everything went smoothly and you have processed the records. Next time your processing program kicks in you may have a few more records in those tables. How are you going to tell the old processed records from the new ones?
In a minute we are going to explain the way to handle it, but the last step with an mysterious call to Extend_window procedure belongs to this logic. We don't see why we shouldn't run the "Extend window" step right away, which is why it is included in this script. You can organize the sequence in a different way, but remember that you always have to mention the same handle value you got when created the subscription.
Important: Oracle documentation leaves the impression that you have to get a new handle every time you want to do something in CDC. But the problem is that if you call get_handle again you will get a different value, the one to which you never tied your subscription (get_handle method is apparently just a call to a brainless sequence generator). So whatever else you want to do - activate, extend window, create view, etc. all these steps will fail, because you don't refer to the same subscription. There is also another caveat - even if you use the same value, but have logged in as somebody other then the one who initially got that handle (subscriber), the system will behave exactly in the same manner - as if you supplied a wrong handle.

Extending the window

This step seems to be the most confusing part of the entire CDC process, because the name of this procedure is somewhat misleading. You hear "extending" and you want to know the time period through which you are extending the window and can't find it. The reason for that is that this procedure doesn't do that at all. By contrast, it does just the opposite - it marks the end of a data collection cycle for the view that you will need to create in the next step, and while it still allows the new data to be collected in the change table, they are "invisible" to your code. Here, the basic idea is this: you will be accessing the change data only through the view. This view is created based on System Change Numbers (SCN) collected in change tables. When you request a view creation, the system creates a view for you, which includes all the change records collected between the last two extend_vindow calls. That is why we are calling the first extend_window call now.

More on this further.

Extend window and create a view

What we were doing so far is supposed to be done only once. Now we are getting into a cyclical part. You may remember we mentioned in the beginning that CDC process is NOT a datawarehouse solution, but rather lays a nice ground work for it. So, at this point on one end Publisher created change tables and is constantly collecting change records. On another end Subscriber specified which of these tables he/she is interested in. It is time now to somehow start processing change records. Reading change tables directly is not recommended by Oracle, because the tables are not stable - number of records keeps growing while your datawarehouse process reads these records. The solution is to create views that give you a fixed set of records for each underlying change table. After your datawarehouse script finishes processing records you may drop this view and when you are ready for a next cycle the process begins again.
To create the view you have to first issue an extend_window call and then a create_view call. That way the create_view call would include only records that were posted from your previous call to extend_window until your last call to extend_window. Remember, we made a call to extend_window right after creating the subscription? So, if first time you run your cycle you call extend_window and then create_view the first capture would be from the time you activated your subscription to this first run. Next time, it would be from the first run to the second run, etc.
Take a look at the script below (and also etl_extend_create_views.sql)
Extend substription window and create CDC views
connect boris_subscriber/boris_subscriber@whatever
set serveroutput on size 1000000
set linesize 200
/* Login as a subscriber to run this */
BEGIN	
DECLARE vSubhandle NUMBER;
	v_subscription_description VARCHAR2(40):='scott -> Datawarehouse';
	v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER';
	v_source_schema VARCHAR2(20) := 'SCOTT';
	v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER';
	v_source_table VARCHAR2(100);
	v_cdc_table VARCHAR2(100);
	v_cdc_view_name VARCHAR2(40);
	our_view_name VARCHAR2(200);
	vSQL VARCHAR2(1000);
	CURSOR c_tables IS
		SELECT 'emp' table_name FROM dual
		UNION
		SELECT 'dept' table_name FROM dual
	;
BEGIN
   -- extend the window   
   BEGIN
	-- get the handle
	SELECT handle INTO vSubhandle FROM all_subscriptions
 		     WHERE description = v_subscription_description;
	-- Extend the window for subscription
	DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(SUBSCRIPTION_HANDLE=>vSubhandle);
	DBMS_output.put_line('Window to the CDC subscription '''
		|| v_subscription_description
		|| ''' was successfully extended');
     EXCEPTION	WHEN OTHERS THEN
	  DBMS_output.put_line(
		'Error start *************************************');
	  DBMS_output.put_line(
		'Error during window extention to CDC subscription '''
		||v_subscription_description || ''':');
	  DBMS_output.put_line('Handle # '||TO_CHAR(vSubhandle));
	  DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm);
	  DBMS_output.put_line(
		'Error end   *************************************');
   END;
   -- create subscriber views/synonyms
   BEGIN
	FOR tables IN c_tables LOOP
	   -- create a view
	   v_source_table := UPPER(TRIM(tables.table_name));
	   BEGIN
	      v_cdc_table := 'CDC_'||v_source_table; 
	      v_cdc_view_name:=v_cdc_table|| '_vw';
	      DBMS_CDC_SUBSCRIBE.PREPARE_SUBSCRIBER_VIEW(
			SUBSCRIPTION_HANDLE=> vSubhandle
			,SOURCE_SCHEMA => v_source_schema
			,SOURCE_TABLE => v_source_table
			,VIEW_NAME => our_view_name);
	      DBMS_output.put_line('Subscriber view '''
		||our_view_name|| ''' was successfully created for table '
		|| v_source_schema||'.'||v_source_table);
  	   EXCEPTION
		WHEN OTHERS THEN
		  DBMS_output.put_line(
			'Error start **********************************');
		  DBMS_output.put_line(
		     'Error during creation of subscriber view'
			|| for source table '
			||v_source_table||': ');
		  DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm 
			|| ' Handle   #'||TO_CHAR(vSubhandle));
		  DBMS_output.put_line(
			'Error end   **********************************');
	   END;
	   -- drop the previous synonym
	   BEGIN
		vSQL := 'DROP SYNONYM ' || v_cdc_view_name;
		EXECUTE IMMEDIATE vSQL;
	        EXCEPTION WHEN OTHERS THEN NULL;
	   END;	   -- create a private synonym to point to this view:
	   BEGIN
		vSQL := 'CREATE SYNONYM ' || v_cdc_view_name 
		||' FOR '|| our_view_name;
		EXECUTE IMMEDIATE vSQL;
		DBMS_output.put_line('Private synonym ''' || v_cdc_view_name
 			|| ''' for view ''' || our_view_name 
			|| ''' was successfully created.');
	        EXCEPTION
		  WHEN OTHERS THEN
			DBMS_output.put_line(
				'Error start ******************************');
			DBMS_output.put_line(
				'Error during creation of the synonym '''
				|| v_cdc_view_name || ''' for view '''
				|| our_view_name || ''': ');
			DBMS_output.put_line(to_char(sqlcode)|| ' ' || sqlerrm);
			DBMS_output.put_line(
				'Error end   ******************************');
	   END;
	END LOOP;
   END;
END;
END;
/
When you run this you script you will get output similar to this:
Window to the CDC subscription 'scott -> Datawarehouse' was successfully extended
Subscriber view 'CDC#CV$8757846' was successfully created for table SCOTT.DEPT
Private synonym 'CDC_DEPT_vw' for view 'CDC#CV$8757846' was successfully created.
Subscriber view 'CDC#CV$8757848' was successfully created for table SCOTT.EMP
Private synonym 'CDC_EMP_vw' for view 'CDC#CV$8757848' was successfully created.
A couple of words about this script: As you can see, the view names that CDC system assigns to you aren't good to be used in your PL/SQL or Java program. So we propose to assign private synonyms to the views assigned by CDC. That way when you run this script on subsequent occasions you will still have the same view name to work with. If you happen to look at the views you just created you will see something like this:
CREATE OR REPLACE VIEW CDC#CV$8757846 ( 
	OPERATION$, CSCN$, COMMIT_TIMESTAMP$, TIMESTAMP$, USERNAME$, 
	TARGET_COLMAP$, SOURCE_COLMAP$, RSID$, DEPTNO, 
	DNAME, LOC ) AS 
SELECT 
	OPERATION$, CSCN$, COMMIT_TIMESTAMP$, TIMESTAMP$, USERNAME$, 
	TARGET_COLMAP$, SOURCE_COLMAP$, RSID$, "DEPTNO", 
	"DNAME", "LOC" FROM "BORIS_PUBLISHER"."CDC_DEPT" 
WHERE CSCN$ >= 40802127 AND CSCN$ <= 41013754 WITH READ ONLY
Condition WHERE CSCN$ >= 40802127 AND CSCN$ <= 41013754 is really the whole point of this view - these two CSCN$ values are collected as you issue an EXTEND_WINDOW call and this is how CDC keeps your transactions separate: The ones you processed previously have CSCN$ < 40802127 (or whatever it will be in your case) and your new ones, that are posted to CDC_dept change table will have CSCN$ > 41013754. It is really that simple.

Also, this is the first time we really need to call a handle that we created in previous runs. Oracle leaves it up to you to somehow know the handle you need. We think selecting it by trying to match the description is a good way of doing it (assigning the subscription ID would probably be a better way, but this is not an option Oracle gives you).

Processing change records

This is really where you write your own code to do whatever you want to do with the change transactions. There are several comments to make here. First of all - remember that if you are using CDC feature you get only table by table changes. If you were thinking of joins of your CDC views to update your resulting tables - you better think again. For example if your target database table contains emlpoyee information and in addition to dept_no you decided to keep dept_name in it, you cannot do the join of cdc_emp_vw and cdc_dept_vw to get the name, because when they update an employee, they don't have to update department records and therefore you won't have a matching dept record in your cdc_dept_vw. You may need to go to the real table to do that. Another comment is that most likely your target database will be different from the source so you may need to establish a dblink, and a set of synonyms on your target environment to access the CDC views on your source machine. And the last comment is that using CDC gives you an interesting possibility to do the initial load using the same set of scripts that you will have for going against CDC views. Just consider the following:
CREATE VIEW CDC_EMP_VW AS
SELECT'I' operation$, 1 cscn$, SYSDATE commit_timestamp$, 1 rsid$
, 'initial_load' username$, SYSDATE timestamp$, HEXTORAW('FEFFFFFF)' SOURCE_COLMAP$
, HEXTORAW('FEFFFFFF') TARGET_COLMAP$ , t.* 
FROM emp t;
This view looks exactly as the matching view that we built using the CDC feature. Now, granted, it requires some more work to set up (for instance, the code above is really a view, whereas CDC_EMP_VW was really a synonym to a view with a weird name), but potential savings on not doing double work are well worth considering.

The data in CDC views (real ones) has different operation$ codes - UU (old values), UN or UO (new values), 'I' (inserted values), and 'D' (deleted values). You can build your logic accordingly.
Important: Don't forget when processing records from the views order them by CSCN$. This is the only way to ensure that your script will read "UN" after "UU" for the same changed record.

Source_colmap$

Both SOURCE_COLMAP$ and TARGET_COLMAP$ columns are left completely unexplained by Oracle. Since it is undocumented and therefore is not supported by Oracle please take the rest of this paragraph for what it's worth.

These columns are supposed to have information about which columns have been changed in update statement.
We couldn't figure out the TARGET_COLMAP$ column, but received some encouraging results about SOURCE_COLMAP$ column. In all_tab_columns view for the source table there is a column column_id. It represents the column number in a table definition. It seems Oracle uses a bit map to present the columns starting not with bit 0 but with bit 1.
Column id Binary Value Hex Value
1 10 2
2 100 4
3 1000 8
4 10000 10
5 100000 20
6 1000000 40
7 10000000 80
8 00000000 00000001 00 01
9 00000000 00000010 00 02
. . . . . . . . .
16 00000001 00000000 00000000 01 00 00
The complications come when you display the columns in hexadecimal format.
Apparently Oracle's inner presentation of the values is as set of binary words (two bytes) and they are usually for historical reasons reversed in memory presentation - the least significant byte comes first and the most significan byte follows.
So in our case should the columns 2 and 3 were changed the SOURCE_COLMAP$ would be displayed as 0C (1002 + 10002 = 11002=0Chex). Change in columns 2, 3 and 5 would be displayed as 2C (1002+10002+1000002=1011002=2Chex). But when it gets to column 8 we get something that looks weird even though it makes sense: change in column 2, 3, and 8 will result in code: 0C01hex - the number spils to the byte to the right, rather than what one would expect - towards more significant digits. Anyway, we are including the third function in package etl_utils, get_modified_names, which for a given table and SOURCE_COLMAP$ will build a list of column names that were changed. The functions is included for illustration purposes only (The idea of using the all_tab_columns may be not a bad one, but doing it dynamically is not exactly a performance enhancer).

Another comment is that while learning what columns have been changed maybe important to your script, doing it using SOURCE_COLMAP$ may not give you correct results anyway, since Oracle doesn't check whether values really changed. It grabs columns that were mentioned in the UPDATE statement even if this statement is assigning the same values back.

Dropping views and purging the window

Now that you have run your update scripts, it's time to finish the loop. Remember, we started the loop by extending the window and creating cdc views. Now we need to drop these views, so nothing will run against them, necessitating us to purge the data in the change tables that correspond to those views. Here is the script (and also at etl_drop_views.sql):
Drop CDC Views and Purge CDC Window
connect boris_subscriber/boris_subscriber@whatever
set serveroutput on size 1000000
set linesize 200
/* Login as a subscriber to run this */
BEGIN
	DECLARE vSubhandle NUMBER;
		v_subscription_description VARCHAR2(40):=
			'scott -> Datawarehouse';
		v_publisher_id VARCHAR2(20) := 'BORIS_PUBLISHER';
		v_source_schema VARCHAR2(20) := 'SCOTT';
		v_subscriber_id VARCHAR2(20) := 'BORIS_SUBSCRIBER';
		v_source_table VARCHAR2(100);
		v_cdc_table VARCHAR2(100);
		v_cdc_view_name VARCHAR2(40);
		our_view_name VARCHAR2(200);
		vSQL VARCHAR2(1000);
		v_err VARCHAR2(10):='OK';
		CURSOR c_tables IS
			SELECT 'emp' table_name FROM dual
				UNION
			SELECT 'dept' table_name FROM dual
		;
BEGIN
	-- find a subscription
   	BEGIN
	     vSubhandle :=0;
	     -- get the handle
	     SELECT handle INTO vSubhandle FROM all_subscriptions
 	     WHERE description = v_subscription_description;
	EXCEPTION
		WHEN OTHERS THEN
		   vSubhandle := 0;
		   dbms_output.put_line('Subscription '
			||v_subscription_description
 			|| ' does not exist in the database'
		   );
	END;
	FOR tables IN c_tables LOOP
	   v_source_table := UPPER(TRIM(tables.table_name));
	   BEGIN
		v_cdc_table := 'CDC_'||v_source_table;
		v_cdc_view_name:=v_cdc_table || '_vw';
		-- drop the synonym
		BEGIN
		   vSQL := 'DROP SYNONYM ' || v_cdc_view_name;
		   EXECUTE IMMEDIATE vSQL;
		  EXCEPTION WHEN OTHERS THEN NULL;
		END;
		-- drop the subscriber view
	        DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIBER_VIEW(
			SUBSCRIPTION_HANDLE=> vSubhandle
			,SOURCE_SCHEMA => v_source_schema
			, SOURCE_TABLE => v_source_table);
		DBMS_output.put_line('Subscriber View for CDC table ''' 
		|| v_cdc_table 
		|| ''' was successfully dropped.  Handle # '
		||TO_CHAR(vSubhandle));
	   EXCEPTION	
		WHEN OTHERS THEN
		  DBMS_output.put_line(
		    'Error start ********************************');
		  DBMS_output.put_line(
		    'Error during dropping the subscriber view for '''
 		    || v_cdc_table || '''  '
		    || ' to the CDC subscription ''' 
		    || v_subscription_description || ''':');
		  DBMS_output.put_line(to_char(sqlcode)|| ' ' 
		    || sqlerrm 
		    || '  Handle = '||TO_CHAR(vSubhandle));
		  DBMS_output.put_line(
		    'Error end   ********************************');
	   END;
	END LOOP;
	BEGIN
	-- purge window
		DBMS_CDC_SUBSCRIBE.PURGE_WINDOW(
		    SUBSCRIPTION_HANDLE=> vSubhandle);
		DBMS_output.put_line(
		    'Subscriber Window for subscription ''' 
		    || v_subscription_description 
		    || ''' was successfully purged ');
	EXCEPTION	WHEN OTHERS THEN
	  	DBMS_output.put_line(
			'Error during subscriber view drop: ');
		DBMS_output.put_line(to_char(sqlcode)|| ' ' 
		   || sqlerrm
		   ||'  Handle # '||TO_CHAR(vSubhandle));
	END;
END;
END;
/
Here is the output:
Subscriber View for CDC table 'CDC_DEPT' was successfully dropped.  Handle # 86
Subscriber View for CDC table 'CDC_EMP' was successfully dropped.  Handle # 86
Subscriber Window for subscription 'scott -> Datawarehouse' was successfully purged
Purging the views is an interesting step. This is another case of misleading names.
Remember, our subscription may not necessarily be the only subscription in the system. We may be done with our processing of the data but someone else's application may not. Purging data in this case may not be a good idea.
What call to purge_window really means is that we report being done with the data up to the last CSCN$ we used for our views.
When all the subscribers report that they are done with theirs, then publisher will be able to really purge the data to the least CSCN$ of all the subscribers.

Review

So, you are going to design some kind of a process that would kick in every so often and do the following: The frequency of this cycle obviously depends on how much time your update process takes.
Note 1:
If your source database is really on another instance your update process will be the one with a lot of @db_link tables.
Note 2:
It would be wise for you to design the update process in such a way that it could be applied again without causing problems. For instance, you may want to treat inserts as updates if the key already exists in a target database or that deletes don't really delete anything (this will happen if you are running your update script the second time) - that allows better debugging the scripts.
Note 3:
We recommend slightly different sequence of steps for production environmen
This sequence leaves your CDC views intact between runs and you can do the research what went wrong between runs.
Our view of the correctly designed CDC process is this:
Note 4:
Your update script may run quickly or take a long time, depending on the intensity of updates in the system. You better design your scripts in such way that they won't run into each other.

Clean-up

From our own experience, we can guarantee that you are going to make a lot of mistakes before setting everything up "just so", so we are providing a script that you may use to undo the changes and start over (See etl_undo_cdc.sql)

Conclusion

Oracle CDC is a new and exciting feature that nicely complements the Oracle suite of tools for data warehousing and generic replication. Mastering it and making it work flawlessly requires a fair degree of effort, but it can help your organization handle data movement in a more efficient and systematic manner.

References

Something strange is happening to Oracle's CDC documentation - it disappears from servers that used to have it. I updated these links at least 3 times and every time at some point the links become broken. Sorry, folks, but I tried. Here is where I found them last time:
  1. DBMS_LOGMNR_CDC_PUBLISH - http://hoth.stsci.edu/public/920doc/appdev.920/a96612/d_logpub.htm.
  2. DBMS_LOGMNR_CDC_SUBSCRIBE - http://hoth.stsci.edu/public/920doc/appdev.920/a96612/d_logsub.htm.
  3. Change Data Capture - http://os.cc.biu.ac.il/documentation/oracle/9.0.2_doc/server.920/a96520/cdc.htm.

We hope you found this article useful.
If you have any comments about this article please send them to Boris Knizhnik  borisk@bikinfo.com.
We appreciate any links to this site as they show your appreciation and work as free ads. If you decide to include a link to our site on your site, please drop us a note.

 
Last modified on Saturday, 09-Apr-2005 09:52:57 EDT


Back to BIK Information Services Home page Back to BIK Information Services home page  Go to Top Go to top