Data Architecture & Design: Capture data as a service (linkedin databus)

1.ARCHITECTURE

The Databus architecture is split into four logical components.

a fetcher which extracts changes from the data source or another Databus component,

a log store which caches this change stream,

a snapshot store which stores a moving snapshot of the stream, and

a subscription client which pulls change events seamlessly across the various components and surfaces them up to the application.

2.Oracle Adapter

A simple approach to get the change log from Oracle is to have a timestamp column with every row. A trigger on the table updates the timestamp column with the current time on an insert or update to the row. The adapter then issues a query to the database to get all the changed rows.This however has a problem. Timestamps in this mechanism are set at the time of the change to the row, not at transaction commit. Long running transactions might have rows that changed long before the transaction nally commits. Thus, this query will miss changes from the database. For example, tx2 commits before tx1 but t2> T1. If the query happens between the two commits, lastTimeStamp is t2 and tx1 is missed. We can try some padding to reread rows that changed since lastTimeStamp - n seconds but this is very error prone.

Oracle 10g and later versions provide a feature that provides the ora rowscn pseudo column which contains the internal Oracle clock (SCN - system change number) at transaction commit time. By default, ora rowscn is available at the block granularity but tables can be created with an option to provide ora rowscn at the row granularity. We can now query the

database to fetch all rows that changed since the last rowscn but unfortunately ora rowscn is not an indexable column.
To get around this problem, we add a regular column scn to the table and create an index on the column. The default value of scn is set to in nity. After commit, the ora rowscn for the a ected rows is set. Every so often, we run a statement to update the scn column.
update T set scn = ora_rowscn
where scn = infinity;
The query to select the changed rows since lastScn now becomes
select * from T
where scn > lastScn
AND ora_rowscn > lastScn;

3.MySQL Adapter

4.The Bad
1. Databus sources in the form of complex join views are expensive because they have to be evaluated at fetch time.
2. Large BLOBs and CLOBs in rows can incur additional disk seeks.
3. The TxLog table grows over time which aff ects performance. Without time-based partitioning of the table, truncation of older rows requires maintenance with a period of write unavailability for the primary database.
4. Very high update rate can cause increased load on the SCN update job; this a ffects the effectiveness of the indexes on the TxLog table.
5. Very high write tra ffic can also lead to increased read and write contention on the TxLog table.

Data Architecture & Design

Friday, November 2, 2012

Capture data as a service (linkedin databus)

No comments:

Post a Comment