DATABASE SCHEMA CHANGES WITH DBT

13.09.2021: AN EXPERT CONTRIBUTION

Paul Batzdorf on 13.09.2021

When a source system is replaced, the data basis changes and the previous analysis pipeline no longer works? How do we deal with this?

When dealing with data warehouses (DWH), we often have to change the structure for various reasons. This usually involves incorporating additional information from a source into our models, being able to answer more in-depth questions with our data, translating changed business logic in such a way that this is also reflected in the data flow, or building completely new analyses/models that provide us with new insights. Recently, we were confronted with a somewhat rarer problem at one of our customers: Schema changes due to changes in source systems.

Regardless of which method we use to model our DWH (Data Vaulty< 2.0, Kimball, Inmon, etc.), we always use the original schema and available information as a guide when adding data sources. So how do we proceed if we now have to make do with a new source schema for the same data? And how can we ensure that the transition between systems is as smooth as possible? The different implementation methods have different approaches here and are differently suited to dealing with such changes.

In this case, none of these modeling methods were strictly implemented in the DWH. We used Snowflake as the DWH. We use dbt (data build tool) for transformation and the source system is a SaaS ERP system that is replaced by a proprietary ERP system. Our procedure for loading the tables is: We load the source completely and without prior transformation into the DWH and only add transformations afterwards. This applies to both the old and the new ERP system.

First of all, we need to take stock. In which models (models in dbt are representations of objects in the DWH: tables, views, materialized views, etc.) do we use information from our source system? In our case: quite a lot, so for the moment we are concentrating on the models that are directly connected to our source tables. The Docs page of dbt helps us with this. Each dbt project creates a Docs page on request, where we can find information about our DWH objects. The DAG (Directed Acyclic Graph) is particularly valuable for us and could look like this:

All further steps depend on how the DWH is structured and to what extent the schema of the system to be replaced differs from the old one. It also depends on whether we want to keep all the old information from the old system for analyses, even if it is not available in the new system.

In our example, we have decided to keep only the information from the historical system that is currently used in the warehouse. In addition, we are dealing with a dynamically grown DWH and this is a good opportunity for us to rethink the structure of the DWH at this point.

Once we have identified the models in which we will process our source data, we create a model that is fed from the source data and contains the relevant columns that we will use in the other models. It is very likely that it makes sense to group the relevant columns in several models according to their affiliation. In these models, we also include transformations that we have carried out several times in later steps and would therefore perhaps be better placed in an earlier position in the DAG, such as currency conversions or other business logic that is repeated in downstream models. If this structure is already in the DWH, this step is not necessary.

Next, we connect the new system to our DWH and try to recreate these created tables with columns from the new system. It was helpful in this step that both systems ran in parallel for a while so that the individual contents of the two systems could be compared with each other. This was not only helpful for the implementation in the DWH, but also for checking the system itself, as the output of both systems could be easily compared with each other. If the new data model is not compatible with the old modeling, the modeling can now be adapted accordingly based on the old data. The aim should be to end up with a replica of the historical schema with the new source. The historical schema must contain at least all the information that is required in further transformation steps. All this information should be mirrored in the new schema and be found in the same tables with the same column names. In the schema for the new system, we can now also add further fields that the system change may make available to us. However, this can also be done in later iteration steps. Two identical tables from the schema of each of the two systems are now unioned using the dbt_utils function.

Select * from {{dbt_utils.union_relations(relations=[ ref(‘stg__order_[oldsystem]’), ref(‘stg__orders_[newsystem]’)}}

dbt utils is a dbt package that contains various functions that can be helpful in many cases. These functions are stored in the form of macros and range from cross-database implementations of a Datediff function to functions that fill a Datespine column. The Union Relations function is particularly helpful in this case, as the two tables do not have to contain exactly the same columns. Only the columns with the same name are merged; the columns that only exist in one table are filled with NULL values in the rows of the other table.

This gives us a complete history of the required columns across both systems in one table. If new information from one of the systems is required in the future, this can be added in a later step. We have thus achieved a solution that allows us to react flexibly to further adjustments in the future.