Benoit Perigaud
Benoit Perigaud

How dbt can supercharge your Data Vault 2.0 modelling

Data Vault modelling is a good option to consider when the goal of the data warehouse is to keep a historical record of changes and when data from multiple source systems needs to be combined. You can read more about this method in this article dedicated to Data Vault 2.0.

While the data modelling methodology focuses on the “what” should be part of the model, there are many options about “how” to technically translate the model into physical tables and views in the data warehouse, and how to orchestrate the different data loads. And this is where the dbt™ platform comes into place, making you focus on the model of your DataVault rather than its implementation.

If you are not already aware, we are huge fans of dbt at Aginic. We are proud to be one of the few preferred dbt consulting providers and have deployed the platform for quite a number of clients, based on Data Vault modelling.

 

A very quick intro to Data Vault 2.0

At a very high level, the Data Vault modelling revolves around a few core objects (While this is a simplification, please do your own research about Data Vault 2.0.) Please see a few definitions for these objects bewow, including hubs, links and satellites.

Hubs are used to store lists of business keys and which system those keys come from. They are very useful when the same keys are used across multiple systems. (e.g. a Customer ID in Hubspot is the same as the Customer ID in Xero)

Links are used to define relations between business keys. If you have a Hub for your Clients and a Hub for your Projects, the Link tables will allow you to mention which projects relate to each client. They are usually one-to-many or many-to-many relationships.

Satellites are entities that hold additional information about specific keys in your Data Vault. They can be describing Hubs or Links. In the case of our Customer Hub, we might have different Satellites storing information about the Customer Contact Details, the Customer Address, the Customer ABN number etc…

 

Where does dbt fit in the midst of Data Vault?

As mentioned before, the Data Vault methodology focuses on what entities (Hubs/Links/Satellites) should be part of the data model. What dbt can help with is how you can translate those tables and views in your Data Warehouse.

For people unaware of what dbt is and does, dbt labs explains its core value proposition and how it defines itself here – “Dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast and fun.”

In other words, dbt allows you to write your data transformation using SQL and code-reusable macros to run your data pipelines in a clean and efficient way.

However, the most important part for the Data Vault use case is the ability to define and use those macros. As a result, I’d like to discuss the dbtvault package: https://dbtvault.readthedocs.io/

dbtvault is a dbt package that provides macros (e.g. pre-defined SQL code logic) to translate a Data Vault model “on paper”, to actual tables and views.

 

Jumping back to our intro about Data Vault, dbtvault provides tools to make the ingestion of data into the different Data Vault components very easy. For example, dbtvault provides out of the box macros for creating your Hubs, Links and Satellites, once you have identified them and how they should be defined. (it also supports other entities of the Data Vault methodology, but this article is still a brief introduction).

Drilling into the example of the Hub, the only piece of code that needs to be written for creating the Hub is the following:

{%- 

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

with src_pk, src_nk, src_ldts and src_source being the columns of your model storing respectively your primary key, natural key, load datetime and the source system of the data.

This macro can then be automatically compiled by dbt into SQL that can be run by the Data Warehouse like below:


By just providing metadata about your Hub (its Primary Key, Natural Key etc….), dbt will be able to handle the ingestion of the data, both for the initial load as well as for the future incremental loads of your data. And for that, dbtvault leverages the ability of dbt to easily define “incremental models”, which are tables that are created in a way that only changed and new records are updated, without modifying the records that have been kept unchanged.

Once all your dbt models are created, running a simple “dbt run” will populate all your Hubs, Links and Satellites, incrementally, and without the need to define yourself complex SQL code and transformations.

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt along with dbtvault can take care of the entire logic.

If you want to have a chat with us about dbt and Data Vault, feel free to drop me a note at [email protected] and if you’d like us to share more insights and details about how dbt and Data Vault 2.0 can be used together, give us a shout, and who knows, we might create a Part 2 of this article!

 

Useful resources:

 

dbt™ is a trademark of dbt Labs.

Get in touch with Benoit Perigaud

Frenchy calling Australia home since 2015. Experienced IT and management consultant turned analytics devotee, always keen to play around with new tools and technologies and understand how they can help our clients and their customers. Can usually be found chasing a frisbee or wearing a wetsuit and diving mask.

Get in touch