Skip to main content

High Watermarks For Incremental Models in dbt

The last few months it’s all been dbt. Dbt is a transform and load tool which is provided by fishtown analytics. For those that have created incremental models in dbt would have found the simplicity and easiness of how it drives the workload. Depending on the target datastore, the incremental model workload implementation changes. But all that said, the question is, should the incremental model use high-watermark as part of the implementation.

How incremental models work behind the scenes is the best place to start this investigation. And when it’s not obvious, the next best place is to investigate the log after an test incremental model execution and find the implementation.

Following are the internal steps followed for a datastore that does not support the merge statements. This was observed in the dbt log.

- As the first step, It will copy all the data to a temp table generated from the incremental execution.
- It will then delete all the data from the base table that's unique in the temp table ( The temp table mentioned in the previous step).
- Finally, it will insert all the data from the temp table to the base table.

The highlights
- As the best practice, always use a high watermark when modeling an incremental load. This will optimize the workflow by limiting the amount data it has to merge or delete and insert. 
- This also allows us to run an incremental load by backdating the models by N days. I think this is a great flexibility (in disguise ) to have for systems with upstream data delays.
- Always, have a uniquekey in the model. I won’t go to the length of describing the importance of following best practices for a model. But from a optimisation perspective, I would always add index on the uniquekey which would optimize the internal implementation of a incremental model.

Comments

Post a Comment

Popular posts from this blog

Create a dacpac To Compare Schema Differences

It's been some time since i added anything to the blog and a lot has happened in the last few months. I have run into many number of challenging stuff at Xero and spread my self to learn new things. As a start i want to share a situation where I used a dacpac to compare the differences of a database schema's. - This involves of creating the two dacpacs for the different databases - Comparing the two dacpacs and generating a report to know the exact differences - Generate a script that would have all the changes How to generate a dacbpac The easiest way to create a dacpac for a database is through management studio ( right click on the databae --> task --> Extract data-tier-application). This will work under most cases but will error out when the database has difffrent settings. ie. if CDC is enabled To work around this blocker, you need to use command line to send the extra parameters. Bellow is the command used to generate the dacpac. "%ProgramFiles...

How to Backup postgres globals without sysadmin permission in RDS

For those of my Friends who are on postgres RDS and require to backup the globas for a rds instance you would soon find it can't perform a backup with pg_dumpall --globals-only.  The default user that gets created along with the RDS postgres instance does not have adequate permission to get through the backup processes. Bellow is the message that you would receive. pg_dumpall.exe : pg_dumpall: query failed: ERROR:  permission denied for relation pg_authid At line:1 char:1 + & 'C:\setup_msi\postgres\bin\pg_dumpall.exe'  -h XXXXXX.cXXXX... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~    + CategoryInfo          : NotSpecified: (pg_dumpall: que...ation pg_authid:String) [], RemoteException    + FullyQualifiedErrorId : NativeCommandError pg_dumpall: query was: SELECT oid, rolname, rolsuper, rolinherit, rolcreaterole, rolcreatedb, rolcanlogin,...