Build a Better Data Lake with MotherDuck

Last year we wrote a blog post called Building a Poor Man’s Data Lake with DuckDB. In it, we showed that you could use two excellent open-source technologies - DuckDB and Dagster - and build a powerful, cheap, and useful data platform.

Since then, MotherDuck has emerged from stealth to build a high-powered cloud version of DuckDB that competes with other cloud data warehouses like Snowflake.

In this blog post, we will migrate the Poor Man’s Data Lake away from S3 and Parquet files into a single system: MotherDuck. As you’ll see in this blog post, it’s very straightforward due to the elegant design of both DuckDB, MotherDuck, and Dagster, and we can realize all of the benefits of MotherDuck without even touching our business logic.

In this blogpost:

What is MotherDuck?
Step 1: Connecting to MotherDuck
Step 2: Replacing the IOManager
Step 3: There is no Step 3!

⚡ As always, the code is on GitHub

Do you prefer to listen than to read? Pete Hunt walks us through porting our DuckDB project to MotherDuck.

What is MotherDuck?

MotherDuck is a cloud service that hosts your DuckDB tables. Rather than pointing DuckDB at a single database file on disk, you can point it at a cloud-hosted database that can be accessed anywhere with high performance, just like any other cloud data warehouse like Snowflake or BigQuery.

Let’s get started with it. But before we do, let’s remind ourselves of how DuckDB works. Install a recent version and create an in-memory database via the terminal:

‍

As you can see, we’ve used the CLI to create a new table and can read and write data to it. However, this is stored in a transient, in-memory database, and we’ll lose our data if we quit the DuckDB CLI:

‍

We can instead save our database to a file on disk by passing it as a command-line arg:

‍

…and if we quit and reopen the CLI, the data is still there!

‍

This is useful for a single engineer developing on a single machine, but what if you want to share this data with others? You need to find some way to distribute this database file. Additionally, if multiple stakeholders want to add data to the same DB, you need to figure out how to ensure that all stakeholders have up-to-date copies of the data and that their changes are merged into a single consistent database.

That’s where MotherDuck comes in. Head on over to app.motherduck.com and get an access token. Then, simply connect to MotherDuck right from the DuckDB CLI!

‍

We can use the same argument to duckdb on a different machine and still see the same data:

‍

So how did this work?

We passed a funny looking argument to duckdb. This is a URL to our motherduck instance. md: is the MotherDuck protocol, helloworld is the name of our database (which will be automatically created if it doesn’t exist), and ?token=... provides our access token.

This works identically in the DuckDB Python bindings too. Let’s start hacking on the Poor Man’s Data Lake and get it integrated with MotherDuck.

Step 1: Connecting to MotherDuck

⚡ Be sure to read the original blog post so you can follow along!

If you take a look at the original project, we created a class that wrapped our DuckDB connection:

‍

We are going to make a slight change to this class to allow users to pass in their MotherDuck URL if they so desire. Update the class to read as follows:

‍

Now we can pass in a MotherDuck URL, and we will automatically run our project on MotherDuck. Recall the Definitions in the original project’s __init__.py file:

‍

We can simply pass in our MotherDuck URL as an environment variable, and our project should still run identically to before, except this time it’s running in The Cloud™.

‍

Pretty easy! But we aren’t taking full advantage of MotherDuck quite yet.

‍

Step 2: Replacing the IOManager

Recall that Dagster uses IO Managers to abstract away the underlying storage system from business logic. This makes it very straightforward for us to swap out our existing in-memory DuckDB storage system with one powered by MotherDuck, without touching any of our business logic!

Let’s start by copy-pasting the existing DuckpondIOManager as a new IOManager called MotherduckIOManager (inside of duckpond.py):

‍

We need to make some changes to this:

We need to remove S3 from handle_output(), and replace it with a MotherDuck table
We need to read from that MotherDuck table in load_input()

Let’s refactor this class step-by-step. First, let’s modify the constructor to remove any reference to S3 buckets:

‍

Next, let’s replace our helper method that creates S3 URLs with one that creates MotherDuck table names:

‍

Next, we’ll modify handle_output() to CREATE OR REPLACE a table by that name, rather than write a Parquet file to S3:

‍

Notice the only thing we changed here was replacing the COPY SQL statement with a CREATE OR REPLACE TABLE statement towards the end of the function.

Finally, let’s read from that table:

‍

Not too hard! We just SELECT from the table instead of reading a Parquet file as we did before.

One final step to wire it up. Let’s swap out the IOManager in our Definitions inside of __init__.py:

‍

The only change here is that we’ve changed the io_manager resource to point to MotherduckIOManager instead of DuckpondIOManager.

Step 3: There is no step 3!

Now if you simply run the project, you will see beautiful green boxes (after a short moment):

And even better, we can open up the MotherDuck UI and explore our tables, just like any other data warehouse.

This is a huge usability improvement on top of S3 and Parquet, and it’s much easier to collaborate using MotherDuck rather than vanilla DuckDB.

It is worth noting that this project has over 100 lines of business logic that we didn't even need to touch (in a real app, this could be tens of thousands, if not more!). Due to MotherDuck’s deep integration with DuckDB and Dagster’s IOManager abstraction, we were easily able to scale up from our “poor man’s data lake” to something much more robust, easy to use, and team-friendly.

That’s all we have for today. Thanks for following along, and let us know if you have success with this stack by joining our community and providing feedback!

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: