DuckDB Bundle on Kubernetes

Exploring DuckDB using Apache Superset on K8s

Nov 04, 2022

What is DuckDB to me?

It’s an extremely fast Database, which outperforms many distributed databases. It’s designed for a single node and that makes it the perfect candidate for simplifying the data science workflow.

DuckDB interacts very well with Apache Arrow and Parquet.

ABC @Ubunta

I demoed @duckdb to 10 data scientists last Friday & they used it regularly for the week. Feedback - * Speed is the most impressive thing about DuckDB. * It's not exactly a Pandas replacement ( not yet) * Most of them found out the real power of @ApacheArrow 1st time

Why Kubernetes?

Because that’s my infrastructure for the ML Platform.

ABC @Ubunta

Deploying @duckdb on Kubernetes & interact with the duckdb over @apachesuperset is impressive. It's by far the cheapest option (*for me) to explore data quickly for building data science applications. Native integration with #apachearrow made life much easier

A bit of Theory…

I am using Kubernetes Deployment for this example but will publish a helm chart as well.

Created a Docker File
Integrated Apache Superset 1.5.2 , DuckDB 0.5.1 & duckdb-engine 0.6.4
Created a default Superset admin user
Created a Default DuckDB Database in Superset named DuckDB 😁
Inside the database config, added connect_args for DuckDB to interact with AWS S3.

The Deployment file has nothing special and can be found in this github.

Deployment

Check out the deployment_duckdb_bundle.yaml
Make the following changes, *if required
tolerations, serviceAccountName, Affinity, resources

Post Launch Pre-configuration

After successful deployment, you need to port forward on 8088.

The first time after deployment, the user will have to do some pre configurations manual steps

Superset (1.5.2) has a bug where a database configuration doesn’t detect Allow DML option
It is mandatory to install `httpfs` on the superset. Even after Pip, duckdb didn’t recognize and needed an explicit installation
Finally, need to apply connect_arg on the Database, if the user needs to access AWS S3 bucket or MinIO

Walkthrough

Use the world’s most secure password admin/admin

Go to Data > Databases and edit the DuckDB database

Advanced > Allow DML and Save
Go to SQL Lab > SQL Editor and Run INSTALL ‘httpfs‘
Now you are ready, technically you can already run queries on local files or if you copy data in your pod.

For blob storage, do the following. Go to Data > Database > Edit > Advanced and add connect_args

{
  "connect_args": {
    "preload_extensions": [
      "httpfs"
    ],
    "config": {
      "s3_endpoint": "s3.XXX.amazonaws.com",
      "s3_region": "us-east-2",
      "s3_access_key_id": "XXXXXX",
      "s3_secret_access_key": "XXXXXXXYYYYYYY",
      "s3_url_style": "path",
      "s3_use_ssl": "false"
    }
  }
}

Explore Data

I am following this DuckDB Blog https://duckdb.org/2021/06/25/querying-parquet.html

Observation

DuckDB performs surprisingly fast on Blob storage as well, unless there is an absolute millisecond response time required.
create Table AS on S3 parquet data is very slow
Processing a few million records with 20+ columns (S3 parquet) runs smoothly on DuckDB

My Memory Requirements

CPU : 2

RAM: 30GB

Why am I trying this?

ABC @Ubunta

People are debating on Snowflake vs Databricks and I am rebuilding my Data/ML stack on @duckdb, Apache Arrow, @IbisData and @flyteorg

ABC @Ubunta

Most of my machine learning work is now more on optimizing for a single node. From large distributed databases to @duckdb From Spark to polars / pandas Reducing TBs of data to a few GB for the actual ML model

ABC's Machine Learning Journey

Discussion about this post