Switch from AWS RDS to Neon for simplified environment management, seamless scaling, and reduced maintenance — Compare both tools

The pg_mooncake extension

new

Fast analytics in Postgres with columnstore tables and DuckDB execution

The pg_mooncake extension enables fast analytic workloads in Postgres by adding native columnstore tables and vectorized execution (DuckDB).

Columnstore tables improve analytical queries by storing data vertically, enabling compression and efficient column-specific retrieval with vectorized execution.

pg_mooncake columnstore tables are designed so that only metadata is stored in Postgres, while data is stored in an object store as Parquet files with Icebergor Delta Lake metadata.

Queries on pg_mooncake columnstore tables are executed by DuckDB.

Try it on Neon!

Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.

Sign Up

You can create and use pg_mooncake columnstore tables like regular Postgres heap tables to run:

  • Transactional INSERT, SELECT, UPDATE, DELETE, and COPY operations
  • Joins with regular Postgres tables

In addition, you can:

  • Load Parquet, CSV, and JSON files into columnstore tables
  • Load Hugging Face datasets
  • Run DuckDB specific aggregate functions like approx_count_distinct
  • Read existing Iceberg and Delta Lake tables
  • Write Delta Lake tables from Postgres tables

Version availability:

pg_mooncake is supported on the following versions of Postgres.

  • Postgres 14 - pg_mooncake 0.1.0
  • Postgres 15 - pg_mooncake 0.1.0
  • Postgres 16 - pg_mooncake 0.1.0
  • Postgres 17 -pg_mooncake 0.1.0

note

pg_mooncake is an open-source extension for Postgres that can be installed on any Neon Project using the instructions below.

Use cases for pg_mooncake

pg_mooncake supports several use cases, including:

  1. Analytics on Postgres data
  2. Time Series & Log Analytics
  3. Exporting Postgres tables to your Lake or Lakehouse
  4. Querying and updating existing Lakehouse tables and Parquet files directly in Postgres

This guide provides a quickstart to the pg_mooncake extension.

  1. Enable the extension

    note

    The pg_mooncake extension is currently in beta. A separate, dedicated Neon project is recommended when using an extension that is still in Beta.

    While the pg_mooncake extension is in Beta, you need to explicitly allow it to be used on Neon before you can install it. To do so, connect to your Neon database via an SQL client like psql or the Neon SQL Editor and run the SET command shown below.

    SET neon.allow_unstable_extensions='true';

    Install the extension:

    CREATE EXTENSION pg_mooncake;
  2. Set up your object store

    Run the commands outlined in the following steps on your Neon database to setup your object store.

    tip

    If you don't have an object storage bucket, you can get a free S3 express bucket here. When using the free s3 bucket, the SELECT and SET statements defined below are generated for you, which you can copy and run.

    Add your object storage credentials. In this case, S3:

    SELECT mooncake.create_secret('<name>', 'S3', '<key_id>',
            '<secret>', '{"REGION": "<s3-region>"}');

    Set your default bucket:

    SET mooncake.default_bucket = 's3://<bucket>';

    note

    In the future, you will not have to bring your own bucket to use pg_mooncake with Neon.

  3. Create a columnstore table with USING columnstore

    Run the following SQL statement on your Neon database to create a columnstore table:

    CREATE TABLE reddit_comments(
        author TEXT,
        body TEXT,
        controversiality BIGINT,
        created_utc BIGINT,
        link_id TEXT,
        score BIGINT,
        subreddit TEXT,
        subreddit_id TEXT,
        id TEXT
    ) using columnstore;
  4. Load data

    You can find a list of data sources here. This dataset has 13 million rows and may take a few minutes to load.

    INSERT INTO reddit_comments
    (SELECT author, body, controversiality, created_utc, link_id, score, subreddit, subreddit_id, id
    FROM mooncake.read_parquet('hf://datasets/fddemarco/pushshift-reddit-comments/data/RC_2012-01.parquet')
    AS (author TEXT, body TEXT, controversiality BIGINT, created_utc BIGINT, link_id TEXT, score BIGINT, subreddit TEXT, subreddit_id TEXT, id TEXT));
  5. Query the table

    Queries on columnstore tables are executed by DuckDB. For example, this aggregate query runs in ~200 milliseconds on 13 million rows:

    -- Top commenters (excluding [deleted] users)
    SELECT
        author,
        COUNT(*) as comment_count,
        AVG(score) as avg_score,
        SUM(score) as total_score
    FROM reddit_comments
    WHERE author != '[deleted]'
    GROUP BY author
    ORDER BY comment_count DESC
    LIMIT 10;

References

Last updated on

Was this page helpful?