GitHub - Sturdy-Statistics/duckdb-sparse-array-list-functions: Support for common list operations with sparsely encoded arrays. [proxy]

DuckDB Sparse Array & List Functions

This repo provides a set of useful functions for operating on sparsely stored arrays. This is extremely useful for significantly reducing storage and minorly improving runtime when you have high dimensional sparse (or nearly sparse) arrays. In following with DuckDB's convention, it expects the stored array to be 1-indexed.

These functions support index, value style sparse arrays. For example

[7, 0, 0 ,0, 23] => (5, [1, 5], [1, 23])

This is extremely useful for 100, 500, or 1000+ dimensional arrays that only have a few meaningful values.

Functions

Name	Description
`sparse_list_extract(position, inds, vals)`	Returns the value of a sparse array at the requested position (1 Indexed)
`sparse_list_select(positions, inds, vals)`	Returns the values of a sparse array at the requested position (1 Indexed)
`dense_x_sparse_dot_product(dense_arr, inds, vals)`	Returns the dot product between a dense array and a sparse array
`sparse_to_dense(K, inds, vals)`	Constructs a dense array of size K from sparse inds + vals
`agg_array_sum(arrs)`	Sum a list of arrays of size K. Returns a single array of size K

Visit our docs for extended documentation and examples.

Benchmarks

We migrated our DuckDB arrays to sparse storage in Dec 2024. These are the results of the benchmarks we ran on our internal duckDB files with confidence intervals.

Visit our blog post for more context on the benchmarks.

Considerations

Consideration 1: Why the reduction is not 90%+?

DuckDB, parquet, and other popular datalake file formats run compression. As a result, general purpose compression algorithms are able to take some advantage of the sparse format. However you still get a meaningful storage reduction by explicitly converting sparse arrays into an INDS, VALS format.

Consideration 2: Our atypical benchmark data

It is on our todo list to implement a public benchmark on more standard vector data. For now, we want to just mention our lists are generated by sparsity inducing dirichlet process models, so your results may vary from our own internal benchmarks.

Consideration 3: The only reliable benchmark is on your own data

In building this, we leverage random noise arrays, ones arrays, etc to try to get a estimated benchmark of how various strategies would affect our data storage. However, due to the highly variable nature of compression, none of these estimates reflected our real world results. Similarly, in your use case you may see better or worse benchmarks, but we highly suggested you first estimated on a subset of your actual data.

Why do the functions explicitly accept inds & vals?

DuckDB's columnar engine enables you to only read a subset of columns from your data. To take full advantage of this, I prefer to store the inds and values separately so that I can perform operations on them independently. E.g. I can leverage the inds column to quickly filter documents on whether or not a specific features is even present.