Google's Dremel

December 17, 2024 · 2 min read

High Performance Developer

Dremel is the query system behind the BigQuery API. Notes from the 2010 paper from Google.

dremel

Dremel was designed for interactive and ad-hoc queries on read only data stored in a column layout (with nested data). One of the main contributions at the time was a novel way of representing nested records.

Storage

Dremel can operate on in situ nested data - it can operate directly on the data in the storage layer. It uses GFS as the data layer which is allows fast access to the data without a loading phase.

In contrast to layers such as Pig [18] and Hive [16], 
it executes queries natively without translating them into MR jobs.

The shared storage format just support nested data - we effectively have a column or file per field. This gets more complicated as we have sparse records. Data is kept in columns for faster sequential access (retrieval efficiency) of the same field which is a common pattern for analytics pipelines .

Multi level execution trees

Breaking down more complex queries into sub-queries that can be executed on the cluster.

Data Model

The data model allows for repeating and optional fields.

Nested Columns

Repetition Level - value repetition at the same depth. A value of 0 means the first occurrence, higher values mean repetition within the same parent group.
Definition Level - presence of optional or repeated values. A value of 1 means it is present at that level. Higher values can mean it exists implicitly by virtue of the group it's in.

Repetition and Definition Levels

For a good overview of how to calculate repetition and definition levels, check out this blog post from Akshay Kumar.

Sparse Data and Field Writers

Many datasets used at Google are sparse; it is not uncommon to have 
a schema with thousands of fields, only a hundred of which are used 
in a given record. Hence, we try to process missing fields as cheaply 
as possible. 

To produce column stripes, we create a tree of field writers, 
whose structure matches the field hierarchy in the schema.

Links and References

Dremel: Interactive Analysis of Web-Scale Datasets, 2010 - Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Google, Inc.

Storage​

Multi level execution trees​

Data Model​

Nested Columns​

Repetition and Definition Levels​

Sparse Data and Field Writers​

Links and References​