Skip to main content

Apache Arrow TLDR;

· 2 min read
Sanjeev Sarda
High Performance Developer

What is Apache Arrow?

dremel

What is Apache Arrow?

Arrow is a column based table format. It heavily exploits aligned data access as part of it's specification.

It has an extensive type system that includes nested and user defined types.

It utilises SIMD (parallel operations via packing the same types into registers XMM0-XMM7 with 128bit capacity each).

Designed for analytics pipelines and to run in memory. Can serialize and deserialize Arrow records to files or streams.

Zero copy reads are done via buffer abstractions across languages.

To me there seems to be some overlap (or at least similarity in design synthesis) with SBE and the Agrona DirectBuffer.

Arrow seems to have a richer type system and you don't need to write ming XML config. I may write a more substantial comparison in the future.

Serialization and Deserialization

Serialization is the process of converting an object's state into a format that can be stored, transmitted, or reconstructed later.

Deserialization is the process of converting a data structure or object from a stored format into a usable object in memory.

Thanks for that Gemini!

So, for a system to be free of serialization and deserialization, it must represent object state in the same way:

  • It's used by the application.
  • It's stored.
  • It's transmitted.

Benefits of Standardisation

One of the main benefits touted is that of standardization.

alt text

Traditionally, data processing engine developers have created custom 
data structures to represent datasets in-memory while they are being
processed. Given the “custom” nature of these data structures, they
must also develop serialization interfaces to convert between these
data structures and different file formats, network wire protocols,
database clients, and other data transport interface.

Arrow Flight

Arrow Flight is a framework for making services ontop of Arrow data using gRPC.

Flight is organized around streams of Arrow record batches, being 
either downloaded from or uploaded to another service. A set of
metadata methods offers discovery and introspection of streams,
as well as the ability to implement application-specific methods.