Introduction to Arrow Flight SQL
SQL
Apache Arrow
2023-04-20

This article aims to provide a brief introduction to the background and reasons behind Flight SQL, without delving into the specific implementation details.

The Era of Columnar Storage #

In the era of big data, particularly in data analysis scenarios, people are no longer concerned with the entire entity but only focus on specific attributes. In such cases, if data is stored on disk in a row-based format, we would need to read all rows from the disk, even though we only require a small portion of data from a particular column. This leads to I/O wastage. However, if data is stored on disk in a columnar format, we only need to read a small portion of data occupied by the required columns, significantly reducing disk I/O.

Furthermore, columnar storage, by placing similar columns together, exhibits better compression efficiency compared to row-based storage, where different types of data are stored together.

Columnar Storage Era with Row-based Infrastructure #

Although we are currently in the era of columnar storage, many databases still primarily rely on row-based infrastructure to meet the demands of data analysis scenarios. For example, the communication protocol between database clients and servers, such as JDBC and ODBC, was designed for row-based storage prevalent in the past:

However, in the current landscape where both clients and servers actively embrace columnar storage, this approach is no longer ideal. To use JDBC/ODBC, data must be converted to a row-based format when transferred from the server to the client for JDBC/ODBC transmission. Once the data reaches the client, it needs to be transformed back to columnar storage for further processing. This conversion process involves serialization and deserialization, incurring additional overhead.

We can entirely avoid this unnecessary conversion if our communication protocol is also based on columnar storage.

Data Transmission Using Columnar Storage Format #

Apache Arrow, as a widely adopted in-memory columnar storage format, presents an opportunity to develop a transmission protocol based on Arrow. The Apache Arrow community recognized this potential and thus, Apache Flight was created.

So, What is Flight SQL? #

Now that we understand what Flight is, why do we have Flight SQL? The reason is that:

  • Flight client and server communicate by sending a sequence of bytes.
  • Flight supports any tabular data, not specifically tied to databases.

Therefore, we can observe that Flight is a relatively generic protocol and cannot be considered a standard for database operations. Flight SQL builds upon Flight to provide a proprietary SQL database communication protocol with dedicated support for SQL.

In addition to addressing unnecessary data conversions, a standardized communication protocol can be independent of databases. As long as a database supports Flight SQL, clients can communicate with it without the need to install a driver for each database, as required by JDBC/ODBC. This achieves a 1:n support relationship.

Support for Flight SQL by Various Databases #

  • Databend: https://github.com/datafuselabs/databend/issues/10745 Databend has already started working on implementing server-side support for Flight SQL.
  • TIDB: https://github.com/pingcap/tidb/issues/21056 The TIDB community has not actively pursued this matter, perhaps due to:
    1. Flight SQL is still immature, especially considering the time of this issue’s creation: 2020.
    2. TIDB internally implements a columnar data structure similar to RecordBatch called chunk.
    3. Communication between TIDB and TIKV follows a similar implementation to Flight, and the potential benefits of switching to something similar might not be significant.

References #

  1. Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers
  2. Introducing Apache Arrow Flight: A Framework for Fast Data Transport
  3. Introducing Apache Arrow Flight SQL: Accelerating Database Access
热门文章
标签
Easysearch x
Gateway x
Console x