Querying Parquet Files

Overview [Back to TOC]

Apache Parquet is a columnar file format for storing semi-structured data (like JSON). Apache AsterixDB supports running queries against Parquet files that are stored in Amazon S3 and Microsoft Azure Blob Storage as External Datasets.

DDL [Back to TOC]

To start, an end-user needs to create a type as follows:

-- The type should not contain any declared fields
CREATE TYPE ParquetType AS {
}

Note that the created type does not have any declared fields. The reason is that Parquet files embed the schema within each file. Thus, no type is needed to be declared, and it is up to AsterixDB to read each file’s schema. If the created type contains any declared type, AsterixDB will throw an error:

Type 'ParquetType' contains declared fields, which is not supported for 'parquet' format

Next, the user can create an external dataset - using the declared type - as follows:

Amazon S3

CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
    -- Replace <ACCESS-KEY> with your access key
    ("accessKeyId"="<ACCESS-KEY>"),

    -- Replace <SECRET-ACCESS-KEY> with your access key
    ("secretAccessKey" = "<SECRET-ACCESS-KEY>"),

    -- S3 bucket
    ("container"="parquetBucket"),

    -- Path to the parquet files within the bucket
    ("definition"="path/to/parquet/files"),

    -- Specifying the format as parquet
    ("format" = "parquet")
);

Microsoft Azure Blob Storage

CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING AZUREBLOB
(
    -- Replace <ACCOUNT-NAME> with your account name
    ("accountName"="<ACCOUNT-NAME>"),

    -- Replace <ACCOUNT-KEY> with your account key
    ("accountKey"="<ACCOUNT-KEY>"),

    -- Azure Blob container
    ("container"="parquetContainer"),

    -- Path to the parquet files within the bucket
    ("definition"="path/to/parquet/files"),

    -- Specifying the format as parquet
    ("format" = "parquet")
);

Additional setting/properties could be set as detailed later in Parquet Type Flags

Query Parquet Files [Back to TOC]

To query the data stored in Parquet files, one can simply write a query against the created External Dataset. For example:

SELECT COUNT(*)
FROM ParquetDataset;

Another example:

SELECT pd.age, COUNT(*) cnt
FROM ParquetDataset pd
GROUP BY pd.age;

Type Compatibility [Back to TOC]

AsterixDB supports Parquet’s generic types such STRING, INT and DOUBLE. However, Parquet files could contain additional types such as DATE and DATETIME like types. The following table show the type mapping between Apache Parquet and AsterixDB:

Parquet AsterixDB Value Examples Comment
BOOLEAN BOOLEAN true / false -
INT_8 BIGINT AsterixDB BIGINT Range:
  • Min:-9,223,372,036,854,775,808
  • Max: 9,223,372,036,854,775,807
-
INT_16
INT_32
INT_64
UNIT_8
UINT_16
UINT_32
UINT_64 There is a possibility that a value overflows. A warning will be issued in case of an overflow and MISSING would be returned.
FLOAT DOUBLE AsterixDB DOUBLE Range:
  • Min Positive Value: 2^-1074
  • Max Positive Value: 2^1023
-
DOUBLE
FIXED_LEN_BYTE_ARRAY (DECIMAL) Parquet DECIMAL values are converted to doubles, with the possibility of precision loss. The flag decimal-to-double must be set upon creating the dataset.
BINARY (DECIMAL)
BINARY (ENUM) "Fruit" Parquet Enum values are parsed as Strings
BINARY (UTF8) STRING "Hello World" -
FIXED_LEN_BYTE_ARRAY (UUID) UUID uuid("123e4567-e89b-12d3-a456-426614174000") -
INT_32 (DATE) DATE date("2021-11-01") -
INT_32 (TIME) TIME time("00:00:00.000") Time in milliseconds.
INT_64 (TIME) TIME Time in micro/nano seconds.
INT_64 (TIMESTAMP) DATETIME datetime("2021-11-01T21:37:13.738")" Timestamp in milli/micro/nano seconds. Parquet also can store the timestamp values with the option isAdjustedToUTC = true. To get the local timestamp value, the user can set the time zone ID by setting the value using the option timezone to get the local DATETIME value.
INT96 A timestamp values that separate days and time to form a timestamp. INT96 is always in localtime.
BINARY (JSON) any type
  • {"name": "John"}
  • [1, 2, 3]
Parse JSON string into internal AsterixDB value. The flag parse-json-string is set by default. To get the string value (i.e., not parsed as AsterixDB value), unset the flag parse-json-string.
BINARY BINARY hex("0101FF") -
BSON N/A BSON values will be returned as BINARY
LIST ARRAY [1, 2, 3] Parquet's LIST type is converted into ARRAY
MAP ARRAY of OBJECT [{"key":1, "value":1}, {"key":2, "value":2}] Parquet's MAP types are converted into an ARRAY of OBJECT. Each OBJECT value consists of two fields: key and value
FIXED_LEN_BYTE_ARRAY (INTERVAL) - N/A INTERVAL is not supported. A warning will be issued and MISSING value will be returned.

Parquet Type Flags [Back to TOC]

The table in Type Compatibility shows the type mapping between Parquet and AsterixDB. Some of the Parquet types are not parsed by default as those type are not natively supported in AsterixDB. However, the user can set a flag to convert some of those types into a supported AsterixDB type.

DECIMAL TYPE

The user can enable parsing DECIMAL Parquet values by enabling a certain flag as in the following example:

CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
    -- Credintials and path to Parquet files
    ...

    -- Enable converting decimal values to double
    ("decimal-to-double" = "true")
);

This flag will enable parsing/converting DECIMAL values/types into DOUBLE. For example, if the flag decimal-to-double is not set and a Parquet file contains a DECIMAL value, the following error will be thrown when running a query that request a DECIMAL value:

Parquet type "optional fixed_len_byte_array(16) decimalType (DECIMAL(38,18))" is not supported by default. To enable type conversion, recreate the external dataset with the option "decimal-to-double" enabled

and the returned value will be MISSING. If the flag decimal-to-double is set, the converted DOUBLE value will be returned.

TEMPORAL TYPES

For the temporal types (namely DATETIME), their values could be stored in Parquet with the option isAdjustedToUTC = true. Hence, the user has to provide the timezone ID to adjust their values to the local value by setting the flag timezone. To do so, a user can set the timezone ID to “PST” upon creating a dataset as in the following example:

CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
    -- Credintials and path to Parquet files
    ...

    -- Converting UTC time to PST time
    ("timezone" = "PST")
);

If the flag timezone is not set, a warning will appear when running a query:

Parquet file(s) contain "datetime" values that are adjusted to UTC. Recreate the external dataset and set "timezone" to get the local "datetime" value.

and the UTC DATETIME will be returned.

JSON TYPE

By default, we parse the JSON values into AsterixDB values, where a user can process those values using SQL++ queries. However, one could disable the parsing of JSON string values (which stored as STRING) by unsetting the flag parseJsonString as in the following example:

CREATE EXTERNAL DATASET ParquetDataset(ParquetType) USING S3
(
    -- Credintials and path to Parquet files
    ...

    -- Stop parsing JSON string values
    ("parse-json-string" = "false")
);

And the returned value will be of type STRING.

INTERVAL TYPE

Currently, AsterixDB do not support Parquet’s INTERVAL type. When a query requests (or projects) an INTERVAL value, a warning will be issued and MISSING value will be returned instead.