Using Parquet format best practices

Overview

Parquet is a column-oriented storage format widely used in the Hadoop ecosystem. It facilitates efficient scanning of a column or a set of columns in large amounts of data, unlike row-based file storages, such as CSV. For more information on Parquet, see the Apache Parquet documentation page.

 

This article explains the best practices that Talend suggests you follow when working with Parquet. It is intended for beginners or anyone new to using Parquet when creating Jobs in Talend Studio.

 

Environment

  • Talend Studio 7.1.1

  • Cloudera 5.14 cluster

  • This article uses Hue to view the content of Hive tables. However, there are a number of different tools that you can use to query your Hive tables.

 

Introduction

In a Talend Standard Job, Hive components allow you to use Parquet with Hive, version 0.10 or later, for storage. In a Big Data Batch Job, the tFileInputParquet and tFileOutputParquet components allow you to read and write to and from HDFS respectively.

 

Creating a Big Data Batch Job to write in Parquet format

  1. Create a Big Data Batch Job, using the Spark framework, to store data in Parquet format. In this example, the Job uses the following components.

    • tHDFSConfiguration – connects to HDFS on Cloudera cluster
    • tRowGenerator – generates random rows of data
    • tMap – concatenates firstname and lastname to generate emailID column data
    • tFileOutputParquet – stores data into HDFS in Parquet format

    Note: that you can also create a Big Data Batch Job using MapReduce in the Job framework.

    write_job.png

     

  2. Configure the tFileOutputParquet component, using the following options.

    1. Select the Define a storage configuration component check box, then select a storage configuration component, such as tHDFSConfiguration, from the pull-down menu.
    2. From the Property Type drop-down list, select Repository metadata for the connection to HDFS.
    3. From the Schema drop-down list, select Built-In.
    4. In the Folder/File field enter the target directory to store the Parquet files.
    5. From the Action drop-down list, select Create.
    6. From the Compression drop-down list, choose a compression technique Uncompressed, GZip, or Snappy for the Parquet files to use. This example uses Uncompressed.

    write_parquetComponent.png

     

  3. Successfully execution of the Job stores the data in Parquet file format on HDFS. Use Hue to view the stored data, as shown below:

    write_hueView.png

     

Creating a Big Data Batch Job to read Parquet files in HDFS

  1. Create a Big Data Batch Job, to read data stored in parquet file format on HDFS, using the following components.

    • tHDFSConfiguration – connect to HDFS
    • tFileInputParquet – read Parquet data from HDFS
    • tLogRow – print the data to the console

    read_parquet.png

     

  2. Configure the tFileInputParquet component, as shown below.

    The properties for the tFileInputParquet component are similar to the tFileOutputParquet component. However, it isn't necessary to configure the Compression and Action options. In this example, the same files that were written to HDFS in the previous section are read into the console.

    read_parquetComponent.png

     

  3. Run the Job, and review the output.

    read_console.png

     

Creating a Talend Standard Job using Hive components

  1. Create a Talend Standard Job, using Hive components and Hive table components that support the Parquet file format. In this example, the Job uses the following components.

    • tHiveConnection – uses metadata from the repository to connect to Hive on Cloudera cluster
    • tHiveCreateTable – creates a Hive table
    • tHiveLoad – loads data to Hive table in Parquet format
    • tHiveClose – closes the connection to Hive server

    hive_job.png

     

  2. Configure the tHiveCreateTable component, using the following options.

    1. Select the Use an existing connection check box.
    2. From the Component List drop-down list, select the Hive connection configured from the repository.
    3. From the Schema drop-down list, select Built-In.
    4. In the Table Name field enter the name of your Hive table. In this example the table name is "vp_customers_parquet".
    5. From the Action on table drop-down list, select Create table.

      Use Create table if the Job is intended to run one time as part of a flow. Use Create table if not exists to run the Job multiple times.

    6. From the Format drop-down list, select PARQUET. This is important property to define when Hive table being created needs to support Parquet format.

    hive_createTable.png

     

  3. Configure the tHiveLoad component, using the following options.

    1. Select the Use an existing connection check box, to use an existing Hive connection configured through a tHiveConnection component.
    2. Select Insert from the Load action pull-down menu.
    3. Choose TABLE from the Target type pull-down menu.
    4. Enter the name of your table name in the Table Name field.
    5. Select The target table uses the Parquet format check box, then choose Uncompressed from the Compression drop-down menu.
    6. Enter "SELECT * FROM 'csa'.'vp_demo_customerdata' LIMIT 100" in the Query text box.
    7. Choose OVERWRITE from the Action on file pull-down menu.

    hive_load.png

     

  4. Run the Job, to create a Hive table, load the data from another Hive table, and store it in parquet file format. From Hue, review the data stored on the Hive table.

    hive_huetable.png

     

Summary of Parquet best practices in Talend Jobs

It’s important that you follow some best practices when using the Parquet format in Talend Jobs.

  • Select Define a storage configuration component; the advantage of using this option is that the configuration details can be part of the repository metadata making it reusable in other Jobs. The Big Data Batch Jobs in this article use the HDFS connection defined as part of the tHDFSConfiguration component for storage.

    bestpractice_1.png

     

  • Select the Compression technique best suited to your use case. The GZip compression rate is higher than Snappy, and creates smaller files. However, Snappy usually has better performance.

    bestpractice_2.png

     

  • In Talend Standard Jobs, Hive components support storing table data in Parquet format.

    1. The first two best practices above, apply to Talend Standard Jobs as well, with the exception of Jobs using a tHiveConnection component. These Jobs require you to use a connection component instead of a configuration component.

    2. Similarly, the tHiveLoad component in Talend Standard Job allows you to define the Compression technique to be used for Parquet storage. Talend recommends making a selection based on your use case, as in Big Data Batch Jobs. The image below uses tHiveLoad as an example, other Hive components allow you either select or define the Parquet format.

      bestpractice_3.png

       

Conclusion

The specific Talend Studio components tFileInputParquet and tFileOutputParquet make it easy to read and write data in Parquet format. Several other components, such as Hive components in Talend Standard Jobs support scans and writes with Parquet format data.

Version history
Revision #:
15 of 15
Last update:
‎03-11-2019 10:17 AM
Updated by:
 
Labels (3)
Contributors