Parquet is a column-oriented storage format widely used in the Hadoop ecosystem. It facilitates efficient scanning of a column or a set of columns in large amounts of data, unlike row-based file storages, such as CSV. For more information on Parquet, see the Apache Parquet documentation page.
This article explains the best practices that Talend suggests you follow when working with Parquet. It is intended for beginners or anyone new to using Parquet when creating Jobs in Talend Studio.
Talend Studio 7.1.1
Cloudera 5.14 cluster
In a Talend Standard Job, Hive components allow you to use Parquet with Hive, version 0.10 or later, for storage. In a Big Data Batch Job, the tFileInputParquet and tFileOutputParquet components allow you to read and write to and from HDFS respectively.
Create a Big Data Batch Job, using the Spark framework, to store data in Parquet format. In this example, the Job uses the following components.
Note: that you can also create a Big Data Batch Job using MapReduce in the Job framework.
Configure the tFileOutputParquet component, using the following options.
Successfully execution of the Job stores the data in Parquet file format on HDFS. Use Hue to view the stored data, as shown below:
Create a Big Data Batch Job, to read data stored in parquet file format on HDFS, using the following components.
Configure the tFileInputParquet component, as shown below.
The properties for the tFileInputParquet component are similar to the tFileOutputParquet component. However, it isn't necessary to configure the Compression and Action options. In this example, the same files that were written to HDFS in the previous section are read into the console.
Run the Job, and review the output.
Create a Talend Standard Job, using Hive components and Hive table components that support the Parquet file format. In this example, the Job uses the following components.
Configure the tHiveCreateTable component, using the following options.
Use Create table if the Job is intended to run one time as part of a flow. Use Create table if not exists to run the Job multiple times.
Configure the tHiveLoad component, using the following options.
Run the Job, to create a Hive table, load the data from another Hive table, and store it in parquet file format. From Hue, review the data stored on the Hive table.
It’s important that you follow some best practices when using the Parquet format in Talend Jobs.
Select Define a storage configuration component; the advantage of using this option is that the configuration details can be part of the repository metadata making it reusable in other Jobs. The Big Data Batch Jobs in this article use the HDFS connection defined as part of the tHDFSConfiguration component for storage.
Select the Compression technique best suited to your use case. The GZip compression rate is higher than Snappy, and creates smaller files. However, Snappy usually has better performance.
In Talend Standard Jobs, Hive components support storing table data in Parquet format.
The first two best practices above, apply to Talend Standard Jobs as well, with the exception of Jobs using a tHiveConnection component. These Jobs require you to use a connection component instead of a configuration component.
Similarly, the tHiveLoad component in Talend Standard Job allows you to define the Compression technique to be used for Parquet storage. Talend recommends making a selection based on your use case, as in Big Data Batch Jobs. The image below uses tHiveLoad as an example, other Hive components allow you either select or define the Parquet format.
The specific Talend Studio components tFileInputParquet and tFileOutputParquet make it easy to read and write data in Parquet format. Several other components, such as Hive components in Talend Standard Jobs support scans and writes with Parquet format data.