Split source based on byte size

Overview

You may encounter a situation where you have to split your bulky source into multiple target files with a specific byte size. Talend target file components don't come with the option to create files based on specific KB/MB sizes.

 

Procedure

Use the tJavaFlex component with FileOutputStream to achieve this.

  1. Create a Standard Job in Talend Studio, and define the source on which the split needs to be done.

  2. Configure the Basic settings of the source with column fields. Here, file input is used as the source, and the component is renamed Read Source.

    readsource.png

     

  3. Create a tJavaFlex component in the workspace, and connect the source to the tJavaFlex component with a Main row. Make sure all the fields from the source are captured with a Sync columns option, and rename the component (for better naming convention) as Split_Based_On_Size

    Article1.png

     

  4. Define context variables for your Target path (TgtFilePath) where the files will be created, and split size (SplitByte) based on the size at which new target files will be created.

    split.png

     

    Here the goal is to split into 1 MB target files, so the SplitByte value was derived with:

             1 MB = 1024 KB = 1024 * 1024 bytes = 1048576

     

  5. A tJavaFlex component comes with three code components: Start code, Main Code, and End Code. They help to initialize/define things in Start Code, execute the required logic/operations in Main Code, and finish it with End Code.

    1. In Start code, define two integer variables: iterator to keep the count of files generated, and ByteCount to count the number of bytes read. Define a FileOutputStream that will be used to write target files:

      // start part of your Java code
      Integer iterator = 1;
      Integer ByteCount = 0;
      FileOutputStream fos = new
      FileOutputStream(context.TgtFilePath+"TargetFile_"+iterator+".txt");
    2. In Main Code, define where you can read the records from source and convert them to Bytes and get their length:

      String tmpReadLine= row1.FirstName+","+
      row1.LastName+","+
      row1.Age+","+
      row1.City+","+
      row1.State+"\n"; //Read input fields
      
      byte[] contentInBytes = tmpReadLine.getBytes();//Convert them to Byte array
      
      ByteCount=ByteCount+contentInBytes.length; // Summation of line bytes read
      
      if ( ByteCount > context.SplitByte ) {
      // Check if bytes read hasn't crossed the threshold
          ByteCount = 0;
      
          fos.flush();
          fos.close();
          iterator = iterator+ 1;
      
          // Threshold crossed write to new file
          fos = new
      FileOutputStream(context.TgtFilePath+"TargetFile_"+iterator+".txt");
      } else {
          fos.write(contentInBytes); // else write to same file
      }

      Note: When you are initializing the tmpReadLine variable, choose the delimiter and row separator you want before writing to the file. If you're not sure, configure them through context variables and use them here.

       

      Using the tmpReadLine variable, read the entire line using source fields—FirstName, Lastname, Age, City and State. Change the code based on your source fields.

       

    3. In the End Code, use fos.close() to close the FileOutputStream connection.

      In order to use FileOutputStream, you need to import the java.io.FileOutputStream library. Add this library in the Advanced settings of the tJavaFlex component.

      Article3.png

       

    4. Execute the Job, and check the Target file location as configured in the Job:

      Article4.png

       

Summary

In this design, there will never be partial or broken records written to the target, as you either write the entire line or move to a new target file.

Version history
Revision #:
6 of 6
Last update:
‎10-23-2017 06:53 PM
Updated by:
 
Labels (2)