I am new to Talend and trying to understand the best component to read a variable length file. Lets say I have an input dataset that looks like this, with no end of line delimiter... 1234ABCD10XYDDDRRERE5678EFGH08ABABABAB9101IJKL12YYYYYYYYYYYY There are 3 records in the above file which i want to break into the following... Record is described as follows... 1st 4 bytes are FIELD1 Next 4 bytes are FIELD2 Next 2 bytes are LENGTH Remaining length depends on the length of the LENGTH field above. 1234 ABCD 10 XYDDDRRERE 5678 EFGH 08 ABABABAB 9101 IJKL 12 YYYYYYYYYYYY What is the best component to use to take a file with no delimiter or NEWLINE idicator and process variable length records.
Thanks for your response, I know that I can build a process to pre-process the data, I wanted to see if this option to read variable length records with no end of line delimiter is posssible, if not i can do the solution that way. Just looking for an answer if it is possible through Talend.
Ok, there is a process that should do what you want, but it requires using some seldom used components in an unusual way. Essentially, we can do the pre-processing as the first few steps of a single Talend job. I have not fully implemented this exact set of components in exactly this way for exactly this purpose. I have used each of the components for similar things in separate jobs. Buckle up for an interesting ride. If there is only one "row" in the file, then the first challenge you may face will be reading the entire row into a single input component. Try tFileInputFullRow and see if that will allow you to read in the row of data. Send the row of data to a context variable so you can work with it. If the row of data is too large for a context variable, or global variable, or variable defined in a tJava component, then we are back to splitting the row some other way. If you can read the maximum length row in successfully, then keep reading. Next, use a tLoop to iterate over the records in the row. Use the "while" condition of the tLoop, and set the condition to be something like context.work_row.length() > 10. 10 is because you know each record is at least 4+4+2 bytes in length. Inside the loop, consider using a tExtractPositionalFields, or just use a tMap. You do have a pattern that you are working with: 4,4,2,some_really_big_number. Using tMap, split the first ten bytes into three distinct columns. Use the value in the third column to pull n bytes from the remaining string. This is the fourth column of your record. Send the 4 columns to an output flow. Write the output flow to a tHashOutput. In the same tMap, remove the first record from the context.work_row. Remove the first record by context.work_row.substring (4+4+2+Integer.valueOf(textLength),context.work_row.length()), where textLength is the value contained in bytes 9 and 10. Send the shorter row out to a tLoadContext and store the new value to context.work_row. Because you are in a tLoop, and the context.work_row still has data, the process will continue stepping through the string pulling records one by one and sending them to the tHashOutput. When the tLoop completes, you should have a tHashOutput populated with one row for each of the records. Use a tHashInput and do whatever you need to do with each record. The diagram so far, minus any support stuff like tLogRow, is something like this: tFileInputFullRow --> tContextLoad_1 | |---> tLoop --> tIterateToFlow --> tMap --> tContextLoad_2 | |---> tHashOutput
So... What are the problem points? 1) Successfully reading a potentially very large row into a working variable so you can step through it. You may need to create a custom global variable using tJava in order to hold the working string. 2) Setting and resetting the same context variable inside a tMap inside a tLoop seems a little odd, but it works 3) Parsing and reducing the working variable at each step pushes the Talend job back into a procedural programming model rather than a functional programming model. It works, but may not be the most efficient tool to use for this specific purpose. 4) Using the value in the third column to calculate the length of the fourth column, then substring the text will require that you use Var.* in the tMap in the proper order. That is, you will need to split out the first three columns as individual Var variables, then use the third Var variable to substring the original string. Not difficult, but may take a while to debug and get the lengths correct. Hope this helps