Slow tFilelist?

One Star

Slow tFilelist?

Hello,
I have a job which needs the latest file from a directory with a lot of files ( over 56.000 with about 700 with the same filemask that I am searching for ).
The file I need is searchable and contains a datetimestamp in the file ( but not always from today or yesterday ).
On a local disk it runs adequate ( it finds the file in about 2 sec ) but if i try it on a windows share which has the files it takes over 40 minutes. What's wrong with it.
The filename I'm searching for is : "test." + context.customernumber + "*.txt"
with the settings sorted by date desc and then a iterate to a tjava which sets a globalvar if it's unset else it does nothing (so this way I get the latest file)
I have tried sorted by date asc and then keeping the last iteration but the time remains the same.
The setup :
Client (which runs Talend) Win7
Server (which has the files on a samba share) windows 2003 server
I am almost desperate enough to create a subjob which gets the complete filelisting unsorted and then sort them in the subjob. But I don't think this is the correct way to go.
One Star

Re: Slow tFilelist?

Hi
Welcome to Talend Community!
Could you explain in detail about your job logic?
I need to know what the job will do if there is a latest file. It will copy this file or move it?
Sometimes when Talend job try to handle a file(e.g. Excel) which is opend by other user, the job will wait until the file is not in use.
Or i miss some detail?
Regards,
Pedro
One Star

Re: Slow tFilelist?

Job logic is pretty simple ( in the test-job i created for finding the problem ):
There are some things coming from a context.
tJava_1 :
if context property is null then the filemask should be different from when the context property is filled.
tFileList_1 :
search for all files with the filemask specified in the tJava_1 property (this takes 30 minutes in this example)
tJava_2 :
print the last record found
tFileExist_1 :
the start of the job if there is a last file.
In this example I was searching without a context property so the filemask should be : class.*
The file-specs are :
Total files : 16655
class.* files : 486
I don't see where the 30 minutes goes in this job. So there is no opening / closing of files involved. All the files on the server are closed
One Star

Re: Slow tFilelist?

Ok, some further investigation revealed that tFilelist with sort-options set is terribly slow.
It's about 100x faster to build a tfilelist (without sorting) -> tfileinfo -> tsortrow than to use the sorting possibiilities on the tfilelist settings.
One Star nc
One Star

Re: Slow tFilelist?

tFileList seems to have some problem when working with network paths... I have a directory containing about 2k files and tFileList freezes in spite of the very good latency time of the connection... I suppose it is a bug?
One Star

Re: Slow tFilelist?

@nc : What kind of network connection are you using? I was using windows-UNC paths (so I guess it uses the SMB-components).
If you are using FTP or some other network connection the problem may be somewhere else...
One Star nc
One Star

Re: Slow tFilelist?

I'm using a standard windows UNC path as of "\\serverName.domainName.local\sharedDirectory". When I open the UNC path in the windows explorer I see the list of files in a flash and I'm able to walk in each directory without any delay... In spite of the above, when I try to print the directory list with a simple job as of "tFileList->tLogRow" I have to wait many minutes...
One Star nc
One Star

Re: Slow tFilelist?

Just one note: the "order by" and "order action" setting are left on their default value.
I didn't reported well the simple job to test the behavior: it's "tFileList->tIterateToFlow->tLogRow".
Thanks,
N.
Six Stars dgm
Six Stars

Re: Slow tFilelist?

It's faster to list files using system command than using tfilelist.

 

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.