Slow down while processing 100M+ row file?

Highlighted
Nine Stars

Slow down while processing 100M+ row file?

I have a file with over 100 million rows of data.

The job processes around 2,780 files per second when the job starts, but after about 5 million rows the speed starts to slow down and eventually goes down to about 2 rows per second.

 

The job is:

tFileInputDelimited > tMap > tContextLoad

                             ↓

                         tJava > tFileOutputDelimited

 

In the tMap component, I have Advanced settings Store on disk Max buffer size: 1,000,000

 

In the job's Run Tab advanced settings I have: -Xms6256M and -Xmx7024M

The virtual server I am running the job on has 8 processors, 8 sockets and 32GB of RAM

 

What can I do to keep the job running at 2,780 files per second?

 

 

 


Accepted Solutions
Community Manager

Re: Slow down while processing 100M+ row file?

Your routine needs to look something like this (you will need to handle the imports, etc).....

public class GPSConvert {

    public static String ConvertCoords(Double long_, Double lat_){
    	
    	String myResult = "";
    	CoordinateConversion cs = new CoordinateConversion();

    	//GET THE MGRS VALUE:
    	myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));
    	return myResult;
    }    
    
}

You can use this in your tMap by simply placing the code below in your the column you want to output this data in....

 

routines.GPSConvert.ConvertCoords(row1.long, row1.lat)

There may be a bit of tidying up to do, but this will make your job run a lot faster.


All Replies
Community Manager

Re: Slow down while processing 100M+ row file?

Can you show us a screenshot of your job. Your job description doesn't make any sense I'm afraid. 

Also, I have just run a job where I generated 100,000,000 rows of data and wrote them to a file. It was writing at 1.3 millions rows a second and I was using just 4GB RAM. I sense you are doing a little more than just reading and writing. A screenshot might help fill in the blanks

Fifteen Stars TRF
Fifteen Stars

Re: Slow down while processing 100M+ row file?

Also having an idea of what happens in tJava should be usefull

TRF
Seven Stars JGM
Seven Stars

Re: Slow down while processing 100M+ row file?

it sounds like you're running out of memory. you can test this theory by increasing the Xmx setting by a few GB and see if your slowdown occurs later in the process (maybe you get to ~6MM rows instead of 5MM)

for us to be able to give better advice, it would be very helpful if you can share a screenshot of your job and some detail of what you are doing in both the tMap and your tJava.
Nine Stars

Re: Slow down while processing 100M+ row file?

In the tJava I am passing the Latitude, Longitude values to convert the point to the Military Grid Reference System value, here is the code:

 

String myResult = "";
CoordinateConversion cs = new CoordinateConversion();

Double lat_  = Double.parseDouble(context.myLat);
Double long_ = Double.parseDouble(context.myLong);

//GET THE MGRS VALUE:
context.myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));

//WRITE TO OUTPUT FILE:
row2.myLat    = context.myLat;
row2.myLong   = context.myLong;
row2.myResult = context.myResult;

Nine Stars

Re: Slow down while processing 100M+ row file?

There isn't much to the job, the Lat Long values are loaded to context variables then passed to the tJava for converting to MGRS, then the MGRS value is outputted to the results file:

 

Talend_MGRS.png

Nine Stars

Re: Slow down while processing 100M+ row file?

Instead of continuing to keep throwing more memory at it, is there some way to clear the job buffer/cache every 1M rows processed?

Community Manager

Re: Slow down while processing 100M+ row file?

I am still a little confused by the layout. Why are you assigning context variables millions of times? Why are you iterating to a tJava? What is the tJava sending to the tFileOutputDelimited? By the way, the tJava is not really best suited to working with row connectors. Can you give a description of what you are trying to achieve? This does not look like it will be terribly efficient at all.

Nine Stars

Re: Slow down while processing 100M+ row file?

I have a file with over 100M unique Latitude, Longitude points.

I need to find out what the corresponding Military Grid Reference System (MGRS) designation is for each point.

 

For example input:

LATITUDELONGITUDE
33.172 -97.069

 

Output:

MGRSLATITUDELONGITUDE
14SPB80072033.172 -97.069

 

I pull the latitude longitude from each row of the file, I am passing the lat/long values to context variables so I can use the context variables in the tJavaRow when I call the function for getting the MGRS value.

Community Manager

Re: Slow down while processing 100M+ row file?

OK, that is not necessary  and is probably causing horrendous memory and time issues. Here is the layout you will need....

 

Input File ----->tMap------->Output File

 

The function can be used in a tMap against your column values while they are part of the row. If your function is several lines of code, add it to a Routine. If you are not sure how to do that, post your function here and I can help convert it for you.

 

If you convert it to use the above configuration, it will run significantly faster.

Community Manager

Re: Slow down while processing 100M+ row file?

Your routine needs to look something like this (you will need to handle the imports, etc).....

public class GPSConvert {

    public static String ConvertCoords(Double long_, Double lat_){
    	
    	String myResult = "";
    	CoordinateConversion cs = new CoordinateConversion();

    	//GET THE MGRS VALUE:
    	myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));
    	return myResult;
    }    
    
}

You can use this in your tMap by simply placing the code below in your the column you want to output this data in....

 

routines.GPSConvert.ConvertCoords(row1.long, row1.lat)

There may be a bit of tidying up to do, but this will make your job run a lot faster.

Nine Stars

Re: Slow down while processing 100M+ row file?

As you suggested, I created a routine to call the MGRS CoordinateConversion routine.

 

It runs fine for over 2K rows then dies when the input is:    
Latitude:    21.32889
Longitude: -158.12221

 

Which doesn't make sense because the points are within range and it should return:  04QEJ9102958801

 

It died with this error:

 

Exception in component tMap_1 (mgrs)
java.lang.IllegalArgumentException: Legal ranges: latitude [-90,90], longitude [-180,180).
    at routines.CoordinateConversion.validate(CoordinateConversion.java:30)
    at routines.CoordinateConversion.access$1(CoordinateConversion.java:25)
    at routines.CoordinateConversion$LatLon2MGRUTM.convertLatLonToMGRUTM(CoordinateConversion.java:257)
    at routines.CoordinateConversion.latLon2MGRUTM(CoordinateConversion.java:39)
    at routines.MGRS_Convert.CoordinateConversion(MGRS_Convert.java:11)
    at demo.mgrs.mgrs.tFileInputDelimited_1Process(mgrs.java:1287)
    at demo.mgrs.mgrs.runJobInTOS(mgrs.java:2508)
    at demo.mgrs.mgrs.main(mgrs.java:2339)

Community Manager

Re: Slow down while processing 100M+ row file?

I suspect that you reversed the longitude and latitude given the error message. 

Nine Stars

Re: Slow down while processing 100M+ row file?

You are good!

Community Manager

Re: Slow down while processing 100M+ row file?

I've worked with Long and Lat before. I made the same mistake countless times :-)

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Best Practices for Using Context Variables with Talend – Part 1

Learn how to do cool things with Context Variables

Blog

Migrate Data from one Database to another with one Job using the Dynamic Schema

Find out how to migrate from one database to another using the Dynamic schema

Blog

Best Practices for Using Context Variables with Talend – Part 4

Pick up some tips and tricks with Context Variables

Blog