Slow SQL Server updates

Seven Stars

Slow SQL Server updates

Is Talend making any effort to improve Talend performance with MS SQL Server?

 

Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server.  I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).

 

I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables:  a source table on one SQL Server and a target table on another SQL Server.  The data is about 220k rows.  The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec.  Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec!  Really??  Almost 100x slower?  What the heck is it doing?

 

I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact).  None of these made significant differences in performance.  Still achieves < 30 rows/sec.


Accepted Solutions
Seven Stars

Re: Slow SQL Server updates

By adding "sendStringParametersAsUnicode=false" to the Advanced Settings of the update component, it is now producing a sp_prepare using varchar(8000) instead of nvarchar(4000) and the performance difference was dramatic -- almost 4k rows/sec instead of a couple hundred.  The problem is that the clustered key index in SQL Server is varchar() so it doesn't match the nvarchar() parameter value and forces a table scan.

 

The engineer assigned to the defect I opened (TDI-40281) was able to confirm the performance issue.

 

On a side note:  It isn't clear to me why the component wouldn't use the size of the schema columns for the parameters, but that probably doesn't matter much.

 


All Replies
Sixteen Stars

Re: Slow SQL Server updates

All I can say is that your job or database are poorly configured if those numbers are true. I'm currently working with Ms SQL Server and am able to get around 2500 rows per second without even looking into optimisation of my job.
Seven Stars

Re: Slow SQL Server updates

I appreciate you confirming you are able to see decent results.  That gives me hope that it is something misconfigured in the job.  It is clearly not the database as Datastage can achieve respectable rates similar to what you stated.

 

The input side of the job all performs very well.  The job uses tMSSQLInput to select all rows in the source table, another tMSSQLInput to select all rows from the target table into a tHashOutput, and then compares the two streams in a tMap, sending the output to either an insert path or an update path depending upon whether the record exists in the target with the same CRC (calculated in a tAddCRCRow to evaluate changes to the columns).

 

If I direct the insert and update paths to simple delimited files, the output is also performant reaching 51k rows/sec.  (Of course, I realize a flat file is going to be extremely faster than a DB update.  Just using it to isolate where the issue lies.)

 

When I enable the tMSSqlOutput components on the insert and update paths, the performance drops to < 30 rows/sec.  Thus, the reason I've been playing around with the various settings for that component.

Seven Stars

Re: Slow SQL Server updates

I built a new job that does nothing but select rows from one table and insert them into a new table in the same DB on the same server.  I see the same results with this new, simpler job even though it is on a different DB and different server than the previous job was running against.

 

Based upon the displayed statistics, it cranks along fine at about 1k-2k rows/sec until it hits 10k rows and then just apparently hangs.  If I wait, it will eventually finish and I'll then see the 30 rows/sec rate.

 

Given the 10k limit, I tried changing the "Batch Size" and "Commit Every" values (which appear to be the same as the "Array Size" and "Transaction Size" in Datastage) to other than 10k.  As expected when I do this, it then hangs at whatever the larger of those two values are.

 

Screen Shot 2018-03-21 at 1.34.54 PM.png

Sixteen Stars

Re: Slow SQL Server updates

Right, I have just set up a job based on a table I have with 37668 rows. First, I essentially used Talend to read everything into a tHash component and write it to a new version of the table (but with no indexes or primary key). I used Talend to "create table if does not exist" and it built the table and loaded all 37668 rows in 1.74 seconds. This was with standard "out of the palette' settings.

 

Screen Shot 2018-03-21 at 20.07.14.png

 

However, inserting into an empty table and updating are two different beasts, so I then changed the job to update every single one of the 37668 rows. The only change I made was to the "Action on data" option and I set the update key using the Advanced settings' "Use field options". I then ran the job again....and I managed to get 47.66 rows per second....

Screen Shot 2018-03-21 at 20.50.47.png

Pretty terrible, but I barely changed anything and the database table has no keys or indexes.

 

I next played around with the "Use batch size" and "Commit every" settings and got marginal gains, but nothing substantial.

 

Now remember I said I have no keys or indexes on the table. Well I decided to add a clustered index on the update key field. This is an entirely reasonable thing to do, in fact it could be argued that you should index your tables appropriately to how they are used. If you update by a combination of columns as a composite key, you should create an index based on that combination of columns. Anyway, I added the index and ran exactly the same bog standard update with default settings (other than the "field options" settings). This was the result....

Screen Shot 2018-03-21 at 21.05.07.png

19598.34 rows per second from 47.66 just by adding an index on the update key.

 

Now I understand you may not be able to add any new indexes on your table, but I assume you have some already. Are you able to set the field options of the db component to match any existing indexes? Can you rebuild the index before you run? Essentially you have to performance tune the Talend job like you would performance tune a SQL query (going back to ETL with just database links :-) ).

Seven Stars

Re: Slow SQL Server updates

UPDATE: I just realized that this simple job is doing a straight insert so the update key won't have any affect. DOH!  I built the same simple job in Datastage for comparison and will do further testing.  I'll try the update key in the job doing the updates and see if that is an improvement.

 

Well, that is interesting.  (Note that this simple job is just doing an insert, not update.)  Refer to images, below:

  • The key in my target table is already a primary clustered key.
  • I have that column marked as the key in my Talend schemas.
  • I enabled "Field Options" and checked the box under "Update Key" beside that column.
  • I set "Commit Every" and "Batch Size" back to the default 10k.
  • Performance hasn't changed.  It reaches 10k fine, and then hangs.

Screen Shot 2018-03-21 at 2.22.10 PM.png

 

Screen Shot 2018-03-21 at 2.23.20 PM.png

 

Screen Shot 2018-03-21 at 2.26.22 PM.png

 

Screen Shot 2018-03-21 at 2.21.16 PM.png

 

Screen Shot 2018-03-21 at 2.20.31 PM.png

 

Seven Stars

Re: Slow SQL Server updates

Figured out one thing:  I had been using the Microsoft provided JDBC provider.  Apparently, that was a bad idea.  In the simple insert job, I switched to the Open Source JTDS provider and it now is providing very respectable numbers.  However, making that switch in my other job (the one doing the update) has not had that same affect.  The performance drops dramatically once it hits the first commit/batch limit.  Dang...

Sixteen Stars

Re: Slow SQL Server updates

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options? This can make a big difference since updating a key field (even with the same value) will incur an update performance hit since the index is technically having to be recalculated. Since it is a key value, it won't change....so you may as well untick the updatable field.

 

If that doesn't make a difference then I can only assume there is something funky going on with your database config. My database is not on the same machine as my Studio. In fact, it is in a different country. My Studio machine is a virtual machine with 8GB RAM (not ideal). However I am processing more data in a fraction of the time with what appear to be the same (or very similar) Talend settings. 

 


In your last post you say "Note that this simple job is just doing an insert, not update" and show the status of the job hung. But you also show the field options with an update key specified. You only want to use the update key option if you are updating....not that it will make much of difference in an insert.

 

By the way, I don't believe your job is hanging at 10000 records. That is just a user interface "feature" it is displaying the results pre-emptively according to the "Batch Size" settings. Switch that off and you will see the result increment in smaller more frequent increments. Alternatively if you left your job running it would eventually get through the data (probably in a similar sort of time to if you left the job running with the "Batch Size" swicthed off).

 

Another slight difference between yours and mine is that you are using a primary key. When inserting a value into a primary key field the database essentially has to ensure that the value being inserted is unique, otherwise the constraint is broken. This will cause a little latency when inserting a large amount of data with existing PK data. Having said that, it shouldn't cause the issues you are seeing.

 

I really do not believe that Talend is doing anything weird here. Firstly because I have never seen any issues on countless customer sites. Secondly because Talend are simply using third party Jars to provide the database functionality. These are the same Jars used by many other Java applications with database interactions. All Talend does is prepare insert/update/select statements which are sent to the DB using these Jars. Even if the statements were inefficient, most RDBMS' ultimately rewrite the statements internally to make them more efficient. Any latency caused by Talend is more likely to be caused by complex in-memory transformations reuquiring a lot of memory. This is the main cause of any perfomance tuning I need to carry out, no matter what database I am using. 

Seven Stars

Re: Slow SQL Server updates

NOTE: I added the topic Performance impact of tMSSqlOutput Commit Every and Batch Size which shows the performance of inserts and how Commit and Batch affect the outcome.  I did this testing with regard to the experience I was describing in this thread.

 

First, thank you very much for all the info you have shared and provided.  It is really appreciated.  I've got years of Datastage experience and now we're moving to Talend.  Much of what I know still applies but the details are often different.

 


@rhall_2_0wrote:

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options?

 


Yes, saw that list and made that same change at the time.  I unchecked everything except the primary key in the  "Update key" column and checked everything except the primary key in the "Updatable" column.  I even unchecked everything in the "Insertable" column, just in case.

 


@rhall_2_0wrote:

If that doesn't make a difference then I can only assume there is something funky going on with your database config. My database is not on the same machine as my Studio. In fact, it is in a different country. My Studio machine is a virtual machine with 8GB RAM (not ideal). However I am processing more data in a fraction of the time with what appear to be the same (or very similar) Talend settings.

 


For this "update" job, the source and target databases are remote from me, both in Chicago.  My Studio machine is a Macbook Pro 16G Ram, 2.8 GHz i7.  As noted in the article from the link at top, my inserts appear to be performing fine.

 


@rhall_2_0wrote:

In your last post you say "Note that this simple job is just doing an insert, not update" and show the status of the job hung. But you also show the field options with an update key specified. You only want to use the update key option if you are updating....not that it will make much of difference in an insert.

 


Yeah, I caught that just after I posted and then made an update to that post.  Oversight on my part, getting confused between the two jobs.

 


@rhall_2_0wrote:

By the way, I don't believe your job is hanging at 10000 records. That is just a user interface "feature" it is displaying the results pre-emptively according to the "Batch Size" settings. Switch that off and you will see the result increment in smaller more frequent increments. Alternatively if you left your job running it would eventually get through the data (probably in a similar sort of time to if you left the job running with the "Batch Size" swicthed off). 


I only meant "hanging" in the sense of the visual behavior I saw.  Of course, it is that the first "Batch" rows get processed very quickly (so the count goes quickly to that number), and then it hangs as it waits for that batch to be committed.   As noted in the other article, "Batch" clearly has an impact on performance.  Disabling it resulted in a huge drop in performance for the inserts.

 


@rhall_2_0wrote: 

Another slight difference between yours and mine is that you are using a primary key. When inserting a value into a primary key field the database essentially has to ensure that the value being inserted is unique, otherwise the constraint is broken. This will cause a little latency when inserting a large amount of data with existing PK data. Having said that, it shouldn't cause the issues you are seeing.

 


I think you are referring to the fact that one of the columns in the stream has the "Key" checked?  Assuming I understand that correctly, I had wondered why setting that column as the key still required me to do something else to tell the update to use it.  Seems odd.  In DS, setting a column as the key automatically makes it use in the update statement.

 


@rhall_2_0wrote:

I really do not believe that Talend is doing anything weird here. Firstly because I have never seen any issues on countless customer sites. Secondly because Talend are simply using third party Jars to provide the database functionality. These are the same Jars used by many other Java applications with database interactions. All Talend does is prepare insert/update/select statements which are sent to the DB using these Jars. Even if the statements were inefficient, most RDBMS' ultimately rewrite the statements internally to make them more efficient. Any latency caused by Talend is more likely to be caused by complex in-memory transformations reuquiring a lot of memory. This is the main cause of any perfomance tuning I need to carry out, no matter what database I am using. 


Your point about using 3rd party components is an important one and certainly validates the thought that Talend isn't really at fault here, wherever else the fault might lie.

 

I'm just still stumped though still investigating.  Doesn't make sense why this Talend job would perform so poorly compared to a Datastage job doing the same thing to the same tables in the same database.

 

I will report back when and if I find out anything conclusive on the update performance issue.

 

Again, thanks.

Seven Stars

Re: Slow SQL Server updates

I'm still seeing 31 rows/sec in Talend vs. 3898 rows/sec in Datastage -- over 131 minutes vs. 1 minute. 

 

In Datastage, I can look at the actual statement the stage is going to use:

UPDATE customer SET customer_name=?,ship_address=?,ship_postalcode=?,ship_city=?,ship_county=?,ship_state=?,ship_country=?,mktsegcode=?,crc=? WHERE cmf=?;

I'd really like to see if the SQL used by the Talend component actually includes the WHERE clause to employ the update key.

 

The docs indicate there is a global variable QUERY but I've not been able to successfully access.  Probably have a syntax issue...

Sixteen Stars

Re: Slow SQL Server updates

I can help you here. On your workspace you will see a a tab called "Code" in the bottom left corner. Click on that and you will see the code generated. It is a bit messy, but you can learn a lot from it. Now your update component will have a name something like ....

 

tMSSqlOutput_

...after the _ will be a number. You can see the name when you click on the component and then click on the component tab. Lets assume your component is numbered 1. What you want to search for when looking in the code tab is for a code like this....

 

String update_tMSSqlOutput_1 =

This will take you to where your update query is built as a String. 

Seven Stars

Re: Slow SQL Server updates

I selected the component and then went to the "Code Viewer" window.  I didn't see any way to search there so I selected all the content and pasted it into notepad++.  I see references for "whetherReject_tMSSqlOutput_2", "pstmt_tMSSqlOutput_2", "updatedCount_tMSSqlOutput_2", and several others but no "update_tMSSqlOutput_2".  So I checked my "insert" path (tMSSqlOutput_1) and had the same result -- lots of references but no "insert_tMSSqlOutput_1".  (Yes, I was searching case-insensitive.)  I am clearly missing something...

 

I can see where it uses "pstmt_tMSSqlOutput_2" as an arg of type java.sql.PreparedStatement, and can see where it builds that variable, but it only shows the column values being added -- I assume as parameters to the statement.

 

Screen Shot 2018-03-22 at 3.11.29 PM.png

 

	
	/**
	 * [tMSSqlOutput_2 main ] start
	 */

	

	
	
	currentComponent="tMSSqlOutput_2";

	
    			if(log.isTraceEnabled()){
    				log.trace("row17 - " + (row17==null? "": row17.toLogString()));
    			}
    		



        whetherReject_tMSSqlOutput_2 = false;
                    if(row17.customer_name == null) {
pstmt_tMSSqlOutput_2.setNull(1, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(1, row17.customer_name);
}

                    if(row17.ship_address == null) {
pstmt_tMSSqlOutput_2.setNull(2, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(2, row17.ship_address);
}

                    if(row17.ship_postalcode == null) {
pstmt_tMSSqlOutput_2.setNull(3, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(3, row17.ship_postalcode);
}

                    if(row17.ship_city == null) {
pstmt_tMSSqlOutput_2.setNull(4, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(4, row17.ship_city);
}

                    if(row17.ship_county == null) {
pstmt_tMSSqlOutput_2.setNull(5, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(5, row17.ship_county);
}

                    if(row17.ship_state == null) {
pstmt_tMSSqlOutput_2.setNull(6, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(6, row17.ship_state);
}

                    if(row17.ship_country == null) {
pstmt_tMSSqlOutput_2.setNull(7, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(7, row17.ship_country);
}

                    if(row17.mktsegcode == null) {
pstmt_tMSSqlOutput_2.setNull(8, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(8, row17.mktsegcode);
}

                    if(row17.crc == null) {
pstmt_tMSSqlOutput_2.setNull(9, java.sql.Types.INTEGER);
} else {pstmt_tMSSqlOutput_2.setLong(9, row17.crc);
}


                    if(row17.cmf == null) {
pstmt_tMSSqlOutput_2.setNull(10 + count_tMSSqlOutput_2, java.sql.Types.VARCHAR);
} else {pstmt_tMSSqlOutput_2.setString(10 + count_tMSSqlOutput_2, row17.cmf);
}


    		pstmt_tMSSqlOutput_2.addBatch();
    		nb_line_tMSSqlOutput_2++;
    		
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("Adding the record ")  + (nb_line_tMSSqlOutput_2)  + (" to the ")  + ("UPDATE")  + (" batch.") );
    		batchSizeCounter_tMSSqlOutput_2++;
    		
            	//////////batch execute by batch size///////
            	class LimitBytesHelper_tMSSqlOutput_2{
            		public int limitBytePart1(int counter,java.sql.PreparedStatement pstmt_tMSSqlOutput_2) throws Exception {
                try {
						
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("Executing the ")  + ("UPDATE")  + (" batch.") );
						for(int countEach_tMSSqlOutput_2: pstmt_tMSSqlOutput_2.executeBatch()) {
							if(countEach_tMSSqlOutput_2 == -2 || countEach_tMSSqlOutput_2 == -3) {
								break;
							}
							counter += countEach_tMSSqlOutput_2;
						}
						
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("The ")  + ("UPDATE")  + (" batch execution has succeeded.") );
                }catch (java.sql.BatchUpdateException e){
                	
                	int countSum_tMSSqlOutput_2 = 0;
					for(int countEach_tMSSqlOutput_2: e.getUpdateCounts()) {
						counter += (countEach_tMSSqlOutput_2 < 0 ? 0 : countEach_tMSSqlOutput_2);
					}
				
            	    	
            log.error("tMSSqlOutput_2 - "  + (e.getMessage()) );
                		System.err.println(e.getMessage());
                	
               			 }
    				return counter;
            	}
            	
            	public int limitBytePart2(int counter,java.sql.PreparedStatement pstmt_tMSSqlOutput_2) throws Exception {
                try {
                		
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("Executing the ")  + ("UPDATE")  + (" batch.") );
						for(int countEach_tMSSqlOutput_2: pstmt_tMSSqlOutput_2.executeBatch()) {
							if(countEach_tMSSqlOutput_2 == -2 || countEach_tMSSqlOutput_2 == -3) {
								break;
							}
							counter += countEach_tMSSqlOutput_2;
						}
						
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("The ")  + ("UPDATE")  + (" batch execution has succeeded.") );
                }catch (java.sql.BatchUpdateException e){
                	
                	
					for(int countEach_tMSSqlOutput_2: e.getUpdateCounts()) {
						counter += (countEach_tMSSqlOutput_2 < 0 ? 0 : countEach_tMSSqlOutput_2);
					}
					
            	    	
            log.error("tMSSqlOutput_2 - "  + (e.getMessage()) );
                        System.err.println(e.getMessage());
                	
                		}	
                	return counter;	
            	}
            }
    		if ((batchSize_tMSSqlOutput_2 > 0) && (batchSize_tMSSqlOutput_2 <= batchSizeCounter_tMSSqlOutput_2)) {
    		
    		            
            	    		updatedCount_tMSSqlOutput_2 = new LimitBytesHelper_tMSSqlOutput_2().limitBytePart1(updatedCount_tMSSqlOutput_2,pstmt_tMSSqlOutput_2);
            	    	
    			
			    batchSizeCounter_tMSSqlOutput_2 = 0;
			}
    		

    	////////////commit every////////////
    			
    		    commitCounter_tMSSqlOutput_2++;
                if(commitEvery_tMSSqlOutput_2 <= commitCounter_tMSSqlOutput_2) {
                if ((batchSize_tMSSqlOutput_2 > 0) && (batchSizeCounter_tMSSqlOutput_2 > 0)) {
    		            
            	    		updatedCount_tMSSqlOutput_2 = new LimitBytesHelper_tMSSqlOutput_2().limitBytePart1(updatedCount_tMSSqlOutput_2,pstmt_tMSSqlOutput_2);
            	    	
            	batchSizeCounter_tMSSqlOutput_2 = 0;
            	}
                	
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("Connection starting to commit ")  + (commitCounter_tMSSqlOutput_2)  + (" record(s).") );
                	conn_tMSSqlOutput_2.commit();
            		
                if(log.isDebugEnabled())
            log.debug("tMSSqlOutput_2 - "  + ("Connection commit has succeeded.") );
                	commitCounter_tMSSqlOutput_2=0;
                }

 


	tos_count_tMSSqlOutput_2++;

/**
 * [tMSSqlOutput_2 main ] stop
 */

 

Sixteen Stars

Re: Slow SQL Server updates

You should be able to switch to the code screen and hit ctrl+F to bring up a find box. You can edit the code, but you can search it. The code you copied and pasted isn't all of it for the tMSSqlOutput. There is a lot more. Here is an example I quickly knocked up. This isn't configured for use, but just dumped onto a job and configured to carry out an update. You will notice there is a section of code like this....

 

String update_tMSSqlOutput_1 = "UPDATE ["
						+ tableName_tMSSqlOutput_1
						+ "] SET [newColumn] = ?,[newColumn1] = ?,[newColumn2] = ?,[newColumn3] = ?,[newColumn4] = ? WHERE [newColumn3] = ? AND [newColumn4] = ?";

I've highlighted in bold what you should search for in your job. I assume you are using Talend 6.something? If so this code will be there if you have your component configured for an update.

 

public void tMSSqlOutput_1Process(
			final java.util.Map<String, Object> globalMap)
			throws TalendException {
		globalMap.put("tMSSqlOutput_1_SUBPROCESS_STATE", 0);

		final boolean execStat = this.execStat;

		String iterateId = "";

		String currentComponent = "";
		java.util.Map<String, Object> resourceMap = new java.util.HashMap<String, Object>();

		try {

			String currentMethodName = new java.lang.Exception()
					.getStackTrace()[0].getMethodName();
			boolean resumeIt = currentMethodName.equals(resumeEntryMethodName);
			if (resumeEntryMethodName == null || resumeIt || globalResumeTicket) {// start
																					// the
																					// resume
				globalResumeTicket = true;

				/**
				 * [tMSSqlOutput_1 begin ] start
				 */

				ok_Hash.put("tMSSqlOutput_1", false);
				start_Hash.put("tMSSqlOutput_1", System.currentTimeMillis());

				currentComponent = "tMSSqlOutput_1";

				int tos_count_tMSSqlOutput_1 = 0;

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - " + ("Start to work."));
				class BytesLimit65535_tMSSqlOutput_1 {
					public void limitLog4jByte() throws Exception {

						StringBuilder log4jParamters_tMSSqlOutput_1 = new StringBuilder();
						log4jParamters_tMSSqlOutput_1.append("Parameters:");
						log4jParamters_tMSSqlOutput_1
								.append("USE_EXISTING_CONNECTION" + " = "
										+ "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("DRIVER" + " = "
								+ "JTDS");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("HOST" + " = "
								+ "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("PORT" + " = "
								+ "\"1433\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("DB_SCHEMA"
								+ " = " + "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("DBNAME" + " = "
								+ "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("USER" + " = "
								+ "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("PASS"
								+ " = "
								+ String.valueOf("f4f7aba1746784ea").substring(
										0, 4) + "...");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("TABLE" + " = "
								+ "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("TABLE_ACTION"
								+ " = " + "NONE");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("IDENTITY_INSERT"
								+ " = " + "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("DATA_ACTION"
								+ " = " + "UPDATE");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1
								.append("SPECIFY_DATASOURCE_ALIAS" + " = "
										+ "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("DIE_ON_ERROR"
								+ " = " + "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("PROPERTIES"
								+ " = " + "\"\"");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("COMMIT_EVERY"
								+ " = " + "10000");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("ADD_COLS" + " = "
								+ "[]");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1
								.append("USE_FIELD_OPTIONS" + " = " + "true");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("FIELD_OPTIONS"
								+ " = " + "[{UPDATE_KEY=" + ("false")
								+ ", DELETE_KEY=" + ("false") + ", UPDATABLE="
								+ ("true") + ", INSERTABLE=" + ("true")
								+ ", SCHEMA_COLUMN=" + ("newColumn")
								+ "}, {UPDATE_KEY=" + ("false")
								+ ", DELETE_KEY=" + ("false") + ", UPDATABLE="
								+ ("true") + ", INSERTABLE=" + ("true")
								+ ", SCHEMA_COLUMN=" + ("newColumn1")
								+ "}, {UPDATE_KEY=" + ("false")
								+ ", DELETE_KEY=" + ("false") + ", UPDATABLE="
								+ ("true") + ", INSERTABLE=" + ("true")
								+ ", SCHEMA_COLUMN=" + ("newColumn2")
								+ "}, {UPDATE_KEY=" + ("true")
								+ ", DELETE_KEY=" + ("false") + ", UPDATABLE="
								+ ("true") + ", INSERTABLE=" + ("true")
								+ ", SCHEMA_COLUMN=" + ("newColumn3")
								+ "}, {UPDATE_KEY=" + ("true")
								+ ", DELETE_KEY=" + ("false") + ", UPDATABLE="
								+ ("true") + ", INSERTABLE=" + ("true")
								+ ", SCHEMA_COLUMN=" + ("newColumn4") + "}]");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1
								.append("IGNORE_DATE_OUTOF_RANGE" + " = "
										+ "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1
								.append("ENABLE_DEBUG_MODE" + " = " + "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1
								.append("SUPPORT_NULL_WHERE" + " = " + "false");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("USE_BATCH_SIZE"
								+ " = " + "true");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						log4jParamters_tMSSqlOutput_1.append("BATCH_SIZE"
								+ " = " + "10000");
						log4jParamters_tMSSqlOutput_1.append(" | ");
						if (log.isDebugEnabled())
							log.debug("tMSSqlOutput_1 - "
									+ (log4jParamters_tMSSqlOutput_1));
					}
				}

				new BytesLimit65535_tMSSqlOutput_1().limitLog4jByte();

				int nb_line_tMSSqlOutput_1 = 0;
				int nb_line_update_tMSSqlOutput_1 = 0;
				int nb_line_inserted_tMSSqlOutput_1 = 0;
				int nb_line_deleted_tMSSqlOutput_1 = 0;
				int nb_line_rejected_tMSSqlOutput_1 = 0;

				int deletedCount_tMSSqlOutput_1 = 0;
				int updatedCount_tMSSqlOutput_1 = 0;
				int insertedCount_tMSSqlOutput_1 = 0;
				int rejectedCount_tMSSqlOutput_1 = 0;
				String dbschema_tMSSqlOutput_1 = null;
				String tableName_tMSSqlOutput_1 = null;
				boolean whetherReject_tMSSqlOutput_1 = false;

				java.util.Calendar calendar_tMSSqlOutput_1 = java.util.Calendar
						.getInstance();
				long year1_tMSSqlOutput_1 = TalendDate.parseDate("yyyy-MM-dd",
						"0001-01-01").getTime();
				long year2_tMSSqlOutput_1 = TalendDate.parseDate("yyyy-MM-dd",
						"1753-01-01").getTime();
				long year10000_tMSSqlOutput_1 = TalendDate.parseDate(
						"yyyy-MM-dd HH:mm:ss", "9999-12-31 24:00:00").getTime();
				long date_tMSSqlOutput_1;

				java.util.Calendar calendar_datetimeoffset_tMSSqlOutput_1 = java.util.Calendar
						.getInstance(java.util.TimeZone.getTimeZone("UTC"));

				int updateKeyCount_tMSSqlOutput_1 = 2;
				if (updateKeyCount_tMSSqlOutput_1 < 1) {
					throw new RuntimeException(
							"For update, Schema must have a key");
				}

				java.sql.Connection conn_tMSSqlOutput_1 = null;
				String dbUser_tMSSqlOutput_1 = null;
				dbschema_tMSSqlOutput_1 = "";
				String driverClass_tMSSqlOutput_1 = "net.sourceforge.jtds.jdbc.Driver";

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - " + ("Driver ClassName: ")
							+ (driverClass_tMSSqlOutput_1) + ("."));
				java.lang.Class.forName(driverClass_tMSSqlOutput_1);
				String port_tMSSqlOutput_1 = "1433";
				String dbname_tMSSqlOutput_1 = "";
				String url_tMSSqlOutput_1 = "jdbc:jtds:sqlserver://" + "";
				if (!"".equals(port_tMSSqlOutput_1)) {
					url_tMSSqlOutput_1 += ":" + "1433";
				}
				if (!"".equals(dbname_tMSSqlOutput_1)) {
					url_tMSSqlOutput_1 += "//" + "";

				}
				url_tMSSqlOutput_1 += ";appName=" + projectName + ";" + "";
				dbUser_tMSSqlOutput_1 = "";

				final String decryptedPassword_tMSSqlOutput_1 = routines.system.PasswordEncryptUtil
						.decryptPassword("f4f7aba1746784ea");

				String dbPwd_tMSSqlOutput_1 = decryptedPassword_tMSSqlOutput_1;
				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Connection attempts to '")
							+ (url_tMSSqlOutput_1) + ("' with the username '")
							+ (dbUser_tMSSqlOutput_1) + ("'."));
				conn_tMSSqlOutput_1 = java.sql.DriverManager.getConnection(
						url_tMSSqlOutput_1, dbUser_tMSSqlOutput_1,
						dbPwd_tMSSqlOutput_1);
				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - " + ("Connection to '")
							+ (url_tMSSqlOutput_1) + ("' has succeeded."));

				resourceMap.put("conn_tMSSqlOutput_1", conn_tMSSqlOutput_1);

				conn_tMSSqlOutput_1.setAutoCommit(false);
				int commitEvery_tMSSqlOutput_1 = 10000;
				int commitCounter_tMSSqlOutput_1 = 0;

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Connection is set auto commit to '")
							+ (conn_tMSSqlOutput_1.getAutoCommit()) + ("'."));
				int batchSize_tMSSqlOutput_1 = 10000;
				int batchSizeCounter_tMSSqlOutput_1 = 0;

				if (dbschema_tMSSqlOutput_1 == null
						|| dbschema_tMSSqlOutput_1.trim().length() == 0) {
					tableName_tMSSqlOutput_1 = "";
				} else {
					tableName_tMSSqlOutput_1 = dbschema_tMSSqlOutput_1 + "].["
							+ "";
				}
				int count_tMSSqlOutput_1 = 0;

				String update_tMSSqlOutput_1 = "UPDATE ["
						+ tableName_tMSSqlOutput_1
						+ "] SET [newColumn] = ?,[newColumn1] = ?,[newColumn2] = ?,[newColumn3] = ?,[newColumn4] = ? WHERE [newColumn3] = ? AND [newColumn4] = ?";
				java.sql.PreparedStatement pstmt_tMSSqlOutput_1 = conn_tMSSqlOutput_1
						.prepareStatement(update_tMSSqlOutput_1);

				/**
				 * [tMSSqlOutput_1 begin ] stop
				 */

				/**
				 * [tMSSqlOutput_1 main ] start
				 */

				currentComponent = "tMSSqlOutput_1";

				tos_count_tMSSqlOutput_1++;

				/**
				 * [tMSSqlOutput_1 main ] stop
				 */

				/**
				 * [tMSSqlOutput_1 end ] start
				 */

				currentComponent = "tMSSqlOutput_1";

				try {
					int countSum_tMSSqlOutput_1 = 0;
					if (pstmt_tMSSqlOutput_1 != null
							&& batchSizeCounter_tMSSqlOutput_1 > 0) {

						if (log.isDebugEnabled())
							log.debug("tMSSqlOutput_1 - " + ("Executing the ")
									+ ("UPDATE") + (" batch."));
						for (int countEach_tMSSqlOutput_1 : pstmt_tMSSqlOutput_1
								.executeBatch()) {
							if (countEach_tMSSqlOutput_1 == -2
									|| countEach_tMSSqlOutput_1 == -3) {
								break;
							}
							countSum_tMSSqlOutput_1 += countEach_tMSSqlOutput_1;
						}

						if (log.isDebugEnabled())
							log.debug("tMSSqlOutput_1 - " + ("The ")
									+ ("UPDATE")
									+ (" batch execution has succeeded."));
					}

					updatedCount_tMSSqlOutput_1 += countSum_tMSSqlOutput_1;

				} catch (java.sql.BatchUpdateException e) {

					int countSum_tMSSqlOutput_1 = 0;
					for (int countEach_tMSSqlOutput_1 : e.getUpdateCounts()) {
						countSum_tMSSqlOutput_1 += (countEach_tMSSqlOutput_1 < 0 ? 0
								: countEach_tMSSqlOutput_1);
					}

					updatedCount_tMSSqlOutput_1 += countSum_tMSSqlOutput_1;

					log.error("tMSSqlOutput_1 - " + (e.getMessage()));
					System.err.println(e.getMessage());

				}
				if (pstmt_tMSSqlOutput_1 != null) {

					pstmt_tMSSqlOutput_1.close();

				}

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Connection starting to commit ")
							+ (commitCounter_tMSSqlOutput_1) + (" record(s)."));
				conn_tMSSqlOutput_1.commit();

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Connection commit has succeeded."));
				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Closing the connection to the database."));
				conn_tMSSqlOutput_1.close();
				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - "
							+ ("Connection to the database has closed."));
				resourceMap.put("finish_tMSSqlOutput_1", true);

				nb_line_deleted_tMSSqlOutput_1 = nb_line_deleted_tMSSqlOutput_1
						+ deletedCount_tMSSqlOutput_1;
				nb_line_update_tMSSqlOutput_1 = nb_line_update_tMSSqlOutput_1
						+ updatedCount_tMSSqlOutput_1;
				nb_line_inserted_tMSSqlOutput_1 = nb_line_inserted_tMSSqlOutput_1
						+ insertedCount_tMSSqlOutput_1;
				nb_line_rejected_tMSSqlOutput_1 = nb_line_rejected_tMSSqlOutput_1
						+ rejectedCount_tMSSqlOutput_1;

				globalMap.put("tMSSqlOutput_1_NB_LINE", nb_line_tMSSqlOutput_1);
				globalMap.put("tMSSqlOutput_1_NB_LINE_UPDATED",
						nb_line_update_tMSSqlOutput_1);
				globalMap.put("tMSSqlOutput_1_NB_LINE_INSERTED",
						nb_line_inserted_tMSSqlOutput_1);
				globalMap.put("tMSSqlOutput_1_NB_LINE_DELETED",
						nb_line_deleted_tMSSqlOutput_1);
				globalMap.put("tMSSqlOutput_1_NB_LINE_REJECTED",
						nb_line_rejected_tMSSqlOutput_1);

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - " + ("Has ") + ("updated")
							+ (" ") + (nb_line_update_tMSSqlOutput_1)
							+ (" record(s)."));

				if (log.isDebugEnabled())
					log.debug("tMSSqlOutput_1 - " + ("Done."));

				ok_Hash.put("tMSSqlOutput_1", true);
				end_Hash.put("tMSSqlOutput_1", System.currentTimeMillis());

				/**
				 * [tMSSqlOutput_1 end ] stop
				 */
			}// end the resume

		} catch (java.lang.Exception e) {

			if (!(e instanceof TalendException)) {
				log.fatal(currentComponent + " " + e.getMessage(), e);
			}

			TalendException te = new TalendException(e, currentComponent,
					globalMap);

			throw te;
		} catch (java.lang.Error error) {

			runStat.stopThreadStat();

			throw error;
		} finally {

			try {

				/**
				 * [tMSSqlOutput_1 finally ] start
				 */

				currentComponent = "tMSSqlOutput_1";

				if (resourceMap.get("finish_tMSSqlOutput_1") == null) {
					if (resourceMap.get("conn_tMSSqlOutput_1") != null) {
						try {

							if (log.isDebugEnabled())
								log.debug("tMSSqlOutput_1 - "
										+ ("Closing the connection to the database."));

							java.sql.Connection ctn_tMSSqlOutput_1 = (java.sql.Connection) resourceMap
									.get("conn_tMSSqlOutput_1");

							ctn_tMSSqlOutput_1.close();

							if (log.isDebugEnabled())
								log.debug("tMSSqlOutput_1 - "
										+ ("Connection to the database has closed."));
						} catch (java.sql.SQLException sqlEx_tMSSqlOutput_1) {
							String errorMessage_tMSSqlOutput_1 = "failed to close the connection in tMSSqlOutput_1 :"
									+ sqlEx_tMSSqlOutput_1.getMessage();

							log.error("tMSSqlOutput_1 - "
									+ (errorMessage_tMSSqlOutput_1));
							System.err.println(errorMessage_tMSSqlOutput_1);
						}
					}
				}

				/**
				 * [tMSSqlOutput_1 finally ] stop
				 */
			} catch (java.lang.Exception e) {
				// ignore
			} catch (java.lang.Error error) {
				// ignore
			}
			resourceMap = null;
		}

		globalMap.put("tMSSqlOutput_1_SUBPROCESS_STATE", 1);
	}
Seven Stars

Re: Slow SQL Server updates


@rhall_2_0wrote:

You should be able to switch to the code screen and hit ctrl+F to bring up a find box.


Must be a difference between Windows and Mac.  On Mac, it would be command-F to bring up search -- which works in most places but nothing happens in Talend.  I tried control-F, also, but no response.

 


@rhall_2_0wrote:

The code you copied and pasted isn't all of it for the tMSSqlOutput.


The code I pasted is all that appears in the Code Viewer window when I have the tMSSqlOutput_2 job selected.  You can see the start and stop comments from top and bottom of what displays in the viewer -- pasted again, below.  Maybe another difference between Windows and Mac?  Hmmm...  I'm going to use SQL Server Profiler to try and look at the SQL coming in.  But I'd really like to know why I don't see what you see.

 

/**
* [tMSSqlOutput_2 main ] start
*/

...
/** * [tMSSqlOutput_2 main ] stop */

 

Sixteen Stars

Re: Slow SQL Server updates

I was using a Mac at the time but was assuming you were using Windows when I suggested that.

 

I've just realised what you are doing wrong. I was not talking about the Code Viewer (its useless in my eyes). I was talking about about the "Code" tab at the bottom left of the design window....

 

CodeTab.png

Click that and you will see the whole code for the job. It is also useful when looking for compilation errors or trying to identify errors in error stacks. In fact, you should be spending a lot of time looking at this to quickly debug issues.

 

 

Seven Stars

Re: Slow SQL Server updates

Ah, yes, that is more helpful. Smiley Happy  I agree about the "Code Viewer" window -- seems pointless given the ability to view the "Code" tab of the designer.

 

Clearly the key column is being used in the WHERE clause -- it appears in the Talend statement:

String update_tMSSqlOutput_2 = "UPDATE ["
	+ tableName_tMSSqlOutput_2
	+ "] SET [customer_name] = ?,[ship_address] = ?,[ship_postalcode] = ?,[ship_city] = ?,[ship_county] = ?,[ship_state] = ?,[ship_country] = ?,[mktsegcode] = ?,[crc] = ? WHERE [cmf] = ?";

And in the resulting line shown in SQL Profiler:

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 nvarchar(4000),@P5 nvarchar(4000),@P6 nvarchar(4000),@P7 nvarchar(4000),@P8 bigint,@P9 nvarchar(4000)',N'UPDATE [dbo].[customer] SET [customer_name] =  @P0 ,[ship_address] =  @P1 ,[ship_postalcode] =  @P2 ,[ship_city] =  @P3 ,[ship_county] =  @P4 ,[ship_state] =  @P5 ,[ship_country] =  @P6 ,[mktsegcode] =  @P7 ,[crc] =  @P8  WHERE [cmf] =  @P9 ',1

Which is followed by (for each row):

exec sp_execute 1,...

NOTE: It is interesting that the sp_prepare shows nvarchar(4000) for every string parameter despite each column being defined explicitly as varchar(n) with an appropriate length.  But this is probably a generic thing done for all parameters since it requires NVARCHAR.

 

I did find the article Watch out those prepared SQL statements.  It indicates that use of sp_prepare/sp_execute is typical of ODBC, OLEDB, and, apparently, JDBC.  But it also points out a potential issue with the sp_unprepare not always being triggered.  I'm not sure of the impact of that but I wouldn't suspect it is related to what I am seeing.

 

I'm completely stumped at this point.  I have no explanation for the poor performance compared to Datastage.

 

Sixteen Stars

Re: Slow SQL Server updates

I wish I could offer more help, but without actually seeing the issue and being able to hack around, it is very hard. As my screenshots showed you, Talend is capable of much better performance than you are seeing for both inserts and updates. I'd actually be interested to try out Datastage to get some comparisons of my own (I came from a pure SQL with DB links ETL to Informatica background). If you have a Talend license, I would advise contacting Talend support and asking them for a WebEx session so that they can see exactly what you have and maybe have a bit of a look around.

Seven Stars

Re: Slow SQL Server updates

I appreciate all you have contributed to this conversation.  I'll end this thread for now until I (hopefully) come up with a resolution.  Meanwhile, I'll close with this information:

 

Using SQL Profiler I traced each of the Talend and Datastage jobs and saved the trace which I include below.  (I have included only a single sp_execute as they are all the same for each row except for the content being passed in.)

 

The only differences in their output was this:

  • Talend (using jDTS/JDBC):
    • Forces everything to unicode in the sp_prepare and defaults parameters to maximum size.
      • From researching this on the 'net, this seems to be a standard practice with JDBC due to Java being unicode by default.
      • I don't understand why it doesn't respect the column definitions and just maxes the parameter size.  Not sure that has an impact on the issue covered in this thread but it seems sloppy.
      • Note that Datastage (using OLEDB) does use the column definitions.
    • Does not issue a closing sp_unprepare or reset IMPLICIT_TRANSACTIONS.
      • Personally, I think this is not a good thing though I don't know that it has any impact on the issue I am covering in this thread.  Again, it just seems sloppy.
      • Note that Datastage does reset the IMPLICIT_TRANSACTIONS to off and closes with sp_unprepare

Talend

-- network protocol: TCP/IP
set quoted_identifier on
set arithabort off
set numeric_roundabort off
set ansi_warnings on
set ansi_padding on
set ansi_nulls on
set concat_null_yields_null on
set cursor_close_on_commit off
set implicit_transactions off
set language us_english
set dateformat mdy
set datefirst 7
set transaction isolation level read committed

SELECT @@MAX_PRECISION
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
SET IMPLICIT_TRANSACTIONS OFF
SET QUOTED_IDENTIFIER ON
SET TEXTSIZE 2147483647

SET IMPLICIT_TRANSACTIONS ON

declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 nvarchar(4000),@P5 nvarchar(4000),@P6 nvarchar(4000),@P7 nvarchar(4000),@P8 bigint,@P9 nvarchar(4000)',N'UPDATE [dbo].[customer] SET [customer_name] =  @P0 ,[ship_address] =  @P1 ,[ship_postalcode] =  @P2 ,[ship_city] =  @P3 ,[ship_county] =  @P4 ,[ship_state] =  @P5 ,[ship_country] =  @P6 ,[mktsegcode] =  @P7 ,[crc] =  @P8  WHERE [cmf] =  @P9 ',1
select @p1

exec sp_execute 1,N'...

IF @@TRANCOUNT > 0 COMMIT TRAN

 

Datastage

-- network protocol: TCP/IP
set quoted_identifier on
set arithabort off
set numeric_roundabort off
set ansi_warnings on
set ansi_padding on
set ansi_nulls on
set concat_null_yields_null on
set cursor_close_on_commit off
set implicit_transactions off
set language us_english
set dateformat mdy
set datefirst 7
set transaction isolation level read committed

set implicit_transactions on 

declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 varchar(50),@P2 varchar(240),@P3 varchar(10),@P4 varchar(60),@P5 varchar(60),@P6 char(2),@P7 char(4),@P8 varchar(2),@P9 bigint,@P10 char(8)',N'UPDATE customer SET customer_name=@P1,ship_address=@P2,ship_postalcode=@P3,ship_city=@P4,ship_county=@P5,ship_state=@P6,ship_country=@P7,mktsegcode=@P8,crc=@P9 WHERE cmf=@P10',1
select @p1

exec sp_execute 1,'...

IF @@TRANCOUNT > 0 COMMIT TRAN

set implicit_transactions off 

exec sp_unprepare 1

 

 

 

 

Sixteen Stars

Re: Slow SQL Server updates

Not wishing to steal the last word, but I just had a thought which might help with your investigation. At the moment you are using the components in a way which lets them handle commits for you. You can take control of this and manually create a connection, perform your updates/inserts/deletes and then commit or rollback at the end. I mainly work in this way if I have to ensure a successful atomic transaction or rollback. But it *might* help with your problem. Take a look at this page which shows how to do this.....

 

https://help.talend.com/reader/iYcvdknuprDzYycT3WRU8w/tR5sAkr~80fPFhrYZNlNWQ

Seven Stars

Re: Slow SQL Server updates

I originally had this job using a shared connection but had moved back from that to connections in the components in an effort to simplify the job and determine why it was slow.  I'll keep in mind the option you have suggested here.

Seven Stars

Re: Slow SQL Server updates

I started from scratch and built a completely new job and new tables in a new database -- see image at bottom. I am seeing essentially the same results.  Most importantly, I still see the same issue that it is clearly something to do with pushing the data to the database as, once it hits the first "commit" limit (when it begins to send data to the database), the job hangs visually and I can watch the row count in SQL Profiler, just crawling along...

 

Interesting note #1:  The Open Source JTDS JDBC provider provides marginally better performance (100 rows/sec) than the Microsoft JDBC provider (30 rows/sec), though still nothing like an acceptable level.

 

Interesting note #2:  From reading the docs, I understand that I should be able to use a single tMSSqlConnection with multiple tMSSqlInput or tMSSqlOutput components.  So, in this job, I initially tried that.  The first run was with zero rows in the target table: the job ran, and completed quickly, indicating that all 20k rows had been inserted.  Yet nothing appeared in the target table; it was empty.  After playing around with this, I then added a second tMSSqlConnection (thinking I must have misunderstood the docs) and set the tMSSqlOutput components to use it.  I ran the test again:  It indicated it had inserted 20k rows but the target table was still empty.  (NOTE: Using SQL Profiler during this, I confirmed that nothing was being sent to the database during the tests.)  So, I changed the tMSSqlOutput components to NOT use a connection but to use their a connection from themselves.  This time, when I ran the test, it showed 20k rows inserted, and those rows were actually in the table.  Similarly, though slow, the update worked as well only when configured to not use an existing connection.   However, the tMSSqlConnection component works fine for multiple tMSSqlInput connections (there are two in this job).  It is just the tMSSqlOutput components which will not work with it.

 

Should anyone care to try and replicate this, I've also attached a zip of the exported job (sans context) and a SQL script that will facilitate building the tables and managing the data between tests.  The general test is to:

  1. Create the tables and populate the source table
  2. Run the job -- confirm that all 20k rows take the insert path and that the target table is populated.
  3. Run the job again -- confirm that no inserts or updates occur.
  4. Run the section of the script to reset the crc values in the target table.
  5. Run the job again -- confirm that all 20k rows take the update path and that the target table crc values are all non-zero.

 

 

Screen Shot 2018-03-26 at 11.42.15 AM.png

 

 

 

Seven Stars

Re: Slow SQL Server updates

@rhall_2_0 wrote:

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options? This can make a big difference since updating a key field (even with the same value) will incur an update performance hit since the index is technically having to be recalculated. Since it is a key value, it won't change....so you may as well untick the updatable field.

 

In fact, it appears the tMSSqlOutput component is, indeed, using the key(s) defined in the schema as the column(s) to use in the WHERE clause and the Field Options have no affect.

 

If I uncheck the key column in the schema and attempt to run, it replies with this:

 

[FATAL]: workspace.testupdate_0_1.TestUpdate - tMSSqlOutput_2 For update, Schema must have a key

 

If I then check the key column and add a second key column, when I run the job, I get this in the sp_prepare -- note that both columns are listed:

 

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 nvarchar(4000),@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval2] = @P0 ,[strval3] = @P1 ,[crc] = @P2 WHERE [keyval] = @P3 AND [strval1] = @P4 ',1 select @p1

 

I then unchecked the second column from above (leaving only the first column which is the real key) and played around with the Field Options.  It has no effect on the sp_prepare statement:

  

Field Options checked, Updatable checked:

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

Field Options unchecked (leaving Updatable checked):

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

Field Options unchecked (leaving Updatable unchecked):

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

 

Sixteen Stars

Re: Slow SQL Server updates

I'm afraid your last post has some clear errors in it. I shall explain. The code (SQL) that Talend fires is not hardcoded and hidden, it changes with the component configuration and can be seen to change if you look at the generated code. There is absolutely no way that the same query is being fired to the database with different field option configurations. The field options DO make a difference....unless we are going to believe that the code generated is just for show and Talend does something mysterious behind the scenes. The code below shows the changes with different settings. I replicated the sort of changes you made and am showing the code that is generated....

 

Field options checked, no update field checked, updatable field catalogue_id checked (amongst others)...

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE ";

You will see there is an empty WHERE CLAUSE which is what you noticed. This will not run due to the empty WHERE CLAUSE.

Field options checked, update field id checked, updatable field catalogue_id checked (amongst others)....

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE [id] = ?";
				

You will notice that the WHERE CLAUSE now has the id field specified.

Field options checked, update field id checked and updatable field catalogue_id unchecked....

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE [id] = ?";
				

In the code above you will notice that the catalogue_id field is removed from the list of fields to be updated. That is what the updatable field does. It specifies whether the field should be able to be updated. It has nothing to do with the update key. I mentioned it because there is no point updating your key field as the value will not change. If your key is id and your datarow contains 21 for the key, the row to update will have a key of 21 and if you update the key field as well you are updating 21 with 21. That was the point I was making.

Field options unchecked, but everything else left the same as above...

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [id] = ?,[catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE ";

Since we have switched off the Field Options, the settings are now ignored. We are back to having no WHERE CLAUSE again. This will not run. You will get an error similar to the following....

Exception in component tMSSqlOutput_1
java.lang.RuntimeException: For update, Schema must have a key
	at allport_dq.testjob_0_1.TestJob.tHashInput_1Process(TestJob.java:1769)
	at allport_dq.testjob_0_1.TestJob.tMSSqlInput_1Process(TestJob.java:3073)
	at allport_dq.testjob_0_1.TestJob.runJobInTOS(TestJob.java:3425)
	at allport_dq.testjob_0_1.TestJob.main(TestJob.java:3166)
[FATAL]: allport_dq.testjob_0_1.TestJob - tMSSqlOutput_1 For update, Schema must have a key

If you have Field Options unchecked and are trying to carry out an update, there is no way your job would have run. So your SQL Profiler data showing what was fired at the database when the Field Options was switched off is not actually showing that. It might be showing a cached query or maybe just reshowing the last successful run. But it is impossible for it to be showing a query that could never have been fired at the database.

 

I'm sensing either aggravation at being forced to use a tool other than Datastage, or some eagerness to "prove" that Datastage is better. The problem is that you cannot do that until you know how to use Talend. You clearly do not know Talend and are not trying to learn it....just trying to find fault. The problem with doing that is that you look very silly when you make a schooboy error because you have omitted to RTFM. An example of this can be seen below....

"Interesting note #2:  From reading the docs, I understand that I should be able to use a single tMSSqlConnection with multiple tMSSqlInput or tMSSqlOutput components.  So, in this job, I initially tried that.  The first run was with zero rows in the target table: the job ran, and completed quickly, indicating that all 20k rows had been inserted.  Yet nothing appeared in the target table; it was empty.  After playing around with this, I then added a second tMSSqlConnection (thinking I must have misunderstood the docs) and set the tMSSqlOutput components to use it.  I ran the test again:  It indicated it had inserted 20k rows but the target table was still empty.  (NOTE: Using SQL Profiler during this, I confirmed that nothing was being sent to the database during the tests.)  So, I changed the tMSSqlOutput components to NOT use a connection but to use their a connection from themselves.  This time, when I ran the test, it showed 20k rows inserted, and those rows were actually in the table.  Similarly, though slow, the update worked as well only when configured to not use an existing connection.   However, the tMSSqlConnection component works fine for multiple tMSSqlInput connections (there are two in this job).  It is just the tMSSqlOutput components which will not work with it."

 

Now someone who had read the manual would know that with the tMSSQLConnection component you need to either set Auto Commit on the Advanced Settings tab OR use a tMSSQLCommit component. Had you done this, your data would have been committed. Here is the section of the manual you missed (and I believe I pointed you to): https://help.talend.com/reader/iYcvdknuprDzYycT3WRU8w/tR5sAkr~80fPFhrYZNlNWQ

 

 

I may have come across as a bit harsh here, but I'm afraid I am a little cheesed off that I tried to help you only for you to play passive aggressive developer one-upmanship with me, while trying to pull apart a tool you clearly have little understanding of (and haven't really bothered to try to learn). Do you REALLY think that Talend would have reached the adoption levels it has if it could only update at 30 rows a second?!? I mean, seriously? I have demonstrated screen shots showing you updates on a database running in a different country to the location of my Studio, running at a thousand times faster than you are claiming you are stuck at. But you insist it is a flaw with Talend and have spent your time trying to prove that Talend is a smoke and mirrors application. Maybe you should stick with Datastage......

 

Seven Stars

Re: Slow SQL Server updates

@rhall_2_0: I intended no offense at all.  That you are offended saddens me.   I had no intent to make you look bad or whatever you felt.  I was only reporting what I experienced.  I said many times that I appreciated your help and I still do.  My apologies for whatever you mistakenly thought I intended.

 

My sole purpose has been and continues to be to understand why Talend performs so poorly compared to my own experience with Datastage.  I would be very happy for someone to identify what it is in the job (which I attached in the prior post) is being done incorrectly to account for the experience I am having.  The goal is to leave Datastage behind but I'm having trouble doing that when I cannot justify a 100x difference in performance.

 

I came here seeking help.  That has not changed.  I will repeat my tests from the last post in the chance I did, indeed, make some errors -- and I will include the code that is generated for comparison; I neglected to include that.

 

You needn't feel obligated to respond but I hope that you will as your prior information has been beneficial.

Seven Stars

Re: Slow SQL Server updates

My previous post turns out to have suffered from a misunderstanding of how the key checkbox and Field Options work, and some copy/paste errors.  Below are the details of what I've found but I'm still left trying to understand why updates are so slow compared to my experience in Datastage.  This, after creating multiple jobs from scratch, and testing with different databases on different servers (trying to isolate potential contributors).  I would still appreciate someone attempting to duplicate these tests using the job and script I attached.  I'd really like to understand what I may be doing wrong so I can finally move forward using Talend.

 

After confirming, again, that the key checkbox is used by an Update action to supply the key, I found this section in the documentation of tMSSqlOutput:

 

"It is necessary to specify at least one column as a primary key on which the Update and Delete operations are based. You can do that by clicking Edit Schema and selecting the check box(es) next to the column(s) you want to set as primary key(s). For an advanced use, click the Advanced settings view where you can simultaneously define primary keys for the Update and Delete operations. To do that: Select the Use field options check box and then in the Key in update column, select the check boxes next to the column names you want to use as a base for the Update operation. Do the same in the Key in delete column for the Delete operation."

 

I was able to confirm that the key checkboxes are used unless the Field Options checkbox is checked, in which case, it uses whatever is set in the "Field Options" columns.  So, while the key checkbox is sufficient for most cases, there may be times where checking the "Field Options" checkbox is necessary to achieve a greater level of control.

 

I was able to confirm this both in the generated Java code as well as in the sp_prepare statements (which suffered from copy/paste errors in my previous post).  Details listed below.

 

field options one updatable key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ?";

 

declare @p1 int

set @@p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 bigint,@P5 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [keyval] = @P0 ,[strval1] = @P1 ,[strval2] = @P2 ,[strval3] = @P3 ,[crc] = @P4 WHERE [keyval] = @P5 ',1

select @p

 

field options no updatable key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE ";

 

Job errors out - no sp_prepare sent

 

field options two updatable keys

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ? AND [strval1] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 bigint,@P5 nvarchar(4000),@P6 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [keyval] = @P0 ,[strval1] = @P1 ,[strval2] = @P2 ,[strval3] = @P3 ,[crc] = @P4 WHERE [keyval] = @P5 AND [strval1] = @P6 ',1

select @p1

 

key checkbox -- one key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1

select @@p1

 

key checkbox -- no key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE ";

 

Job errors out - no sp_prepare sent

 

key checkbox -- two keys

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ? AND [strval1] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 nvarchar(4000),@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval2] = @P0 ,[strval3] = @P1 ,[crc] = @P2 WHERE [keyval] = @P3 AND [strval1] = @P4 ',1

select @p1

 

 

Sixteen Stars

Re: Slow SQL Server updates

So it seems I may have got the wrong the impression from the last post I replied to. I guess it is very easy to give and get the wrong impression when communicating in just text, so I apologise for misunderstanding your motives.

 

Regarding the job you uploaded to test, I have had a look at it and tested it on my machine. I downloaded a new version of Talend 6.5.1 (since I am using 6.2.1 at the moment and your job was written in 6.4.1). I am running Microsoft SQL Server 2014 (SP1-GDR). Here is what I found.

 

1) The reason your data wasn't inserting or updating is because there was not commit specified when using the tMSSQLConnection component. This can either be set using  a tMSSQLCommit component or setting the Auto Commit option on the Advanced Settings of the tMSSQLConnection.

2) You only need one tMSSQLConnection for this. Adding 2 does nothing for the performance in this case. However I suspect you added a second in an effort to solve the problem with inserts and updates not occurring.

3) When I sorted the Auto Commit on the tMSSQLConnection component, the update ran successfully at a rate of 1320.9 rows per second. This seemed a little slow, but apparently is faster than you were getting. How busy is your database out of interest?

4) I then left everything in the job the same but changed the Update DB component to manage it's own connection. I set the connection details for that component. I left everything else the same. So the shared connection was responsible for the reads and the update had its own connection. I managed to get a performance of 3545 rows per second. This was with no Field Options (so the schema keys were automatically used).

5) I then switched the Field Options on and ticked the update key (keyval), leaving everything else the same. The performance was worse doing this, at 2471 rows per second.

6) My suspicion was that updating the keyval was likely causing this issue (since it was left as updateable), so I unticked it and ran again. I got a row rate of 3563 rows per second.

7) Since the point of this update is just to update one field, I unticked all updateable fields but the crc field. So the only fields ticked were the key (keyval) and the field to update (crc). This increased the rate to 3926 per second.

Screen Shot 2018-03-28 at 01.32.31.png

The screenshot above shows the slightly changed layout of the job (I removed the extra connection components) and also shows the row rate I got on the last test.

Screen Shot 2018-03-28 at 01.32.54.png

The screenshot above shows the configuration of the update component's field options for the fastest performance. 

 

What sort of performance do you get on your system with the same configuration? Even if it is slower, do you see the same sort of performance improvements when you make the changes I have made?

 

 

Seven Stars

Re: Slow SQL Server updates


@rhall_2_0wrote: 

The reason your data wasn't inserting or updating is because there was not commit specified when using the tMSSQLConnection component. This can either be set using  a tMSSQLCommit component or setting the Auto Commit option on the Advanced Settings of the tMSSQLConnection.


Dang!  I actually knew this but had forgotten it in the midst of focusing upon why the updates weren't performing well.  'Sometimes can't see the forest because of being too close to the tree.

 


@rhall_2_0wrote:

 

When I sorted the Auto Commit on the tMSSQLConnection component, the update ran successfully at a rate of 1320.9 rows per second. This seemed a little slow, but apparently is faster than you were getting. How busy is your database out of interest?

 


The database is completely idle except for my testing.  I am the only one on it.

 


@rhall_2_0wrote:

 

I then left everything in the job the same but changed the Update DB component to manage it's own connection. I set the connection details for that component. I left everything else the same. So the shared connection was responsible for the reads and the update had its own connection. I managed to get a performance of 3545 rows per second. This was with no Field Options (so the schema keys were automatically used).


Well, I'm not surprised.  You had indicated that you've had no issues with updates.  And, certainly, those rates are very reasonable.  There is clearly something that I am missing.

 


@rhall_2_0wrote: 

What sort of performance do you get on your system with the same configuration? Even if it is slower, do you see the same sort of performance improvements when you make the changes I have made?


The rates I'm seeing are so low that using either the key checkbox or the Field Options, I don't see enough difference to determine if either is better than the other.  Both are < 100 rows/sec.

 

Another associate has begun working with Talend and I asked him to run this job.  He did so and got numbers similar to what I did.  That would lead me to believe that it has to be something about the database.  But he also changed the job to point at an Oracle database and though he saw a little better performance, he still saw < 500 rows/sec.  And when I run other statements in SQL Mgmt Studio or run the Datastage job I originally started with, I have no obvious performance issues.  I just don't get it.

 

Thanks for willingness to try and help me resolve this.

Sixteen Stars

Re: Slow SQL Server updates

With Oracle you will get very different tuning requirements. I've found that reads are often the bottleneck with Oracle. Using a cursor for reads can make a big difference....but you aren't really concerned about that right now.

 

One more thing I've thought of, have you tried compiling your job to a Jar (build job) and running it from the command line? You won't be able to see the rows per second as easily, but you will be able to tell from the time of running whether it improves things. On some systems the Studio can be a bit slow. This can be because of virus software (scanning the Studio filesystem for changes), slow disk reads/writes (best to use SSD), low memory (I insist upon at least 8GB,...preferably 16GB) and non-i7 processors.

 

You can also try switching out the Jars for newer ones. I was using 6.5.1 so may very well have newer MS DB Jars. Although this will be fiddly.  

 

If you are assessing Talend for a potential purchase, I would advise raising this with their presales team. I have to admit that I have never seen the sort of performance you describe without having been able to improve it with a bit of tinkering. I used to work for Talend in professional services, so I have seen A LOT of examples of most of the supported databases. If you get a presales guy/girl on site, they should be able to find your issue with a little time tinkering.

 

One more performance thing, your Studio will have the bog standard JVM settings I imagine. You will probably want to improve those. If you look for an ini file in your Studio folder. The file will be named the same as the executable you use to start Studio. Change the contents of the ini file to something like this....

 

-vmargs
-Xms512m
-Xmx8192m
-Dfile.encoding=UTF-8
-Djava.net.preferIPv4Stack=true

You will see that the standard mem settings are pretty low.

 

 

Seven Stars

Re: Slow SQL Server updates

By adding "sendStringParametersAsUnicode=false" to the Advanced Settings of the update component, it is now producing a sp_prepare using varchar(8000) instead of nvarchar(4000) and the performance difference was dramatic -- almost 4k rows/sec instead of a couple hundred.  The problem is that the clustered key index in SQL Server is varchar() so it doesn't match the nvarchar() parameter value and forces a table scan.

 

The engineer assigned to the defect I opened (TDI-40281) was able to confirm the performance issue.

 

On a side note:  It isn't clear to me why the component wouldn't use the size of the schema columns for the parameters, but that probably doesn't matter much.

 

Four Stars

Re: Slow SQL Server updates

I have a MUCH SIMPLER job and I am have the same performance issue on the tMSSqlOutput doing a straight Insert to a new table -- it writes the first 10,000 rows fairly fast, and then slows to a crawl. I initially saw this problem in a more complex job, so I built 2 very simple jobs to try to find the cause/bottleneck.

 

Simple Job: Properties:

  • Talend-Studio v6.5.1 on local Windows 10 PC
  • Sql Server 2016 running on an AWS EC2 (connect using TCP connection to IP address w/ Sql Authentication)
  • Table of Account data: 69,000 rows and about 15 fields (address fields and few string attributes)
  • Sql Connection stored in Repository (both sql components below using this Repository connection)
  • Action 1:  tMSSqlInput - Read all fields all rows from the above table, using "Microsoft JDBC"
  • Action 2: tMSSqlOutput - Create a new table (drop if exist and create), doing Insert; no Keys defined, default settings commit every 10K, using "Microsoft JDBC"

Comparison:

  • the input Read operation by itself takes 7 seconds to read 69,000 rows == good (10K rows/sec)
  • Trial 1:  wire the tMSSqlInput data straight into tMSSqlOutput  (or through tMap) -- writes 10,000 rows in a few seconds, then slows to a CRAWL, finishes in 45 minutes (avg 25 rows/sec)
  • Trial 2:  change the "commit every" settings, make it 1,000, etc. -- it writes only about 24~28 rows/sec
  • Trial 3:  separate the tMSSqlInput and tMSSqlOutput components into 2 subjobs -- 1st subjob reads tMSSqlInput  and writes to a CSV file (tOutputFileDelimited), then trigger On Component Ok --> 2nd subjob reads CSV file (tInputFileDelimited) and writes to tMSSqlOutput   -- this completes the read in 7sec (10K rows/s = good) and the write/insert step takes only 1 minute (1000 rows/s = decent)

As you can see, I am doing nothing special with the data. Read from a table, write to a new different table, no keys, no transformation, just straight read then write the data as-is. SOMETHING IS NOT RIGHT with the tMSSqlOutput component, it should not take longer to read and write directly to/from SQL than it takes to write it to a CSV file in the middle and re-read it.