Hi, I'm using tOracleTableTransfer for loading ~90 big tables, and noticed different speed for 2 tables with similar row counts, and found out the one took longer was due to the big table size! I tried to play with the "Log interval in seconds", "Source select fetch size" and "Insert Batch size" parameter numbers and hope for the best results but it still taking a very long time for my tables to get loaded! (5 hours for about 90 tables with average row counts of 50K/table) Does anyone have more depth knowledge on how this perfect custom-made component works? I have even divided my table list into 2 for relatively large and small tables and gave 50,000 to small ones and 500,000 to large ones for "Source select fetch size" and "Insert Batch size" parameters, but it didn't help with the speed. Any other idea on how to speed the load up and enhance the performance? Any help is appreciated!
I have updated the component tOracleTableTransfer now. Because of Talend Exchange is extremely slow in approving updates I have provided this component on my personal web space: http://jan-lolling.de/talend/components/tOracleTableTransfer.zip Could you please check this version. One word to large tables. It could be the transfer slows down if the tables grows. This is a effect caused by the database itself. Check if you have some indices. They will be updated too and can cause serious performance issues while writing the data.
Thank you so much for quick response! I'm running my job right now with the updated version of the component. However I do have some questions regarding the component: 1. Now I see the following log are being generated by the component, I was wondering if there is a way to avoid those for some jobs?
2. This component does commit the data even if I don't put commit at the end. So I can't have a "all or none" if I'm loading multiple tables within 1 job? 3. Is it correct to assume the larger table, the larger fetch size then higher performance and smaller table, smaller number for fetch size? 4. Does "Log interval seconds" have any impact on the performance? I've tried 0, 1 and 5, and 0 gave me a better performance, but not sure if there is any correlation! 5. I used the new version of this component and ran the same amount of data and number of tables, with the old version it took me 10min, with the new one took 19min, is it something that I'm not setting up correctly that I observe this difference? My job logic is to have 1 tOracleTableTransfer component, and I loop through my table list, and plug them into the component to load the data from source to target. Is there something wrong with doing that in terms of the speed. Of course my tables are not the same in the size, but all of them are over 40K records. Also the data in my source are growing but I have where clause that only grabs the data that are not changing (not getting any updates), and my target is not changing, so basically I'm just loading data into a repository kind of database once a day. My source does have PK and FK and indices but my target has primary keys with no foreign keys. Thank you again!
OK, The sped actually should not become slower. I am surprised about this happens in your job. 1. With the Setting log-interval in seconds = 0 you can switch off the logging. I will change the logging to a log4j logging. In the past Talend had no support for log4j, now this is the default 2. Yes the component commits the data (after every batch). I can add an option to switch off the commit and let you do this. 3. The fetch size and the batch size is highly depending on the database and the table structure. The fetch size works in the same way the the tOracleInput and here the Cursor size (which is actually the fetch size) and the batch size is exactly the same as the batch size in the normal tOracleOutput components. You should play with the parameters and find out the best values. If you set the batch size to high you will fill to much space in the Oracle SGA and this can lead to a reduced performance. I can tell you, much more then 10.000 is actually useful only in seldom situations. 4. To write to the standard output can cause performance issues. To switch off this be setting the log interval to 0 is a good choice 5. In my projects the update had no negative impact to the performance. Take care every run of a table transfer will use a new database connection (never use a connection for parallel runs!). You can usually drastical increase the performance if you disable the primary key before. ALTER TABLE table_name DISABLE CONSTRAINT constraint_name;
One hint: this new release can use not only tOracleConnection as source. You can also use PostgreSQL or any other database as source to feed your Oracle database this way.
Again thank you so much jlolling! I have set the log interval in second to 0 but I'm still getting log output in my Talend output console! Adding the optional commit would be an awesome option to have! Regarding the batch size, you mentioned more than 10.000 is very seldom, but the default there is 50.000, is this correct? So it's better to change it to less 10K for better performance in y job? Not being limited to oracle in terms of the datasource it's a big break through for our projects. We will incorporated soon. Thank you sooooo much!
With the log output, I will change all of them to the DEBUG level, so you have the INFOI level free for any business informations. Regarding the batch size: Please keep in mind, it is about increasing the network speed because of the better efficient TCP packages and it is a matter of the database of using a kind of bulk load facility. If you want to find out the maximum (to much batch site can lead into a decrease of the performance) check it out, try it for you. Let it run over a time and check the overall performance. The usage of other database types can sometimes a bit difficult because of different data types. Because of that, I have added the option to map database types to Java types. The java type will be used get the data from the source and to send the data to the target. But please keep in mind, a transaction should not be that big, there are mostly limitations about the size of a transaction. Check that with your database administrator. It will take a bit - I guess until Monday to change the component. Stay tuned.
I have finished my work on the tOracleTableTransfer component. If you want, you can try it: http://jan-lolling.de/talend/components/tOracleTableTransfer.zip In the advanced settings you will find the option "User a separate tOracleCommit component..." and this will prevent the component from committing. Please keep in mind to use this option carefully because usually transactions have a maximum size. Cheers Jan
Thank you SOOOOOO MUCH jlolling again! The log is VERY clean and neat, and having the option for committing is a big plus for this project of ours. I will be testing the component more and monitor the performance with/out the commit, and will update on which works better for us. BTW, I changed my "Source select fetch size" to 25K and "Insert Batch size" to 10K and I see better performance, I will keep playing with those numbers to see what works better! One other good thing I noticed about this component is that it does not truncate the 6 digits in timestamps (even though Java is limited and truncating the last 3 digits in milliseconds). Thanks!
jLolling, I'm testing the latest component version to see the performance. What I have been doing is giving the component different values for "Source select fetch size" and "Insert Batch size". One thing that I noticed is that if I run the component multiple times with the same values for "Source select fetch size" and "Insert Batch size" (using the same table, the same records), I get different performance. My table has 21 columns, 46000 records and one time it took 3510 milliseconds, and the next time 6208 milliseconds. On average the best speed that I got for this table was with "Source select fetch size" =50,000 and "Insert Batch size" =50,000 , but it varies each time I try. Is there something I'm doing wrong? Or a step that I'm missing? I run the garbage collector in between my each runs!
The time differences are not caused by the component, instead caching or load situations in the database. The code of the component does the work always in the same way, no possible root clause for different durations. The garbage collect is actually not necessary because there is a huge time (from the perspective of the JVM) between both runs and the GC has already done what ever is necessary. I suggest you check the mode (option: Backup in file + Only backup in file, no inserts into target table) in which you only write the file and not in the table and check if this only reading varies in the duration and what time it needs. You cause huge transaction size and this depends on a huge space in the SGA.
I applied your you suggestion to check the back up in file mode and not inserting into the target table, and ran the job few times for the same table and same settings and these are the duration in milliseconds: 25693,2574,2558,2480,3120,2808,14118,3417,14758,3307,2479,2583 If it only depends on the DB setting and not component, then I think we are good. It was just an observation that I had and wanted to make sure I'm using this component correctly! Thanks!
Also we tested the component with MSSQL as the source, and were not able to run the job. This is the error that we got: : sqlrunner.datamodel.SQLDataModel - loadSchemas (schemas) for catalog: Infra failed: The server principal "username" is not able to access the database "Infra" under the current security context. ERROR: Write failed in line number 35 message:Invalid column type java.sql.SQLException: Invalid column type ..... ERROR: Read has been failed. Stop write in table. tOracleTableTransfer_1 read:23641 inserted:35 disconnected Nothing was written into DB, even though the log says 35 inserted. We tested the connection separately using tMSSQLInput, and it connects and returns results just fine, but when we use tOracleTableTransfer we get the above error. We tried using the "Self defined source query" and separately with "source table where clause", both cases failed with that error. SELECT MRN, col1, col2, col3, col4 FROM SchemaName.TableNameMSSQL where SchemaName.TableNameMSSQL.col2 is not null
Any update on the source connection issue? Also I was wondering if instead of a DB connection for the source we could use a flat file? I believe the process does at already, but it doesn't have an option to select a file as your source data. Thanks!
jLolling, This is the error that we get, one note is that LAST_SERVICE_DATE is type of data and there is no way to force tOracleTableTransfer to recognize it as a date, it does not read schema. We believe the type date is being treated/processed as a varchar type! Starting job tOracleTabletransfer at 11:32 10/11/2016.
connecting to socket on port 3817 connected : sqlrunner.datamodel.SQLDataModel - loadSchemas (schemas) for catalog: Infra failed: The server principal "sqlaaccusr.prompt.pr" is not able to access the database "Infra" under the current security context. DEBUG: createSourceSelectStatement SQL: SELECT mrn as MEMBER_ID, Carepath_Code, Carepath_Display_Name as Carepath_Name, Last_Service_Date, active FROM AACC.vwGetCarepaths where AACC.vwGetCarepaths.Status_Message_Patient is not null and MRN is not null
DEBUG: No MySQL class loaded. DEBUG: createTargetInsertStatement SQL:insert into CD_ETLSB1.T_PHP_IMMUNIZATION (MEMBER_ID,CAREPATH_CODE,CAREPATH_NAME,LAST_SERVICE_DATE,LAST_SERVICE_DATE_STRING,NEXT_DUE_DATE,NEXT_DUE_DATE_STRING,STATUS_MESSAGE_1,STATUS_MESSAGE_2,STATUS_MESSAGE_BADGE,ACTIVE,STATUS,UPDATE_TS,CREATE_TS) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?) Starting.... DEBUG: Start writing data into target table CD_ETLSB1.T_PHP_IMMUNIZATION DEBUG: Start fetch data from source query SELECT mrn as MEMBER_ID, Carepath_Code, Carepath_Display_Name as Carepath_Name, Last_Service_Date, active FROM AACC.vwGetCarepaths where AACC.vwGetCarepaths.Status_Message_Patient is not null and MRN is not null DEBUG: Analyse result set... DEBUG: Name: member_id, Type: CHAR DEBUG: Name: carepath_code, Type: VARCHAR DEBUG: Name: carepath_name, Type: VARCHAR DEBUG: Name: last_service_date, Type: NVARCHAR DEBUG: Name: active, Type: BIT DEBUG: Start fetching data... DEBUG: Output class mapping: #1 (MEMBER_ID) use: String DEBUG: Output class mapping: #2 (CAREPATH_CODE) use: String DEBUG: Output class mapping: #3 (CAREPATH_NAME) use: String DEBUG: Output class mapping: #4 (LAST_SERVICE_DATE) use: String DEBUG: Output class mapping: #11 (ACTIVE) use: Boolean DEBUG: Write execute insert batch ERROR: Write failed in line number 10000 messageRA-01861: literal does not match format string
java.sql.BatchUpdateException: ORA-01861: literal does not match format string
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:11190) at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:244) at de.jlo.talendcomp.tabletransfer.TableTransfer.writeTable(TableTransfer.java:482) at de.jlo.talendcomp.tabletransfer.TableTransfer.access$100(TableTransfer.java:58) at de.jlo.talendcomp.tabletransfer.TableTransfer$2.run(TableTransfer.java:223) ERROR: Next exceptionRA-01861: literal does not match format string
java.sql.SQLDataException: ORA-01861: literal does not match format string
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399) at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1017) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:655) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:249) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:566) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:215) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:58) at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:943) at oracle.jdbc.driver.OraclePreparedStatement.executeForRowsWithTimeout(OraclePreparedStatement.java:10932) at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:11100) at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:244) at de.jlo.talendcomp.tabletransfer.TableTransfer.writeTable(TableTransfer.java:482) at de.jlo.talendcomp.tabletransfer.TableTransfer.access$100(TableTransfer.java:58) at de.jlo.talendcomp.tabletransfer.TableTransfer$2.run(TableTransfer.java:223) ERROR: Read has been failed. Stop write in table. DEBUG: Finished write data into target table schemaName.tableName, count inserts:10000 DEBUG: Write table finished. tOracleTableTransfer_2 read:20315 inserted:10000 DEBUG: Close source connection... Job tOracleTabletransfer ended at 11:34 10/11/2016.
I think it has to do with the problem you are trying to use String as transport data type for the column LAST_SERVICE_DATE. A String must match specific patterns to be accepted as date value in Oracle. Why do you not able to use Date as type? What happens if you set as data type Date for this column? What is the original data type of this column in the source database?
We did 2 rounds of tests. 1. We didn't define any data type in the component, used the default setting, where no data type is being defined. (Can we in this component define the column type specifically?) 2. In the advanced settings we defined "Date" to "timestamp" under custom type mapping. In both cases we got the same error. The data type for this column on both source and target is Date.