CRC duplicate value when exporting Date-Host-URL from Google Analytics

One Star

CRC duplicate value when exporting Date-Host-URL from Google Analytics

I'm exporting data from Google Analytics into MySQL db. I have 5 different websites' data to import.
I have 3 columns:
date - string (yyyyMMdd)
host - string - www.website1.com or www.website2.com etc. (constant value in tMap -> different for each job, constant under the job)
URL - string
Basic logic:
1) 2014-06-10 - www.website1.com - "/" is different from 2014-06-11 - www.website1.com - "/"
2) 2014-06-10 - www.website1.com - "/" is different from 2014-06-10 - www.website1.com - "/1"
3) 2014-06-10 - www.website1.com - "/" is different from 2014-06-10 - www.website2.com - "/"
I assume that there is no way how rows can be duplicate...
Lenght of CRC is set to 255 (it generates 8-10 integer numbers) in database it is set to BigINT.
1) What might cause this problem?
2) If I'm wrong about uniqueness of my rows, how can I save duplicate values to check them after the job is complete? (although I can't understand how it is possible since "HOST" variable is set as constant inside tMap and is 100% different from whatever I had in previous job runs)
Thanks,
Ivan
P.S. Any idea why on the long daterange my job ends at 1000000 rows?
P.S.S. Same issue was reported here: http://www.talendforge.org/forum/viewtopic.php?id=28670

UPDATE: I've checked couple codes from statisctics window manually and it gives me both:
1) Duplicate CRCs within 1 "host" job
2) Duplicate CRCs for different jobs with diffrent "host" constants.
Seventeen Stars

Re: CRC duplicate value when exporting Date-Host-URL from Google Analytics

At the moment I have no idea what is your problem. Could you please explain a bit more detailed what exactly does not work.
To your post script comments:
There is actually not technical reason to stop at a number of rows (e.g. 100,000) caused by the component tGoogleAnalyticsInput. I guess Google sets here some limit.
Here all available limits but I have not read anything about row count limits.
https://developers.google.com/analytics/devguides/reporting/core/v3/limits-quotas
To be honest I am never running into this trouble because I design the requests in a way they returns only the data for a day or for an hour or for one profile. I mean, trying to gather all data at once is a bad design and and prevents restart capability.
I suggest you run multiple queries with a reasonable smaller amount of data. You can always check your queries in the API console.
Please keep in mind, the component receives a huge JSON file as result and I guess there is a natural limit how large a answer should be.
One Star

Re: CRC duplicate value when exporting Date-Host-URL from Google Analytics

Thanks for your reply, Jlolling.
My issue:
I have 5 websites:
TUT.BY
SPORT.TUT.BY
NEWS.TUT.BY
AUTO.TUT.BY
LADY.TUT.BY
They have different URIs, but I need to have them all in the SAME DB.
What I do:
1) Query Google Analytics:
"ga:date,ga:sourceMedium" + "ga:sessions,gaSmiley Tongueageviews,ga:bounces,ga:sessionDuration,ga:users,ga:newUsers"
2) Add column "Host": for "TUT.BY" job = "TUT_BY", for "NEWS.TUT.BY" job = "NEWS_TUT_BY", etc. So I have 5 different jobs for each profile. (I use tMap component for that.)
3) I pass this table into tAddCRCrow, and generate 32bit code based on ga:date + ga:sourceMedium + Host columns
(CRC column is set to "key" to enable future updates.)
4) upload this data into MySQL db.
BUT when I run it I see in statistics "Duplicate value error" for CRC column.
I.e. CRC1 for "TUT.BY" job sometimes = CRC2 for "NEWS.TUT.BY" job.
How is that possible? Or how can I fix that?
Best,
Ivan
Seventeen Stars

Re: CRC duplicate value when exporting Date-Host-URL from Google Analytics

The way you describe the build of the CRC checksum, I have also no idea whats wrong with this approach.
But, why do you build a checksum?
I giess the duplicate error is caused by an unique constraint in the database?
Next: I would not build 5 very similar jobs, I would build one job which gets the URL (or host, as well as the profile Id) as context parameter and I would start 5 instances of this job with different values for the url. This way you avoid copy&paste errors.
To detect the way how the wrong CRC value appears, you could use the Trace Mode and inspect the values of all flows.
The very last method could be Java Debugging.