A possible Data Streams bug....

Sixteen Stars

A possible Data Streams bug....

Hi guys, I'm not sure if this is a bug, but I cannot find a way around this. I have created a pipeline that works brilliantly. When I created a new one, things that were possible in the first one, just result in an error......

 

Warning java.lang.IllegalArgumentException: Cannot create PyString with non-byte value

I have a Python Map component with the following code....

 

output['bob']="bob"

I just want it to return a record called "bob" with the data "bob". In the previous pipeline that works, I have done similar without error. But this just won't work. I have attached a screenshot of the error, showing my pipeline. It is without a data target at the moment, but that shouldn't cause an issue.

Screen Shot 2018-08-02 at 23.04.05.png

 

So, is this a bug or am I doing something stupid?.....I'm happy to be told I am doing something stupid if you tell me what :-)

 


Accepted Solutions
Employee

Re: A possible Data Streams bug....

Hello @rhall_2_0

 

You are not doing something stupid at all. This is actually a known issue with the Python Component. 

 

What happens is that the data coming from Twitter can embed special characters (for example Chinese characters). Unfortunately the Python component raises this error when dealing with them.

 

I believe it worked in your first pipeline because the data you got in your preview (the 50 records) had no special characters so the Python component was working well. In your second pipeline, you happen to get different data for the preview and in your case you mabye got special characters in it which causes this issue.

 

A quick fix would be to filter the non-ascii characters in your data.

 

Best regards,

Employee

Re: A possible Data Streams bug....

This might be a silly question -- is your first pipeline actually sending data through the PythonRow?  The known issue with UTF-8 occurs when the incoming data is being "prepared" as python dicts, lists and primitives. Any incoming record with a string field containing non-ascii characters triggers the bug, even if the field is otherwise unused!  Even if you completely ignore the incoming record (as when you do output['bob'] = "Bob")!

 

I can think of a couple of scenarios: if your first pipeline filtered these records before they reached the PythonRow, for example, you wouldn't see the bug, or if the non-ascii fields were removed by an upstream component like Aggregate or FieldSelector.

 

You can check this by looking through the previews in the successful and failing pipelines too -- are there non-ascii characters coming into the PythonRow in the successful pipeline?  Are you guaranteed that the records going into PythonRow are "sanitized" in the failing pipelines?

 

Keep in mind that a pipeline that succeeds on a sample might still fail on a real run, if "unsanitized" data gets to the PythonRow.

 

Given the error message, I strongly suspect this is the known issue!  But I'm very curious whether this might be an unrelated issue -- would it be possible to retrieve and share some sample data that fails in the second pipeline?  I frequently use the "Custom Data" dataset and component to test different input data in a pipeline, so this might be a path to explore!

 

(By the way, please be assured that we took this bug seriously -- it was a priority to fix as soon as it was discovered!  Python is a pretty mature technology, and yet is still pretty fragile with respect to unicode!  See http://bugs.jython.org/issue2632 for a similar example in their csv processing module.  Even with the fix, it's easy to make strings inside the python code that end up as "undefined" characters!)

 

 

 


All Replies
Employee

Re: A possible Data Streams bug....

Hello @rhall_2_0

 

You are not doing something stupid at all. This is actually a known issue with the Python Component. 

 

What happens is that the data coming from Twitter can embed special characters (for example Chinese characters). Unfortunately the Python component raises this error when dealing with them.

 

I believe it worked in your first pipeline because the data you got in your preview (the 50 records) had no special characters so the Python component was working well. In your second pipeline, you happen to get different data for the preview and in your case you mabye got special characters in it which causes this issue.

 

A quick fix would be to filter the non-ascii characters in your data.

 

Best regards,

Sixteen Stars

Re: A possible Data Streams bug....

Ah, that makes sense. Thanks for pointing this out. Is this an issue with all multi-byte characters? Is this going to be fixed? Temporarily I can probably make use of your suggested workaround, but if it is all multi-byte characters, it could lead to losing some useful data.

 

 

Sixteen Stars

Re: A possible Data Streams bug....

I've double checked this and the problem cannot be what you have suggested. I've written some code to remove non ascii characters and loaded that data into kinesis. I have refreshed the data source (which takes a lot of clicking on refresh before it actually loads new data) and have tested the same data in my first (working) pipeline and the subsequent ones (not working). The same data works in the first pipeline, but fails with the error I specified in the others. It also seems odd that the error would occur when I am not actually consuming data from the data source, but am simply hardcoding an output. I suspect there is something different causing this issue.

Employee

Re: A possible Data Streams bug....

Hello!  Thanks for trying out the Data Streams AMI!

 

We discovered and fixed the UTF-8 bug for data passing through the PythonRow, and it has already been fixed.  I'm uncertain when the fix will be available to AMI users, so your current technique (sanitizing the data) is the best approach for the short term.  I couldn't see any problems with your job, and it should work -- are you sure that you are removing all non-ASCII characters?  Even extended one-byte ASCII characters like é can cause this problem.

 

We've been hard at work improving this experience, and the responsiveness of the application!  In particular, with Kinesis, I've noticed that the latency between different Amazon regions is pretty noticeable -- the best practice is to ensure that your AMI and Kinesis resource are in the same region (likewise, if you write data to S3).

 

 

Once again, your feedback on the AMI is really valuable and thanks for contributing -- don't hesitate if you have other questions or problems!  Ryan

Sixteen Stars

Re: A possible Data Streams bug....

Hi @rskraba,

 

I actually have 3 pipelines using the same Twitter data source. The first one I created seems OK with the data. It has never failed like this while I have been building and tweaking. The other two both fail with the same sample data. To prove this I refreshed the data source so that the first Tweet I could read was the same across each pipeline. I then stepped through each of the components in each of the pipelines. The first one I built is absolutely fine. The other two fail as soon as I try to step through the python component. The really funny thing is that it fails even when I simply want to output a hardcoded output value. 

 

Could this be something to do with using the same data source? 

Employee

Re: A possible Data Streams bug....

This might be a silly question -- is your first pipeline actually sending data through the PythonRow?  The known issue with UTF-8 occurs when the incoming data is being "prepared" as python dicts, lists and primitives. Any incoming record with a string field containing non-ascii characters triggers the bug, even if the field is otherwise unused!  Even if you completely ignore the incoming record (as when you do output['bob'] = "Bob")!

 

I can think of a couple of scenarios: if your first pipeline filtered these records before they reached the PythonRow, for example, you wouldn't see the bug, or if the non-ascii fields were removed by an upstream component like Aggregate or FieldSelector.

 

You can check this by looking through the previews in the successful and failing pipelines too -- are there non-ascii characters coming into the PythonRow in the successful pipeline?  Are you guaranteed that the records going into PythonRow are "sanitized" in the failing pipelines?

 

Keep in mind that a pipeline that succeeds on a sample might still fail on a real run, if "unsanitized" data gets to the PythonRow.

 

Given the error message, I strongly suspect this is the known issue!  But I'm very curious whether this might be an unrelated issue -- would it be possible to retrieve and share some sample data that fails in the second pipeline?  I frequently use the "Custom Data" dataset and component to test different input data in a pipeline, so this might be a path to explore!

 

(By the way, please be assured that we took this bug seriously -- it was a priority to fix as soon as it was discovered!  Python is a pretty mature technology, and yet is still pretty fragile with respect to unicode!  See http://bugs.jython.org/issue2632 for a similar example in their csv processing module.  Even with the fix, it's easy to make strings inside the python code that end up as "undefined" characters!)

 

 

 

Sixteen Stars

Re: A possible Data Streams bug....

You may have hit the nail on the head there. I am filtering in the first pipeline and by the end of it I am left with an aggregation of Twitter hashtags in a large dict with counts per word. I suspect I am less likely to see multibyte characters in the hashtags. I will take another look.

 

It is good to know that regardless of whether you use the data entering the python component IN the python component, that is can still cause this. I was convinced I had removed this possibility with my "bob" code. I will test this again and filter out everything by hashtags and let you know.

 

Thanks :-)

 

Sixteen Stars

Re: A possible Data Streams bug....

You were right @rskraba. I have filtered the data to just hashtags and I do not get the error anymore. 

 

Just out of interest, when the fix you guys are working on is ready, will it be automatically distributed to the AMIs or will a new AMI have to be created? If it is the latter, I will need to retrieve my pipelines and data sources, etc, from my current AMI. There is no mechanism to let us download this at the moment (would be MASSIVELY useful so we do not need to keep an AMI) so is there a "hack" to do this by SSH-ing in and picking up a certain folder? I suspect there is, but I'd rather not spend hours hunting it out if it is something you can share :-)

Employee

Re: A possible Data Streams bug....

@rhall_2_0

 

The fix won't be automatically distributed to your AMI. You will need to create a new instance with the revised version of the AMI.

 

The export/import is not available in the AMI but will be in the cloud managed version. However, if you check the volumes attached to your AMI, you'll see that you have two of them. In fact, one is the root volume with Talend Data Streams and the second one holds your data (connections, datasets and pipelines).

 

When a newer version of the AMI will be available, you will be able to attach your data volume to this new AMI and retrieve all your pipelines. Please look at the documentation here to achieve this.

Employee

Re: A possible Data Streams bug....

I'm glad that it worked for you!  I don't have any information about future AMI release plans yet, but we've definitely planned to support migrating your pipelines when it happens.  There's some existing documentation at https://help.talend.com/reader/y7aZk2qgXXZFHGyOHvynTQ/jG1FbsqIu9JuCY4qu75z9g about saving the state of your existing pipelines and other business objects and re-importing them into a new AMI.

 

We're all looking forward to a future release where this bug no longer exists!