Weird AVRO error when processing data

Community Manager

Weird AVRO error when processing data

I am getting a weird error when running my Pipeline which says....

 

org.apache.avro.AvroTypeException: Non-null default value for null type: ""

 

I am writing my data to AVRO in a Python script and sending it to AWS Kinesis. This is being consumed by the Pipeline. It works for sometimes up to several thousand rows of data, then it fails with message above. The data that is successful is sent to Elasticsearch and I and seeing a lot of data there, exactly as I want it.

 

The problem I am having is that I cannot see why I would get this error. It is telling me that a default value that is not null is being set for a null type. The ONLY default values I am setting in my AVRO schema are null and I am setting all of my types to ["null", "{the required type}"] so this must mean it is occurring inside the Pipeline somewhere. Unfortunately the error message is not pointing to any data or a particular processor. Any ideas how to debug this?

 

Employee

Re: Weird AVRO error when processing data

Hello!  It certainly sounds like you're doing the right thing in your Avro schema... (i.e. null first and only using the null default for a field).  It's unlikely that it's a configuration issue with the Kinesis input!

 

This is an error that can only occur when building a new Avro record, so it must be related to one of the processors.  Can you help us narrow it down?  Which processors are you using?  I'm assuming that there's no useful information in the stack trace when you are running the job!  Ideally, we'd be able to find out who is specifically calling the "build()" for the generated record...

Community Manager

Re: Weird AVRO error when processing data

Hi @rskraba,

 

Please see a screenshot of my pipeline below.....

 

pipeline1.png

 

The FieldSelector is simply selecting a Twitter string Id, the created_date and a selection of geo points. Not all of them will be populated. In the PythonRow I am selecting the best possible geo points to send to Elasticsearch. The code I am using is below....

 

output = json.loads("{}")
output['location']=''
output['latitude']=None 
output['longitude']=None
output['id']=input['id_str']
output['created_at']=input['created_at']

if input['geo_coords']==None:
    if input['coords']==None:
        output['latitude']=input['place_coords'][0][0][1]
        output['longitude']=input['place_coords'][0][0][0]
        output['location']=str(input['place_coords'][0][0][1])+','+str(input['place_coords'][0][0][0])
    else:
        output['latitude']=input['coords']['coordinates'][1]
        output['longitude']=input['coords']['coordinates'][0]
        output['location']=str(input['coords']['coordinates'][1])+','+str(input['coords']['coordinates'][0])
else:
    output['latitude']=input['geo_coords']['coordinates'][0] 
    output['longitude']=input['geo_coords']['coordinates'][1]  
    output['location']=str(input['geo_coords']['coordinates'][0])+','+str(input['geo_coords']['coordinates'][1])

While this is arguably not the safest way of doing this, I could not see how I was ever adding a not null value to a null field? However, I am not as experienced at Python as I am with Java :-)

 

I have restricted my Tweets in my streaming query (prior to it arriving at the pipeline) to ensure that I only get Tweets with geo data

 

Unfortunately there is not any useful data in the error stack....well, not useful to me anyway :-)

 

Regards

 

Richard 

Employee

Re: Weird AVRO error when processing data

Thanks for the screenshot and python!  Everything looks good in your job, so I suspect that you've found a bug in the product.

 

It makes it simpler to identify the bug now that I can focus on the two processors FieldSelector (and Python).  The specific error message says that "someone" (i.e. Data Streams) is trying to create use a Schema where the default value is the empty String, but the type (or union type) is NULL.  I checked through some of the code and found two possibilities:

 

1. Some of our code positions NULL at the end of an Avro schema UNION (so the default *should* be non-null).  This corresponds to the https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/SchemaBuilder.html#nullable() convention, but isn't mandatory.  This might cause errors with a default value, but not the error you're seeing!

 

2. Some of our code sets the empty string as the default in generated schemas, which is normally OK if the NULL is last.  This would definitely cause that error *if* the original schema NULL-first is being retained, but we're overriding the default!  If this is the case, it would probably be a Field Selector error.

 

I *don't* suggest modifying your Avro schema -- we should be able to work with it.  Would it be possible to try removing the Field Selector and seeing if that fixes your problem?  This should be transparent if the field selector is just trimming unused fields (and much more complicated if it's doing more than that!).

 

The Field Selector has been completely rewritten since the release of Talend Data Streams for AWS so I will try and see if I can reproduce and/or report this as a bug.

Community Manager

Re: Weird AVRO error when processing data

Thanks for looking into this @rskraba. I was involved in the testing at the beginning of this week and can see that this has changed quite a lot. I wasn't able to recreate this issue with the data I was using, so hopefully it is resolved

Cloud Free Trial

Try Talend Cloud free for 30 days.

Tutorial

Introduction to Talend Open Studio for Data Integration.

Definitive Guide to Data Integration

Practical steps to developing your data integration strategy.

Definitive Guide to Data Quality

Create systems and workflow to manage clean data ingestion and data transformation.