Remote Engine for Pipelines keeps running out of memory

Four Stars

Remote Engine for Pipelines keeps running out of memory

I'm having some stability issues with the Remote Engine for Pipelines. It is running inside docker containers with resource limitations because it would chew up my 32G of RAM in a flash and I have other services running. But the limits are pretty wide: livy and previewrunner both get 8G and 6G of RAM respectively. Which should be more than enough since they can both run on an 8G machine without any problems. I have also given my environment extra space. I have increased and decreased the memory limits inside the .env file and the running profiles inside the Management Console but nothing seems to stick. It might work for a couple of minutes and then crash, or it might crash immediately, or it might run flawlessly for 5 times in a row and then fail to start up again.

 

Stack trace:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.spark.ContextCleaner.start(ContextCleaner.scala:126) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:560) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:560) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext.(SparkContext.scala:560) at org.apache.spark.SparkContext.(SparkContext.scala:117) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2540) at org.talend.datastreams.streamsjob.utils.SparkExecHelper.finalize(SparkExecHelper.scala:89) at org.talend.datastreams.streamsjob.FullRunJob$.runJob(FullRunJob.scala:118) at org.talend.datastreams.streamsjob.FullRunJob$.main(FullRunJob.scala:91) at org.talend.datastreams.streamsjob.FullRunJob.main(FullRunJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:162) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 

This is the most common OOM error I get. Sometimes it's other components that runs out (If i try to write to S3 it almost immediately out of memory. Same for avro) but most of the time its because it can't create a new native thread in it's main method. This leads me to believe that it most likely has to do with the spark driver, but i gave it a couple of gigs to work with and changed the .env variables to make sure it has enough space. But it still runs out of memory.
Maybe it is the component memory that is running out? Does it fill the complete memory reservation for each new component you make, or is it enough for all of them? How does spark spread it's memory usage? Does it create only one executor per pipeline, or are there multiple?

Is there something I'm missing here?

Employee

Re: Remote Engine for Pipelines keeps running out of memory

Hi,

Can you please provide more details:

- what Docker version do you have on your machine?

- what kind of connections/datasets are used when the issue happens?

- is the issue occuring only when you run pipelines, or even when you just add/edit datasets?

2019 GARNER MAGIC QUADRANT FOR DATA INTEGRATION TOOL

Talend named a Leader.

Get your copy

OPEN STUDIO FOR DATA INTEGRATION

Kickstart your first data integration and ETL projects.

Download now

What’s New for Talend Summer ’19

Watch the recorded webinar!

Watch Now

Generating a Heat Map with Twitter data using Pipeline Designer – Part 2

Part 2 of a series of blogs on showing you how to generate a Heat Map with Pipeline Designer

Blog

Generating a Heat Map with Twitter data using Pipeline Designer – Part 1

Part 1 of a series of blogs on showing you how to generate a Heat Map with Pipeline Designer

Blog

Creating Avro schemas for Pipeline Designer with Pipeline Designer

Create Avro schemas with Pipeline Designer

Blog