I'm having some stability issues with the Remote Engine for Pipelines. It is running inside docker containers with resource limitations because it would chew up my 32G of RAM in a flash and I have other services running. But the limits are pretty wide: livy and previewrunner both get 8G and 6G of RAM respectively. Which should be more than enough since they can both run on an 8G machine without any problems. I have also given my environment extra space. I have increased and decreased the memory limits inside the .env file and the running profiles inside the Management Console but nothing seems to stick. It might work for a couple of minutes and then crash, or it might crash immediately, or it might run flawlessly for 5 times in a row and then fail to start up again.
Stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.spark.ContextCleaner.start(ContextCleaner.scala:126) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:560) at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:560) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext.(SparkContext.scala:560) at org.apache.spark.SparkContext.(SparkContext.scala:117) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2540) at org.talend.datastreams.streamsjob.utils.SparkExecHelper.finalize(SparkExecHelper.scala:89) at org.talend.datastreams.streamsjob.FullRunJob$.runJob(FullRunJob.scala:118) at org.talend.datastreams.streamsjob.FullRunJob$.main(FullRunJob.scala:91) at org.talend.datastreams.streamsjob.FullRunJob.main(FullRunJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:162) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
This is the most common OOM error I get. Sometimes it's other components that runs out (If i try to write to S3 it almost immediately out of memory. Same for avro) but most of the time its because it can't create a new native thread in it's main method. This leads me to believe that it most likely has to do with the spark driver, but i gave it a couple of gigs to work with and changed the .env variables to make sure it has enough space. But it still runs out of memory.
Maybe it is the component memory that is running out? Does it fill the complete memory reservation for each new component you make, or is it enough for all of them? How does spark spread it's memory usage? Does it create only one executor per pipeline, or are there multiple?
Is there something I'm missing here?