ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

记一次--------spark.driver.host参数报错问题

2020-03-25 22:58:29  阅读:285  来源: 互联网

标签:java scala driver yarn host 报错 apache org spark


报错日志:
20/03/25 10:28:07 WARN UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.spark.SparkException: Exception thrown in awaitResult
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1930)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:41640
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$anon$1.call(Outbox.scala:191)
    at org.apache.spark.rpc.netty.Outbox$anon$1.call(Outbox.scala:187)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:41640
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
 
LogType:stderr
Log Upload Time:Wed Mar 25 10:31:22 +0800 2020
LogLength:63452
Log Contents:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/yarn/nm/usercache/root/filecache/4996/__spark_libs__5125379819107399169.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/03/25 10:29:18 INFO SignalUtils: Registered signal handler for TERM
20/03/25 10:29:18 INFO SignalUtils: Registered signal handler for HUP
20/03/25 10:29:18 INFO SignalUtils: Registered signal handler for INT
20/03/25 10:29:19 INFO ApplicationMaster: Preparing Local resources
20/03/25 10:29:19 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1585020115190_0150_000002
20/03/25 10:29:19 INFO SecurityManager: Changing view acls to: yarn,root
20/03/25 10:29:19 INFO SecurityManager: Changing modify acls to: yarn,root
20/03/25 10:29:19 INFO SecurityManager: Changing view acls groups to:
20/03/25 10:29:19 INFO SecurityManager: Changing modify acls groups to:
20/03/25 10:29:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
20/03/25 10:29:19 INFO ApplicationMaster: Starting the user application in a separate Thread
20/03/25 10:29:19 INFO ApplicationMaster: Waiting for spark context initialization...
20/03/25 10:29:21 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
20/03/25 10:29:21 INFO RMProxy: Connecting to ResourceManager at df1/172.16.252.11:8030
20/03/25 10:29:28 WARN YarnAllocator: Container marked as failed: container_1585020115190_0150_02_000003 on host: df3. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1585020115190_0150_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
    at org.apache.hadoop.util.Shell.run(Shell.java:507)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
 
Container exited with a non-zero exit code 1

 

  问题回顾:     编写好程序,在本地idea远程访问测试环境进行测试, 一切正常。     提交程序到测试环境,使用spark local模式执行程序 , 一切正常。     使用cluster 模式执行程序,报错报错报错。。。   思路:     因为在测试环境跑local模式一切正常, 所以首先考虑到是不是因为环境问题,但是别的程序可以正常运行。 所以应该不是环境问题。     然后就想着应该是代码出现了问题, 但是看代码愣是没看出来, 就只能使用笨办法,重新写了一个最简单的程序 没有任何配置参数,没有任何逻辑, 就是读hdfs 然后打印到控制台。   结果正常运行。 就想到可能是 spark的配置参数出问题,然后观察程序的参数配置发现     spark.driver.host 参数解释: driver监听的主机名或者IP地址。这用于和executors以及独立的master通信  部署完Spark后,分别使用yarn-cluster模式和yarn-client模式运行Spark自带的计算pi的示例。     Spark的一些配置文件除了一些基本属性外,均未做配置,结果运行的时候两种运行模式出现了不同的状况。yarn-cluster模式可以正常运行,yarn-client模式总是运行失败。查看ResourceManager、NodeManager端的日志,发现程序总是找不到ApplicationMaster,这就奇怪了!并且,客户端的Driver程序开启的端口,在NodeManager端访问被拒绝!非Spark的其他MR任务,能够正常执行。 检查客户端配置文件,发现原来在客户端的/etc/hosts文件中,客户端的一个IP对应了多个Host,Driver程序会默认去取最后对应的那个Host,比如是hostB,但是在NodeManager端是配置的其他Host,hostA,所以导致程序无法访问。为了不影响其他的程序使用客户端的Host列表,这里在Spark配置文件spark-defaults.conf中使用属性spark.driver.host来指定yarn-client模式运行中和Yarn通信的DriverHost,此时yarn-client模式可以正常运行。 上面配置完了之后,发现yarn-cluster模式又不能运行了!想想原因,肯定是上面那个配置参数搞的鬼,注释掉之后,yarn-cluster模式可以继续运行。原因是,yarn-cluster模式下,spark的入口函数是在客户端运行,但是Driver的其他功能是在ApplicationMaster中运行的,上面的那个配置相当于指定了ApplicationMaster的地址,实际上的ApplicationMaster在yarn-master模式下是由ResourceManager随机指定的。     注释掉即可。 耐心找bug。。。。      

标签:java,scala,driver,yarn,host,报错,apache,org,spark
来源: https://www.cnblogs.com/yzqyxq/p/12571207.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有