ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Flume数据采集至HDFS的排雷日记

2021-03-11 09:32:49  阅读:598  来源: 互联网

标签:Flume HDFS 排雷 java flume hadoop FileSystem apache org


文章目录


写在前面

       本篇文章对于想了解Flume采集数据至HDFS的过程中有哪些需要注意的小伙伴有一定的帮助,这里为了模拟真实环境,临时搭建一台虚拟机,将数据存入TOMCAT中后,我们将数据从当前虚拟机传输至另外一台虚拟机的HDFS上。

环境所涉及版本:

  • apache-tomcat-8.5.63
  • flume-ng-1.6.0-cdh5.14.2
  • hadoop-2.6.0-cdh5.14.2

一、Flume-agent配置

话不多说,直接上agent代码,简单的解释下每行的意义:
(如果还不够清楚,见官网手册

# 定义source,channel,和sink的名字
a1.channels = c1
a1.sources = s1
a1.sinks = k1

#设置source为Spooling Directory Source(专门对文件提取的一种source)
a1.sources.s1.type = spooldir
a1.sources.s1.channels = c1
#s设置提取文件目录位置
a1.sources.s1.spoolDir = /opt/software/tomcat8563/webapps/mycurd/log
#设置输入字符编码(Flume默认是UTF-8的,这里我的日志字符类型为GBK)
a1.sources.s1.inputCharset = GBK

#设置channel为File Channel
a1.channels.c1.type = file
#设置检查点目录
a1.channels.c1.checkpointDir = /opt/flume/checkpoint
#设置数据目录
a1.channels.c1.dataDirs = /opt/flume/data

#设置sink为HDFS Sink
a1.sinks.k1.type = hdfs
#设置HDFS目录路径(后面加了转义序列)
a1.sinks.k1.hdfs.path = hdfs://192.168.237.130:9000/upload/%Y%m%d
#设置文件的开头
a1.sinks.k1.hdfs.filePrefix = upload-
#设置使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#设置刷写至HDFS的事件数
a1.sinks.k1.hdfs.batchSize = 100
#设置文件流类型
a1.sinks.k1.hdfs.fileType = DataStream
#设置滚动至下一个文件等待的秒数
a1.sinks.k1.hdfs.rollInterval = 600
#设置滚动至下一个文件时当前文件的最大文件大小(单位字节)
a1.sinks.k1.hdfs.rollSize = 134217700
#设置截断文件的事件数(设置为0就不因为event数量截断文件)
a1.sinks.k1.hdfs.rollCount = 0
#设置hdfs存放副本数
a1.sinks.k1.hdfs.minBlockReplicas = 1
#设置通道
a1.sinks.k1.channel = c1

TIPS:channel的checkpointDir和dataDirs目录需要提前在虚拟机上创建好!


二、连续报错排雷

上面配置完后,,博主和大家一样迫不及待的启动agent试了起来:

flume-ng agent --name a1 --conf /opt/software/flume160/conf/ -f /opt/flumeconf/file-hdfs.conf -Dflume.root.logger=DEBUG,console

然后一盆冷水接一盆冷水的浇来,我们来看看有哪些冷水打扰了我们的兴致:


org/apache/hadoop/io/SequenceFile$CompressionType

2021-03-10 23:58:20,087 (conf-file-poller-0) [ERROR - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:146)] Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType
        at org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:235)
        at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
        at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
        at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
        at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.SequenceFile$CompressionType
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 12 more

解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;

jar包名:${HADOOP_HOME}share/hadoop/common/hadoop-common-2.6.0-cdh5.14.2.jar


org/apache/commons/configuration/Configuration

2021-03-11 08:45:13,867 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
        at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:139)
        at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:259)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2979)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2971)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2834)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
        at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
        at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
        at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.configuration.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 18 more

解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;

jar包名:${HADOOP_HOME}share/hadoop/common/lib/commons-configuration-1.6.jar


org/apache/hadoop/util/PlatformName

Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
        at org.apache.hadoop.security.UserGroupInformation.getOSLoginModuleName(UserGroupInformation.java:442)
        at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:487)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2979)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2971)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2834)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
        at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
        at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
        at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 16 more

解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;

jar包名:${HADOOP_HOME}share/hadoop/common/lib/hadoop-auth-2.6.0-cdh5.14.2.jar


org/apache/htrace/core/Tracer$Builder

2021-03-11 09:07:27,157 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoClassDefFoundError: org/apache/htrace/core/Tracer$Builder
        at org.apache.hadoop.fs.FsTracer.get(FsTracer.java:42)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2803)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
        at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
        at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
        at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;

jar包名:${HADOOP_HOME}share/hadoop/common/lib/htrace-core4-4.0.1-incubating.jar


No FileSystem for scheme: hdfs

2021-03-11 09:14:59,911 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:443)] HDFS IO error
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2796)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:260)
        at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:252)
        at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:701)
        at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
        at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:698)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

解决方法:将HADOOP下的jar包拷贝至Flume/lib目录下;

jar包名:${HADOOP_HOME}share/hadoop/hdfs/hadoop-hdfs-2.6.0-cdh5.14.2.jar


java.nio.charset.MalformedInputException

2021-03-10 22:07:14,385 (pool-5-thread-1) [ERROR - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:280)] FATAL: Spool Directory source s1: { spoolDir: /opt/software/tomcat8563/webapps/mycurd/log }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
        at org.apache.flume.serialization.ResettableFileInputStream.readChar(ResettableFileInputStream.java:283)
        at org.apache.flume.serialization.LineDeserializer.readLine(LineDeserializer.java:132)
        at org.apache.flume.serialization.LineDeserializer.readEvent(LineDeserializer.java:70)
        at org.apache.flume.serialization.LineDeserializer.readEvents(LineDeserializer.java:89)
        at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readDeserializerEvents(ReliableSpoolingFileEventReader.java:343)
        at org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:318)
        at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:250)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

解决方法:agent配置source需要添加字符集设置(Flume默认是UTF-8):

a1.sources.s1.inputCharset = GBK

java.lang.OutOfMemoryError: GC overhead limit exceeded

问题原因:内存溢出

解决方法:首先进入此目录${FLUME_HOME/bin,编辑flume-ng

# set default params
FLUME_CLASSPATH=""
FLUME_JAVA_LIBRARY_PATH=""
JAVA_OPTS="-Xmx1024m"  #调整JVM堆的设置
LD_LIBRARY_PATH=""

PS:作者是一枚刚入编程的小白,如果有写错或者写的不好的地方,欢迎各位大佬在评论区留下宝贵的意见或者建议,敬上!如果这篇博客对您有帮助,希望您可以顺手帮我点个赞!不胜感谢!


原创作者:wsjslient

作者主页:https://blog.csdn.net/wsjslient


标签:Flume,HDFS,排雷,java,flume,hadoop,FileSystem,apache,org
来源: https://blog.csdn.net/wsjslient/article/details/114648813

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有