【发布时间】:2015-08-27 20:56:03
【问题描述】:
我想我已经尝试了所有更改配置文件的组合。我还在某处看到这可能是由于我的复制因子为 3,所以我将其更改为 1。我在 AWS 上使用 cloudera manager。下面是我的配置文件,有什么想法吗?
在 HDFS 中,文件大小都在 20kb 以下,试图获得至少 40-50mb。有趣的是,相同的配置文件正在我正在练习的虚拟机上写入约 60mb 的文件(预安装的 hadoop + 工具)。请参阅下面的配置文件,有什么想法吗?
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = apple, grapes, fruits, strawberry, mango, pear
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://123.456.789.us-west-2.compute.amazonaws.com:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
【问题讨论】:
-
你检查过这个解决方案了吗? stackoverflow.com/questions/22145899/…
标签: hadoop hdfs twitter4j flume