在实时推文流中跟踪关键字答案

【问题标题】：Tracking keywords in a live stream of tweets在实时推文流中跟踪关键字
【发布时间】：2012-08-08 10:55:57
【问题描述】：

我安装并试用了tweepy，我现在正在使用以下功能：

来自API Reference

API.public_timeline()

返回 20 个最近的状态已设置自定义用户图标的非受保护用户。公众时间线被缓存 60 秒，因此请求它的频率比这是对资源的浪费。

但是，我想从完整的直播流中提取与某个正则表达式匹配的所有推文。我可以将public_timeline() 放在while True 循环中，但这可能会遇到速率限制问题。不管怎样，我真的不认为它可以涵盖所有当前的推文。

这怎么可能？如果不是所有推文，那么我想提取与某个关键字匹配的尽可能多的推文。

【问题讨论】：

标签： python twitter tweepy

【解决方案1】：

流式 API 是您想要的。我使用了一个名为 tweetstream 的库。这是我的基本监听功能：

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

我已经有一段时间没有看了，但我很确定这个库只是在访问示例流（而不是 firehose）。 HTH。

编辑添加：您说您想要“完整的直播”，也就是消防软管。这在财政和技术上都很昂贵，而且只有非常大的公司才能拥有它。查看文档，您会发现示例基本上具有代表性。

【讨论】：

【解决方案2】：

看看streaming API。您甚至可以订阅您定义的单词列表，并且只返回与这些单词匹配的推文。

流式 API 速率限制的工作方式不同：每个 IP 获得 1 个连接，并且每秒最多事件数。如果发生的事件多于该值，那么无论如何您只会获得最大值，并会通知您由于速率限制而错过了多少事件。

我的理解是流式 API 最适合根据需要将内容重新分发给您的用户的服务器，而不是由您的用户直接访问 - 常设连接的成本很高，而且 Twitter 在连接失败后开始将 IP 列入黑名单并重新连接，之后可能还有您的 API 密钥。

【讨论】：