【发布时间】:2020-11-25 18:08:04
【问题描述】:
我正在使用一个名为 twitterscraper 的库,它从任何给定的 url 抓取推文。我给了它一条推文回复的网址,它成功地抓取了页面上显示的推文。 (除了网址本身的推文,但我已经有了那条推文)。问题是我在调试时找不到它从 html 本身的响应 html 中抓取的任何元素。当我搜索它们时,我也找不到推文内容。推文根本不存在。
这是它得到响应的地方:
response = requests.get(url, headers=HEADER, proxies={"http": proxy}, timeout=timeout)
### some code
html = response.text
来自_html 的调用:
tweets = list(Tweet.from_html(html))
bs4 find_all 调用并解析推文
def from_html(cls, html):
soup = BeautifulSoup(html, "lxml") #no li element with js-stream-item class found when i looked through the html.
tweets = soup.find_all('li', 'js-stream-item') #but it still finds the li elements with tweets in them?
if tweets:
for tweet in tweets:
try:
yield cls.from_soup(tweet)
except AttributeError:
pass
except TypeError:
pass
这是怎么回事?
我在调试时复制了vscode中html变量的值,并通过它进行了搜索。 bs4 的 find_all 方法链接:https://beautiful-soup-4.readthedocs.io/en/latest/#find-all。链接到回复网址 - https://twitter.com/renderwonk/status/1290793272353239040
提供用于抓取 url 的函数(在第一行中进行了一次更改,已被注释掉。而不是给出查询,我传递了 url 本身):
def query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60, use_proxy=True):
"""
Returns tweets from the given URL.
:param query: The query parameter of the query url
:param lang: The language parameter of the query url
:param pos: The query url parameter that determines where to start looking
:param retry: Number of retries if something goes wrong.
:return: The list of tweets, the pos argument for getting the next page.
"""
#url = get_query_url(query, lang, pos, from_user)
url = query
logger.info('Scraping tweets from {}'.format(url))
try:
if use_proxy:
proxy = next(proxy_pool)
logger.info('Using proxy {}'.format(proxy))
response = requests.get(url, headers=HEADER, proxies={"http": proxy}, timeout=timeout)
else:
print('not using proxy')
response = requests.get(url, headers=HEADER, timeout=timeout)
if pos is None: # html response
html = response.text or ''
json_resp = None
else:
html = ''
try:
json_resp = response.json()
html = json_resp['items_html'] or ''
except (ValueError, KeyError) as e:
logger.exception('Failed to parse JSON while requesting "{}"'.format(url))
tweets = list(Tweet.from_html(html))
if not tweets:
try:
if json_resp:
pos = json_resp['min_position']
has_more_items = json_resp['has_more_items']
if not has_more_items:
logger.info("Twitter returned : 'has_more_items' ")
return [], None
else:
pos = None
except:
pass
if retry > 0:
logger.info('Retrying... (Attempts left: {})'.format(retry))
return query_single_page(query, lang, pos, retry - 1, from_user, use_proxy=use_proxy)
else:
return [], pos
if json_resp:
return tweets, urllib.parse.quote(json_resp['min_position'])
if from_user:
return tweets, tweets[-1].tweet_id
return tweets, "TWEET-{}-{}".format(tweets[-1].tweet_id, tweets[0].tweet_id)
except requests.exceptions.HTTPError as e:
logger.exception('HTTPError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.ConnectionError as e:
logger.exception('ConnectionError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.Timeout as e:
logger.exception('TimeOut {} while requesting "{}"'.format(
e, url))
except json.decoder.JSONDecodeError as e:
logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
e, url))
if retry > 0:
logger.info('Retrying... (Attempts left: {})'.format(retry))
return query_single_page(query, lang, pos, retry - 1, use_proxy=use_proxy)
logger.error('Giving up.')
return [], None
在方法from_html中调用find_all的结果
bs4 从中找到上述元素的 html。我在调试时复制了它: https://codeshare.io/ad8qNe (复制到编辑器并使用自动换行)
这和javascript有关系吗?
【问题讨论】:
-
贴出推特回复的链接
-
你能把
twitterscraper的代码和你得到的结果贴出来吗? -
好的,我发布了为抓取提供的功能、html 响应和 bs4 find_all 的结果
标签: python twitter beautifulsoup