【问题标题】:re.findall not finding all, only some. How could this be? [closed]re.findall 没有找到所有,只有一些。这怎么可能? [关闭]
【发布时间】:2020-05-02 21:14:10
【问题描述】:

我有一个包含五个网站的文本文件。在这些网站中的每一个中都有多个亚马逊链接,我的目标是收集所有这些链接。然而,五个网站之一使用“amzn.to”而不是“amazon.com”来引导亚马逊链接,我最初认为只需使用这个就可以解决:

any(re.findall(r'(amazon.com|amzn.to)', str, re.IGNORECASE))

我的亚马逊链接整体列表中应该包含十个 amzn.to 链接,但只找到了两个。

这是我的完整代码:

import requests
import re
from bs4 import BeautifulSoup
from collections import OrderedDict

file_name = raw_input("Enter file name: ")
filepath = "%s"%(file_name)

with open(filepath) as f:
    listoflinks = [line.rstrip('\n') for line in f]

raw_links = []
for i in listoflinks:
    html = requests.get(i).text
    bs = BeautifulSoup(html)
    possible_links = bs.find_all('a')
    for link in possible_links:
        if link.has_attr('href'):
            raw_links.append(link.attrs['href'])

amazon_links = []
for str in raw_links:
    if (any(re.findall(r'(amazon.com|amzn.to)', str, re.IGNORECASE))) and (str not in amazon_links):
        amazon_links.append(str)

for i in amazon_links:
    print i
print len(amazon_links)

我知道它有效,但没有我想要的那么好。请帮我查明问题。

【问题讨论】:

  • 可以添加一些数据样本吗?
  • 没有数据文件(或者也有问题的更短的例子),很难告诉你到底是什么问题。立即引人注目的是您的正则表达式中有.,您想要\.,因为您想要匹配实际的句点,而不是任何字符。另请注意,使用给定的表达式,您将匹配比您想要的更多,例如 'http://mymalware.haha/amazon.com/ransomware.exe'

标签: python regex web-scraping beautifulsoup


【解决方案1】:

使用简化文档的解决方案。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://www.shifu.com/best-shower-curtain-rods/')
doc = SimplifiedDoc(html)
amazon_links = doc.getElements('a')
amazon_links = amazon_links.containsOr(['amazon.com','amzn.to'],attr='href')
print ([a.href for a in amazon_links])

结果:

['https://www.amazon.com/InterDesign-Constant-Tension-Shower-Curtain/dp/B006J23OGU/ref=as_li_ss_il?ie=UTF8&qid=1531507667&sr=8-1-spons&keywords=InterDesign+Cameo+Constant+Tension+Shower+Curtain+Rod&th=1&linkCode=li2&tag=shifu02-20&linkId=9cb3c83107c687168b9c74469d907a6a', 'https://amzn.to/2N9WWyn', 'https://www.amazon.com/Bath-Bliss-Expandable-72-inch-Curtain/dp/B00VMTKHBU/ref=as_li_ss_il?s=aps&ie=UTF8&qid=1531508002&sr=1-1-catcorr&keywords=Bath+Bliss+Expandable+42+to+72-inch+Curved+Shower+Curtain+Rod&linkCode=li2&tag=shifu02-20&linkId=3aabac324a48f7eeeae9a1b329d92f6f', 'https://amzn.to/2zC1rQh', 'https://www.amazon.com/Zenna-Home-35633SSP-NeverRust-Aluminum/dp/B00JVG5NMY/ref=as_li_ss_il?s=hi&ie=UTF8&qid=1531508744&sr=1-1&keywords=Zenna+Home+35633SSP,+NeverRust+Aluminum+Tension+Curved+Shower+Curtain+Rod&dpID=31nFuJj2r4L&preST=_SY300_QL70_&dpSrc=srch&linkCode=li2&tag=shifu02-20&linkId=d9c411023c0c33b4eef978b69932a649', 'https://amzn.to/2NeG4qg',
... and so on.

可以获取SimplifiedDochere的例子

【讨论】:

  • 谢谢,我可以用这个!
猜你喜欢
  • 2020-02-07
  • 1970-01-01
  • 1970-01-01
  • 2012-11-20
  • 1970-01-01
  • 2020-08-01
  • 2021-10-06
  • 2018-10-31
  • 1970-01-01
相关资源
最近更新 更多