【问题标题】:Find an element using text contained in link title/subtitle使用链接标题/副标题中包含的文本查找元素
【发布时间】:2020-07-23 22:23:55
【问题描述】:

我正在尝试在下面显示的脚本中选择包含药物名称 ALMOGRAN、编号 12.5 和“传单”的链接。我使用的代码错过了这个元素。而对于其他药物名称,它工作正常。从网页中过滤和选择此元素的最佳方法是什么 (https://products.mhra.gov.uk/search/?query=almogran&page=1)

PIL

ALMOGRAN 12.5 MG 薄膜包衣片剂单张 MAH BRAND_PLPI 20774-1629.pdf
Code I am using :
elem7 = driver.find_element_by_xpath("//a[contains(., 'leaflet')and contains(.,'" +g+ "')]")
        link2 = elem7.get_attribute('href')
        time.sleep(15)

其中 g 是数字 12.5

请帮助我了解我哪里出错了,我是新手。谢谢]1

【问题讨论】:

  • 尝试将 text() 添加到它们 contains(text(),'leaflet')
  • 你想点击它吗?
  • @0m3r 不,我想要那个网址
  • 您只是想要这个特定的链接,还是要搜索文本然后获取链接?
  • XPath 对我来说看起来不错。正如@Mace 解释的那样,由于您使用 find_element ,您只会得到一个结果(4 个结果中的第一个)。你得到了什么:driver.find_element_by_xpath("//a[contains(., 'leaflet')and contains(.,'%s')]" % (g))

标签: python selenium xpath selenium-chromedriver


【解决方案1】:

您的代码找到所需的链接

g = '12.5'
elem7 = driver.find_element_by_xpath(("//a[contains(., 'leaflet')and contains(.,'" + g + "')]"))
print(elem7.text)
link2 = elem7.get_attribute('href')
print(link2)

结果

ALMOGRAN 12.5 MG FILM-COATED TABLETS
leaflet MAH BRAND_PLPI 20774-1629.pdf
https://mhraproductsprod.blob.core.windows.net/docs-20200406/bf115fe972a98836c2af4072a77e2aaa04bcfa24

但它始终是您在页面上找到的第一个链接,因为您使用了“find_element”。如果您使用“find_elements”测试您的搜索条件,您会发现它实际上找到了所有 4 个链接。

g = '12.5'
elements = driver.find_elements_by_xpath(("//a[contains(., 'leaflet')and contains(.,'" + g + "')]"))
for element in elements:
    print(element.text)
    link2 = element.get_attribute('href')
    print(link2)

结果

ALMOGRAN 12.5 MG FILM-COATED TABLETS
leaflet MAH BRAND_PLPI 20774-1629.pdf
https://mhraproductsprod.blob.core.windows.net/docs-20200406/bf115fe972a98836c2af4072a77e2aaa04bcfa24
ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS
leaflet MAH BRAND_PLPI 20636-1099.pdf
https://mhraproductsprod.blob.core.windows.net/docs-20200406/3ef1e68659b17b5cbc6bed7988dd7d32d8ff5258
ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS
leaflet MAH BRAND_PLPI 18799-1153.pdf
https://mhraproductsprod.blob.core.windows.net/docs-20200406/91cb21381c4aea438fe87a49cfc4abd3557f7614
ALMOGRAN 12.5 MG FILM-COATED TABLETS,ALMOTRIPTAN 12.5 MG FILM-COATED TABLETS
leaflet MAH BRAND_PLPI 20636-2661.pdf
https://mhraproductsprod.blob.core.windows.net/docs-20200406/f30be6ab9c3e85abd455da4262ffaf5932813014

因此,如果您的药物不是页面上的第一个药物,您的代码就会找到另一种药物。您可以使您的搜索条件更具体。另一种方法是使用 find_elements 并在 for 循环中添加类似 'if "extra criteria" in element.text: ...' 之类的内容。

这会搜索页面上的所有药物

# ------------------------------------------------------------------
def find_medicine_leaflet(page_meds, med_name):
    print(f'\n------- searching for {med_name} -------')
    nr_found = 0
    for page_med in page_meds:
        if medicine_name in page_med.text:
            nr_found += 1
            text = page_med.text.replace('\n', ' - ')
            print(f"{nr_found} {text}\n  leaflet url: {page_med.get_attribute('href')}")
    print(f'------- {nr_found} found -------\n')


# ------------------------------------------------------------------
medicine_names = [
    'ALMOGRAN 12.5 MG FILM-COATED TABLETS',
    'ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS',
    'ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS',
    'ALMOGRAN 12.5 MG FILM-COATED TABLETS,ALMOTRIPTAN 12.5 MG FILM-COATED TABLETS',
]

g = '12.5'

page_medicines = driver.find_elements_by_xpath(("//a[contains(., 'leaflet')and contains(.,'" + g + "')]"))

for medicine_name in medicine_names:
    find_medicine_leaflet(page_medicines, medicine_name)

结果

------- searching for ALMOGRAN 12.5 MG FILM-COATED TABLETS -------
1 ALMOGRAN 12.5 MG FILM-COATED TABLETS - leaflet MAH BRAND_PLPI 20774-1629.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/bf115fe972a98836c2af4072a77e2aaa04bcfa24
2 ALMOGRAN 12.5 MG FILM-COATED TABLETS,ALMOTRIPTAN 12.5 MG FILM-COATED TABLETS - leaflet MAH BRAND_PLPI 20636-2661.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/f30be6ab9c3e85abd455da4262ffaf5932813014
------- 2 found -------


------- searching for ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS -------
1 ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS - leaflet MAH BRAND_PLPI 20636-1099.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/3ef1e68659b17b5cbc6bed7988dd7d32d8ff5258
2 ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS - leaflet MAH BRAND_PLPI 18799-1153.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/91cb21381c4aea438fe87a49cfc4abd3557f7614
------- 2 found -------


------- searching for ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS -------
1 ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS - leaflet MAH BRAND_PLPI 20636-1099.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/3ef1e68659b17b5cbc6bed7988dd7d32d8ff5258
2 ALMOTRIPTAN 12.5MG TABLETS,ALMOGRAN 12.5MG TABLETS - leaflet MAH BRAND_PLPI 18799-1153.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/91cb21381c4aea438fe87a49cfc4abd3557f7614
------- 2 found -------


------- searching for ALMOGRAN 12.5 MG FILM-COATED TABLETS,ALMOTRIPTAN 12.5 MG FILM-COATED TABLETS -------
1 ALMOGRAN 12.5 MG FILM-COATED TABLETS,ALMOTRIPTAN 12.5 MG FILM-COATED TABLETS - leaflet MAH BRAND_PLPI 20636-2661.pdf
  leaflet url: https://mhraproductsprod.blob.core.windows.net/docs-20200406/f30be6ab9c3e85abd455da4262ffaf5932813014
------- 1 found -------

据我所知,它没有遗漏任何内容。由于药物名称重叠,它可以找到 2 个传单,但一切都在找到。

【讨论】:

  • 对不起,我把它改回你的代码变量并犯了一个错误。现在的代码给出了你看到的结果,应该没问题。
  • 嗨@Mace,这看起来与我一直在尝试的相似,但当我在循环中针对多种不同药物运行它时错过了该元素
  • 链接中的产品页面列出了 4 种药物,我的回答中的最后一个结果也是如此。我不清楚你所说的“错过多种不同药物的循环”是什么意思你能解释一下吗?或者更好的是,用循环显示一个简短的工作代码?
  • 我有一份药品清单,对于清单中的每个元素,我都在此页面上进行搜索并提取传单的第一个链接。
  • 查看我的答案。
【解决方案2】:

你能在xpath下面试试吗

g=12.5
//p[contains(text(),'leaflet') or contains(text(),'"+g+"')]//parent::a

g=12.5   

//a[contains(., 'leaflet') and contains(.,'"+g+"')]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-11-08
    • 2019-08-08
    • 1970-01-01
    • 2020-08-02
    • 1970-01-01
    • 2017-03-28
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多