【问题标题】:Python: how to extract "data-bind" html elements?Python:如何提取“数据绑定”html元素?
【发布时间】:2017-07-19 16:51:13
【问题描述】:

我正在尝试从网站中提取数据。元素被隐藏。当我尝试“查看源代码”时,不显示标题文本。

<h4 data-bind="Text: Name"></h4>

但是当我尝试检查时,有文本可见。

<h4 data-bind="Text: Name">STM1F-1S-HC</h4>

使用的代码是:

def getlink(link):
    try:
        f = urllib.request.urlopen(link)
        soup0 = BeautifulSoup(f)
    except Exception as e:
        print (e)
        soup0 = 'abc'
    for row2 in soup0.findAll("h4",{"data-bind":"text: Name"}):
        Name = row2.text
        print(Name)

#code to find all links to the products for further processing.
i=1
global i
for row in r1.findAll('a', { "class" : "col-xs-12 col-sm-6" }):
    link = 'https://www.truemfg.com/USA-Foodservice/'+row['href']
    print(link)
    getlink(link)
print(productcount)

输出是:

https://www.truemfg.com/USA-Foodservice/Products/Traditional-Reach-Ins
C:\Users\Santosh\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file C:\Users\Santosh\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

https://www.truemfg.com/USA-Foodservice/Products/Specification-Series

https://www.truemfg.com/USA-Foodservice/Products/Food-Prep-Tables

https://www.truemfg.com/USA-Foodservice/Products/Undercounters

https://www.truemfg.com/USA-Foodservice/Products/Worktops

https://www.truemfg.com/USA-Foodservice/Products/Chef-Bases

https://www.truemfg.com/USA-Foodservice/Products/Milk-Coolers

https://www.truemfg.com/USA-Foodservice/Products/Glass-Door-Merchandisers

https://www.truemfg.com/USA-Foodservice/Products/Air-Curtains

https://www.truemfg.com/USA-Foodservice/Products/Display-Cases

https://www.truemfg.com/USA-Foodservice/Products/Underbar-Refrigeration

我们发现没有打印名字。

谁能告诉我一个打印名称的解决方案。

谢谢, 桑托什

【问题讨论】:

    标签: python html data-binding web-scraping data-extraction


    【解决方案1】:

    XHR 动态生成的必需内容。你可以试试下面的代码直接请求数据,避免解析HTML

    import requests
    
    url = 'https://prodtrueservices.azurewebsites.net/api/products/productline/403/1?skip=0&take=200&unit=Imperial'
    r = requests.get(url)
    counter = 0
    
    while True:
        try:
            print(r.json()['Products'][counter]['Name'])
                counter += 1
        except IndexError:
            break
    

    这应该允许您获取所有名称

    【讨论】:

    • 感谢您的回答。请告诉我如何使用 XHR 获取 url。
    • 你能说清楚你到底想得到什么吗?
    • 我的要求是 - 对于上述输出中的所有链接,我想提取 excel 工作表中的所有产品信息以及文件夹和文件路径中的所有图像和文件作为工作表中的一列.您是如何到达“prodtrueservices.azurewebsites.net/api/products/productline/403/…”这个网址的?
    • 我不确定如何获取具有特定要求的产品线列表...这不是最好的主意,但您可以将源 ID 硬编码为 pL = [403, 558, 560, 445, 436, 555, 561, 564, 557, 559, 562] 并循环浏览此列表,例如: for line in pL: r = requests.get('https://prodtrueservices.azurewebsites.net/api/products/productline/%s/1?skip=0&amp;take=200&amp;unit=Imperial' % line) ... 请注意,您可以使用 print(r.json()['Products'][0]['ProductImageUrl']) 获取到每个图像文件的链接
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2023-03-21
    • 2015-11-07
    • 2018-05-30
    • 2019-11-20
    • 2021-10-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多