编写循环：Beautifulsoup 和 lxml 用于在逐页跳过设置中获取页面内容答案

【问题标题】：Writing a loop: Beautifulsoup and lxml for getting page-content in a page-to-page skip-setting编写循环：Beautifulsoup 和 lxml 用于在逐页跳过设置中获取页面内容
【发布时间】：2020-03-31 15:40:39
【问题描述】：

更新：现在包含 6600 多个目标页面之一的图像：https://europa.eu/youth/volunteering/organisation/48592 见下文 - 图像以及目标目标的解释和描述以及所需的数据。

我是志愿服务领域数据工作领域的新手。任何帮助表示赞赏。在过去的几天里，我从 αԋɱҽԃ αмєяιcαη 和 KunduK 等一些编码英雄那里学到了很多东西。

基本上，我们的目标是简要概述欧洲的一系列免费志愿服务机会。我有我想用来获取数据的 URL 列表。我可以为一个这样的 url 做这样的事情：- 目前正在研究一种深入研究 python 编程的方法：我有几个已经可以工作的解析器部分 - 请参阅下面的几个页面的概述。顺便说一句：我想我们应该用 pandas 收集信息并将其存储在 csv 中......

...等等... - [注意 - 并非每个 URL 和 id 都备份有内容页面 - 因此我们需要增量 n+1 设置] 因此我们可以计算页面逐个 - 并计数增量 n+1

查看示例：

方法：我使用了 CSS 选择器； XPath 和 CSS 选择器执行相同的任务，但是 - 对于 BS 或 lxml，我们可以使用它或与 find() 和 findall() 混合使用。

所以我在这里运行这个迷你方法：

from bs4 import BeautifulSoup

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

resonse = requests.get(url)

soup = BeautifulSoup(resonse.content, 'lxml')

tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')

print(tag_info[0].text)

输出： Norwegian Judo Federation

小方法 2：

from lxml import html

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

response = requests.get(url)

tree = html.fromstring(response.content)

tag_info = tree.xpath("//p[contains(text(),'Norwegian')]")

print(tag_info[0].text)

输出： Norwegian Judo Federation (NJF) is a center organisation for Norwegian Judo clubs. NJF has 65 member clubs, which have about 4500 active members. 73 % of the members are between ages of 3 and 19. NJF is organized in The Norwegian Olympic and Paralympic Committee and Confederation of Sports (NIF). We are a member organisation in European Judo Union (EJU) and International Judo Federation (IJF). NJF offers and organizes a wide range of educational opportunities to our member clubs.

等等等等。我想要实现的目标：目标是从所有 6800 个页面中收集所有有趣的信息 - 这意味着信息，例如：

页面的 URL 以及页面中所有标记为红色的部分
组织名称
地址
组织描述
角色
到期日期
范围
最后更新
组织主题（并非每页都注明：偶尔）

...并迭代到下一页，获取所有信息等等。所以我尝试下一步以获得更多经验：...从所有页面收集信息注意：我们有 6926 个页面

问题是 - 关于 URL，如何找出哪个是第一个 URL，哪个是最后一个 URL - 想法：如果我们从零迭代到 10 000！？

用网址的数字！？

import requests
from bs4 import BeautifulSoup
import pandas as pd

numbers = [48592, 50160]


def Main(url):
    with requests.Session() as req:
        for num in numbers:
            resonse = req.get(url.format(num))
            soup = BeautifulSoup(resonse.content, 'lxml')
            tag_info =soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
            print(tag_info[0].text)



Main("https://europa.eu/youth/volunteering/organisation/{}/")

但在这里我遇到了问题。猜猜我在结合上述部分的想法时监督了一些事情。再次。我想我们应该用 pandas 收集信息并将其存储在 csv 中......

【问题讨论】：

我在这里闻到了我的代码方式：P，您必须在网站上分享所需输出的屏幕截图
你好，亲爱的 αԋɱҽԃ αмєяιcαη ...你是对的：我是你的编码方法的粉丝 - 我喜欢你的想法和你如何解决问题。等待 - 我将创建所需输出的屏幕截图 - 我只需要大约 60 分钟，然后我将在此处将此信息添加到线程中 - 同时非常感谢您的回复并且您在这里;）很高兴看到。
@αԋɱҽԃ αмєяιcαη ：现在我添加了其中一页的图像。这些页面都采用相同的方式 - 只有组织主题不在每个页面上......猜想我们可以使用你的一些很棒的方法 - 我有 se@αԋɱҽԃ αмєяιcαη ：现在我添加了其中一个的图像页。在过去的几天里，这些页面都以同样的方式出现......你循环了很多页面 - 例如。在 daad 页面上，德国大学课程的德语页面......我猜你采样并收集到熊猫中...... - 期待收到你的消息......问候零..;）
我试着在这里理解你的目标，你的开始循环在哪里？从哪个到哪个？

标签： python loops web-scraping beautifulsoup

【解决方案1】：

import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)

注意：我已经测试了第一个10 页面，如果您希望获得更多speed，我建议您使用concurrent.futures。如果有任何错误。使用try/except。

【讨论】：

哇，我印象深刻 - 这太棒了！刚回到办公室，看到了您的解决方案 - 尚未测试 - 但我愿意今天下午运行它。关于页数注意：我们有 6926 页 - 请参阅此处europa.eu/youth/volunteering/organisations_en#open（我在上面的线程中添加了一个小图像）：问题是 -关于 URL 如何找出第一个 URL 和最后一个 URL - idea：如果我们从零迭代到 10 000 会怎样！？与网址的数字！？你怎么看？期待着听到您的意见！非常感谢所有人！
@zero ！你不需要从0迭代到10,000，实际上第一个函数catch是从0循环到348，每个页面会返回20个id，所以20 * 347=@987654336 @，您确实有 6926，因为最后一页仅包含 6 id，这意味着 6940 - 14 = 6926
@zero 记录，这是完整的ids 排序格式
美好的一天 - 这超出了预期：非常感谢您。你帮了我很多;)
再次非常感谢 - 这里需要你的想法和你的 Python 天才。我需要获取一堆 url...stackoverflow.com/questions/61106309/… - 这个超出了我的想象。如何获取所有现有的 Url b。如何获取每个插件的元数据......猜你有一个解决方案......;）