使用 <li> 标签从网站抓取 html 数据答案

【问题标题】：Scraping html data from a web site with <li> tags使用 <li> 标签从网站抓取 html 数据
【发布时间】：2019-10-25 15:06:30
【问题描述】：

我正在尝试从这个彩票网站获取数据： https://www.lotterycorner.com/tx/lotto-texas/2019

我要抓取的数据是 2017 年到 2019 年的日期和中奖号码。然后我想将数据转换为列表并保存到 csv 文件或 excel 文件。

如果我的问题无法理解，我深表歉意，因为我是 python 新手。这是我尝试过的代码，但我不知道之后该怎么做

page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2017')    
soup = BeautifulSoup(page.content,'html.parser')    
week = soup.find(class_='win-number-table row no-brd-reduis')    
dates = (week.find_all(class_='win-nbr-date col-sm-3 col-xs-4'))    
wn = (week.find_all(class_='nbr-grp'))

我希望我的结果是这样的：

【问题讨论】：

标签： excel python-3.x csv web-scraping beautifulsoup

【解决方案1】：

如果有表格标签，不要使用 BeautifulSoup。让 Pandas 为您完成工作要容易得多（它使用 BeautifulSoup 在后台解析表格）。

import pandas as pd

years = [2017, 2018, 2019]

df = pd.DataFrame()
for year in years:
    url = 'https://www.lotterycorner.com/tx/lotto-texas/%s' %year
    table = pd.read_html(url)[0][1:]
    win_nums = table.loc[:,1].str.split(" ",expand=True).reset_index(drop=True)
    dates = pd.DataFrame(list(table.loc[:,0]), columns=['date'])

    table = dates.merge(win_nums, left_index=True, right_index=True)

    df = df.append(table, sort=True).reset_index(drop=True) 

df['date']= pd.to_datetime(df['date']) 
df = df.sort_values('date').reset_index(drop=True)

df.to_csv('file.csv', index=False, header=False)

输出：

print (df)
          date   0   1   2   3   4   5
0   2017-01-04   5   7  36  39  40  44
1   2017-01-07   2   5  14  18  26  27
2   2017-01-11   4  13  16  19  43  51
3   2017-01-14   7   8  10  18  47  48
4   2017-01-18   6  11  17  37  40  49
5   2017-01-21   2  13  17  39  41  46
6   2017-01-25   1  14  19  32  37  46
7   2017-01-28   5   7  30  48  51  52
8   2017-02-01  12  19  26  29  37  54
9   2017-02-04   8  13  19  25  26  29
10  2017-02-08  10  15  47  49  51  52
11  2017-02-11  24  25  26  29  41  53
12  2017-02-15   1   4   5  43  53  54
13  2017-02-18   5  11  14  21  38  44
14  2017-02-22   4   8  21  27  52  53
15  2017-02-25  16  37  42  46  49  54
16  2017-03-01   3  24  33  34  45  51
17  2017-03-04   2   4   5  17  48  50
18  2017-03-08  15  19  24  33  34  47
19  2017-03-11   5   6  24  28  29  37
20  2017-03-15   4  11  19  27  32  46
21  2017-03-18  12  15  16  23  38  43
22  2017-03-22   3   5  15  27  36  52
23  2017-03-25  21  25  27  30  36  48
24  2017-03-29   7   9  11  18  23  43
25  2017-04-01   3  21  28  33  38  52
26  2017-04-05   8  20  21  26  51  52
27  2017-04-08  10  11  12  47  48  52
28  2017-04-12   5  26  30  31  46  54
29  2017-04-15   2  11  36  40  42  53
..         ...  ..  ..  ..  ..  ..  ..
265 2019-07-20   3  35  38  45  50  51
266 2019-07-24   2   9  16  22  46  49
267 2019-07-27   1   2   6   8  20  53
268 2019-07-31  20  24  34  36  41  44
269 2019-08-03   6  17  18  20  26  34
270 2019-08-07   1   3  16  22  31  35
271 2019-08-10  18  19  27  36  48  52
272 2019-08-14  22  23  29  36  39  49
273 2019-08-17  14  18  21  23  40  44
274 2019-08-21  18  28  29  36  48  52
275 2019-08-24  11  31  42  48  50  52
276 2019-08-28   9  21  40  42  49  53
277 2019-08-31   5   7  30  41  44  54
278 2019-09-04   4  26  36  37  45  50
279 2019-09-07  22  23  31  33  40  42
280 2019-09-11   8  11  12  30  31  49
281 2019-09-14   1   3  24  28  31  41
282 2019-09-18   3  24  26  29  45  50
283 2019-09-21   2  20  31  43  45  54
284 2019-09-25   5   9  26  38  41  44
285 2019-09-28  16  18  39  45  49  54
286 2019-10-02   9  26  39  42  47  49
287 2019-10-05   6  10  18  24  32  37
288 2019-10-09  14  18  19  27  33  41
289 2019-10-12   3  11  15  29  44  49
290 2019-10-16  12  15  25  39  46  49
291 2019-10-19  19  29  41  46  50  51
292 2019-10-23   4   5  11  35  44  50
293 2019-10-26   1   2  26  41  42  54
294 2019-10-30  10  11  28  31  40  53

[295 rows x 7 columns]

【讨论】：

有没有办法从按钮中重新发送抽奖？
@outkast，你是什么意思？
如果您查看输出，最近的抽奖位于顶部。有没有办法把顺序颠倒过来
啊哈。好吧，“最近”不是“怨恨”。我被那个错字弄糊涂了。所以你基本上希望它们按日期顺序排列，最近的日期在底部？
抱歉打错了。但是是的，最近的日期在底部。

【解决方案2】：

下面的代码按年份创建包含所有标题和值的数据的 csv 文件，在下面的示例中将是 3 个文件：data_2017.csv、data_2018.csv 和 data_2019.csv。
如果需要，您可以在 years = ['2017', '2018', '2019'] 中再添加一年。
中奖号码格式为 1-2-3-4-5。

from bs4 import BeautifulSoup
import requests
import pandas as pd

base_url = 'https://www.lotterycorner.com/tx/lotto-texas/'
years = ['2017', '2018', '2019']

with requests.session() as s:
    for year in years:
        data = []

        page = requests.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = BeautifulSoup(page.content, 'html.parser')
        rows = soup.select(".win-number-table tr")

        headers = [td.text.strip() for td in rows[0].find_all("td")]
        # remove header line
        del rows[0]
        for row in rows:
            td = [td.text.strip() for td in row.select("td")]
            # replace whitespaces in Winning Numbers with -
            td[headers.index("Winning Numbers")] = '-'.join(td[headers.index("Winning Numbers")].split())
            data.append(td)

        df = pd.DataFrame(data, columns=headers)
        df.to_csv(f'data_{year}')

要仅保存中奖号码，请将df.to_csv(f'data_{year}') 替换为：

df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False, header=False)

2017 年的示例输出，只有中奖号码，没有标题：

9-14-16-27-45-51
2-4-15-38-48-53
8-22-23-29-34-36
6-10-11-22-30-45
5-10-16-22-26-46
12-14-19-34-39-47
4-5-10-21-34-40
1-25-35-42-48-51

【讨论】：

有没有办法让我获得中奖号码
是的，你可以。检查只有中奖号码的代码的答案更新。如果您想要文件中的标题，请使用删除 header=False in df.to_csv(f'data_{year}', columns=["Winning Numbers"], index=False)

【解决方案3】：

这应该将您需要的数据导出到 csv 文件中：

from bs4 import BeautifulSoup
from csv import writer
import requests


page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2019')

soup = BeautifulSoup(page.content,'html.parser')

header = {
    'date': 'win-nbr-date col-sm-3 col-xs-4',
    'winning numbers': 'nbr-grp',
    'jackpot': 'win-nbr-jackpot col-sm-3 col-xs-3',
}

table = []

for header_key, header_value in header.items():
    items = soup.find_all(class_=f"{header_value}")
    column = [','.join(item.get_text().split()) if header_key=='winning numbers'
                       else ''.join(item.get_text().split()) if header_key == 'jackpot'
    else item.get_text() for item in items]
    table.append(column)

rows = list(zip(*table))

with open("winning numbers.csv", "w") as f:
    csv_writer = writer(f)
    csv_writer.writerow(header)
    for row in rows:
        csv_writer.writerow(row)

header 是一个字典，将您的 csv 标头映射到它们的 html 类值

在 for 循环中，我们构建每列的数据。 “中奖号码”和“头奖”需要一些特殊处理，我将用逗号/空字符串替换任何空格/隐藏字符。

每一列都会被添加到一个名为table的列表中。我们将所有内容都写入一个 csv 文件，但是由于 csv 一次写入一个 row，我们需要使用 zip function (rows = list(zip(*table))) 来准备我们的行

【讨论】：

【解决方案4】：

这是 bs4 4.7.1+ 的一种简洁方式，它使用 :not 排除标题和 zip 来组合列以进行输出。结果如页面所示。 Session 用于提高 tcp 连接重用效率。

import requests, re, csv
from bs4 import BeautifulSoup as bs

dates = []; winning_numbers = []

with requests.Session() as s:
    for year in range(2017, 2020):
        r = s.get(f'https://www.lotterycorner.com/tx/lotto-texas/{year}')
        soup = bs(r.content)
        dates.extend([i.text for i in soup.select('.win-nbr-date:not(.blue-bg)')])
        winning_numbers.extend([re.sub('\s+','-',i.text.strip()) for i in soup.select('.nbr-list')])

with open("lottery.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['date','numbers'])
    for row in zip(dates, winning_numbers):
        w.writerow(row)

【讨论】：

【解决方案5】：

这个有效：

import requests
from bs4 import BeautifulSoup
import io
import re

def main():
    page = requests.get('https://www.lotterycorner.com/tx/lotto-texas/2018')
    soup = BeautifulSoup(page.content,'html.parser')
    week = soup.find(class_='win-number-table row no-brd-reduis')
    wn = (week.find_all(class_='nbr-grp'))
    file = open ("vit.txt","w+")
    for winning_number in wn:
        line = remove_html_tags(str(winning_number.contents).strip('[]'))
        line = line.replace(" ", "")
        file.write(line + "\n")
    file.close()

def remove_html_tags(text):
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

这部分代码循环通过wn变量并将每一行写入“vit.txt”文件：对于wn中的wining_number： line = remove_html_tags(str(winning_number.contents).strip('[]')) line = line.replace(" ", "") file.write(line + "\n") 文件.close()

<li> 标签的“剥离”可能会做得更好，例如应该有一种优雅的方式将winning_number 保存到列表中并用 1 行打印列表。

【讨论】：

如何打印输出？
@outkast20 print(line) 在file.write(line + "\n") 之前的那一行。