【问题标题】:Issues with CSV after scraping from HTML Page从 HTML 页面抓取后的 CSV 问题
【发布时间】:2020-05-17 14:53:44
【问题描述】:

我的代码工作正常,但我不知道为什么当我尝试输出到 csv 时。当我使用“打印”时,行数是原来的两倍。

不知何故,我似乎无法破译额外的行来自哪里。

这是我保存到 csv 时的代码

url1 = 'https://yugioh.fandom.com/wiki/Set_Card_Lists:Deck_Build_Pack:_Mystic_Fighters_(OCG-JP)'
output_file1_2 = "DBMF - CardList - tr2.csv" #change this to your own file output

def OutputHTMLFileSummary2(url,html_tag,output_file):
    array = []
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source, 'html.parser')
    f = csv.writer(open(output_file, "w", encoding="utf-8"))
    links = soup.find_all(html_tag)

    counter = 0.0
    for link in links:
        counter += 1
        if (counter/2) != 0.0:
            array.append([f.text.strip().replace("\xa0\n\t", "") for f in link.find_all("td")])
            print(counter)
        else:
            pass
    print(array)
    for i in range(len(array)):
        f.writerow([array[i]])
OutputHTMLFileSummary2(url1,"tr",output_file6)

file = open(output_file6, encoding="utf-8")
reader = csv.reader(file)
lines= len(list(reader))
print(lines)

The output in csv

【问题讨论】:

  • 你想用 counter 和 counter/2 做什么?您不应该测试浮点数是否相等。你是想跳过偶数还是什么?
  • 暂时忽略计数器。最初我想用它来计算行数,如果行命中空行,我会跳过它,但注意到我不需要这样做。我没有删除那行代码(可以删除)

标签: python csv beautifulsoup


【解决方案1】:

对我来说似乎工作正常。我对打印输出做了一些调整,但它们的大小相同。

import bs4 as bs
import urllib
import csv

url1 = 'https://yugioh.fandom.com/wiki/Set_Card_Lists:Deck_Build_Pack:_Mystic_Fighters_(OCG-JP)'
output_file = "DBMF - CardList - tr2.csv" #change this to your own file output

def OutputHTMLFileSummary2(url,html_tag,output_file):
    array = []
    source = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(source, 'html.parser')
    with open(output_file, "w", encoding="utf-8") as src:
        f = csv.writer(src)
        links = soup.find_all(html_tag)

        counter = 0.0
        for link in links:
            counter += 1
            if (counter/2) != 0.0:   # <---- do NOT do this...
                array.append([f.text.strip().replace("\xa0\n\t", "") for f in link.find_all("td")])
                print(counter)
            else:
                pass
        for idx, item in enumerate(array):
            print(f'{idx}: {item}')
        #print(array)
        for i in range(len(array)):
            f.writerow([array[i]])

OutputHTMLFileSummary2(url1,"tr",output_file)

file = open(output_file, encoding="utf-8")
reader = csv.reader(file)
#lines= len(list(reader))
print('\nFrom the file...\n')
for idx, item in enumerate(reader):
    print(f'{idx}: {item}')
file.close()

【讨论】:

    猜你喜欢
    • 2023-03-24
    • 2018-03-20
    • 1970-01-01
    • 1970-01-01
    • 2012-11-24
    • 2015-04-05
    • 2012-12-08
    • 2020-09-13
    相关资源
    最近更新 更多