【问题标题】:Put array data on a csv file将数组数据放在 csv 文件中
【发布时间】:2016-11-12 14:52:23
【问题描述】:

如何将数组输出保存为 csv 文件? 我试过使用 csv 模块,但没有给我正确的输出。我想要如下图所示的输出。

output1.html

<div class="side-article txt-article">
    <p><strong></strong> <a href="http://batam.tribunnews.com/tag/polres/" title="Polres"></a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan"></a></p>
    <p><br></p>
    <p><a href="http://batam.tribunnews.com/tag/polres/" title="Polres"></a></p>
    <p><a href="http://batam.tribunnews.com/tag" title="Polres"></a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan"></a></p>
    <br>

我有代码:

import csv
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser

with open('output1.html', 'r') as f:
    html = f.read()
soup = BeautifulSoup(html.strip(), 'html.parser')

for line in html.strip().split('\n'):
    link_words = 0

    line_soup = BeautifulSoup(line.strip(), 'html.parser')
    for link in line_soup.findAll('a'):
        link_words += len(link.text.split())

    # naive way to get words count
    words_count = len(line_soup.text.split())- link_words

    number_tag_p = len(line_soup.find_all('p'))
    number_tag_br = len(line_soup.find_all('br'))
    number_tag_break = number_tag_br + number_tag_p

    #for line in html.strip().split('\n'):
    number_of_starttags = 0
    number_of_endtags = 0


        # create a subclass and override the handler methods
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            global number_of_starttags
            number_of_starttags += 1

        def handle_endtag(self, tag):
            global number_of_endtags
            number_of_endtags += 1

                # instantiate the parser and fed it some HTML


    parser = MyHTMLParser()
    parser.feed(line.lstrip())
    number_tag = number_of_starttags + number_of_endtags
    #print(number_of_starttags + number_of_endtags)
    CTTD = words_count + link_words + number_tag_break


    if (words_count + link_words) == 0:
        CTTD == 0
    else:
        CTTD

    print ('TC : {0} LTC : {1} TG : {2} P : {3} CTTD : {4}'
           .format(words_count, link_words, number_tag, number_tag_break, CTTD))



res = ('TC : {0} LTC : {1} TG : {2} P : {3} CTTD : {4}'
           .format(words_count, link_words, number_tag, number_tag_break, CTTD))
csvfile = "./output1.csv"

#Assuming res is a flat list
with open(csvfile, "wb") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in res:
        writer.writerow([val])

#Assuming res is a list of lists
with open(csvfile, "wb") as output:
    writer = csv.writer(output, lineterminator='\n')
    writer.writerows(res)

算法的输出

TC : 0 LTC : 0 TG : 0 P : 0 CTTD : 0
TC : 0 LTC : 0 TG : 0 P : 0 CTTD : 0
TC : 0 LTC : 0 TG : 1 P : 0 CTTD : 0
TC : 0 LTC : 0 TG : 1 P : 0 CTTD : 0
TC : 15 LTC : 0 TG : 2 P : 0 CTTD : 15

输出 csv:

如何将打印保存到 csv? 任何python库都可以做到这一点?

我预计输出将是

谢谢。

【问题讨论】:

  • csv 模块以什么方式失败?它是工作的工具。输出中的空白或其他装饰性装饰是否存在问题?我所知道的任何工具都不会输出您提供的确切表格,因为该表格由 GUI 呈现并且根本不在文件中。如果您只是简单地保存一个以逗号分隔的简洁 csv,然后将其导入电子表格,您将得到您所显示的内容。
  • @tdelaney 我更新了我的输出。谢谢
  • 你从哪里得到 HTMLparser?
  • 对于使用 python 3.x 的人,您可以让 HTMLParser 做:从 html.parser 导入 HTMLParser

标签: python html arrays csv parsing


【解决方案1】:

也许这就是你想要的:

import csv
from bs4 import BeautifulSoup
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global number_of_starttags
        number_of_starttags += 1

    def handle_endtag(self, tag):
        global number_of_endtags
        number_of_endtags += 1

with open('output1.html', 'r') as f:
    html = f.read()
soup = BeautifulSoup(html.strip(), 'html.parser')

ress = []
for line in html.strip().split('\n'):
    link_words = 0

    line_soup = BeautifulSoup(line.strip(), 'html.parser')
    for link in line_soup.findAll('a'):
        link_words += len(link.text.split())

    words_count = len(line_soup.text.split())- link_words
    number_tag_p = len(line_soup.find_all('p'))
    number_tag_br = len(line_soup.find_all('br'))
    number_tag_break = number_tag_br + number_tag_p

    number_of_starttags = 0
    number_of_endtags = 0

    parser = MyHTMLParser()
    parser.feed(line.lstrip())
    number_tag = number_of_starttags + number_of_endtags
    CTTD = words_count + link_words + number_tag_break


    if (words_count + link_words) == 0:
        CTTD == 0
    res = [words_count, link_words, number_tag, number_tag_break, CTTD]
    ress.append(res)

csvfile = "./output.csv"
firstline = ["TC", "LTC", "TG", "P", "CTTD"]
with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    writer.writerow(firstline)
    for val in ress:
        writer.writerow(val)

无论如何,我的输出与你的不同......我得到了这个 csv:

TC,LTC,TG,P,CTTD
0,0,1,0,0
0,0,8,1,1
0,0,3,2,2
0,0,4,1,1
0,0,6,1,1
0,0,1,1,1

因为你在 for cicle 中只有最后一行值(你的 . 格式在 for 范围之外)

【讨论】:

    【解决方案2】:

    writerow 获取元素列表,这些元素构成特定行中单元格的值。

    因此,在写入 csv 时,始终建议将标题构建为列表,并将所有值构建为列表列表

    header = ["TC", "LTC", "TG", "P", "CTTD"]
    val = [[1,2,3,4],[2,3,4,5]]
    with open(csvfile, "w") as output:
        writer = csv.writer(output, lineterminator='\n')
        writer.writerow(header)
        for v in val:
            writer.writerow(v)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-02-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多