【问题标题】:scrape web page and write to CSV using Python使用 Python 抓取网页并写入 CSV
【发布时间】:2019-11-12 13:14:23
【问题描述】:

我正在尝试抓取我公司的网页并将结果写入 CSV 文件。

我可以通过这段代码获得我想要的数据:

page = requests.get('https://wiki.us.cworld.company.com/display/6TO/AWS+Accounts', auth=('tdunphy', 'secret!'))
soup = BeautifulSoup(page.text, 'html.parser')
html = list(soup.children)[1]
all_rows = soup.find_all('tr')
row_count = 0
for row in all_rows:
    row_count += 1
    if row_count == 1:
        continue
    print(row.get_text())

但生成的数据是一起运行的,几乎无法破译:

company-govcloud-ab-mc-stage-adminkpmg-us-aws-adv-ab-mc-govcloud-admin-stageCommercial AccountAdvisory12345678901NoIslandhttps://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/consoleKarel Somebody23452126676371Console, Access Key
company-govcloud-ab-mc-stagekpmg-us-aws-adv-ab-mc-govcloud-stageGov AccountAdvisory12324546562NoIslandhttps://company-govcloud-ab-mc-stage.signin.amazonaws-us-gov.com/consoleKarel Somebody123213123131Console, Access Key
company-cob(Decommissioned 03/28/2019)company-COB COB, Client OnboardingAdvisory21234546789812NoIslandhttps://company-cob.signin.aws.amazon.com/console/Laurence LorcaPending DecommissionConsole, Access Key

我希望生成的 CSV 具有以下标题:

['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'CIDR Block', 'Connected to Montvale', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type']

在原始网页上,数据在 HTML 表格中,结果清晰易读:

company-govcloud-ab-mc-stage-admin  company-us-aws-adv-ab-mc-govcloud-admin-stage   Commercial Account  Advisory    12345667890101  No  Island  https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console    Karel Somebody  123456789101    Console, Access Key

这是我提取的数据中的一些示例 HTML:

<tr><td class="confluenceTd">company-master</td><td class="confluenceTd">us-ktawsmasacct</td><td class="confluenceTd">Master Account</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">123456789101</span></td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd">N/A - no cloud resources</td><td class="confluenceTd"><a href="https://us-ktech-aws-master-acct.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://us-ktech-aws-master-acct.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-transit-hub1</td><td class="confluenceTd">us-ktawsth1acct</td><td class="confluenceTd">Transit Hub</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">303779310401</span></td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);">10.47.0.0/24</span></td><td class="confluenceTd">No</td><td class="confluenceTd">Peninsula</td><td class="confluenceTd"><a href="https://company-transit-hub1.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-transit-hub1.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr>

<tr><td colspan="1" class="confluenceTd">company-transit-hub3 (lab)</td><td colspan="1" class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">us-dbawsth3acct</span></td><td colspan="1" class="confluenceTd">Transit Hub</td><td colspan="1" class="confluenceTd">BPG</td><td colspan="1" class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">1098765432101</span> </td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);">10.0.0.0/24</span></td><td colspan="1" class="confluenceTd">No</td><td colspan="1" class="confluenceTd">Island</td><td colspan="1" class="confluenceTd"> <a href="https://company-transithub3-lab.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-transithub3-lab.signin.aws.amazon.com/console</a></td><td colspan="1" class="confluenceTd">Rahul Arya </td><td colspan="1" class="confluenceTd"> </td><td colspan="1" class="confluenceTd">Console, Access Key</td></tr>

<tr><td class="confluenceTd">company-security</td><td class="confluenceTd"><span style="color: rgb(68,68,68);text-decoration: none;">us-ktawssecacct</span></td><td class="confluenceTd">Security</td><td class="confluenceTd">BPG</td><td class="confluenceTd">254312345691</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://us-ktawssecacct.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://us-ktawssecacct.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"> 245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-shared-services</td><td class="confluenceTd">us-ktawsssacct</td><td class="confluenceTd">Shared Services</td><td class="confluenceTd">BPG</td><td class="confluenceTd">300944922012</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://company-shared-services.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-shared-services.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd">245612345678</td><td class="confluenceTd">Console,   Access Key</td></tr><tr>

<tr><td class="confluenceTd">company-logging</td><td class="confluenceTd">us-ktawslogmonacct</td><td class="confluenceTd">Logging</td><td class="confluenceTd">BPG</td><td class="confluenceTd">542348765123</td><td colspan="1" class="confluenceTd"><br/></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://company-logging.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://company-logging.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd">800000039768</td><td class="confluenceTd">Console,   Access Key</td></tr><tr><td class="confluenceTd">company-spoke-acct1</td><td class="confluenceTd">us-ktawsspk1acct</td><td class="confluenceTd">Spoke Account</td><td class="confluenceTd">BPG</td><td class="confluenceTd"><span style="text-decoration: none;">103440952267</span></td><td colspan="1" class="confluenceTd"><span style="color: rgb(0,0,0);text-decoration: none;">10.47.8.0/24</span></td><td class="confluenceTd">No</td><td class="confluenceTd"><span>connected through hub1</span></td><td class="confluenceTd"><a href="https://block-chain.signin.aws.amazon.com/console" class="external-link" rel="nofollow">https://block-chain.signin.aws.amazon.com/console</a></td><td class="confluenceTd">Rahul Arya</td><td class="confluenceTd"><p>123456757897</p></td><td class="confluenceTd">Console,   Access Key</td></tr>

问题是我从页面抓取数据的时候,数据是一起跑的,需要把数据分开,插入逗号。

如何在表格数据的每个字段之间插入逗号,以便将其写入 CSV 文件?

【问题讨论】:

  • 最好解析每个row (tr),提取每个元素,例如td 并创建 listlists,可以读取例如与pandas。也许这会有所帮助...如果没有,请发布您的数据样本 (page),以便可以复制您的输出。
  • 谢谢。我已经用我试图提取的数据中的一些示例 HTML 更新了 OP。我也把它放到a paste
  • 看看下面的 Andrej Keselys 答案,我认为这就是您所需要的。 :-)

标签: python web-scraping


【解决方案1】:

要写入 CSV 文件,请使用内置 csv 模块:

data = '''
<table>
<tr>
<td>company-govcloud-ab-mc-stage-admin</td>
<td>company-us-aws-adv-ab-mc-govcloud-admin-stage</td>
<td>Commercial Account</td>
<td>Advisory</td>
<td>12345667890101</td>
<td>No</td>
<td>Island</td>
<td>https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console</td>
<td>Karel Somebody</td>
<td>123456789101</td>
<td>Console, Access Key</td>
</tr>
</table>'''

headers = ['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'Connected to Montvale', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type']

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(data, 'lxml')

rows = []
for tr in soup.select('tr'):
    rows.append([td.text for td in soup.select('td')])


with open('out.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=';',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(headers)

    for row in rows:
        writer.writerow(row)

文件out.csv 包含:

Company Account Name;AWS Account Name;Description;LOB;AWS Account Number;Connected to Montvale;Peninsula or Island;URL;Owner;Engagement Code;CloudOps Access Type
company-govcloud-ab-mc-stage-admin;company-us-aws-adv-ab-mc-govcloud-admin-stage;Commercial Account;Advisory;12345667890101;No;Island;https://company-govcloud-ab-mc-stage-admin.signin.aws.amazon.com/console;Karel Somebody;123456789101;Console, Access Key

来自 LibreOffice Calc 的屏幕截图:

【讨论】:

  • 谢谢。当我使用您的代码时,它几乎完全按照呈现的方式生成this result。我使用的代码与您的代码之间的唯一区别是我从网页中提取信息,并且我认为您使用的是嵌入到脚本中的信息。这是code I'm using。我怎样才能得到你在帖子中显示的结果?
  • 这是我正在使用的code。抱歉,我之前发布的链接有问题。
  • @bluethundr CSV 似乎没问题,但您打开的程序不支持;作为分隔符。例如,尝试将其更改为逗号。现在放假了,没办法仔细看
  • 谢谢,我使用逗号作为分隔符:writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) 但我得到了相同的结果。我在 Excel 中打开 CSV。并且行数据向下重复了 68 次。并且独特的数据在整个页面中传播而不是向下传播。
猜你喜欢
  • 2019-03-12
  • 2013-11-14
  • 1970-01-01
  • 2014-06-20
  • 2017-02-10
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多