【问题标题】:Python : how to add column above excel xlsx file when crawling data using BeautifulSoupPython:使用BeautifulSoup抓取数据时如何在excel xlsx文件上方添加列
【发布时间】:2020-01-14 08:08:41
【问题描述】:

您好,我是代码新手,我正在尝试从 cnn.com 获取新闻标题,就像下面附加的 excel 文件的图像一样。

但是问题是,我不知道如何添加每一列,例如 World/Politics/Health,并且我的代码仅从元组列表的 LAST 元素(在此代码中为“politics”)获取数据。

这是我的代码。提前谢谢!

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import os
from bs4 import BeautifulSoup as soup
from bs4 import NavigableString
import re
import xlsxwriter
from openpyxl import Workbook


path = "C:/Users/Desktop/chromedriver.exe"
driver = webdriver.Chrome(path)

# per section

a =['world','health','politics']
wb = Workbook()
ws = wb.active

for i in a:
    nl = []
    driver.get("https://edition.cnn.com/"+str(i))
    driver.implicitly_wait(3)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    find_ingre = soup.select("span.cd__headline-text")

    for i in find_ingre:
        nl.append(i.get_text())

# make dataframe --> save xlsx

import pandas as pd
from pandas import Series, DataFrame

df = pd.DataFrame(nl)
df.to_excel("cnn_recent_topics.xlsx",index=False)

现在的结果--->

我想要得到的结果--->

【问题讨论】:

    标签: python excel pandas dataframe web-crawler


    【解决方案1】:

    你可以试试这个,如果你需要解释,请评论:

    def custom_scrape(topic):
        nl = []
        driver.get("https://edition.cnn.com/"+str(topic))
        driver.implicitly_wait(3)
        html = driver.page_source
        soup = BeautifulSoup(html, "lxml")
        find_ingre = soup.select("span.cd__headline-text")
    
        for i in find_ingre:
            nl.append(i.get_text())
    
    
        return nl
    
    topics =['world','health','politics']
    result = pd.DataFrame()
    for topic in topics:
        temp_df = pd.DataFrame(nl)
        temp_df.columns = [topic]
        result = pd.concat([result, temp_df], ignore_index=True, axis=1)
    

    【讨论】:

    • 嗨,实际上代码给了我几个'NameError',不知道为什么。你能补充一些解释吗?
    猜你喜欢
    • 2020-06-21
    • 1970-01-01
    • 1970-01-01
    • 2021-12-17
    • 2020-06-03
    • 1970-01-01
    • 1970-01-01
    • 2017-11-14
    • 1970-01-01
    相关资源
    最近更新 更多