【问题标题】:Creating a dataframe out of an array从数组中创建数据框
【发布时间】:2021-11-08 11:16:03
【问题描述】:

我有一个数据挖掘脚本,可以将我的数据返回到这样的数组中:

price_per_m2 = [742.0, 1210.0, 954.0, 1078.0, 910.0, 1553.0, 0, 1.0, 417.0, 553.0, 41.0, 550.0, 367.0, 11.0, 533.0, 2.0, 1139.0, 1466.0, 1042.0, 800.0, 906.0, 60.0, 91.0, 812.0, 412.0, 1000.0, 64.0, 778.0, 63.0, 1043.0, 899.0, 951.0]

type_of_property = ['Магазин', 'Двустаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Парцел', 'Парцел', 'Гараж', 'Офис', 'Заведение', 'Офис', 'Гараж', 'Парцел', 'Офис', 'Парцел', 'Офис', 'Офис', 'Магазин', 'Магазин', 'Гараж', 'Земеделски имот', 'Парцел', 'Магазин', 'Офис', 'Двустаен апартамент', 'Парцел', 'Магазин', 'Парцел', 'Двустаен апартамент', 'Едностаен апартамент', 'Двустаен апартамент', 'Офис', 'Едностаен апартамент', 'Земеделски имот', 'Офис', 'Едностаен апартамент', 'Едностаен апартамент', 'Магазин', 'Двустаен апартамент', 'Офис', 'Двустаен апартамент', 'Едностаен апартамент', 'Двустаен апартамент']
  • 请注意,这两个数组的长度可能不相等,因为我没有粘贴完整的数组,因为它们太长了。

最终目标是每天从所有数组(每天提取)中创建一个 excel 文件。

不过现在的目标是:

  • 从上述数组之一中创建一个 pandas 数组
  • 将该数组保存到 excel 文件中。

到目前为止我做了什么:

df_price_per_m2 = pd.DataFrame(data=price_per_m2)
df_type_of_property = pd.DataFrame(type_of_property)

df_price_per_m2.to_excel('sqm.xlsx')
df_type_of_property.to_excel('sqm.xlsx')

您会注意到,我已经尝试过,既有“data=”这个词,也没有。我的程序在此代码的第一行返回错误。

完整程序:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page=1&sid=fSNNpb'

r = s.get(url)
soup_for_last_page = BeautifulSoup(r.text, 'html.parser')


# Get all the data from the page
def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    # print(soup)
    return soup


def getnextpage(soup):
    page = soup.find('nav', {'class': 'paginator'})
    if page.find('a', {'class': 'next-page-btn'}):
        url = str(page.find('a', {'class': 'next-page-btn'})['href'])
        return url
    else:
        return


last_page = soup_for_last_page.find('a', {'class': 'last-page'})
last_page_number = int(last_page.get_text())

urls = []
for page in range(1, last_page_number + 1):
    url = f'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page={page}&sid=fSNNpb'
    urls.append(url)


# while True:
#     soup = getdata(url)
#     url = getnextpage(soup)
#     if not url:
#         break
#     urls.append(url)
#     #print(url)


prices = []
type_of_property = []
sqm_area = []
locations = []
publisher = []
price_per_m2 = []


def price_per_m2_0(x):
    if x.get_text().strip().find('/:') == -1:
        return 0
    else:
        return float(x.get_text().strip().split('/:')[1].strip().replace('EUR', '').strip().replace(' ', ''))


def get_sqm(links):
    for i in links:
        soup = getdata(i)
        for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            sqm_value = sqm.get_text().split(',')[1].split()[0]
            sqm_area.append(sqm_value)
    return sqm_area


def get_location(links):
    for i in links:
        soup = getdata(i)
        for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            location_value = location.get_text().split(',')[-1].strip()
            locations.append(location_value)
    return locations


def get_type(links):
    for i in links:
        soup = getdata(i)
        for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            property_type_value = ' '.join(
                property_type.get_text().split(',')[0].split()[1:3])
            type_of_property.append(property_type_value)
    return type_of_property


def get_publisher(links):
    for i in links:
        soup = getdata(i)
        for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'})[1::2]:
            publish_value = publish.get_text().strip()
            publisher.append(publish_value)
    return publisher


def get_price_per_m2(links):
    for i in links:
        soup = getdata(i)
        for price_per_m2_ in soup.find('ul', {'class': 'list-view real-estates'}).find_all('ul', {'class': 'parameters'}):
            price_per_m2_value = price_per_m2_0(price_per_m2_)
            price_per_m2.append(price_per_m2_value)
    return price_per_m2


def total_price(links):
    for i in links:
        soup = getdata(i)
        for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
            price_text = price.get_text()
            price_arr = re.findall('[0-9]+', price_text)
            final_price = ''
            for each_sub_price in price_arr:
                final_price += each_sub_price
            prices.append(final_price)
    return prices


print(get_sqm(urls))
print(get_location(urls))
print(get_type(urls))
print(get_publisher(urls))
print(get_price_per_m2(urls))
print(total_price(urls))

df_get_sqm = pd.DataFrame(data=get_sqm)
df_get_location = pd.DataFrame(get_location)
df_get_type = pd.DataFrame(get_type)
df_get_publisher = pd.DataFrame(get_publisher)
df_get_price_per_m2 = pd.DataFrame(get_price_per_m2)
df_total_price = pd.DataFrame(total_price)

df_get_sqm.to_excel('sqm.xlsx')

编辑: 我收到的错误消息:

Traceback (most recent call last):
  File "/Users/tdonov/Desktop/Python/Realestate Scraper/real_estate_test.py", line 130, in <module>
    df_get_sqm = pd.DataFrame(data=get_sqm)
  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 590, in __init__
    raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!
[Finished in 70.198s]

【问题讨论】:

  • 错误是什么?
  • 粘贴完整的错误信息
  • 错误消息:回溯(最近一次调用最后一次):文件“/Users/tdonov/Desktop/Python/Realestate Scraper/real_estate_test.py”,第 130 行,在 df_get_sqm = pd.DataFrame (data=get_sqm) 文件“/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py”,第 590 行,在 init 中引发 ValueError("DataFrame 构造函数不是正确调用!”)ValueError:未正确调用DataFrame构造函数! [70.198s 完成]

标签: python arrays pandas export-to-excel


【解决方案1】:

试试:

df_price_per_m2 = pd.DataFrame(data={'price':price_per_m2})

【讨论】:

  • 不幸的是不是解决方案。我收到此错误: Traceback(最近一次调用最后一次):文件“/opt/anaconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py”,第 80 行,在 arrays_to_mgr index = extract_index(数组)文件“/opt/anaconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py”,第 391 行,在 extract_index raise ValueError("如果使用所有标量值,则必须传递一个索引") ValueError: 如果使用所有标量值,则必须传递一个索引
  • 至少现在你不再有构造问题了。检查你的数组是否真的是数组
【解决方案2】:

从上述数组之一中创建一个 pandas 数组

请注意,[1,2,3] 之类的东西在python 中通常称为列表而不是数组。如果您有单个平面列表(例如您的 price_per_m2),那么 pandas.Series 就足够了,请尝试关注

import pandas as pd
price_per_m2 = [742.0, 1210.0, 954.0, 1078.0, 910.0, 1553.0, 0, 1.0, 417.0, 553.0, 41.0, 550.0, 367.0, 11.0, 533.0, 2.0, 1139.0, 1466.0, 1042.0, 800.0, 906.0, 60.0, 91.0, 812.0, 412.0, 1000.0, 64.0, 778.0, 63.0, 1043.0, 899.0, 951.0]
s = pd.Series(price_per_m2)
s.to_excel('sqm.xlsx')

如果您想了解更多关于将pandas.Series 写入excel 文件的信息,请阅读pandas.Series_to_excel docs

【讨论】:

    猜你喜欢
    • 2021-09-04
    • 2021-06-29
    • 2020-02-19
    • 1970-01-01
    • 2018-07-14
    • 1970-01-01
    • 2019-05-26
    • 2016-02-18
    • 2018-11-04
    相关资源
    最近更新 更多