【发布时间】:2017-07-08 10:16:52
【问题描述】:
以下是我抓取网站的代码。我必须创建一个长度不等的数组的 DataFrame,例如 property_Type 具有不同的长度,soe 列表有一个 property_type ,有些有两个,有些有三个。同样,机构名称也有不同的长度。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
urls = []
for i in range(1,3):
pages = "http://www.realcommercial.com.au/for-sale/property-offices-retail-in-vic/list-{0}?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true".format(i)
urls.append(pages)
Data = []
for info in urls:
page = requests.get(info)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'details-panel'})
hrefs = [link['href'] for link in links]
for href in hrefs:
pages = requests.get(href)
soup_2 =BeautifulSoup(pages.content, 'html.parser')
Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
Address = [Address.text.strip() for Address in Address_1]
Prop_Type = soup_2.find_all('div', attrs={'class' :'propType ellipsis'})
Property_Type = [Property_Type.text.strip() for Property_Type in Prop_Type]
Agency_1=soup_2.find_all('div', attrs={'class' :'agencyName ellipsis'})
Agency_Name=[Agency_Name.text.strip() for Agency_Name in Agency_1]
Agent_1=soup_2.find_all('div', attrs={'class' :'agentName ellipsis'})
Agent_Name=[Agent_Name.text.strip() for Agent_Name in Agent_1]
raw_data = dict(A=np.array(Address),B=np.array(Property_Type),C=np.array(Agency_Name),D=np.array(Agent_Name))
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
我得到的错误是
File "<ipython-input-8-3a7c5fc4fb93>", line 32
raw_df = pd.DataFrame(dict([ k,series(v) for k,v in raw_data.iteritems() ]))
^
SyntaxError: invalid syntax
我应该怎么做才能有一个只有相关值属于相关列的数据框,比如属性类型应该是属性类型而不是机构名称。
任何帮助将不胜感激, 谢谢!!!
【问题讨论】:
标签: python arrays pandas numpy