如何将 BeautifulSoup 结果中的 URL 存储到列表中，然后存储到表中答案

【问题标题】：How to store URL from BeautifulSoup results to a list and then to a table如何将 BeautifulSoup 结果中的 URL 存储到列表中，然后存储到表中
【发布时间】：2019-09-30 21:20:02
【问题描述】：

我正在抓取一个房地产网页，试图获取一些 URL，然后创建一个表格。 https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html 我有几天在努力

将结果存储到列表或字典中，然后
创建表但我真的卡住了

from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')


#Extract URL 
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
  link_text = a['href']
  URL='https://www.zonaprop.com.ar'+link_text
  print(URL)

好的，输出对我来说没问题：

https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html

问题是输出是真实的链接（您可以点击它们并转到页面）

但是当我尝试将它存储在一个新变量中时（列表或字典，列名为“地址”以加入“PlacesDf”（相同的列名“地址”））/转换为表/或任何我不能的技巧找到解决方案。事实上，当我尝试转换为 pandas 时：

Address = pd.dataframe(URL)

它只创建一个单行表。

我希望看到类似的东西

Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]

或字典或任何我可以用 pandas 转到表的东西

【问题讨论】：

标签： python pandas url beautifulsoup

【解决方案1】：

您应该执行以下操作：

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')

#Extract URL
all_url = [] 
link_text = ''
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
  link_text = a['href']
  URL='https://www.zonaprop.com.ar'+link_text
  print(URL)
  all_url.append(URL)

df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name

希望对你有帮助

【讨论】：

谢谢，非常有用，我有一个问题，有行大小限制吗？因为 URL 看起来不完整？
@johnGuarenas 是的，您可以使用以下行增加列宽pd.set_option('max_colwidth',-1000)

【解决方案2】：

我不知道您从哪里得到 lat 和 lon，我正在对地址进行假设。我可以看到您当前的网址返回中有很多重复项。我建议以下 css 选择器仅针对列表链接。这些类选择器比您当前的方法快得多。

使用返回的链接列表的 len 来定义行维度，并且您已经有了列。

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re

r = requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html')
soup = bs(r.content, 'lxml') #'html.parser'
links = ['https://www.zonaprop.com.ar' + item['href'] for item in soup.select('.aviso-data-title a')]
locations = [re.sub('\n|\t','',item.text).strip() for item in soup.select('.aviso-data-location')]
df = pd.DataFrame(index=range(len(links)),columns= ['Address', 'Lat', 'Lon', 'Link'])
df.Link = links
df.Address = locations
print(df)

【讨论】：