【问题标题】:How to store URL from BeautifulSoup results to a list and then to a table如何将 BeautifulSoup 结果中的 URL 存储到列表中,然后存储到表中
【发布时间】:2019-09-30 21:20:02
【问题描述】:

我正在抓取一个房地产网页,试图获取一些 URL,然后创建一个表格。 https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html 我有几天在努力

  1. 将结果存储到列表或字典中,然后
  2. 创建表 但我真的卡住了
from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')


#Extract URL 
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
  link_text = a['href']
  URL='https://www.zonaprop.com.ar'+link_text
  print(URL)

好的,输出对我来说没问题:

https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html

问题是输出是真实的链接(您可以点击它们并转到页面)

但是当我尝试将它存储在一个新变量中时(列表或字典,列名为“地址”以加入“PlacesDf”(相同的列名“地址”))/转换为表/或任何我不能的技巧找到解决方案。事实上,当我尝试转换为 pandas 时:

Address = pd.dataframe(URL) 

它只创建一个单行表。

我希望看到类似的东西

Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]

或字典或任何我可以用 pandas 转到表的东西

【问题讨论】:

    标签: python pandas url beautifulsoup


    【解决方案1】:

    您应该执行以下操作:

    from bs4 import BeautifulSoup
    import requests
    import re
    import pandas as pd
    
    
    source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
    soup=BeautifulSoup(source,'lxml')
    
    #Extract URL
    all_url = [] 
    link_text = ''
    PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
    for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
      link_text = a['href']
      URL='https://www.zonaprop.com.ar'+link_text
      print(URL)
      all_url.append(URL)
    
    df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name
    

    希望对你有帮助

    【讨论】:

    • 谢谢,非常有用,我有一个问题,有行大小限制吗?因为 URL 看起来不完整?
    • @johnGuarenas 是的,您可以使用以下行增加列宽pd.set_option('max_colwidth',-1000)
    【解决方案2】:

    我不知道您从哪里得到 lat 和 lon,我正在对地址进行假设。我可以看到您当前的网址返回中有很多重复项。我建议以下 css 选择器仅针对列表链接。这些类选择器比您当前的方法快得多。

    使用返回的链接列表的 len 来定义行维度,并且您已经有了列。

    from bs4 import BeautifulSoup as bs
    import requests
    import pandas as pd
    import re
    
    r = requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html')
    soup = bs(r.content, 'lxml') #'html.parser'
    links = ['https://www.zonaprop.com.ar' + item['href'] for item in soup.select('.aviso-data-title a')]
    locations = [re.sub('\n|\t','',item.text).strip() for item in soup.select('.aviso-data-location')]
    df = pd.DataFrame(index=range(len(links)),columns= ['Address', 'Lat', 'Lon', 'Link'])
    df.Link = links
    df.Address = locations
    print(df)
    

    【讨论】:

      猜你喜欢
      • 2021-04-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多