【问题标题】:How would I go about getting info from a list of link then dump them into a JSON object?我将如何从链接列表中获取信息,然后将它们转储到 JSON 对象中?
【发布时间】:2017-12-11 06:32:59
【问题描述】:

Python 和 BeautifulSoup 的新手。任何帮助都非常感谢

我知道如何构建一个公司信息列表,但那是在单击一个链接之后。

import requests 
from bs4 import BeautifulSoup


url = "http://data-interview.enigmalabs.org/companies/"
r = requests.get(url)

soup = BeautifulSoup(r.content)

links = soup.find_all("a")

link_list = []

 for link in links:
    print link.get("href"), link.text

 g_data = soup.find_all("div",{"class": "table-responsive"})

 for link in links:
    print link_list.append(link)

谁能告诉我如何首先抓取链接,然后为网站构建所有公司列表数据的 JSON?

为了更好的可视化,我还附上了示例图片。

如何在不点击每个单独链接的情况下抓取网站并构建如下例所示的 JSON?

预期输出示例:

all_listing = [ {"Dickens-Tillman":{'Company Detail': 
 {'Company Name': 'Dickens-Tillman',
  'Address Line 1   ': '7147 Guilford Turnpike Suit816',
  'Address Line 2   ': 'Suite 708',
  'City': 'Connfurt',
  'State': 'Iowa',
  'Zipcode  ': '22598',
  'Phone': '00866539483',
  'Company Website  ': 'lockman.com',
  'Company Description': 'enable robust paradigms'}}},
`{'"Klein-Powlowski" ':{'Company Detail': 
 {'Company Name': 'Klein-Powlowski',
  'Address Line 1   ': '32746 Gaylord Harbors',
  'Address Line 2   ': 'Suite 866',
  'City': 'Lake Mario',
  'State': 'Kentucky',
  'Zipcode  ': '45517',
  'Phone': '1-299-479-5649',
  'Company Website  ': 'marquardt.biz',
 'Company Description': 'monetize scalable paradigms'}}}]

print all_listing`

【问题讨论】:

  • 嗯...你能给我们提供实际的网址吗?
  • @cᴏʟᴅsᴘᴇᴇᴅ 是的,没问题,实际网址是link
  • 啊,这看起来像是 selenium + bs4 的工作。
  • 公司信息是显示在链接页面上还是仅显示在单独的页面中(一个公关公司)?
  • @jlaur 你是对的。这就是为什么我很困惑。如果全部在一个页面上会更容易,但我不知道如何获取当前状态下的所有信息

标签: python json beautifulsoup web-mining


【解决方案1】:

这是我对我提出的问题的最终解决方案。

import bs4, urlparse, json, requests,csv
from os.path import basename as bn

links = []
data = {}
base = 'http://data-interview.enigmalabs.org/'

#Approach 
#1. Each individual pages, collect the links
#2. Iterate over each link in a list
#3. Before moving on each the list for links if correct move on, if not review step 2 then 1
#4. Push correct data to a JSON file



def bs(r):
    return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table')

for i in range(1,11):
    print 'Collecting page %d' % i
    links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')]
# Search a the given range of "a" on each page

# Now that I have collected all links into an list,iterate over each link
# All the info is within a html table, so search and collect all company info in data
for link in links:
    print 'Processing %s' % link
    name = bn(link)
    data[name] = {}
    for row in bs(link).findAll('tr'):
        desc, cont = row.findAll('td')
        data[name][desc.text.encode()] = cont.text.encode()

print json.dumps(data)

# Final step is to have all data formating 
json_data = json.dumps(data, indent=4)
file = open("solution.json","w")
file.write(json_data)
file.close()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-06-24
    • 1970-01-01
    相关资源
    最近更新 更多