【问题标题】:Scraping information from library catalog从图书馆目录中抓取信息
【发布时间】:2018-10-30 02:17:08
【问题描述】:

我正在开展一个项目,以从特定图书馆抓取书籍的目录信息。到目前为止,我的脚本可以从表格中刮掉所有单元格。但是,我对如何只返回新不列颠图书馆的特定单元格感到困惑。

import requests
from bs4 import BeautifulSoup

mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)

soup = BeautifulSoup(response.text, 'html.parser')

data = []
table = soup.find('table', attrs={'class':'itemTable'})


rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

for index, libraryinfo in enumerate(data):
    print(index, libraryinfo)

以下是脚本中新不列颠图书馆的示例输出:

["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']

与其归还所有单元格,我将如何仅归还与新不列颠图书馆有关的单元格?我也只想要库名称和结帐状态。

期望的输出是:

["New Britain, Main Library - Children's Department", 'Check Shelf']

可以有多个单元格,因为一本书可以在同一个图书馆有多个副本。

【问题讨论】:

    标签: python beautifulsoup screen-scraping


    【解决方案1】:

    为了简单地根据特定字段(示例中的第一个字段)过滤掉数据,您可以构建一个理解:

    [element for element in data if 'New Britain' in element[0]]
    

    您提供的示例消除了使数据元素具有不同大小的空值。这使得更难知道哪个字段对应于每个数据组件。使用 dicts 我们可以使数据更易于理解和处理。

    某些字段内部似乎有空块(只有类似空格的字符 ['\n''\r''\t'' '])。所以 strip 不会删除那些。将它与一个简单的正则表达式结合起来可以帮助改善这一点。我写了一个简单的函数来做到这一点:

    def squish(s):
        return re.sub(r'\s+', ' ', s)
    

    总结一下,我相信这会对你有所帮助:

    import re
    
    import requests
    from bs4 import BeautifulSoup
    
    
    def squish(s):
        return re.sub(r'\s+', ' ', s)
    
    
    def filter_by_location(data, location_name):
        return [x for x in data if location_name.lower() in x['Location'].lower()]
    
    
    mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
    response = requests.get(mypage)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    data = []
    table = soup.find('table', attrs={'class':'itemTable'})
    
    headers = [squish(element.text.strip()) for element in table.find('tr').find_all('th')]
    
    for row in table.find_all('tr')[1:]:
        cols = [squish(element.text.strip()) for element in row.find_all('td')]
        data.append({k:v for k, v in zip(headers, cols)})
    
    filtered_data = filter_by_location(data, 'New Britain')
    for x in filtered_data:
        print('Location: {}'.format(x['Location']))
        print('Status: {}'.format(x['Status']))
        print()
    

    运行它我得到了以下结果:

    Location: New Britain, Jefferson Branch - Children's Department
    Status: Check Shelf
    
    Location: New Britain, Main Library - Children's Department
    Status: Check Shelf
    
    Location: New Britain, Main Library - Children's Department
    Status: Check Shelf
    

    【讨论】:

      【解决方案2】:

      过滤掉不涉及新不列颠的行只需要检查cols(即cols[0])的第一个元素是否具有库的名称。

      仅获取库名称和签出状态很简单。您只需要访问cols 的第一个和第三个元素(即[cols[0], cols[2]]),因为它们分别具有库名称和签出状态。

      您可以尝试将data.append([ele for ele in cols if ele]) 替换为以下内容。

      # We gotta do this to skip empty rows.
      if len(cols) == 0:
          continue
      
      if 'New Britain' in cols[0]:
          data.append([cols[0], cols[2]])
      

      您的代码将如下所示:

      import requests
      from bs4 import BeautifulSoup
      
      mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
      response = requests.get(mypage)
      
      soup = BeautifulSoup(response.text, 'html.parser')
      
      data = []
      table = soup.find('table', attrs={'class':'itemTable'})
      
      rows = table.find_all('tr')
      for row in rows:
          cols = row.find_all('td')
          cols = [ele.text.strip() for ele in cols]
      
          if len(cols) == 0:
              continue
      
          if 'New Britain' in cols[0]:
              data.append([cols[0], cols[2]])
      
      for index, libraryinfo in enumerate(data):
          print(index, libraryinfo)
      

      输出:

      0 ["New Britain, Jefferson Branch - Children's Department", 'Check Shelf']
      1 ["New Britain, Main Library - Children's Department", 'Check Shelf']
      2 ["New Britain, Main Library - Children's Department", 'Check Shelf']
      

      【讨论】:

        【解决方案3】:

        试试这个以获得想要的内容:

        import requests
        from bs4 import BeautifulSoup
        
        URL = "http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt"
        
        res = requests.get(URL)
        soup = BeautifulSoup(res.text,"lxml")
        for items in soup.find("table",class_="itemTable").find_all("tr"):
            if "New Britain" in items.text:
                data = items.find_all("td")
                name = data[0].a.get_text(strip=True)
                status = data[2].get_text(strip=True)
                print(name,status)
        

        输出:

        New Britain, Jefferson Branch - Children's Department Check Shelf
        New Britain, Main Library - Children's Department Check Shelf
        New Britain, Main Library - Children's Department Check Shelf
        

        【讨论】:

          猜你喜欢
          • 2023-03-31
          • 1970-01-01
          • 1970-01-01
          • 2016-02-05
          • 2023-04-11
          • 2011-03-26
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多