【问题标题】:Is there a way to look for a specific line of code in web-scraping with bs4有没有办法在使用 bs4 进行网络抓取时查找特定的代码行
【发布时间】:2022-01-23 04:08:18
【问题描述】:

我正在尝试使用 html 列表和无序列表对页面进行网页抓取

(嵌套在列表和无序列表中)

但我无法在没有属性的情况下对它们进行网络抓取。

一天下的每个<ul> 标记都包含当天的数据。我知道如何抓取嵌套的 <ul><li> 标签,但由于缺少属性而无法这样做。我想知道是否可以获取已解析的页面并在包含日期的行下查找标签,以便我可以一次刮掉它们。任何帮助将不胜感激。

这里还有一点代码,

<div class="show-content user_content clearfix enhanced" data-uw-styling-context="true">
  <h1 class="page-title" data-uw-styling-context="true">Unit 3 I Week 3</h1>
  
  
    <div style="background-color: #184366; color: white; padding: 15px;" data-uw-styling-context="true">
<h2 data-uw-styling-context="true"><span style="font-size: 30pt;" data-uw-styling-context="true">Unit 3 | Week 3: January 18th-21st</span></h2>
</div>
<h2 data-uw-styling-context="true">Essential Questions</h2>
<ul data-uw-styling-context="true">
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How does voice relate to the audience and purpose?</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">What techniques does the author use to get his/her point across and communicate?</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How can technology be beneficial and/or detrimental to society?</span></li>
</ul>
<h2 data-uw-styling-context="true">Objectives</h2>
<ul data-uw-styling-context="true">
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze the concept of utopia/dystopia as presented in the novel</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a utopia to represent the ideas of the group and backed up with research</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze expository/informational text&nbsp;</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Understand rhetorical devices and logical fallacies</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Interpret elements of media including television and digital graphics</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a TV newscast that organizes and presents research with certain purposes and audiences in mind</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Collaborate to create a professional product</span></li>
<li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Explain author’s purpose and message within a text</span></li>
</ul>
<p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p>
<h2 data-uw-styling-context="true">???? Monday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">No School</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true">???? Tuesday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">????In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Read Chapter 4</li>
<li data-uw-styling-context="true">Annotations&nbsp;</li>
<li data-uw-styling-context="true">Book Study</li>
</ul>
</li>
<li data-uw-styling-context="true">????Due Today:</li>
<li data-uw-styling-context="true">????Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems</li>
<li data-uw-styling-context="true">Annotations and Book Study 1-4 due BOC Wed</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true">???? Wednesday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">????In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Subject Complement Notes&nbsp;</li>
<li data-uw-styling-context="true">"There Will Come Soft Rains"&nbsp;</li>
</ul>
</li>
<li data-uw-styling-context="true">????Due Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Annotations and Book Study Ch. 1-4</li>
</ul>
</li>
<li data-uw-styling-context="true">????Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems&nbsp;</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true">???? Thursday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">????In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Subject Complement Practice</li>
<li data-uw-styling-context="true">TWCSR</li>
</ul>
</li>
<li data-uw-styling-context="true">????Due Today:</li>
<li data-uw-styling-context="true">????Homework for Next Class:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Study Stems&nbsp;</li>
</ul>
</li>
</ul>
</li>
</ul>
<hr data-uw-styling-context="true">
<h2 data-uw-styling-context="true">???? Friday</h2>
<ul data-uw-styling-context="true">
<li style="list-style-type: none;" data-uw-styling-context="true">
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">????In Class Today:
<ul data-uw-styling-context="true">
<li data-uw-styling-context="true">Stems Quiz 5 Major Grade</li>
<li data-uw-styling-context="true">TWCSR (Due Monday BOC)</li>
</ul>
</li>
<li data-uw-styling-context="true">????Due Today:</li>
<li data-uw-styling-context="true">????Homework for Next Class:</li>
</ul>
</li>
</ul>
<p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p>
<p data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791827/download" alt="Left Arrow (1).png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791827" data-api-returntype="File" data-uw-styling-context="true"></p>
<p data-uw-styling-context="true"><br data-uw-styling-context="true">&nbsp;<a title="Unit 3 Overview" href="https://fisd.instructure.com/courses/111538/pages/unit-3-overview" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/unit-3-overview" data-api-returntype="Page" data-uw-styling-context="true">Unit 3 Homepage</a></p>
<p data-uw-styling-context="true">&nbsp;</p>
<p data-uw-styling-context="true"><a title="Home" href="https://fisd.instructure.com/courses/111538/pages/home" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/home" data-api-returntype="Page" data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791834/download?wrap=1" alt="Home Black.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791834" data-api-returntype="File" data-uw-styling-context="true"> <br data-uw-styling-context="true">Course Homepage</a></p>
<p data-uw-styling-context="true">&nbsp;</p>
  
</div>

这是页面的截图,

【问题讨论】:

  • 这是来自公共网页吗?
  • 不,这是来自登录后的学校页面

标签: python html web-scraping beautifulsoup python-requests


【解决方案1】:

注意: 由于缺乏细节,答案只能指出如何在上下文中抓取信息的方向 - 但它没有考虑到路径网站也没有准确的数据准备。

方法是查找所有包含“day”的&lt;h2&gt;、下一个&lt;li&gt; 及其所有子&lt;li&gt;

for day in soup.select('h2:-soup-contains("day")'):
    for item in day.find_next('li').select('li:has(li)'):
        print(item.text)

示例

html = '''<div class="show-content user_content clearfix enhanced" data-uw-styling-context="true"> <h1 class="page-title" data-uw-styling-context="true">Unit 3 I Week 3</h1>   <div style="background-color: #184366; color: white; padding: 15px;" data-uw-styling-context="true"> <h2 data-uw-styling-context="true"><span style="font-size: 30pt;" data-uw-styling-context="true">Unit 3 | Week 3: January 18th-21st</span></h2> </div> <h2 data-uw-styling-context="true">Essential Questions</h2> <ul data-uw-styling-context="true"> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How does voice relate to the audience and purpose?</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">What techniques does the author use to get his/her point across and communicate?</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">How can technology be beneficial and/or detrimental to society?</span></li> </ul> <h2 data-uw-styling-context="true">Objectives</h2> <ul data-uw-styling-context="true"> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze the concept of utopia/dystopia as presented in the novel</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a utopia to represent the ideas of the group and backed up with research</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Analyze expository/informational text&nbsp;</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Understand rhetorical devices and logical fallacies</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Interpret elements of media including television and digital graphics</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Create a TV newscast that organizes and presents research with certain purposes and audiences in mind</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Collaborate to create a professional product</span></li> <li aria-level="1" data-uw-styling-context="true"><span data-uw-styling-context="true">Explain author’s purpose and message within a text</span></li> </ul> <p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p> <h2 data-uw-styling-context="true">? Monday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">No School</li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true">? Tuesday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">?In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Read Chapter 4</li> <li data-uw-styling-context="true">Annotations&nbsp;</li> <li data-uw-styling-context="true">Book Study</li> </ul> </li> <li data-uw-styling-context="true">?Due Today:</li> <li data-uw-styling-context="true">?Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems</li> <li data-uw-styling-context="true">Annotations and Book Study 1-4 due BOC Wed</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true">? Wednesday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">?In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Subject Complement Notes&nbsp;</li> <li data-uw-styling-context="true">"There Will Come Soft Rains"&nbsp;</li> </ul> </li> <li data-uw-styling-context="true">?Due Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Annotations and Book Study Ch. 1-4</li> </ul> </li> <li data-uw-styling-context="true">?Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems&nbsp;</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true">? Thursday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">?In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Subject Complement Practice</li> <li data-uw-styling-context="true">TWCSR</li> </ul> </li> <li data-uw-styling-context="true">?Due Today:</li> <li data-uw-styling-context="true">?Homework for Next Class: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Study Stems&nbsp;</li> </ul> </li> </ul> </li> </ul> <hr data-uw-styling-context="true"> <h2 data-uw-styling-context="true">? Friday</h2> <ul data-uw-styling-context="true"> <li style="list-style-type: none;" data-uw-styling-context="true"> <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">?In Class Today: <ul data-uw-styling-context="true"> <li data-uw-styling-context="true">Stems Quiz 5 Major Grade</li> <li data-uw-styling-context="true">TWCSR (Due Monday BOC)</li> </ul> </li> <li data-uw-styling-context="true">?Due Today:</li> <li data-uw-styling-context="true">?Homework for Next Class:</li> </ul> </li> </ul> <p data-uw-styling-context="true"><img src="https://fisd.instructure.com/courses/56950/files/4791824/download" alt="tear drop line 3.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791824" data-api-returntype="File" style="max-width: 676px;" data-uw-styling-context="true"></p> <p data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791827/download" alt="Left Arrow (1).png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791827" data-api-returntype="File" data-uw-styling-context="true"></p> <p data-uw-styling-context="true"><br data-uw-styling-context="true">&nbsp;<a title="Unit 3 Overview" href="https://fisd.instructure.com/courses/111538/pages/unit-3-overview" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/unit-3-overview" data-api-returntype="Page" data-uw-styling-context="true">Unit 3 Homepage</a></p> <p data-uw-styling-context="true">&nbsp;</p> <p data-uw-styling-context="true"><a title="Home" href="https://fisd.instructure.com/courses/111538/pages/home" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/111538/pages/home" data-api-returntype="Page" data-uw-styling-context="true"><img style="float: left; max-width: 72px;" src="https://fisd.instructure.com/courses/56950/files/4791834/download?wrap=1" alt="Home Black.png" data-api-endpoint="https://fisd.instructure.com/api/v1/courses/56950/files/4791834" data-api-returntype="File" data-uw-styling-context="true"> <br data-uw-styling-context="true">Course Homepage</a></p> <p data-uw-styling-context="true">&nbsp;</p>  </div> '''

soup=BeautifulSoup(html,'lxml')
data = []
for day in soup.select('h2:-soup-contains("day")'):
    d = {'day':day.text,'items':[]}
    for item in day.find_next('li').select('li:has(li)'):
        d['items'].append({'item':item.text})
    data.append(d)
data

输出

[{'day': '? Monday', 'items': []},
 {'day': '? Tuesday',
  'items': [{'item': '?In Class Today:  Read Chapter 4 Annotations\xa0 Book Study  '},
   {'item': '?Homework for Next Class:  Study Stems Annotations and Book Study 1-4 due BOC Wed  '}]},
 {'day': '? Wednesday',
  'items': [{'item': '?In Class Today:  Subject Complement Notes\xa0 "There Will Come Soft Rains"\xa0  '},
   {'item': '?Due Today:  Annotations and Book Study Ch. 1-4  '},
   {'item': '?Homework for Next Class:  Study Stems\xa0  '}]},
 {'day': '? Thursday',
  'items': [{'item': '?In Class Today:  Subject Complement Practice TWCSR  '},
   {'item': '?Homework for Next Class:  Study Stems\xa0  '}]},
 {'day': '? Friday',
  'items': [{'item': '?In Class Today:  Stems Quiz 5 Major Grade TWCSR (Due Monday BOC)  '}]}]

【讨论】:

  • 谢谢,这有很大帮助。但是我想知道我是否可以专门寻找那些h2标签的文本,因为代码中还有其他不属于日期信息的h2标签。我尝试了soup.find('h2', text="? Monday") 和soup.find_all ,但效果不太好。再次感谢您!
  • 如果没有额外的细节,很难给出正确的答案,所以最好改进这个问题。但是为了给你一个提示,你可以使用soup.select('h2:-soup-contains("day")') 而不是soup.find_all('h2') 来选择更具体的。日子里有没有特别的容器?
  • 这也不太行,没有响​​应也没有错误,没有特殊的容器,我在问题中添加了更多信息。
  • 你的库是最新的 - 作为替代使用正则表达式。
  • 天哪,我觉得自己太笨了,它没有登录到页面,因为我需要 javascript 才能访问该站点。我很抱歉。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2016-12-13
  • 1970-01-01
  • 1970-01-01
  • 2021-09-26
  • 2019-02-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多