您不需要删除任何类型的重复项。
只需要更新代码。
请继续阅读。我已经提供了与此问题相关的详细描述。另外不要忘记检查我为调试您的代码而编写的这个要点https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c。
» 问题出在哪里?
我知道你想要这个,因为你得到了重复的字典。
这是因为您将容器选择为 h4 elements & f
或每本书详情,指定页面链接https://open.bccampus.ca/find-open-textbooks/
和https://open.bccampus.ca/find-open-textbooks/?start=10
有 2 个h4 元素。
这就是为什么,而不是获取 20 个项目的列表(每页 10 个)作为容器列表,你
得到双倍,即 40 个项目的列表,其中每个项目都是 h4 元素。
对于这 40 项中的每一项,您可能会得到不同的值,但问题在于选择父项时。
因为它给出了相同的元素所以相同的文本。
让我们通过假设以下虚拟代码来澄清问题。
注意:您也可以访问并检查https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c,因为它包含我创建的用于调试和解决此问题的 Python 代码。你可能会得到一些想法。
<li> <!-- 1st book -->
<h4>
<a> Text 1 </a>
</h4>
<h4>
<a> Text 2 </a>
</h4>
</li>
<li> <!-- 2nd book -->
<h4>
<a> Text 3 </a>
</h4>
<h4>
<a> Text 4 </a>
</h4>
</li>
...
...
<li> <!-- 20th book -->
<h4>
<a> Text 39 </a>
</h4>
<h4>
<a> Text 40 </a>
</h4>
</li>
»» containers = page_soup.find_all("h4"); 将给出以下h4 元素列表。
[
<h4>
<a> Text 1 </a>
</h4>,
<h4>
<a> Text 2 </a>
</h4>,
<h4>
<a> Text 3 </a>
</h4>,
<h4>
<a> Text 4 </a>
</h4>,
...
...
...
<h4>
<a> Text 39 </a>
</h4>,
<h4>
<a> Text 40 </a>
</h4>
]
»» 对于您的代码,内部 for 循环的第一次迭代将以下元素称为 container 变量。
<h4>
<a> Text 1 </a>
</h4>
»» 第二次迭代将下面的元素称为 container 变量。
<h4>
<a> Text 1 </a>
</h4>
»» 在上述(第 1 次,第 2 次)内部 for 循环迭代中,container.parent; 将给出以下元素。
<li> <!-- 1st book -->
<h4>
<a> Text 1 </a>
</h4>
<h4>
<a> Text 2 </a>
</h4>
</li>
»» 和 container.parent.a 将给出以下元素。
<a> Text 1 </a>
»» 最后,container.parent.a.text 将以下文本作为我们前两本书的书名。
Text 1
这就是为什么我们会得到重复的字典,因为我们的动态 title 和 author 也是相同的。
让我们一一解决这个问题。
» 网页详情:
- 我们有 2 个网页的链接。
每个网页都有 10 本教科书的详细信息。
每本书的详细信息都有 2 个h4 元素。
总共有 2x10x2 = 40 个h4 元素。
» 我们的目标:
我们的目标是只获得 20 个字典而不是 40 个的数组/列表。
因此需要将容器列表迭代 2 项,即
通过在每次迭代中跳过 1 个项目。
» 修改后的工作代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
urls = [
'https://open.bccampus.ca/find-open-textbooks/',
'https://open.bccampus.ca/find-open-textbooks/?start=10'
]
data = []
#opening up connection and grabbing page
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.find_all("h4")
for index in range(0, len(containers), 2):
item = {}
item['type'] = "Textbook"
item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
item['source'] = "BC Campus"
item['title'] = containers[index].parent.a.text
item['authors'] = containers[index].nextSibling.findNextSibling(text=True)
data.append(item) # add the item to the list
with open("./json/bc-modified-final.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
» 输出:
[
{
"type": "Textbook",
"title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
"authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
"source": "BC Campus"
},
{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"authors": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
},
{
"type": "Textbook",
"title": "Project Management",
"authors": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
},
...
...
...
{
"type": "Textbook",
"title": "Naming the Unnamable: An Approach to Poetry for New Generations",
"authors": " Michelle Bonczek Evory. Western Michigan University",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
"source": "BC Campus"
}
]
最后,我尝试修改您的代码,并在字典对象中添加了更多详细信息 description、date 和 categories。
Python 版本:3.6
依赖:pip install beautifulsoup4
» 修改后的工作代码(增强版):
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
urls = [
'https://open.bccampus.ca/find-open-textbooks/',
'https://open.bccampus.ca/find-open-textbooks/?start=10'
]
data = []
#opening up connection and grabbing page
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.find_all("h4")
for index in range(0, len(containers), 2):
item = {}
# Store book's information as per given the web page (all 5 are dynamic)
item['title'] = containers[index].parent.a.text
item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
item["description"] = containers[index].parent.p.text.strip()
# Store extra information (1st is dynamic, last 2 are static)
item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
item['source'] = "BC Campus"
item['type'] = "Textbook"
data.append(item) # add the item to the list
with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
» 输出(增强版):
[
{
"title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
"catagories": [
"Ancillary Resources"
],
"authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
"date": "May 3, 2018",
"description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
"source": "BC Campus",
"type": "Textbook"
},
{
"title": "Exploring Movie Construction and Production",
"catagories": [
"Adopted"
],
"authors": "John Reich, SUNY Genesee Community College",
"date": "May 2, 2018",
"description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus",
"type": "Textbook"
},
...
...
...
{
"title": "Naming the Unnamable: An Approach to Poetry for New Generations",
"catagories": [],
"authors": "Michelle Bonczek Evory. Western Michigan University",
"date": "Apr 27, 2018",
"description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
"source": "BC Campus",
"type": "Textbook"
}
]
就是这样。谢谢。