从维基百科页面的类别中获取更一般的类别答案

【问题标题】：Get more general Category from the Category of a Wikipedia page从维基百科页面的类别中获取更一般的类别
【发布时间】：2021-04-23 18:29:53
【问题描述】：

我正在使用Python wikipedia library 获取页面类别列表。我看到它是MediaWiki API 的包装器。

无论如何，我想知道如何将类别概括为 marco 类别，例如这些 Main topic classifications。

例如，如果我搜索页面Hamburger，有一个名为German-American cousine 的类别，但我想获得它的超级类别，例如Food and Drink。我该怎么做？

import wikipedia
page = wikipedia.page("Hamburger")
print(page.categories)
# how to filter only imortant categories?

>>>['All articles with specifically marked weasel-worded phrases', 'All articles with unsourced statements', 'American sandwiches', 'Articles with hAudio microformats', 'Articles with short description', 'Articles with specifically marked weasel-worded phrases from May 2015', 'Articles with unsourced statements from May 2017', 'CS1: Julian–Gregorian uncertainty', 'Commons category link is on Wikidata', 'Culture in Hamburg', 'Fast food', 'German-American cuisine', 'German cuisine', 'German sandwiches', 'Hamburgers (food)', 'Hot sandwiches', 'National dishes', 'Short description is different from Wikidata', 'Spoken articles', 'Use mdy dates from October 2020', 'Webarchive template wayback links', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with NARA identifiers', 'Wikipedia indefinitely move-protected pages', 'Wikipedia pages semi-protected against vandalism']

我没有找到一个 api 来遍历 Wikipedia Categories 的层次结构树。

我接受 Python 和 API 请求解决方案。谢谢

编辑： 我找到了 api categorytree，它似乎做了一些类似于我需要的事情。

无论如何，我找不到插入 options 参数的方法，如文档中所述。我认为选项可以是在这个link中表达的那些，比如mode=parents，但是我找不到在HTTP url中插入这个参数的方法，因为它必须是一个JSON对象，如文档中所说.我正在尝试这个https://en.wikipedia.org/w/api.php?action=categorytree&category=Category:Biscuits&format=json。如何插入options字段？

【问题讨论】：

categorytree 是一个陈旧而丑陋的 API，用于在 UI 中呈现类别树的特定目的。使用categories 或categorylinks dump 可能会更好。
你能定义“重要”吗？
@horcrux 我看不出我在哪里写了“重要”。如果您指的是搜索更一般的类别，我的目的应该是找到最高父类别以概括每个维基百科页面的类别。我想要的分类示例是en.wikipedia.org/wiki/Category:Main_topic_classifications
您说“如何仅过滤重要的类别？” （“imortant”中有错字）。因此，为了更好地定义您的问题：给定类别 X，您想在“类别：主要主题分类”中的类别中获得类别 Y，这样 X 包含在 Y 中。我说的对吗？
你是对的@horcrux。这正是我的目的:)

标签： python mediawiki wikipedia wikipedia-api mediawiki-api

【解决方案1】：

这是一项非常艰巨的任务，因为 Wikipedia 的类别图一团糟（从技术上讲 :-)）。实际上，在一棵树中，您希望在对数时间内到达根节点。但这不是一棵树，因为任何节点都可以有多个父节点！

此外，我认为不能仅使用类别来完成，因为正如您在示例中看到的那样，您很可能会得到意想不到的结果。无论如何，我试图重现类似于你所要求的内容。

下面代码的解释：

从源页面开始（硬编码的是“汉堡”）；
返回递归访问所有父类别；
缓存所有遇到的类别，以避免访问两次相同的类别（这也解决了循环问题）；
如果找到目标类别，则剪切当前分支；
当积压为空时停止。

从给定页面开始，您可能会获得多个目标类别，因此我将结果组织为一本字典，告诉您遇到目标类别的次数。

正如你想象的那样，响应不是立即的，所以这个算法应该在离线模式下实现。并且可以通过多种方式进行改进（见下文）。

代码

import requests
import time
import wikipedia

def get_categories(title) :
    try : return set(wikipedia.page(title, auto_suggest=False).categories)
    except requests.exceptions.ConnectionError :
        time.sleep(10)
        return get_categories(title)

start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories}    # dictionary target category -> number of paths
cached_categories = set()       # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
    print("\nBacklog size: %d" % len(backlog))
    cat = backlog.pop()         # pick a category removing it from backlog
    print("Visiting category: " + cat)
    try:
        for parent in get_categories("Category:" + cat) :
            if parent in target_categories :
                print("Found target category: " + parent)
                result_categories[parent] += 1
            elif parent not in cached_categories :
                backlog.add(parent)
                cached_categories.add(parent)
    except KeyError: pass       # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))

您的示例的结果

在您的示例中，脚本将访问 12176 个类别 (!) 并返回以下结果：

{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}

您可能会注意到，“食品和饮料”类别仅出现了 12 次，而例如“社会”类别已出现 40 次。这告诉我们很多关于 Wikipedia 的类别图有多么奇怪。

可能的改进

在优化或近似此算法方面有很多改进。我想到的第一个：

考虑跟踪路径长度并假设具有最短路径的目标类别是最相关的类别。
减少执行时间：
- 您可以通过在第一次出现目标类别后（或第 N 次出现）停止脚本来减少步骤数。
- 如果您从多篇文章开始执行此算法，您可以将最终目标类别与您遇到的每个类别相关联的信息保存在内存中。例如，在您的“汉堡包”运行之后，您会知道从“类别：快餐”开始您将到达“类别：经济”，这可能是一个宝贵的信息。这在空间方面会很昂贵，但最终会帮助您减少执行时间。
仅将目标更频繁的类别用作标签。例如。如果您的结果是{"Food and drinks" : 37, "Economy" : 4}，您可能只想保留“食品和饮料”作为标签。为此，您可以：
- 取N个出现频率最高的目标类别；
- 取最相关的部分（例如前半部分、第三部分或第四部分）；
- 采用 w.r.t 至少出现 N% 次的类别。最常见的；
- 使用更复杂的统计测试来分析频率的统计显着性。

【讨论】：

这太棒了！感谢这个完整的例子！只有一个问题：如果您这样做for parent in get_categories("Category:" + cat) :，您正在搜索另一个类别页面的类别。这是搜索父类别还是仅搜索相关类别？因为我想做这样的事情，但是我认为类别页面的类别不是父类别，而只是相关的。
“相关类别”是什么意思？如果页面 A（文章或类别）包含在类别 B 中（即它在其源代码中声明 [[Category:B]]），则 B 是 A 的父类别。
啊好吧！所以你的意思是，如果我搜索一个类别页面并查看它的类别，这些就是它的父类别。我以为他们是与该类别相关的类别，不一定是父母！
尝试做一些测试 ;-) 在类别页面 A 中写入 [[Category:B]] 并查看 A 是否作为子类别出现在 B 中。
相反，如果你写[[:Category:B]]（注意冒号），你只是在链接它。另请参阅：Categorization 和 How to link to a category

【解决方案2】：

您可以做一些不同的事情是获取machine-predicted article topic，并使用https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1000459607 之类的查询

【讨论】：

这真的很有趣。你觉得我的目的用现有的api是不可能实现的吗？
并非不可能，但比您预期的要难。 MediaWiki 中的类别不是树，甚至不是 DAG，在大型 wiki 上可能有大量的类别，因此您必须进行某种启发式图遍历，或者下载并在本地预处理整个类别图。