【问题标题】:Wikipedia JSON API retrieve page content without linksWikipedia JSON API 检索没有链接的页面内容
【发布时间】:2012-05-21 05:11:19
【问题描述】:

我正在使用 Wikipedia JSON API,我正在使用它来检索没有链接的页面内容 例如,

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=May_21&prop=revisions&rvprop=content&rvsection=1

例如:

[[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].

&ndash 替换为-

[[Caesar (title)|''Caesar'']] 应该是Caesar

我正在使用 Objective-C

如何检索相同的页面内容,但没有链接字符?

谢谢!

【问题讨论】:

  • 您应该澄清链接字符的含义。也许在你的问题中展示一个例子,这会是什么样子。
  • 您应该使用正则表达式替换它们。您使用什么语言?
  • 谢谢,我正在使用 Objective-C,请看一下我的第二个示例,我无法处理这种文本,因为它可能会有所不同
  • 你想用模板做什么?
  • 为了清楚起见,是否有任何方法可以将页面内容检索为没有任何链接和标题字符的纯文本?

标签: iphone objective-c xml json wikipedia-api


【解决方案1】:

使用 HTML 到文本转换器(例如 links 或某些浏览器模拟器,例如 PhantomJS)。比将 wikitext 转换为文本更痛苦,在这种情况下,您将不得不处理模板。

【讨论】:

  • ...不仅是模板,而且整个 wiki 标记都很难“解析”。
  • 也许可以,但这并不能解决问题,有时你必须使用你所拥有的......
【解决方案2】:

应该是这样的:-)

NSString * stringToParse = @"{\"query\":{\"normalized\":[{\"from\":\"May_21\",\"to\":\"May 21\"}],\"pages\":{\"19684\":{\"pageid\":19684,\"ns\":0,\"title\":\"May 21\",\"revisions\":[{\"*\":\"==Events==\\n* [[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].\\n* [[878]] – [[Syracuse, Italy]], is [[Muslim conquest of Sicily|captured]] by the ...";

//Replace &ndash with -
stringToParse = [stringToParse stringByReplacingOccurrencesOfString:@"&ndash" withString:@"-"];

//[[Caesar (title)|''Caesar'']] Should be Caesar
//and [[Maximian]] should be Maximian
//same for [[1972]] -> 1972
NSString *regexToReplaceWikiLinks = @"\\[\\[([A-Za-z0-9_ ()]+?\\|)?(\\'\\')?(.+?)(\\'\\')?\\]\\]";

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regexToReplaceWikiLinks
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];

// attention, the found expression is replacex with the third parenthesis
NSString *modifiedString = [regex stringByReplacingMatchesInString:stringToParse
                                                           options:0
                                                             range:NSMakeRange(0, [stringToParse length])
                                                      withTemplate:@"$3"];

NSLog(@"%@", modifiedString);

结果:

{"query":{"normalized":[{"from":"May_21","to":"May 21"}],"pages":{"19684":{"pageid":19684,"ns":0,"title":"May 21","revisions":[{"*":"==Events==\n* 293 -; Roman Emperors Diocletian and Maximian appoint Galerius as Caesar to Diocletian, beginning the period of four rulers known as the Tetrarchy.\n* 878 -; Syracuse, Italy, is captured by the ...

【讨论】:

    【解决方案3】:

    Regular expressions 是解决这个问题的方法;这是一个使用 JavaScript 的示例(但您可以将相同的解决方案应用于任何具有正则表达式的语言);

    <dl>
        <script type="text/javascript">
    
            var source = "[[293]] &ndash; Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].";
    
            document.writeln('<dt> Original </dt>');
            document.writeln('<dd>' + source + '</dd>');
    
            // Replace links with any found titles
            var matchTitles = /\[\[([^\]]+?)\|\'\'(.+?)\'\']\]/ig; /* <- Answer */
            source = source.replace(matchTitles, '$2');
    
            document.writeln('<dt> First Pass </dt>');
            document.writeln('<dd style="color: green;">' + source + '</dd>');
    
            // Replace links with contents
            var matchLinks = /\[\[(.+?)\]\]/ig;
            source = source.replace(matchLinks, '$1');
    
            document.writeln('<dt> Second Pass </dt>');
            document.writeln('<dd>' + source + '</dd>');
        </script>
    </dl>
    

    你也可以在这里看到这个工作:http://jsfiddle.net/NujmB/

    【讨论】:

      【解决方案4】:

      我不知道目标 C,但这是我用于相同目的的 javascript 代码
      (它可以作为您的伪代码并帮助 javascript 中的其他用户)

       var url = 'http://en.wikipedia.org/w/api.php?callback=?&action=parse&page=facebook&prop=text&format=json&section=0';
           // Section = 0 for taking first section of wiki page i.e. introduction only     
                  $.getJSON(url,function(response){
                      // Taking only the first paragraph from introduction
                      var intro = $(response.parse.text['*']).filter('p:eq(0)').html();
                      var wikiBox = $('#wikipediaBox .wikipedia div.overview');
                      wikiBox.empty().html(intro);
                      // Converting relative links into absolute ones and links into outer links
                      wikiBox.find("a:not(.references a)").attr("href", function(){ return "http://www.wikipedia.org" + $(this).attr("href");});
                      wikiBox.find("a").attr("target", "_blank");
                      // Removing edits markers
                      wikiBox.find('sup.reference').remove(); 
                  });
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-03-27
        • 2021-10-01
        相关资源
        最近更新 更多