【问题标题】:XPath: Select Current and Next Node's text by Current Node AttributesXPath:通过当前节点属性选择当前和下一个节点的文本
【发布时间】:2011-03-06 05:11:12
【问题描述】:

首先,这是来自my previous question 的衍生品。我再次发布此消息是因为the person whose answer I accepted in the original post 建议我这样做,因为他认为这个问题以前没有正确定义。尝试2:

我正在尝试从this webpage 获取信息。为清楚起见,以下是页面源代码块的选择

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                    <span class='distribution'>(SCI)</span></p> 
<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
        onMouseover="showtip(this,event,'24 Lectures')"
        onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
        onMouseover="showtip(this,event,'12 Tutorials')"
        onMouseout="hidetip()">12T</span>]<br> 
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br> 
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 


从上面的示例块中,我想提取以下信息:

  1. ANT101H5 Introduction to Biological Anthropology and Archaeology
  2. Exclusion: ANT100Y5
  3. Prerequisite: ANT102H5

我想从网页中获取所有此类信息(请记住,某些课程可能还额外列出了“共同要求”,或者可能根本没有列出任何先决条件/共同要求或排除项)。

我一直在尝试为此任务编写一个合适的 xpath 表达式,但我似乎无法做到恰到好处。

到目前为止,在Dimitre Novatchev 的帮助下,我已经能够使用以下表达式:

sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
                    (//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
                    (//span[@class='title2'])[3]/following-sibling::a[1]/text()")

但是,它会产生以下输出,似乎只获取页面上第一个课程的信息:

[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n                        "},
 {"desc": "Exclusion: "},
 {"desc": "ANT100Y5"},
 {"desc": "Prerequisite: "},
 {"desc": "ANT102H5"}]

绝对清楚,这个输出只有在它获得关于第一门课程的正确信息时才是正确的。我需要该网页上列出的所有课程的正确信息。

我已经很接近了,但我似乎无法弄清楚最后一步。

我会很感激任何帮助...在此先感谢

【问题讨论】:

  • 抱歉,提供的文本不是格式良好的 XML。请改正。我可以尝试自己纠正这个问题,但我怎么能确定我“以正确的方式”纠正了它?
  • @inspectorG4dget:我发布了一个完整而简单的 XSLT 解决方案的答案。如果您仍然需要它,您可以从此 XSLT 代码生成您的单个 XPath 表达式。 :)
  • @inspectorG4dget:看起来你正在分组。这在 XPath 1.0 中是不可能的,因为节点集是一组唯一的无序节点。您必须选择组中的第一个(在本例中为 p 元素),然后选择具有此节点作为上下文的组的其余部分。
  • @Dimitre:发布的 XML 是目标网页的直接摘录 - 我不知道您所说的“更正它”是什么意思。如果你能更具体一点,那么我可以尝试更有用
  • @Alejandro:这就是我想做的,因为我知道p 存在。我不知道其余的都这样。如果这不是我的代码所做的,请告知如何更改它

标签: python xslt xpath scrapy


【解决方案1】:

为所有课程选择相关数据所需的单个 XPath 表达式非常混乱,所以我在这里采用另一种方法,可以使用(如果有必要的话)生成单个 XPath表达式:

这个简单的 XSLT 转换

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="p[@class='titlestyle']">
  <xsl:text>&#xA;===================&#xA;</xsl:text>
  <xsl:value-of select="text()[1]"/>
 </xsl:template>

 <xsl:template match=
  "span/span[@class='title2'][not(position() >1)]">
   <xsl:text>&#xA;</xsl:text>
   <xsl:value-of select="."/>
   <xsl:value-of select="following-sibling::a[1]"/>

   <xsl:if test="not(following-sibling::a)">
    <xsl:value-of select="following-sibling::text()[1]"/>
   </xsl:if>
   <xsl:text>&#xA;</xsl:text>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

在页面上应用时http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html(整理成格式良好的 XML 文档),产生想要的结果

===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology

Exclusion: ANT100Y5

===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology

Exclusion: ANT100Y5

===================
ANT200Y5 World Archaeology and Prehistory

Prerequisite: 101H5

===================
ANT203Y5 Biological Anthropology

Prerequisite: 101H5

===================
ANT204Y5 Sociocultural Anthropology

Prerequisite: 101H5

===================
ANT205H5 Introduction to Forensic Anthropology

Prerequisite: 101H5

===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology

Exclusion: ANT206H5

===================
ANT241Y5 Aboriginal Peoples of North America

===================
ANT299Y5 Research Opportunity Program

===================
ANT304H5 Anthropology and Aboriginal Peoples

Exclusion: ANT304Y5

===================
ANT306H5 Forensic Anthropology Field School

Prerequisite: ANT205H5

===================
ANT308H5 Case Studies in Archaeological Botany and Zoology

Prerequisite: ANT200Y5

===================
ANT309H5 Southeast Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT310H5 Complex Societies

Prerequisite: ANT200Y5

===================
ANT312H5 Archaeological Analysis

Prerequisite: ANT200Y5

===================
ANT313H5 China, Korea and Japan in Prehistory

Prerequisite: ANT200Y5

===================
ANT314H5 Archaeological Theory

Exclusion: ANT411H5

===================
ANT316H5 South Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT317H5 Archaeology of Eastern North America

Prerequisite: ANT200Y5

===================
ANT318H5 Archaeological Fieldwork

Prerequisite: ANT200Y5

===================
ANT320H5 Archaeological Approaches to Technology

Prerequisite: ANT200Y5

===================
ANT322H5 Anthropology of Youth Culture

Exclusion: ANT204Y5

===================
ANT327H5 Agricultural Origins:  The Second Revolution

Prerequisite: ANT200Y5

===================
ANT331H5 The Biology of Human Sexuality

Exclusion: ANT330H5

===================
ANT332H5 Human Origins

Exclusion: ANT332Y5

===================
ANT333H5 Human Origins II

Exclusion: ANT332Y5

===================
ANT334H5 Human Osteology

Exclusion: ANT334Y5

===================
ANT335H5 Anthropology of Gender

Exclusion: ANT331Y5

===================
ANT336H5 Molecular Anthropology

Prerequisite: ANT203Y5

===================
ANT338H5 Laboratory Methods in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT339Y5 Human Adaptation through Biological and Cultural Means

Prerequisite: ANT203Y5

===================
ANT340H5 Osteological Theory

Exclusion: ANT334Y5

===================
ANT350H5 Globalization and the Changing World of Work

Prerequisite: ANT204Y5

===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology

Prerequisite: ANT204Y5

===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology

Prerequisite: ANT204Y5

===================
ANT358H5 Ethnographic Methods

Prerequisite: ANT204Y5

===================
ANT360H5 Anthropology of Religion

Exclusion: ANT209Y5

===================
ANT361H5 Anthropology of Sub-Saharan Africa

Exclusion: ANT212Y5

===================
ANT362H5 Language in Culture and Society

Prerequisite: ANT204Y5

===================
ANT363H5 Magic, Witchcraft and Science

Prerequisite: ANT360H5

===================
ANT364H5 Lab in Social Interaction

Prerequisite: ANT206H5

===================
ANT365H5 Semiotic Anthropology

Prerequisite: ANT204Y5

===================
ANT368H5 World Religions and Ecology

Exclusion: RLG311H5

===================
ANT369H5 Religious Violence and Nonviolence

Exclusion: RLG317H5

===================
ANT397H5 Independent Study

Prerequisite: Permission of Faculty Advisor


===================
ANT398Y5 Independent Reading

Prerequisite: Permission of Faculty Advisor


===================
ANT399Y5 Research Opportunity Program

Prerequisite: P.I.


===================
ANT401H5 Vocal and Visual Communication

Prerequisite: ANT102H5

===================
ANT414H5 People and Plants in Prehistory

Prerequisite: ANT200Y5

===================
ANT415H5 Faunal Archaeo-Osteology

Exclusion: ANT415Y5

===================
ANT416H5 Advanced Archaeological Analysis

Prerequisite: ANT312H5

===================
ANT418H5 Advanced Archaeological Fieldwork

Prerequisite: ANT318H5

===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I


===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I. 


===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT432H5 Special Seminar in Anthropology

Prerequisite: P.I.


===================
ANT433H5 Genes, Language, Artifact and Mind

Prerequisite: ANT200Y5

===================
ANT434H5 Palaeopathology

Prerequisite: ANT334Y5

===================
ANT438H5 The Development of Thought in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT439Y5 Advanced Forensic Anthropology

Prerequisite: ANT205H5

===================
ANT441H5 Advanced Bioarchaeology

Prerequisite: ANT334H5

===================
ANT457H5 Anthropology and the Environment

Prerequisite: ANT102H5

===================
ANT458H5 Anthropology of Crime, Law and Order

Exclusion: ANT204Y5

===================
ANT459H5 The Ethnography of Speaking

Prerequisite: ANT206Y5

===================
ANT460H5 Theory in Sociocultural Anthropology

Prerequisite: ANT204Y5

===================
ANT461H5 Emergent Topics in Socio-Cultural &amp;  Linguistic Anthropology

Prerequisite: ANT204Y5

===================
ANT498H5 Advanced Independent Study

Prerequisite: P.I.


===================
ANT499Y5 Advanced Independent Research

Prerequisite: P.I.

【讨论】:

  • 我将如何应用这个 XSLT 转换?我对此非常很陌生,需要一些指导。谢谢
  • @inspectorG4dget:如何执行转换取决于所使用的 XSLT 处理器。几乎每个 XSLT 处理器都有自己的应用转换方式——这在其文档中有所说明。此外,所有 XSLT 处理器都提供两种方式来执行转换:从命令行或在编程语言中。我正在使用 9 个不同的 XSLT 1.0 处理器和 3 个不同的 XSLT 2.0 处理器——全部通过命令行。对于 Msxml6:%xml% %xsl% -o %out% -u '6.0' -t %param[ name="value"]%,.NET:%xml% %xsl% -t -o %out%%param[ name="value"]%
  • @inspectorG4dget:继续:%xml% 表示 XML 文件,%xsl% 表示 xsl 文件,%out% 表示输出文件。 ` %param[ name="value"]%` 用于提供外部参数作为名称-值对——我通常不使用它。 Saxon9.1.5:-Xms512M -Xmx512M -jar C:\xml\Parsers\Saxon\Ver.9.1.0.5\J\saxon9.jar -t -repeat:1 -o %out% %xml% %xsl% %param[ name=\"value\"]%,XQSharp:-s %xml% -o %out% -r 1 -t %xsl% %param[ name="value"]%
  • 非常感谢。我会试试这个并尽快回复你。但是,我现在必须学习一些期中考试,所以我稍后会尝试一下。 +1 努力。一旦我测试它并看到它有效,我会接受它
【解决方案2】:

尝试使用 [position() mod &lt;offset&gt; = &lt;base&gt;] 之类的东西来代替 [&lt;int&gt;]

偏移量是您感兴趣的每个节点之间的距离。 @class='titlestyle' 和 @class='title2' 可能不同。

ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")

编辑:根据要求。

一次执行每个单独的 xpath,而不限制其位置。 这是一个手动事实调查练习,用于确定要在 xpath 中使用的最终值。

返回所有与以下 xpath 匹配的节点(这是第一个)。

ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")

ites 将包含一些你想要的课程和一些你不想要的。

您已经为此确定了第二个节点是您想要的第一个节点。现在计算到ites 中您希望此规则匹配的下一个的距离。这就是我们可以参考的&lt;offset to next to match&gt;

现在对每个剩余的 xpath 搜索重复上述操作。

将 hxs.select("") 视为过滤器,当它遍历 xml 时,将返回与您的 xpath 匹配的每一件事。

这是一个例子http://zvon.org/xxl/XPathTutorial/Output/example22.html

【讨论】:

  • 我是这方面的菜鸟(显然)。您能否澄清一下“每个节点之间的距离”是如何测量的?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-09-17
  • 1970-01-01
  • 1970-01-01
  • 2021-12-13
  • 2010-11-04
  • 1970-01-01
  • 2018-03-15
相关资源
最近更新 更多