【问题标题】:Using beautiful soup to obtain class contents with a conditional使用美汤获取有条件的课程内容
【发布时间】:2017-11-08 05:31:05
【问题描述】:

我想用漂亮的汤来查找子标签(收益或损失)大于0的标签。然后我想打印内部标签“gains”“losses”和“band.textualrepresentation”的内容”。这基本上就是我想要的脚本(虽然这个不起作用)。

import sys
from BeautifulSoup import BeautifulSoup as Soup

def parseLog(file):
        file = sys.argv[1]
        handler = open(file).read()
        soup = Soup(handler)
        for anytype in soup('anytype', 'gains'.string>0 || 'losses'.string>0):
                gain = anytype.gains.string
                loss = anytype.losses.string
                band = anytype.band.textualrepresentation.string
                print gain loss band

parseLog(sys.argv[1])

我一开始就遇到麻烦,连收益的内容都打印不出来,更别说打印符合一定条件的内容了。我当前的脚本

def parseLog(file):
        file = sys.argv[1]
        handler = open(file).read()
        soup = Soup(handler)
        for anytype in soup.findall('anytype'):
                gain = anytype.fetch('gains')
                print gain

parseLog(sys.argv[1])

返回

Traceback (most recent call last):
  File "./soup.py", line 13, in <module>
    parseLog(sys.argv[1])
  File "./soup.py", line 9, in parseLog
    for anytype in soup.findall('anytype'):
TypeError: 'NoneType' object is not callable

.

示例输入

      <anytype xsi:type="GainLossStruct">
         <band>
          <textualrepresentation>
           22q11.1
          </textualrepresentation>
         </band>
         <gains>
          2
         </gains>
         <losses>
          1
         </losses>
         <structs>
          0
         </structs>
        </anytype>
        <anytype xsi:type="GainLossStruct">
         <band>
          <textualrepresentation>
           22q11.2
          </textualrepresentation>
         </band>
         <gains>
          0
         </gains>
         <losses>
          1
         </losses>
         <structs>
          0
         </structs>
        </anytype>
        <anytype xsi:type="GainLossStruct">
         <band>
          <textualrepresentation>
           22q12
          </textualrepresentation>
         </band>
         <gains>
          0
         </gains>
         <losses>
          0
         </losses>
         <structs>
          0
         </structs>
        </anytype>

样本输出

2  1  22q11.1
0  1  22q11.2

.

.

更新 目前的解决方案

import sys
from BeautifulSoup import BeautifulSoup as Soup

def parseLog(file):
        file = sys.argv[1]
        handler = open(file).read()
        soup = Soup(handler)
        for anytype in soup(lambda x: x.name=='anytype' and (hasattr(x, 'gains') and int(x.gains.string) > 0 or hasattr(x, 'losses') and int(x.losses.string) > 0)):
                gain = anytype.gains.string
                loss = anytype.losses.string
                band = anytype.band.textualrepresentation.string
                print gain, loss, band

parseLog(sys.argv[1])

仍然返回错误

Traceback (most recent call last):
  File "./soup.py", line 15, in <module>
    parseLog(sys.argv[1])
  File "./soup.py", line 9, in parseLog
    for anytype in soup(lambda x: x.name=='anytype' and (hasattr(x, 'gains') and int(x.gains.string) > 0 or hasattr(x, 'losses') and int(x.losses.string) > 0)):
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 659, in __call__
    return apply(self.findAll, args, kwargs)
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 849, in findAll
    return self._findAll(name, attrs, text, limit, generator, **kwargs)
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 377, in _findAll
    found = strainer.search(i)
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 966, in search
    found = self.searchTag(markup)
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 924, in searchTag
    or (markup and self._matches(markup, self.name)) \
  File "/Users/jacob/homebrew/lib/python2.7/site-packages/BeautifulSoup.py", line 983, in _matches
    result = matchAgainst(markup)
  File "./soup.py", line 9, in <lambda>
    for anytype in soup(lambda x: x.name=='anytype' and (hasattr(x, 'gains') and int(x.gains.string) > 0 or hasattr(x, 'losses') and int(x.losses.string) > 0)):
AttributeError: 'NoneType' object has no attribute 'string'

即使我将 for 循环减少到

for anytype in soup(lambda x: x.name=='anytype' and (hasattr(x, 'gains'))):
        gain = anytype.gains.string
        print gain

我还是明白了

Traceback (most recent call last):
  File "./soup.py", line 13, in <module>
    parseLog(sys.argv[1])
  File "./soup.py", line 10, in parseLog
    gain = anytype.gains.string
AttributeError: 'NoneType' object has no attribute 'string'

【问题讨论】:

    标签: python xml beautifulsoup


    【解决方案1】:

    我会将整个文档解析为 pandas 数据框,然后进行任何操作;这可能会使数据清理过程更加透明和易于理解。

    我将在这里使用xmltojson,因为我不熟悉美丽的汤(尽管我不得不将整个内容包含在“文档”标签中,因为它是有效的 XML):

    import xmltojson
    import pandas as pd
    
    with open(file) as f:
        j = eval(xmltojson.parse("<document> "+ f.read() + "</document>"))
    
    df = pd.io.json.json_normalize(j['document']['anytype'])
    df.columns = ['type', 'band', 'gain', 'loss', 'struct']
    df[(df.gain > '0') | (df.loss > '0')][['band', 'gain', 'loss']]
    
          band gain loss
    0  22q11.1    2    1
    1  22q11.2    0    1
    

    【讨论】:

    • 我在使用此代码时收到此错误Traceback (most recent call last): File "./script.py", line 5, in &lt;module&gt; with open(file) as f: TypeError: coercing to Unicode: need string or buffer, type found
    【解决方案2】:

    代码应该是:

    for anytype in soup(lambda x: x.name=='anytype' and (int(x.gains.string) > 0 or int(x.losses.string) > 0)):
        gain = anytype.gains.string
        loss = anytype.losses.string
        band = anytype.band.textualrepresentation.string
        print gain loss band
    

    python ||or 并且我们需要在执行整数比较之前将字符串转换为数字,例如int(x.gains.string)。希望对您有所帮助。

    【讨论】:

    • 实际上你可以作为一个字符串离开并执行x.gains.string &gt; '0'
    • 有意义,但前提是数据格式正确。或者如果字符串是'' &gt; '0''foo' &gt; '0',将很难找到错误。我们需要错误而不是默默地产生错误的结果。
    • 我仍然得到 File "./soup.py", line 13 print gain loss band ^ SyntaxError: invalid syntax 的错误代码
    • 这意味着您的某些元素不包含gainslosses 子元素。您可以通过 soup(lambda x: x.name=='anytype' and (hasattr(x, 'gains') and int(x.gains.string) > 0 或 hasattr(x, 'losses') 和int(x.losses.string) > 0))
    • @Jacob 好的,此时 xgains 但它是 None 对象。所以你可能需要更多的保护(hasattr(x, 'gains') and x.gains is not None)
    猜你喜欢
    • 2018-09-13
    • 1970-01-01
    • 1970-01-01
    • 2014-11-03
    • 2019-08-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多