【问题标题】:BeautifulSoup - joining two strings, putting them on the same lineBeautifulSoup - 连接两个字符串,将它们放在同一行
【发布时间】:2022-07-05 23:49:50
【问题描述】:

所以我想从在线词典中提取单词定义。网站结构有点奇怪。单词定义没有标签或属性,所以我使用 .find_next_sibling 方法。我得到了我想要的所有文本,但我找不到加入它们并将它们放在同一行的方法。这是我的代码:

import requests
from bs4 import BeautifulSoup as bs

word = 'ក'
url = "http://dictionary.tovnah.com/?word=" + word + "&dic=headley&criteria=word"
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36 Edg/103.0.1264.44"}
response = requests.get(url, headers=headers)
soup = bs(response.text, "lxml")

main = soup.find('ol', attrs={'start':'1'})
entries = main.find_all('li')
for entry in entries:
    pos = entry.find('a').find_next_sibling(text=True)
    meaning = entry.find('a').find_next_siblings(text=True)[4]
    result = pos + meaning
    
    print(result)

#            first letter of the Cambodian alphabet ​ ​​​​​​​​​​​​​​​​​​​​​​​

             ( n ) 
              
            
            
             neck; collar; connecting link ​​​​​​​​​​​​​​​​​​​​​​​

             ( v ) 
              
            
            
             to build, construct, create, found; to base on; to commence, start up; to come into being ​​​​​​​​​​​​​​​​​​​​​

预期结果:

first letter of the Cambodian alphabet ​ ​​​​​​​​​​​​​​​​​​​​​​​

( n ) neck; collar; connecting link ​​​​​​​​​​​​​​​​​​​​​​​

( v ) to build, construct, create, found; to base on; to commence, start up; to come into being ​​​​​​​​​​​​​​​​​​​​​​​​

我想去掉缩进,把词性(pos)放在定义(意思)之前。我认为我的打印结果是由不可见的 html 元素引起的。当我把结果作为一个列表,它显示:

['\n\n\t\t    \n\t\t    \n\t\t     first letter of the Cambodian alphabet \u200b \u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b']
['\n\t\t     ( n ) \n\t\t      \n\t\t    \n\t\t    \n\t\t     neck; collar; connecting link \u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b']
['\n\t\t     ( v ) \n\t\t      \n\t\t    \n\t\t    \n\t\t     to build, construct, create, found; to base on; to commence, start up; to come into being \u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b']

作为一个列表,我仍然找不到摆脱所有那些不需要的元素的方法。请赐教。

screenshot of the page structure

【问题讨论】:

    标签: python arrays join beautifulsoup


    【解决方案1】:

    使用.strip() 删除前导和尾随空格/换行符

    import requests
    from bs4 import BeautifulSoup as bs
    
    word = 'ក'
    url = "http://dictionary.tovnah.com/?word=" + word + "&dic=headley&criteria=word"
    headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36 Edg/103.0.1264.44"}
    response = requests.get(url, headers=headers)
    soup = bs(response.text, "lxml")
    
    main = soup.find('ol', attrs={'start':'1'})
    entries = main.find_all('li')
    for entry in entries:
        pos = entry.find('a').find_next_sibling(text=True).strip()
        meaning = entry.find('a').find_next_siblings(text=True)[4].strip()
        result = pos + meaning
        print(result)
    

    输出:

    first letter of the Cambodian alphabet ​ ​​​​​​​​​​​​​​​​​​​​​​​
    ( n )neck; collar; connecting link ​​​​​​​​​​​​​​​​​​​​​​​
    ( v )to build, construct, create, found; to base on; to commence, start up; to come into being ​​​​​​​​​​​​​​​​​​​​​​​
    

    【讨论】:

      猜你喜欢
      • 2011-04-07
      • 1970-01-01
      • 1970-01-01
      • 2012-09-28
      • 1970-01-01
      • 2016-05-27
      • 2010-10-14
      • 2021-01-24
      • 2022-06-13
      相关资源
      最近更新 更多