【问题标题】:Extract certain text lines from webpage with python使用python从网页中提取某些文本行
【发布时间】:2018-08-04 02:52:26
【问题描述】:

我正在尝试从this website 中提取第一个 ISS TLE(两线元素集)。

我需要以下前三行:

 TWO LINE MEAN ELEMENT SET

文本:(ISS 行,1 行,2 行)。

所以我用漂亮的汤得到了我想要的文本,但是我真的不知道如何提取这些文本行。我不能使用split(),因为我需要准确地维护这三行中的空白。如何做到这一点?

import urllib2
from bs4 import BeautifulSoup
import ephem
import datetime

nasaissurl = 'http://spaceflight.nasa.gov/realdata/sightings/SSapplications/Post/JavaSSOP/orbit/ISS/SVPOST.html'
soup = BeautifulSoup(urllib2.urlopen(nasaissurl), 'html.parser')
body = soup.find_all("pre")
index = 0
firstTLE = False
for tag in body:
    if "ISS" in tag.text:
        print tag.text

【问题讨论】:

    标签: python string python-2.7 beautifulsoup


    【解决方案1】:

    如果您将文本分成几行并一次处理每一行,那么当您找到所需的三行时,您可以重新加入这些行:

    代码:

    def process_tag_text(tag_text):
        marker = 'TWO LINE MEAN ELEMENT SET'
        text = iter(tag_text.split('\n'))
        for line in text:
            if marker in line:
                next(text)
                results.append('\n'.join(
                    (next(text), next(text), next(text))))
        return results
    

    测试代码:

    import urllib2
    from bs4 import BeautifulSoup
    
    nasaissurl = 'http://spaceflight.nasa.gov/realdata/sightings/' \
                 'SSapplications/Post/JavaSSOP/orbit/ISS/SVPOST.html'
    soup = BeautifulSoup(urllib2.urlopen(nasaissurl), 'html.parser')
    body = soup.find_all("pre")
    results = []
    for tag in body:
        if "ISS" in tag.text:
            results.extend(process_tag_text(tag.text))
    
    print('\n'.join(results))
    

    结果:

    ISS
    1 25544U 98067A   18054.51611082  .00016717  00000-0  10270-3 0  9009
    2 25544  51.6368 225.3935 0003190 125.8429 234.3021 15.54140528 20837
    ISS
    1 25544U 98067A   18055.54493747  .00016717  00000-0  10270-3 0  9010
    2 25544  51.6354 220.2641 0003197 130.5210 229.6221 15.54104949 20991
    ISS
    1 25544U 98067A   18056.50945749  .00016717  00000-0  10270-3 0  9022
    2 25544  51.6372 215.4558 0003149 134.4837 225.6573 15.54146916 21143
    ISS
    1 25544U 98067A   18057.34537198  .00016717  00000-0  10270-3 0  9031
    2 25544  51.6399 211.2932 0002593 130.2258 229.9121 15.54133048 21277
    

    【讨论】:

      【解决方案2】:

      您可以通过多种方式实现相同的目标。这是另一种方法:

      from bs4 import BeautifulSoup
      import requests
      
      URL = "https://spaceflight.nasa.gov/realdata/sightings/SSapplications/Post/JavaSSOP/orbit/ISS/SVPOST.html"
      soup = BeautifulSoup(requests.get(URL).text,"lxml")
      
      for item in soup.select("pre"):
          for line in range(len(item.text.splitlines())):
              if "25544U" in item.text.splitlines()[line]:
                  doc = item.text.splitlines()[line-1].strip()
                  doc1 = item.text.splitlines()[line].strip()
                  doc2 = item.text.splitlines()[line+1].strip()
                  print("{}\n{}\n{}\n".format(doc,doc1,doc2))
      

      部分输出:

      ISS
      1 25544U 98067A   18054.51611082  .00016717  00000-0  10270-3 0  9009
      2 25544  51.6368 225.3935 0003190 125.8429 234.3021 15.54140528 20837
      
      ISS
      1 25544U 98067A   18055.54493747  .00016717  00000-0  10270-3 0  9010
      2 25544  51.6354 220.2641 0003197 130.5210 229.6221 15.54104949 20991
      
      ISS
      1 25544U 98067A   18056.50945749  .00016717  00000-0  10270-3 0  9022
      2 25544  51.6372 215.4558 0003149 134.4837 225.6573 15.54146916 21143
      

      【讨论】:

        猜你喜欢
        • 2020-09-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-05-06
        • 2016-04-18
        相关资源
        最近更新 更多