【问题标题】:Extract a URL from a row with multiple URLS in it从包含多个 URL 的行中提取一个 URL
【发布时间】:2019-11-03 22:49:32
【问题描述】:

我正在尝试从列出多个 URL 的行中提取 URL。

具体来说,我想从行中选择twitter.com/dog_rates/xxxxxxx 的第一个实例并删除剩余的数据。

需要提取的文本示例

输入

1. twitter.com/dog_rates/status/892420643555336193/photo/1 (desired version)

2. www.gofundme.com/3yd6y1c,twitter.com/dog_rates/status/878281511006478336/photo/1

3. m.facebook.com/story.php?story_fbid=1888712391349242&id=1506300642923754&refsrc=ht.co%2FURVffYPPjY&_rdr,twitter.com/dog_rates/status/812503143955202048/photo/1,twitter.com/dog_rates/status/812503143955202048/photo/1

4. www.gofundme.com/sams-smile,twitter.com/dog_rates/status/810984652412424192/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1

5. twitter.com/dog_rates/status/888804989199671297/photo/1,twitter.com/dog_rates/status/888804989199671297/photo/1

我尝试使用切片提取 URL,但遇到了多个 URL 长度和分隔符位置不同的问题。

预期结果

  1. twitter.com/dog_rates/status/892420643555336193/photo/1

  2. twitter.com/dog_rates/status/878281511006478336/photo/1

  3. twitter.com/dog_rates/status/812503143955202048/photo/1

  4. twitter.com/dog_rates/status/810984652412424192/photo/1

  5. twitter.com/dog_rates/status/888804989199671297/photo/1

【问题讨论】:

  • 您如何决定所需的 URL 何时结束?用逗号?

标签: python string pandas extract


【解决方案1】:

使用import re试试这个

import re
input = '''1. twitter.com/dog_rates/status/892420643555336193/photo/1 (desired version)

2. www.gofundme.com/3yd6y1c,twitter.com/dog_rates/status/878281511006478336/photo/1

3. m.facebook.com/story.php?story_fbid=1888712391349242&id=1506300642923754&refsrc=ht.co%2FURVffYPPjY&_rdr,twitter.com/dog_rates/status/812503143955202048/photo/1,twitter.com/dog_rates/status/812503143955202048/photo/1

4. www.gofundme.com/sams-smile,twitter.com/dog_rates/status/810984652412424192/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1

5. twitter.com/dog_rates/status/888804989199671297/photo/1,twitter.com/dog_rates/status/888804989199671297/photo/1'''
input=input+'\n'

regex='(twitter.com/dog_rates/status/\d*/photo/1).*\n'

twitter_list = re.findall(regex,input)
i=1
for item in twitter_list:
    print(str(i)+'. '+item)
    i+=1

【讨论】:

    【解决方案2】:

    您可以轻松地做到这一点...只需将每一行作为字符串加载。

    data= [ "twitter.com/dog_rates/status/892420643555336193/photo/1",
    "www.gofundme.com/3yd6y1c,twitter.com/dog_rates/status/878281511006478336/photo/1",
    "m.facebook.com/story.php?story_fbid=1888712391349242&id=1506300642923754&refsrc=ht.co%2FURVffYPPjY&_rdr,twitter.com/dog_rates/status/812503143955202048/photo/1,twitter.com/dog_rates/status/812503143955202048/photo/1",
    "www.gofundme.com/sams-smile, twitter.com/dog_rates/status/810984652412424192/photo/1, twitter.com/dog_rates/status/709901256215666688/photo/1, twitter.com/dog_rates/status/709901256215666688/photo/1, twitter.com/dog_rates/status/709901256215666688/photo/1, twitter.com/dog_rates/status/709901256215666688/photo/1",
    "twitter.com/dog_rates/status/888804989199671297/photo/1, twitter.com/dog_rates/status/888804989199671297/photo/1"
    ]
    

    现在我们可以使用“,”将数据拆分为每个句子。

    results=[]
    for row in data:
        urls=row.split(",")
        for i in urls:
            if(i.strip().startswith("twitter.com/dog_rates/")):
                results.append(i.strip())
                break
    

    我们将在results 变量中得到结果。

    【讨论】:

      【解决方案3】:

      试试这个,

      import pandas as pd
      
      data = [
          'twitter.com/dog_rates/status/892420643555336193/photo/1',         
          'www.gofundme.com/3yd6y1c,twitter.com/dog_rates/status/878281511006478336/photo/1',
          'm.facebook.com/story.php?story_fbid=1888712391349242&id=1506300642923754&refsrc=ht.co%2FURVffYPPjY&_rdr,twitter.com/dog_rates/status/812503143955202048/photo/1,twitter.com/dog_rates/status/812503143955202048/photo/1',
          'www.gofundme.com/sams-smile,twitter.com/dog_rates/status/810984652412424192/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1',
          'twitter.com/dog_rates/status/888804989199671297/photo/1,twitter.com/dog_rates/status/888804989199671297/photo/1'
      ]
      
      df=pd.DataFrame({'url':data})
      df['res'] = df['url'].str.split(',').str[-1]
      

      只提取由,分割的最后一个值

      【讨论】:

        【解决方案4】:

        试试这个:

        my_data = [
            'twitter.com/dog_rates/status/892420643555336193/photo/1',         
            'www.gofundme.com/3yd6y1c,twitter.com/dog_rates/status/878281511006478336/photo/1',
            'm.facebook.com/story.php?story_fbid=1888712391349242&id=1506300642923754&refsrc=ht.co%2FURVffYPPjY&_rdr,twitter.com/dog_rates/status/812503143955202048/photo/1,twitter.com/dog_rates/status/812503143955202048/photo/1',
            'www.gofundme.com/sams-smile,twitter.com/dog_rates/status/810984652412424192/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1,twitter.com/dog_rates/status/709901256215666688/photo/1',
            'twitter.com/dog_rates/status/888804989199671297/photo/1,twitter.com/dog_rates/status/888804989199671297/photo/1'
        ]
        
        
        
        final_results = []
        pattern= 'twitter.com/dog_rates/'
        ​
        for row in my_data:
            splited_row = row.split(',')
            for recod in splited_row:
                if recod.startswith(pattern):
                    final_results.append(recod)
                    break
        
        In [10]:
        
        final_results
        Out[10]:
        ['twitter.com/dog_rates/status/892420643555336193/photo/1',
         'twitter.com/dog_rates/status/878281511006478336/photo/1',
         'twitter.com/dog_rates/status/812503143955202048/photo/1',
         'twitter.com/dog_rates/status/810984652412424192/photo/1',
         'twitter.com/dog_rates/status/888804989199671297/photo/1']
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2012-02-12
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2013-10-20
          • 2012-08-17
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多