【问题标题】:Urlparse applied to a column for extracting length and TLD infoUrlparse 应用于列以提取长度和 TLD 信息
【发布时间】:2021-08-15 15:36:27
【问题描述】:

我正在尝试从 pandas 数据框中的网站列表中提取长度和后缀 (tld)。

Website.      Label
18egh.com       1
fish.co.uk      0
www.description.com 1
http://world.com 1

我想要的输出应该是

Website      Label    Length   Tld 
18egh.com       1        5      com
fish.co.uk      0        4      co.uk
www.description.com 1    11     com
http://world.com 1       5      com

我先试了一下,长度如下:

def get_domain(df):  
    my_list=[]
    for x in df['Website'].tolist():
          domain = urlparse(x).netloc
          my_list.append(domain)
          df['Domain']  = my_list
          df['Length']=df['Domain'].str.len()
    return df

但是当我检查列表是空的。我知道要提取有关域和 tld 的信息,使用 url 解析可能就足够了,但如果我错了,如果你能指出我正确的方向,我将不胜感激。

【问题讨论】:

    标签: python pandas urlparse


    【解决方案1】:

    更新:

    要提取域等,请尝试tldextract 来完成这项工作。

    示例:

    import pandas as pd
    import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
    
    df = pd.DataFrame({'Website.': {0: '18egh.com',
      1: 'fish.co.uk',
      2: 'www.description.com',
      3: 'http://world.com',
      4: 'http://forums.news.cnn.com/'},
     'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
    
    df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
    
    print(df)
    
                              Website.  Label     subdomin       domain suffix
        0                    18egh.com      1                     18egh    com
        1                   fish.co.uk      0                      fish  co.uk
        2          www.description.com      1          www  description    com
        3             http://world.com      1                     world    com
        4  http://forums.news.cnn.com/      0  forums.news          cnn    com
    

    原答案如下


    试试:

    import pandas as pd
    
    df = pd.DataFrame({'Website.': {0: '18egh.com',
      1: 'fish.co.uk',
      2: 'www.description.com',
      3: 'http://world.com'},
     'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
    
    pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
    
    df['Domain'] = df['Website.'].str.extract(pattern)
    df['Domain_Len'] = df['Domain'].str.len()
    
    print(df)
    
        Website.             Label  Domain          Domain_Len
    0   18egh.com            1      18egh           5
    1   fish.co.uk           0      fish            4
    2   www.description.com  1      description     11
    3   http://world.com     1      world           5
    

    或者:

    pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
    
    df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
    df['TLD_Len'] = df['TLD'].str.len()
    df['Domain_Len'] = df['Domain'].str.len()
    
    print(df)
    
        Website.             Label  TLD     TLD_Len     Domain       Domain_Len
    0   18egh.com            1      com     3           18egh        5
    1   fish.co.uk           0      co.uk   5           fish         4
    2   www.description.com  1      com     3           description  11
    3   http://world.com     1      com     3           world        5
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-10-28
      • 2018-10-28
      • 1970-01-01
      • 2011-02-19
      • 2016-05-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多