更新:
要提取域等,请尝试tldextract 来完成这项工作。
示例:
import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com',
4: 'http://forums.news.cnn.com/'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
print(df)
Website. Label subdomin domain suffix
0 18egh.com 1 18egh com
1 fish.co.uk 0 fish co.uk
2 www.description.com 1 www description com
3 http://world.com 1 world com
4 http://forums.news.cnn.com/ 0 forums.news cnn com
原答案如下
试试:
import pandas as pd
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label Domain Domain_Len
0 18egh.com 1 18egh 5
1 fish.co.uk 0 fish 4
2 www.description.com 1 description 11
3 http://world.com 1 world 5
或者:
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label TLD TLD_Len Domain Domain_Len
0 18egh.com 1 com 3 18egh 5
1 fish.co.uk 0 co.uk 5 fish 4
2 www.description.com 1 com 3 description 11
3 http://world.com 1 com 3 world 5