【问题标题】:Pandas: improve algorithm with find substring in columnPandas:通过在列中查找子字符串改进算法
【发布时间】:2017-02-12 08:44:30
【问题描述】:

我有数据框,我尝试只获取字符串,其中某些列包含一些字符串。

我用:

df_res = pd.DataFrame()
for i in substr:
    res = df[df['event_address'].str.contains(i)]

df 看起来像:

member_id,event_address,event_time,event_duration
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/albums,2015-05-01 00:00:05,8
g1497o1ofm5a1963,9829192.ru/apple-iphone.html,2015-05-01 00:00:15,2
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/album/165150?&p=3,2015-05-01 00:00:17,2
g1497o1ofm5a1963,fotki.yandex.ru/tags/%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=utpaladev&&p=2,2015-05-01 00:01:31,10
g1497o1ofm5a1963,3gmaster.net,2015-05-01 00:01:41,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&&p=2,2015-05-01 00:02:01,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=Sunny-Fanny&,2015-05-01 00:02:31,2
g1497o1ofm5a1963,fotki.9829192.ru/apple-iphone.html,2015-05-01 00:03:25,6

substr 是:

123.ru/gadgets/communicators
320-8080.ru/mobilephones
3gmaster.net
3-q.ru/products/smartfony/s
9829192.ru/apple-iphone.html
9829192.ru/index.php?cat=1
acer.com/ac/ru/ru/content/group/smartphones
aj.ru

我用这段代码得到了理想的结果,但它太长了。 我也尝试使用列(substr 这是一个substr = urls.url.values.tolist()) 我试试

res = df[df['event_address'].str.contains(urls.url)]

但它返回:

TypeError: 'Series' 对象是可变的,因此它们不能被散列

有什么方法可以让它更快,或者我错了?

【问题讨论】:

  • substr 是哪种类型?那是字符串列表吗?

标签: python string pandas indexing substring


【解决方案1】:

这样做:

def check_exists(x):
    for i in substr:
        if i in x:
            return True
    return False

df2 = df.ix[df.event_address.map(check_exists)]

或者如果你喜欢写成一行:

df.ix[df.event_address.map(lambda x: any([True for i in substr if i in x]))]

【讨论】:

    【解决方案2】:

    如果需要更快的解决方案,我认为您需要将join| 添加到str.contains

    res = df[df['event_address'].str.contains('|'.join(urls.url))]
    print (res)
              member_id                       event_address           event_time  \
    1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
    4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
    7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   
    
       event_duration  
    1               2  
    4               6  
    7               6  
    

    另一个list comprehension解决方案:

    res = df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))]
    print (res)
              member_id                       event_address           event_time  \
    1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
    4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
    7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   
    
       event_duration  
    1               2  
    4               6  
    7               6  
    

    时间安排

    #[8000 rows x 4 columns]
    df = pd.concat([df]*1000).reset_index(drop=True)
    
    In [68]: %timeit (df[df['event_address'].str.contains('|'.join(urls.url))])
    100 loops, best of 3: 12 ms per loop
    
    In [69]: %timeit (df.ix[df.event_address.map(check_exists)])
    10 loops, best of 3: 155 ms per loop
    
    In [70]: %timeit (df.ix[df.event_address.map(lambda x: any([True for i in urls.url.tolist() if i in x]))])
    10 loops, best of 3: 163 ms per loop
    
    In [71]: %timeit (df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()] ))])
    10 loops, best of 3: 174 ms per loop
    

    【讨论】:

    • 我尝试了df['event_address'].str.contains('|'.join(urls.url)),因为我需要添加regex=True,但它返回给我sre_constants.error: multiple repeat
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-11-27
    • 2015-03-12
    • 2012-08-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多