【问题标题】:Extract keywords from a title,relevant, and final column math从标题、相关和最终列数学中提取关键字
【发布时间】:2016-12-18 22:30:11
【问题描述】:

我有一个按以下方式构造的 DataFrame:

 Title;         Total Visits;    Rank
 The dog;       8           ;    4
 The cat;       9           ;    4
 The dog cat;   10          ;    3

第二个DataFrame包含:

Keyword;     Rank
snail ;      5
dog   ;      1
cat   ;      2

我想要完成的是:

 Title;         Total Visits;    Rank  ; Keywords    ; Score
 The dog;       8           ;    4     ; dog         ; 1
 The cat;       9           ;    4     ; cat         ; 2
 The dog cat;   10          ;    3     ; dog,cat     ; 1.5

我已经使用了following reference,但是对于一些

df['Tweet'].map(lambda x: tuple(re.findall(r'({})'.format('|'.join(w.values)), x)))

返回空值。任何帮助将不胜感激。

【问题讨论】:

    标签: string python-2.7 pandas dataframe text-extraction


    【解决方案1】:

    你可以使用:

    #create list of all words
    wants = df2.Keyword.tolist()
    #dict for maping
    d = df2.set_index('Keyword')['Rank'].to_dict()
    #split all values by whitespaces, create series
    s = df1.Title.str.split(expand=True).stack()
    #filter by list wants
    s = s[s.isin(wants)]
    print (s)
    0  1    dog
    1  1    cat
    2  1    dog
       2    cat
    dtype: object
    
    #create new columns
    df1['Keywords'] = s.groupby(level=0).apply(','.join)
    df1['Score'] = s.map(d).groupby(level=0).mean()
    
    print (df1)
             Title  Total Visits  Rank Keywords  Score
    0      The dog             8     4      dog    1.0
    1      The cat             9     4      cat    2.0
    2  The dog cat            10     3  dog,cat    1.5
    

    另一种使用列表操作的解决方案:

    wants = df2.Keyword.tolist()
    d = df2.set_index('Keyword')['Rank'].to_dict()
    #create list from each value
    df1['Keywords'] = df1.Title.str.split()
    #remove unnecessary words
    df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
    #maping each word
    df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
    
    #create ne columns
    df1['Keywords'] = df1.Keywords.apply(','.join)
    #mean
    df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
    
    print (df1)
             Title  Total Visits  Rank Keywords  Score
    0      The dog             8     4      dog    1.0
    1      The cat             9     4      cat    2.0
    2  The dog cat            10     3  dog,cat    1.5
    

    时间安排

    In [96]: %timeit (a(df11, df22))
    100 loops, best of 3: 3.71 ms per loop
    
    In [97]: %timeit (b(df1, df2))
    100 loops, best of 3: 2.55 ms per loop
    

    测试代码:

    df11 = df1.copy()    
    df22 = df2.copy() 
    
    def a(df1, df2):
        wants = df2.Keyword.tolist()
        d = df2.set_index('Keyword')['Rank'].to_dict()
        s = df1.Title.str.split(expand=True).stack()
        s = s[s.isin(wants)]
        df1['Keywords'] = s.groupby(level=0).apply(','.join)
        df1['Score'] = s.map(d).groupby(level=0).mean()
        return (df1)
    
    def b(df1,df2):   
        wants = df2.Keyword.tolist()
        d = df2.set_index('Keyword')['Rank'].to_dict()
        df1['Keywords'] = df1.Title.str.split()
        df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
        df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
        df1['Keywords'] = df1.Keywords.apply(','.join)
        df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
        return (df1)
    
    print (a(df11, df22))    
    print (b(df1, df2))
    

    通过评论编辑:

    如果有Keywords一个字以上,可以申请list comprhension

    print (df1)
             Title  Total Visits  Rank
    0      The dog             8     4
    1      The cat             9     4
    2  The dog cat            10     3
    
    print (df2)
       Keyword  Rank
    0    snail     5
    1      dog     1
    2      cat     2
    3  The dog     8
    4  the Dog     1
    5  The Dog     3
    
    wants = df2.Keyword.tolist()
    print (wants)
    ['snail', 'dog', 'cat', 'The dog', 'the Dog', 'The Dog']
    
    d = df2.set_index('Keyword')['Rank'].to_dict()
    df1['Keywords'] = df1.Title.apply(lambda x: [item for item in wants if item in x])
    df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
    df1['Keywords'] = df1.Keywords.apply(','.join)
    df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
    print (df1)
             Title  Total Visits  Rank         Keywords     Score
    0      The dog             8     4      dog,The dog  4.500000
    1      The cat             9     4              cat  2.000000
    2  The dog cat            10     3  dog,cat,The dog  3.666667
    

    【讨论】:

    • 感谢您的回复。使用第一个选项关键字和分数产生 NaN,但一个结果除外,它显示一个关键字(尽管它应该有两个)并且字符串操作选项以 ZeroDivisionError: float 除以零结束。
    • 我遇到麻烦的地方是 - 如果字符串包含例如:“星球大战:胭脂一号”并且关键字是“星球大战”,则字符串存储为“[”星“,” Wars:", "Rogue", "One"] 没有匹配项。
    • 如果有两个或两个以上的词作为关键字,则解决方案比较复杂。主要问题是在df1 中的Title 列中拆分,如果有组合一个单词关键字和两个或多个单词作为关键字。然后用空格分割只分割一个单词关键字。有可能解决这个问题吗?
    • 不幸的是,有些关键字是复合词,我还没有找到一种方法来调整包含复合词的标题。如果有办法使用 pandas 复制 checkResult=[] mList=["dog","cat","apple","The dog", "the Dog", "The Dog"] mString = "The dog is runnng after the cat" for item in mList: if item in mString: checkResult.append(item)我认为这会解决问题
    • 谢谢。但是现在我整个周末都在访问,所以或者发布新问题或者等到星期一,对不起。
    猜你喜欢
    • 2011-06-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-01-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多