【问题标题】:Run a function on each element in a dataframe column of lists Pt. 2对列表 Pt 的数据框列中的每个元素运行一个函数。 2
【发布时间】:2020-08-28 17:06:25
【问题描述】:

这个问题源于Run a function on each element in a dataframe column of lists,它回答了一个问题,其中我有几个函数在pandas df 列表列中的每个元素上运行,并产生这样的分数(func_results):

col1             col2                         func_results
0   MAX          [MAX, amx, akd]              [('MAX',1.0),('amx',0.89),('akd',0.56)]
1   Sam          ['Sam','sammy','samsam']     [('Sam',1.0),('sammy',0.91), ('samsam',0.88)]
2   Larry        ['lar','lair','larrylamo']   [('lar',0.91),('larrylamo',0.91), ('lair',0.83)]

此 ^ df 的可执行代码 - 您需要先从下面运行所有函数:

data = {'col1':  ['MAX', 'Sam', 'Larry'],
        'col2': ["['MAX', 'amx', 'akd']", "['Sam','sammy','samsam']", "['lar','lair','larrylamo']"],
#         'func_results': ["[('MAX',1.0),('amx',0.89),('akd',0.56)]", "[('Sam',1.0),('sammy',0.91), ('samsam',0.88)]", "[('lar',0.91),('larrylamo',0.91), ('lair',0.83)]"]
        }

# df1 = pd.DataFrame (data, columns = ['col1','col2','func_results'])
df1 = pd.DataFrame (data, columns = ['col1','col2'])

df1['col2'] = df1.col2.apply(literal_eval)
df1['func_results'] = df1.agg(lambda x: get_top_matches(*x), axis=1)
df1

现在我只需要在col2 包含任何列表,而是每行只包含一个字符串时运行同一组函数,就像这样 df:

    col1              col2
0   abc co            AAP akj
1   kdj               fuj ddd
2   bac               ADO asd

可执行 ^ this df:

data = {'col1':  ['abc co', 'kdj', 'bac'],
        'col2': ['AAP akj', 'fuj ddd', 'ADO asd']
        }
df3 = pd.DataFrame (data, columns = ['col1','col2'])
df3

功能:

#jaro version
def sort_token_alphabetically(word):
    token = re.split('[,. ]', word)
    sorted_token = sorted(token)
    return ' '.join(sorted_token)

def get_jaro_distance(first, second, winkler=True, winkler_ajustment=True,
                      scaling=0.1, sort_tokens=True):
    """
    :param first: word to calculate distance for
    :param second: word to calculate distance with
    :param winkler: same as winkler_ajustment
    :param winkler_ajustment: add an adjustment factor to the Jaro of the distance
    :param scaling: scaling factor for the Winkler adjustment
    :return: Jaro distance adjusted (or not)
    """
    if sort_tokens:
        first = sort_token_alphabetically(first)
        second = sort_token_alphabetically(second)

    if not first or not second:
        raise JaroDistanceException(
            "Cannot calculate distance from NoneType ({0}, {1})".format(
                first.__class__.__name__,
                second.__class__.__name__))

    jaro = _score(first, second)
    cl = min(len(_get_prefix(first, second)), 4)

    if all([winkler, winkler_ajustment]):  # 0.1 as scaling factor
        return round((jaro + (scaling * cl * (1.0 - jaro))) * 100.0) / 100.0

    return jaro

def _score(first, second):
    shorter, longer = first.lower(), second.lower()

    if len(first) > len(second):
        longer, shorter = shorter, longer

    m1 = _get_matching_characters(shorter, longer)
    m2 = _get_matching_characters(longer, shorter)

    if len(m1) == 0 or len(m2) == 0:
        return 0.0

    return (float(len(m1)) / len(shorter) +
            float(len(m2)) / len(longer) +
            float(len(m1) - _transpositions(m1, m2)) / len(m1)) / 3.0

def _get_diff_index(first, second):
    if first == second:
        pass

    if not first or not second:
        return 0

    max_len = min(len(first), len(second))
    for i in range(0, max_len):
        if not first[i] == second[i]:
            return i

    return max_len

def _get_prefix(first, second):
    if not first or not second:
        return ""

    index = _get_diff_index(first, second)
    if index == -1:
        return first

    elif index == 0:
        return ""

    else:
        return first[0:index]

def _get_matching_characters(first, second):
    common = []
    limit = math.floor(min(len(first), len(second)) / 2)

    for i, l in enumerate(first):
        left, right = int(max(0, i - limit)), int(
            min(i + limit + 1, len(second)))
        if l in second[left:right]:
            common.append(l)
            second = second[0:second.index(l)] + '*' + second[
                                                       second.index(l) + 1:]

    return ''.join(common)

def _transpositions(first, second):
    return math.floor(
        len([(f, s) for f, s in zip(first, second) if not f == s]) / 2.0)

def get_top_matches(reference, value_list, max_results=None):
    scores = []
    if not max_results:
        max_results = len(value_list)
    for val in value_list:
        score_sorted = get_jaro_distance(reference, val)
        score_unsorted = get_jaro_distance(reference, val, sort_tokens=False)
        scores.append((val, max(score_sorted, score_unsorted)))
    scores.sort(key=lambda x: x[1], reverse=True)

    return scores[:max_results]

class JaroDistanceException(Exception):
    def __init__(self, message):
        super(Exception, self).__init__(message)

我只是想让它在col2 不是列表时运行,而是每行只有一个字符串,并在 df 中生成一个func_results 列。

有什么想法吗?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    如果您需要col2作为一个字符串的列表,您需要将col2的每个单元格包装在列表中并调用get_top_matches,如下所示:

    df3['col2'] = df3.col2.map(lambda x: [x])
    df3['func_results'] = df3.agg(lambda x: get_top_matches(*x), axis=1)
    
    Out[360]:
         col1       col2       func_results
    0  abc co  [AAP akj]  [(AAP akj, 0.54)]
    1     kdj  [fuj ddd]  [(fuj ddd, 0.49)]
    2     bac  [ADO asd]  [(ADO asd, 0.49)]
    

    【讨论】:

    • 嘿安迪 - 是的,这确实适用于我提供的数据框。我正在使用的 df 必须搞砸了 - 我不断收到 b/c:TypeError: expected string or bytes-like object 你知道这是什么意思吗?
    • 我的 df 有一些奇怪的字符——比如重音符号..
    • @max:数据框中的某些值很可能不是字符串。你的df 也可能有NaN。尝试运行这个:df.applymap(type) 并尝试在输出中发现任何不是<class 'str'> 的单元格。这些单元格会导致错误。
    • 有时我讨厌这些东西。 df 中间有一个空值。再次感谢。
    猜你喜欢
    • 2020-12-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-02-22
    • 1970-01-01
    相关资源
    最近更新 更多