字符串的特征工程答案

【问题标题】：Features engineering for a String字符串的特征工程
【发布时间】：2020-04-30 12:59:41
【问题描述】：

假设我们有一个由 6 位字符串（全部小写字母）组成的数据集，例如“olmido”和相应的二进制值。例如，“olmido”的值为 1，“lgoad”的值为 0。对于新的 6 位字符串（全部小写字母），我想预测它们的值（即 1 或 0）。

我现在的问题是，将字符串转换为数字字符串的好方法是什么，以便您可以在它们上训练机器学习模型。到目前为止，我只是简单地将字符串分成字母并将它们转换为数字，所以我有 6 个特征。但是对于这个变体，我的机器学习模型仍然没有令人满意的结果。

对于我的变体，字母的顺序无关紧要（因此“olmido”例如被视为与例如“loimod”相同），但字母的顺序应该发挥重要作用。我怎样才能最好地考虑到这一点？

【问题讨论】：

标签： string machine-learning encoding feature-selection supervised-learning

【解决方案1】：

在我看来你的问题可以通过字符 n-gram 来解决。你说你只有 6 个特征，因为你只考虑字符 uni-gram。既然你说的顺序字符在你的分类器中扮演着重要的角色。您应该使用字符二元组甚至三元组作为特征。

【讨论】：

对字符n-gram的更多解释：“olmido”可以提取为uni-gram：“o”、“l”、“m”、“i”、“d”、 o" bi-gram: "ol", "lm","mi","id","do"

【解决方案2】：

我不太确定这里的用例，但我假设您希望根据字母表的sub-sequence 进行预测。

如果它是一个完整的字符串匹配并且您没有内存限制，那么使用字典就足够了。如果它是部分字符串匹配，请查看 Aho-Corasick 方法，您可以在其中进行子字符串匹配。

一种概率更高的方法是使用序列学习算法，例如条件随机场 (CRF)。将其视为一个序列学习问题，下面的 sn-p 学习单词中每个字母的左侧字母特征和右侧字母特征。我添加了一个DEPENDENCY_CHAIN_LENGTH 参数，可用于控制每个字母表要学习多少依赖项。因此，如果您希望模型仅学习直接的左侧和右侧的字母依赖关系，您可以将其分配给 1。我已将其分配给 3 用于下面的 sn-p。

在预测过程中，为每个（编码的）字母（及其左右依赖项）预测一个标签。我对每个字母表的预测进行了平均，并将其聚合为每个单词的单个输出。

如果尚未安装 crfsuite，请发送pip install sklearn_crfsuite 安装。

import sklearn_crfsuite
import statistics

DEPENDENCY_CHAIN_LENGTH = 3


def translate_to_features(word, i):
    alphabet = word[i]
    features = {
        'bias': 1.0,
        'alphabet.lower()': alphabet.lower(),
        'alphabet.isupper()': alphabet.isupper(),
        'alphabet.isdigit()': alphabet.isdigit(),
    }
    j = 1
    # Builds dependency towards the left side characters upto
    # DEPENDENCY_CHAIN_LENGTH characters
    while i - j >= 0 and j <= DEPENDENCY_CHAIN_LENGTH:
        position = (i - j)
        alphabet1 = word[position]
        features.update({
            '-' + str(position) + ':alphabet.lower()': alphabet1.lower(),
            '-' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
            '-' + str(position) + ':alphabet.isdigit()': alphabet1.isdigit(),
        })
        j = j + 1
    else:
        features['BOW'] = True

    j = 1
    # Builds dependency towards the right side characters upto
    # DEPENDENCY_CHAIN_LENGTH characters
    while i + j < len(word) and j <= DEPENDENCY_CHAIN_LENGTH:
        position = (i + j)
        alphabet1 = word[position]
        features.update({
            '+' + str(position) + ':alphabet.lower()': alphabet1.lower(),
            '+' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
            '+' + str(position) + ':alphabet.isdigit()': alphabet1.isupper(),
        })
        j = j + 1

    else:
        features['EOW'] = True

    return features


raw_training_data = {"Titles": "1",
                     "itTels": "0",
                     }

print("Learning dataset with labels : {}".format(raw_training_data))
raw_testing_data = ["titles", "ittsle"]

X_train = []
Y_train = []

print("Feature encoding in progress ... ")
# Prepare encoded features from words
for word in raw_training_data.keys():
    word_tr = []
    word_lr = []
    word_length = len(word)
    if word_length < DEPENDENCY_CHAIN_LENGTH:
        raise Exception("Dependency chain cannot have length greater than a word")
    for i in range(0, len(word)):
        word_tr.append(translate_to_features(word, i))
        word_lr.append(raw_training_data[word])
    X_train.append(word_tr)
    Y_train.append(word_lr)
print("Feature encoding in completed")
# Training snippet
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=1,
    all_possible_transitions=True
)
print("Training in progress")
crf.fit(X_train, Y_train)
print("Training completed")

print("Beginning  predictions")
# Prediction Snippet
for word in raw_testing_data:
    # Encode into features
    word_enc = []
    for i in range(0, len(word)):
        word_enc.append(translate_to_features(word, i))

    # Predict using the encoded features
    pred_values = crf.predict_marginals_single(word_enc)

    # Aggregate scores across spans per label
    label_scores = {}
    for span_prediction in pred_values:
        for label in span_prediction.keys():
            if label in label_scores:
                label_scores[label].append(span_prediction[label])
            else:
                label_scores[label] = [span_prediction[label]]

    # Print aggregated score
    print("Predicted label for the word '{}'  is :".format(word))
    for label in label_scores:
        print("\tLabel {} Score {}".format(label, statistics.mean(label_scores[label])))
print("Predictions  completed")

产生输出：

Learning dataset with labels : {'Titles': '1', 'itTels': '0'}
Feature encoding in progress ... 
Feature encoding in completed
Training in progress
Training completed
Beginning  predictions
Predicted label for the word 'titles'  is :
    Label 1 Score 0.6821365857513837
    Label 0 Score 0.3178634142486163
Predicted label for the word 'ittsle'  is :
    Label 1 Score 0.36701890171374996
    Label 0 Score 0.6329810982862499
Predictions  completed

【讨论】：