【问题标题】:Features engineering for a String字符串的特征工程
【发布时间】:2020-04-30 12:59:41
【问题描述】:

假设我们有一个由 6 位字符串(全部小写字母)组成的数据集,例如“olmido”和相应的二进制值。 例如,“olmido”的值为 1,“lgoad”的值为 0。对于新的 6 位字符串(全部小写字母),我想预测它们的值(即 1 或 0)。

我现在的问题是,将字符串转换为数字字符串的好方法是什么,以便您可以在它们上训练机器学习模型。到目前为止,我只是简单地将字符串分成字母并将它们转换为数字,所以我有 6 个特征。但是对于这个变体,我的机器学习模型仍然没有令人满意的结果。

对于我的变体,字母的顺序无关紧要(因此“olmido”例如被视为与例如“loimod”相同),但字母的顺序应该发挥重要作用。我怎样才能最好地考虑到这一点?

【问题讨论】:

    标签: string machine-learning encoding feature-selection supervised-learning


    【解决方案1】:

    在我看来你的问题可以通过字符 n-gram 来解决。你说你只有 6 个特征,因为你只考虑字符 uni-gram。既然你说的顺序字符在你的分类器中扮演着重要的角色。您应该使用字符二元组甚至三元组作为特征。

    【讨论】:

    • 对字符n-gram的更多解释:“olmido”可以提取为uni-gram:“o”、“l”、“m”、“i”、“d”、 o" bi-gram: "ol", "lm","mi","id","do"
    【解决方案2】:

    我不太确定这里的用例,但我假设您希望根据字母表的sub-sequence 进行预测。

    如果它是一个完整的字符串匹配并且您没有内存限制,那么使用字典就足够了。如果它是部分字符串匹配,请查看 Aho-Corasick 方法,您可以在其中进行子字符串匹配。

    一种概率更高的方法是使用序列学习算法,例如条件随机场 (CRF)。将其视为一个序列学习问题,下面的 sn-p 学习单词中每个字母的左侧字母特征和右侧字母特征。我添加了一个DEPENDENCY_CHAIN_LENGTH 参数,可用于控制每个字母表要学习多少依赖项。因此,如果您希望模型仅学习直接的左侧和右侧的字母依赖关系,您可以将其分配给 1。我已将其分配给 3 用于下面的 sn-p。

    在预测过程中,为每个(编码的)字母(及其左右依赖项)预测一个标签。我对每个字母表的预测进行了平均,并将其聚合为每个单词的单个输出。

    如果尚未安装 crfsuite,请发送pip install sklearn_crfsuite 安装。

    import sklearn_crfsuite
    import statistics
    
    DEPENDENCY_CHAIN_LENGTH = 3
    
    
    def translate_to_features(word, i):
        alphabet = word[i]
        features = {
            'bias': 1.0,
            'alphabet.lower()': alphabet.lower(),
            'alphabet.isupper()': alphabet.isupper(),
            'alphabet.isdigit()': alphabet.isdigit(),
        }
        j = 1
        # Builds dependency towards the left side characters upto
        # DEPENDENCY_CHAIN_LENGTH characters
        while i - j >= 0 and j <= DEPENDENCY_CHAIN_LENGTH:
            position = (i - j)
            alphabet1 = word[position]
            features.update({
                '-' + str(position) + ':alphabet.lower()': alphabet1.lower(),
                '-' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
                '-' + str(position) + ':alphabet.isdigit()': alphabet1.isdigit(),
            })
            j = j + 1
        else:
            features['BOW'] = True
    
        j = 1
        # Builds dependency towards the right side characters upto
        # DEPENDENCY_CHAIN_LENGTH characters
        while i + j < len(word) and j <= DEPENDENCY_CHAIN_LENGTH:
            position = (i + j)
            alphabet1 = word[position]
            features.update({
                '+' + str(position) + ':alphabet.lower()': alphabet1.lower(),
                '+' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
                '+' + str(position) + ':alphabet.isdigit()': alphabet1.isupper(),
            })
            j = j + 1
    
        else:
            features['EOW'] = True
    
        return features
    
    
    raw_training_data = {"Titles": "1",
                         "itTels": "0",
                         }
    
    print("Learning dataset with labels : {}".format(raw_training_data))
    raw_testing_data = ["titles", "ittsle"]
    
    X_train = []
    Y_train = []
    
    print("Feature encoding in progress ... ")
    # Prepare encoded features from words
    for word in raw_training_data.keys():
        word_tr = []
        word_lr = []
        word_length = len(word)
        if word_length < DEPENDENCY_CHAIN_LENGTH:
            raise Exception("Dependency chain cannot have length greater than a word")
        for i in range(0, len(word)):
            word_tr.append(translate_to_features(word, i))
            word_lr.append(raw_training_data[word])
        X_train.append(word_tr)
        Y_train.append(word_lr)
    print("Feature encoding in completed")
    # Training snippet
    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=1,
        all_possible_transitions=True
    )
    print("Training in progress")
    crf.fit(X_train, Y_train)
    print("Training completed")
    
    print("Beginning  predictions")
    # Prediction Snippet
    for word in raw_testing_data:
        # Encode into features
        word_enc = []
        for i in range(0, len(word)):
            word_enc.append(translate_to_features(word, i))
    
        # Predict using the encoded features
        pred_values = crf.predict_marginals_single(word_enc)
    
        # Aggregate scores across spans per label
        label_scores = {}
        for span_prediction in pred_values:
            for label in span_prediction.keys():
                if label in label_scores:
                    label_scores[label].append(span_prediction[label])
                else:
                    label_scores[label] = [span_prediction[label]]
    
        # Print aggregated score
        print("Predicted label for the word '{}'  is :".format(word))
        for label in label_scores:
            print("\tLabel {} Score {}".format(label, statistics.mean(label_scores[label])))
    print("Predictions  completed")
    

    产生输出:

    Learning dataset with labels : {'Titles': '1', 'itTels': '0'}
    Feature encoding in progress ... 
    Feature encoding in completed
    Training in progress
    Training completed
    Beginning  predictions
    Predicted label for the word 'titles'  is :
        Label 1 Score 0.6821365857513837
        Label 0 Score 0.3178634142486163
    Predicted label for the word 'ittsle'  is :
        Label 1 Score 0.36701890171374996
        Label 0 Score 0.6329810982862499
    Predictions  completed
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-11-13
      • 1970-01-01
      • 2021-01-12
      • 2020-09-20
      • 2015-04-03
      • 2015-05-13
      • 2020-10-06
      相关资源
      最近更新 更多