【问题标题】：What features could help to classify the end of sentence? Sequence classification哪些特征可以帮助对句末进行分类？序列分类
【发布时间】：2019-09-05 00:21:08
【问题描述】：

问题：

我有几对句子，它们之间没有句号和大写字母。需要将它们彼此分割。我正在寻找一些帮助来挑选好的功能来改进模型。

背景：

我正在使用pycrfsuite进行序列分类并找到第一句的结尾，如下所示：

从棕色语料库中，我将每两个句子连接在一起并获取它们的 pos 标签。然后，如果空格跟在它后面，我用'S' 标记句子中的每个标记，如果句号跟在句子后面，我用'P' 标记。然后我删除句子之间的句号，并降低以下标记。我得到这样的东西：

输入：

data = ['I love Harry Potter.', 'It is my favorite book.']

输出：

sent = [('I', 'PRP'), ('love', 'VBP'), ('Harry', 'NNP'), ('Potter', 'NNP'), ('it', 'PRP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('book', 'NN')]
labels = ['S', 'S', 'S', 'P', 'S', 'S', 'S', 'S', 'S']

目前，我提取了这些一般特征：

def word2features2(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not
    # at the beginning of a document
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a sentence'
        features.append('BOS')

    # Features for words that are not
    # at the end of a document
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a sentence'
        features.append('EOS')

并使用这些参数训练 crf：

    trainer = pycrfsuite.Trainer(verbose=True)

    # Submit training data to the trainer
    for xseq, yseq in zip(X_train, y_train):
        trainer.append(xseq, yseq)

    # Set the parameters of the model
    trainer.set_params({
        # coefficient for L1 penalty
        'c1': 0.1,

        # coefficient for L2 penalty
        'c2': 0.01,

        # maximum number of iterations
        'max_iterations': 200,

        # whether to include transitions that
        # are possible, but not observed
        'feature.possible_transitions': True
    })

    trainer.train('crf.model')

结果：

准确度报告显示：

              precision    recall  f1-score   support

           S       0.99      1.00      0.99    214627
           P       0.81      0.57      0.67      5734

   micro avg       0.99      0.99      0.99    220361
   macro avg       0.90      0.79      0.83    220361
weighted avg       0.98      0.99      0.98    220361

我可以通过哪些方式编辑 word2features2() 以改进模型？（或任何其他部分）

这里是link 的完整代码，就像今天一样。

另外，我只是 nlp 的初学者，所以我会非常感谢任何总体反馈、相关或有用来源的链接以及相当简单的解释。非常非常感谢！

【问题讨论】：

标签： python machine-learning nlp nltk crf

【解决方案1】：

由于问题的性质，您的类非常不平衡，我建议使用加权损失，其中 P 标签的损失比 S 类的损失值更高。我认为问题可能是由于两个类的权重相等，分类器没有对那些 P 标签给予足够的重视，因为它们对损失的影响非常小。

您可以尝试的另一件事是超参数调整，然后确保针对宏 f1-score 进行优化，因为无论支持实例的数量如何，它都会为两个类赋予相同的权重。

【讨论】：

非常感谢！您是否知道如何使用pycrfsuite 或sklearn_crfsuite 实现加权损失？