我不太确定这里的用例,但我假设您希望根据字母表的sub-sequence 进行预测。
如果它是一个完整的字符串匹配并且您没有内存限制,那么使用字典就足够了。如果它是部分字符串匹配,请查看 Aho-Corasick 方法,您可以在其中进行子字符串匹配。
一种概率更高的方法是使用序列学习算法,例如条件随机场 (CRF)。将其视为一个序列学习问题,下面的 sn-p 学习单词中每个字母的左侧字母特征和右侧字母特征。我添加了一个DEPENDENCY_CHAIN_LENGTH 参数,可用于控制每个字母表要学习多少依赖项。因此,如果您希望模型仅学习直接的左侧和右侧的字母依赖关系,您可以将其分配给 1。我已将其分配给 3 用于下面的 sn-p。
在预测过程中,为每个(编码的)字母(及其左右依赖项)预测一个标签。我对每个字母表的预测进行了平均,并将其聚合为每个单词的单个输出。
如果尚未安装 crfsuite,请发送pip install sklearn_crfsuite 安装。
import sklearn_crfsuite
import statistics
DEPENDENCY_CHAIN_LENGTH = 3
def translate_to_features(word, i):
alphabet = word[i]
features = {
'bias': 1.0,
'alphabet.lower()': alphabet.lower(),
'alphabet.isupper()': alphabet.isupper(),
'alphabet.isdigit()': alphabet.isdigit(),
}
j = 1
# Builds dependency towards the left side characters upto
# DEPENDENCY_CHAIN_LENGTH characters
while i - j >= 0 and j <= DEPENDENCY_CHAIN_LENGTH:
position = (i - j)
alphabet1 = word[position]
features.update({
'-' + str(position) + ':alphabet.lower()': alphabet1.lower(),
'-' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
'-' + str(position) + ':alphabet.isdigit()': alphabet1.isdigit(),
})
j = j + 1
else:
features['BOW'] = True
j = 1
# Builds dependency towards the right side characters upto
# DEPENDENCY_CHAIN_LENGTH characters
while i + j < len(word) and j <= DEPENDENCY_CHAIN_LENGTH:
position = (i + j)
alphabet1 = word[position]
features.update({
'+' + str(position) + ':alphabet.lower()': alphabet1.lower(),
'+' + str(position) + ':alphabet.isupper()': alphabet1.isupper(),
'+' + str(position) + ':alphabet.isdigit()': alphabet1.isupper(),
})
j = j + 1
else:
features['EOW'] = True
return features
raw_training_data = {"Titles": "1",
"itTels": "0",
}
print("Learning dataset with labels : {}".format(raw_training_data))
raw_testing_data = ["titles", "ittsle"]
X_train = []
Y_train = []
print("Feature encoding in progress ... ")
# Prepare encoded features from words
for word in raw_training_data.keys():
word_tr = []
word_lr = []
word_length = len(word)
if word_length < DEPENDENCY_CHAIN_LENGTH:
raise Exception("Dependency chain cannot have length greater than a word")
for i in range(0, len(word)):
word_tr.append(translate_to_features(word, i))
word_lr.append(raw_training_data[word])
X_train.append(word_tr)
Y_train.append(word_lr)
print("Feature encoding in completed")
# Training snippet
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=1,
all_possible_transitions=True
)
print("Training in progress")
crf.fit(X_train, Y_train)
print("Training completed")
print("Beginning predictions")
# Prediction Snippet
for word in raw_testing_data:
# Encode into features
word_enc = []
for i in range(0, len(word)):
word_enc.append(translate_to_features(word, i))
# Predict using the encoded features
pred_values = crf.predict_marginals_single(word_enc)
# Aggregate scores across spans per label
label_scores = {}
for span_prediction in pred_values:
for label in span_prediction.keys():
if label in label_scores:
label_scores[label].append(span_prediction[label])
else:
label_scores[label] = [span_prediction[label]]
# Print aggregated score
print("Predicted label for the word '{}' is :".format(word))
for label in label_scores:
print("\tLabel {} Score {}".format(label, statistics.mean(label_scores[label])))
print("Predictions completed")
产生输出:
Learning dataset with labels : {'Titles': '1', 'itTels': '0'}
Feature encoding in progress ...
Feature encoding in completed
Training in progress
Training completed
Beginning predictions
Predicted label for the word 'titles' is :
Label 1 Score 0.6821365857513837
Label 0 Score 0.3178634142486163
Predicted label for the word 'ittsle' is :
Label 1 Score 0.36701890171374996
Label 0 Score 0.6329810982862499
Predictions completed