【发布时间】:2018-11-22 09:49:55
【问题描述】:
我正在使用 Keras 在 python 中开发图像字幕系统,当使用 argmax 搜索时,我得到了合理的结果(~0.58 Bleu_1 分数,并且句子非常多样化)。
但是,当我尝试使用光束搜索时,每张图片都会得到几乎相同的句子。
我有以下代码用于生成字幕:
# create an array of captions for a chunk of images; first token
# of each caption is the start token
test_x = np.zeros((chunk_size, self.max_len - 1), dtype=np.int)
test_x[:, 0] = self.start_idx + 1
# probability of each caption is 1
captions_probs = np.ones(chunk_size)
# for every image, maintain a heap with the best captions
self.best_captions = [FixedCapacityMaxHeap(20) for i in range(chunk_size)]
# call beam search using the current cnn features
self.beam_search(cnn_feats, test_x, captions_probs, 0, beam_size)
束搜索方法如下:
def beam_search(self, cnn_feats, generated_captions, captions_probs, t, beam_size):
# base case: the generated captions have max_len length, so
# we can remove the (zero) pad at the end and for each image
# we can insert the generated caption and its probablity into
# the heap with the best captions
if t == self.max_len - 1:
for i in range(len(generated_captions)):
caption = self.remove_zero_pad(list(generated_captions[i]))
self.best_captions[i].push(list(caption), captions_probs[i])
else:
# otherwise, make a prediction (we only keep the element at time
# step t + 1, as the LSTM has a many-to-many architecture, but we
# are only interested in the next token (for each image).
pred = self.model.predict(x=[cnn_feats, generated_captions],
batch_size=128,
verbose=1)[:, t + 1, :]
# efficiently get the indices of the tokens with the greatest probability
# for each image (they are not necessarily sorted)
top_idx = np.argpartition(-pred, range(beam_size), axis=1)[:, :beam_size]
# store the probability of those tokens
top_probs = pred[np.arange(top_idx.shape[0])[:, None], top_idx]
# for every 'neighbour' (set of newly generated tokens for every image)
# get the indices of these tokens, add them to the current captions and
# update the captions probabilities by multiplying them with the probabilities
# of the current tokens, then recursively call beam_search
for i in range(beam_size):
curr_idx = top_idx[:, i]
generated_captions[:, t + 1] = curr_idx
curr_captions_probs = top_probs[:, i] * captions_probs
self.beam_search(cnn_feats, generated_captions, curr_captions_probs, t+1, beam_size)
我使用的 FixedCapacityHeap 是:
class FixedCapacityMaxHeap(object):
def __init__(self, capacity):
self.capacity = capacity
self.h = []
def push(self, value, priority):
if len(self.h) < self.capacity:
heapq.heappush(self.h, (priority, value))
else:
heapq.heappushpop(self.h, (priority, value))
def pop(self):
if len(self.h) >= 0:
return heapq.nlargest(1, self.h)[0]
else:
return None
问题在于,使用波束搜索生成的字幕对于每张图像几乎都是相同的(例如:'scaling a in on'、'scaling a are in in''、'scaling a are in'),而argmax 版本(仅在每个时间步取最高概率的标记)实际上能够产生良好的字幕。我已经坚持了很长时间了。我尝试了不同的实现(使用 beam_seach 调用计算每个图像的标题,而不是一次计算所有图像)并且我还尝试了 softmax 温度参数(它负责 LSTM 在其预测中的信心) ,但这些似乎都不能解决问题,所以任何想法都值得赞赏。
【问题讨论】:
标签: python machine-learning keras nlp lstm