Connectionist Temporal Classification, an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems.

【Paper】CTC Introduce

1. Problem

  • don’t know the characters in the transcript align to the audio when having a dataset of audio clips and corresponding transcripts.
  • people’s rates of speech vary.
  • hand-align takes lots of time.
  • Speech recognition, handwriting recognition from images, sequences of pen strokes, action labelling in videos.

2. Question Define

when mapping input sequences X=[x1,x2,,xT]X = [x_1, x_2, \ldots, x_T], such as audio, to corresponding output sequences Y=[y1,y2,,yU]Y = [y_1, y_2, \ldots, y_U], such as transcripts. We want to find an accurate mapping from XsX's to YsY's.

  • Both XX and YY can vary in length.
  • The ratio of the lengths of XX and YY can vary.
  • we don’t have an accurate alignment(correspondence of the elements) of XX and YY.

The CTC algorithm, for a given XX it gives us an output distribution over all possible YsY's, we can use this distribution either to infer a likely output or to assess the probability of a given output.

  • Loss Function: maximize the probability it assigns to the right answer, compute the conditional probability p(YX)p(Y|X);
  • Inference: infer a likely YY given an XX, Y=argmaxp(YX)Y^*=argmaxp(Y|X)

3. Alignment

【Paper】CTC Introduce

  • Often, it doesn’t make sense to force every input step to align to some output. In speech recognition, for example, the input can have stretches of silence with no corresponding output.
  • We have no way to produce outputs with multiple characters in a row. Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will produce “helo” instead of “hello”.

【Paper】CTC Introduce

  • the allowed alignments between XX and YY are monotonic
  • the alignment of XX to YY is many-to-one.
  • the length of YY cannot be greater than the length of XX.

4. Searching Methods

【Paper】CTC Introduce

【Paper】CTC Introduce

【Paper】CTC Introduce
Z=[ϵ,y1,ϵ,y2,,ϵ,yU,ϵ] Z=[ϵ, y_1, ϵ, y_2, …, ϵ, y_U, ϵ]​

  • Case 1: can’t jump over zs1z_{s-1}, the previous token in ZZ.

【Paper】CTC Introduce

【Paper】CTC Introduce

  • Case 2: allowed to skip the previous token in ZZ.

【Paper】CTC Introduce

【Paper】CTC Introduce

【Paper】CTC Introduce

  • Loss Function: for a training set D, the model’s parameters are tuned to minimize the negative log-likelihood instead of maximizing the likelihood directly.

(X,Y)ϵDlogP(YX) \sum_{(X,Y)\epsilon D}-logP(Y|X)

  • Inference: (3) don’t take into account the fact that a single output can have many alignments.

Y=argmaxYp(YX)A=argmaxAt=1Tpt(atX) Y^*=argmax_Yp(Y|X)\\ A^*=argmax_A\prod_{t=1}^Tp_t(a_t|X)

【Paper】CTC Introduce

【Paper】CTC Introduce

5. Properties of CTC

  • Conditional Independence

【Paper】CTC Introduce

  • Alignment Properties

CTC only allows monotonic alignments. In problems such as speech recognition this may be a valid assumption. For other problems like machine translation where a future word in a target sentence can align to an earlier part of the source sentence, this assumption is a deal-breaker.

6. Usage

  • Baidu Research has open-sourced warp-ctc. The package is written in C++ and CUDA. The CTC loss function runs on either the CPU or the GPU. Bindings are available for Torch, TensorFlow and PyTorch.
  • TensorFlow has built in CTC loss and CTC beam search functions for the CPU.
  • Nvidia also provides a GPU implementation of CTC in cuDNN versions 7 and up.

to normalize the α\alpha’s at each time-step to deal with CTC loss numerically unstable problem.

A common question when using a beam search decoder is the size of the beam to use. There is a trade-off between accuracy and runtime.

**From:**https://distill.pub/2017/ctc/

相关文章:

  • 2022-12-23
  • 2021-05-25
  • 2021-11-25
  • 2019-08-14
  • 2021-05-17
  • 2021-07-04
  • 2021-07-20
猜你喜欢
  • 2021-08-29
  • 2021-10-26
  • 2022-01-25
  • 2021-04-21
  • 2021-05-11
  • 2021-05-27
  • 2022-12-23
相关资源
相似解决方案