David D. Palmer Chapter 2: Tokenisation and SentenceSegmentation.2000
https://scholar.google.com/citations?user=flDouC0AAAAJ&hl=zh-CN

word segmentation 和 tokenlization一样,但sentence segmentation不同。

Tokenisation is the process of breaking up the sequence of characters in a text by locating the word boundaries, the points where one word ends and another begins. For computational linguistics purposes, the words thus identified are frequently referred to as tokens. In written languages where no word boundaries are explicitly marked in the writing system,tokenisation is also known as word segmentation, and this term is frequently used synonymously with tokenisation.

Sentence segmentation is the process of determining the longer
processing units consisting of one or more words. This task involves
identifying sentence boundaries between words in different sentences.
Since most written languages have punctuation marks which occur at
sentence boundaries, sentence segmentation is frequently referred to
as sentence boundary detection, sentence boundary disambiguation, or
sentence boundary recognition.
All these terms refer to the same
task: determining how a text should be divided into sentences for
further processing.

Tokenisation &word segmentation & sentence segmentation

相关文章: