【问题标题】:Does Sphinx handle content in asian languages well?Sphinx 能很好地处理亚洲语言的内容吗?
【发布时间】:2014-10-08 18:14:40
【问题描述】:

我正在考虑使用 Sphinx 作为我网站的搜索引擎。但由于我有很多韩语内容,而且可能会跟随其他语言,如中文和泰语,我想知道 Sphinx 对这类内容的处理能力如何。

【问题讨论】:

    标签: internationalization sphinx cjk


    【解决方案1】:

    我正在使用 Sphinx 搜索 CJK 字符(中文、日文和韩文),您需要在 中添加以下行您的配置文件的索引块。

    index test {
      ...
      charset_type = utf-8
      ngram_len = 1
      ngram_chars = U+3000..U+2FA1F
    }
    

    【讨论】:

    • 我发现你的回答非常有用。
    【解决方案2】:

    Sphinx 适用于 UTF-8 字符(我相信包括韩语),但您必须在 sphinx 配置文件中包含要索引的 UTF-8 字符代码列表。

    这是我的 charset_table 变量在 sphinx 配置中的样子,用于添加来自欧洲语言的各种字符:

    charset_table       = 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
    

    【讨论】:

      【解决方案3】:

      在思考斯芬克斯 3:-

      config 文件夹中创建一个thinking_sphinx.yml 文件并将这些行作为:-

      development:
        enable_star: 1
        min_infix_len: 3
        ngram_len: 1
        ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
        charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
      test:
        enable_star: 1
        min_infix_len: 1
      production:
        enable_star: 1
        min_infix_len: 3
        ngram_len: 1
        enable_star: true
        ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
        charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
      

      请参阅Unicode Character Set Tables 了解更多信息。

      【讨论】:

        猜你喜欢
        • 2013-03-02
        • 2016-04-16
        • 1970-01-01
        • 2010-12-12
        • 1970-01-01
        • 1970-01-01
        • 2011-12-24
        • 2012-06-24
        • 1970-01-01
        相关资源
        最近更新 更多