【问题标题】:Converting punycode with dash character to Unicode将带有破折号字符的 punycode 转换为 Unicode
【发布时间】:2010-09-16 01:23:01
【问题描述】:

我需要将 punycode NIATO-OTABD 转换为 nñiñatoñ

前几天我找到了a text converter in JavaScript,但是如果中间有破折号,punycode 转换不起作用。

有解决“破折号”问题的建议吗?

【问题讨论】:

    标签: javascript unicode punycode


    【解决方案1】:

    我花时间创建了下面的 punycode。它基于 RFC 3492 中的 C 代码。要将其与域名一起使用,您必须从解码/编码的输入/输出中删除/添加 xn--

    utf16-class 是从 JavaScript 的内部字符表示转换为 unicode 并返回所必需的。

    还有ToASCIIToUnicode 函数可以更轻松地在弱编码IDN 和ASCII 之间进行转换。

    //Javascript Punycode converter derived from example in RFC3492.
    //This implementation is created by some@domain.name and released into public domain
    var punycode = new function Punycode() {
        // This object converts to and from puny-code used in IDN
        //
        // punycode.ToASCII ( domain )
        // 
        // Returns a puny coded representation of "domain".
        // It only converts the part of the domain name that
        // has non ASCII characters. I.e. it dosent matter if
        // you call it with a domain that already is in ASCII.
        //
        // punycode.ToUnicode (domain)
        //
        // Converts a puny-coded domain name to unicode.
        // It only converts the puny-coded parts of the domain name.
        // I.e. it dosent matter if you call it on a string
        // that already has been converted to unicode.
        //
        //
        this.utf16 = {
            // The utf16-class is necessary to convert from javascripts internal character representation to unicode and back.
            decode:function(input){
                var output = [], i=0, len=input.length,value,extra;
                while (i < len) {
                    value = input.charCodeAt(i++);
                    if ((value & 0xF800) === 0xD800) {
                        extra = input.charCodeAt(i++);
                        if ( ((value & 0xFC00) !== 0xD800) || ((extra & 0xFC00) !== 0xDC00) ) {
                            throw new RangeError("UTF-16(decode): Illegal UTF-16 sequence");
                        }
                        value = ((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000;
                    }
                    output.push(value);
                }
                return output;
            },
            encode:function(input){
                var output = [], i=0, len=input.length,value;
                while (i < len) {
                    value = input[i++];
                    if ( (value & 0xF800) === 0xD800 ) {
                        throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
                    }
                    if (value > 0xFFFF) {
                        value -= 0x10000;
                        output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
                        value = 0xDC00 | (value & 0x3FF);
                    }
                    output.push(String.fromCharCode(value));
                }
                return output.join("");
            }
        }
    
        //Default parameters
        var initial_n = 0x80;
        var initial_bias = 72;
        var delimiter = "\x2D";
        var base = 36;
        var damp = 700;
        var tmin=1;
        var tmax=26;
        var skew=38;
        var maxint = 0x7FFFFFFF;
    
        // decode_digit(cp) returns the numeric value of a basic code 
        // point (for use in representing integers) in the range 0 to
        // base-1, or base if cp is does not represent a value.
    
        function decode_digit(cp) {
            return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : cp - 97 < 26 ? cp - 97 : base;
        }
    
        // encode_digit(d,flag) returns the basic code point whose value
        // (when used for representing integers) is d, which needs to be in
        // the range 0 to base-1. The lowercase form is used unless flag is
        // nonzero, in which case the uppercase form is used. The behavior
        // is undefined if flag is nonzero and digit d has no uppercase form. 
    
        function encode_digit(d, flag) {
            return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
            //  0..25 map to ASCII a..z or A..Z 
            // 26..35 map to ASCII 0..9
        }
        //** Bias adaptation function **
        function adapt(delta, numpoints, firsttime ) {
            var k;
            delta = firsttime ? Math.floor(delta / damp) : (delta >> 1);
            delta += Math.floor(delta / numpoints);
    
            for (k = 0; delta > (((base - tmin) * tmax) >> 1); k += base) {
                    delta = Math.floor(delta / ( base - tmin ));
            }
            return Math.floor(k + (base - tmin + 1) * delta / (delta + skew));
        }
    
        // encode_basic(bcp,flag) forces a basic code point to lowercase if flag is zero,
        // uppercase if flag is nonzero, and returns the resulting code point.
        // The code point is unchanged if it is caseless.
        // The behavior is undefined if bcp is not a basic code point.
    
        function encode_basic(bcp, flag) {
            bcp -= (bcp - 97 < 26) << 5;
            return bcp + ((!flag && (bcp - 65 < 26)) << 5);
        }
    
        // Main decode
        this.decode=function(input,preserveCase) {
            // Dont use utf16
            var output=[];
            var case_flags=[];
            var input_length = input.length;
    
            var n, out, i, bias, basic, j, ic, oldi, w, k, digit, t, len;
    
            // Initialize the state: 
    
            n = initial_n;
            i = 0;
            bias = initial_bias;
    
            // Handle the basic code points: Let basic be the number of input code 
            // points before the last delimiter, or 0 if there is none, then
            // copy the first basic code points to the output.
    
            basic = input.lastIndexOf(delimiter);
            if (basic < 0) basic = 0;
    
            for (j = 0; j < basic; ++j) {
                if(preserveCase) case_flags[output.length] = ( input.charCodeAt(j) -65 < 26);
                if ( input.charCodeAt(j) >= 0x80) {
                    throw new RangeError("Illegal input >= 0x80");
                }
                output.push( input.charCodeAt(j) );
            }
    
            // Main decoding loop: Start just after the last delimiter if any
            // basic code points were copied; start at the beginning otherwise. 
    
            for (ic = basic > 0 ? basic + 1 : 0; ic < input_length; ) {
    
                // ic is the index of the next character to be consumed,
    
                // Decode a generalized variable-length integer into delta,
                // which gets added to i. The overflow checking is easier
                // if we increase i as we go, then subtract off its starting 
                // value at the end to obtain delta.
                for (oldi = i, w = 1, k = base; ; k += base) {
                        if (ic >= input_length) {
                            throw RangeError ("punycode_bad_input(1)");
                        }
                        digit = decode_digit(input.charCodeAt(ic++));
    
                        if (digit >= base) {
                            throw RangeError("punycode_bad_input(2)");
                        }
                        if (digit > Math.floor((maxint - i) / w)) {
                            throw RangeError ("punycode_overflow(1)");
                        }
                        i += digit * w;
                        t = k <= bias ? tmin : k >= bias + tmax ? tmax : k - bias;
                        if (digit < t) { break; }
                        if (w > Math.floor(maxint / (base - t))) {
                            throw RangeError("punycode_overflow(2)");
                        }
                        w *= (base - t);
                }
    
                out = output.length + 1;
                bias = adapt(i - oldi, out, oldi === 0);
    
                // i was supposed to wrap around from out to 0,
                // incrementing n each time, so we'll fix that now: 
                if ( Math.floor(i / out) > maxint - n) {
                    throw RangeError("punycode_overflow(3)");
                }
                n += Math.floor( i / out ) ;
                i %= out;
    
                // Insert n at position i of the output: 
                // Case of last character determines uppercase flag: 
                if (preserveCase) { case_flags.splice(i, 0, input.charCodeAt(ic -1) -65 < 26);}
    
                output.splice(i, 0, n);
                i++;
            }
            if (preserveCase) {
                for (i = 0, len = output.length; i < len; i++) {
                    if (case_flags[i]) {
                        output[i] = (String.fromCharCode(output[i]).toUpperCase()).charCodeAt(0);
                    }
                }
            }
            return this.utf16.encode(output);
        };
    
        //** Main encode function **
    
        this.encode = function (input,preserveCase) {
            //** Bias adaptation function **
    
            var n, delta, h, b, bias, j, m, q, k, t, ijv, case_flags;
    
            if (preserveCase) {
                // Preserve case, step1 of 2: Get a list of the unaltered string
                case_flags = this.utf16.decode(input);
            }
            // Converts the input in UTF-16 to Unicode
            input = this.utf16.decode(input.toLowerCase());
    
            var input_length = input.length; // Cache the length
    
            if (preserveCase) {
                // Preserve case, step2 of 2: Modify the list to true/false
                for (j=0; j < input_length; j++) {
                    case_flags[j] = input[j] != case_flags[j];
                }
            }
    
            var output=[];
    
    
            // Initialize the state: 
            n = initial_n;
            delta = 0;
            bias = initial_bias;
    
            // Handle the basic code points: 
            for (j = 0; j < input_length; ++j) {
                if ( input[j] < 0x80) {
                    output.push(
                        String.fromCharCode(
                            case_flags ? encode_basic(input[j], case_flags[j]) : input[j]
                        )
                    );
                }
            }
    
            h = b = output.length;
    
            // h is the number of code points that have been handled, b is the
            // number of basic code points 
    
            if (b > 0) output.push(delimiter);
    
            // Main encoding loop: 
            //
            while (h < input_length) {
                // All non-basic code points < n have been
                // handled already. Find the next larger one: 
    
                for (m = maxint, j = 0; j < input_length; ++j) {
                    ijv = input[j];
                    if (ijv >= n && ijv < m) m = ijv;
                }
    
                // Increase delta enough to advance the decoder's
                // <n,i> state to <m,0>, but guard against overflow: 
    
                if (m - n > Math.floor((maxint - delta) / (h + 1))) {
                    throw RangeError("punycode_overflow (1)");
                }
                delta += (m - n) * (h + 1);
                n = m;
    
                for (j = 0; j < input_length; ++j) {
                    ijv = input[j];
    
                    if (ijv < n ) {
                        if (++delta > maxint) return Error("punycode_overflow(2)");
                    }
    
                    if (ijv == n) {
                        // Represent delta as a generalized variable-length integer: 
                        for (q = delta, k = base; ; k += base) {
                            t = k <= bias ? tmin : k >= bias + tmax ? tmax : k - bias;
                            if (q < t) break;
                            output.push( String.fromCharCode(encode_digit(t + (q - t) % (base - t), 0)) );
                            q = Math.floor( (q - t) / (base - t) );
                        }
                        output.push( String.fromCharCode(encode_digit(q, preserveCase && case_flags[j] ? 1:0 )));
                        bias = adapt(delta, h + 1, h == b);
                        delta = 0;
                        ++h;
                    }
                }
    
                ++delta, ++n;
            }
            return output.join("");
        }
    
        this.ToASCII = function ( domain ) {
            var domain_array = domain.split(".");
            var out = [];
            for (var i=0; i < domain_array.length; ++i) {
                var s = domain_array[i];
                out.push(
                    s.match(/[^A-Za-z0-9-]/) ?
                    "xn--" + punycode.encode(s) :
                    s
                );
            }
            return out.join(".");
        }
        this.ToUnicode = function ( domain ) {
            var domain_array = domain.split(".");
            var out = [];
            for (var i=0; i < domain_array.length; ++i) {
                var s = domain_array[i];
                out.push(
                    s.match(/^xn--/) ?
                    punycode.decode(s.slice(4)) :
                    s
                );
            }
            return out.join(".");
        }
    }();
    

    更新许可证:
    来自 RFC3492:

    免责声明和许可

    对于整个文档或其中的任何部分(包括伪代码和 C 代码),作者不做任何保证,也不对因使用它而造成的任何损害负责。作者授予任何人不可撤销的许可,以任何方式使用、修改和分发它,但不会削弱任何其他人使用、修改和分发它的权利,前提是再分发的衍生作品不包含误导性的作者或版本信息。衍生作品无需根据类似条款获得许可。

    我把我的工作放在这个 punycode 和公共领域的 utf16 中。很高兴收到一封电子邮件告诉我你在哪个项目中使用它。

    【讨论】:

    • 如果您没有在用户页面配置文件中的某处提供有效的电子邮件地址,则用户无法向您发送电子邮件。传统是将其放在“关于我”字段中
    • @Jeff:当我写的时候我以为它已经在那里了。固定。
    • 很棒的工作,一些!这是我在编写 my own 时比较的 Punycode 实现之一。我希望你不介意我重新使用了你的 UTF16 类:)
    • @Mathias Bynens:谢谢!我对你重用代码没有任何问题,这就是它的目的!但是我很好奇为什么你觉得你需要自己写?你有没有发现它有什么问题?
    • 我将在我的社交网络中使用它来自动检测网址。 github.com/kuchumovn/sociopathy
    猜你喜欢
    • 2020-01-12
    • 1970-01-01
    • 2021-03-28
    • 1970-01-01
    • 2021-07-27
    • 2018-08-28
    • 1970-01-01
    • 2011-12-19
    相关资源
    最近更新 更多