[LeetCode][Facebook面试题] 通配符匹配和正则表达式匹配，题 Wildcard Matching

开篇

通常的匹配分为两类，一种是正则表达式匹配，pattern包含一些关键字，比如\'*\'的用法是紧跟在pattern的某个字符后，表示这个字符可以出现任意多次(包括0次)。

另一种是通配符匹配，我们在操作系统里搜索文件的时候，用的就是这种匹配。比如 "*.pdf"，\'*\'在这里就不再代表次数，而是通配符，可以匹配任意长度的任意字符组成的串。所以"*.pdf"表示寻找所有的pdf文件。

在算法题中，往往也会有类似的模拟匹配题，当然考虑到当场实现的时间，会减少通配符数量或者正则表达式关键字的数量，只留那么几个，即便如此，这类题目也是属于比较难的题目了==。

正则表达式匹配

例题如下：

Regular Expression Matching

http://basicalgos.blogspot.com/2012/03/10-regular-expression-matching.html

\'.\' Matches any single character.
\'*\' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

The function prototype should be:
bool isMatch(const char *s, const char *p)

Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "a*") → true
isMatch("aa", ".*") → true
isMatch("ab", ".*") → true
isMatch("aab", "c*a*b") → true

这道题是面Facebook时遇到的一道题。

要处理的关键字有两个\'*\', \'.\' ，第二个比较好办，第一个比较麻烦，

因为\'*\'可以表示任意数量，因此当*(p+1) == \'*\'时，我们可以掠过\'*\'之前的字符，直接++p，或者如果*s == *(p-1)或*(p-1) == \'.\'，我们可以跳过任意个这样的s。因此，\'*\'的处理被跳过多少个s划分成了多个子问题，我用递归函数来处理这些子问题。当时的代码还没有这么简洁，这是我修改后的代码：

bool isMatch(char *s, char *p){
    if(*s == \'\0\' && *p == \'\0\')
        return true;
        
    if (*(p+1) == \'*\'){
        while(*p == *s || *p == \'.\'){ //若*s和*p相等，挨个略过
            if(isMatch(s++, p+2));
                return true;
        }
        return isMatch(s, p+2); //若*s和*p不等，直接略过*p；或者当*(p+2) == \'\0\'时的最后处理
    }
    
    if(*s == *p || *p == \'.\')
        return *s == \'\0\' ? false : isMatch(s+1, p+1);
    
    return false;
}

通配符匹配

我们以LeetCode上的一题为例。

Wildcard Matching

Implement wildcard pattern matching with support for \'?\' and \'*\'.

\'?\' Matches any single character.
\'*\' Matches any sequence of characters (including the empty sequence).

The matching should cover the entire input string (not partial).

The function prototype should be:
bool isMatch(const char *s, const char *p)

Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "*") → true
isMatch("aa", "a*") → true
isMatch("ab", "?*") → true
isMatch("aab", "c*a*b") → false

required function:

bool isMatch(const char *s, const char *p)

通配符有两个："?"和"*"

因为*是可以匹配任意字符串的，因此还是划分子问题，我一开始的思路是遇到*后，和上一题一样使用递归来处理子问题。

代码：

class Solution {
public:
    bool isMatch(const char *s, const char *p) {
        if(*s == \'\0\'){
            if(*p == \'\0\') return true;
            if(*p != \'*\') return false;
        }
        if(*p == \'?\') return isMatch(++s, ++p);
        else if(*p == \'*\'){
            while(*(++p) == \'*\');
            for(; *s != \'\0\'; ++s){
                if(isMatch(s, p)) return true;
            }
            return isMatch(s, p);
        }else{
            if(*p == *s) return isMatch(++s, ++p);
            return false;
        }
        return false;
    }
};

但是这样做超时。

为了节约时间，我用空间换时间，用rec[][]记录了比较结果。

class Solution {
public:
    bool isMatch(const char *s, const char *p) {
        int lens = 0, lenp = 0;
        const char *s1 = s, *p1 = p;
        for(; *s1 != \'\0\'; ++s1, ++lens);
        for(; *p1 != \'\0\'; ++p1, ++lenp);
        if(lenp == 0) return false;
        if(lens == 0) return true;
        rec = new int*[lens+1];
        for(int i = 0; i <= lens; ++i){
            rec[i] = new int[lenp+1];
            for(int j = 0; j <= lenp; ++j){
                rec[i][j] = -1;
            }
        }
        return isMatchCore(s, s, p, p);
    }
private:
    int** rec;
    bool isMatchCore(const char *oris, const char *s, const char *orip, const char *p) {
        if(*s == \'\0\'){
            if(*p == \'\0\') return true;
            if(*p != \'*\') return false;
        }
        if(rec[s-oris][p-orip] >= 0) return rec[s-oris][p-orip];
        if(*p == \'?\') return isMatchCore(oris, ++s, orip, ++p);
        else if(*p == \'*\'){
            while(*(++p) == \'*\');
            for(; *s != \'\0\'; ++s){
                if(isMatchCore(oris, s, orip, p)) return true;
            }
            return isMatchCore(oris, s, orip, p);
        }else{
            if(*p == *s) return isMatchCore(oris, ++s, orip, ++p);
            return false;
        }
        return false;
    }
};

结果依然超时。

原因在于即便使用了带记录的递归，对于p上的每一个\'*\'，依然需要考虑\'*\' 匹配之后字符的所有情况，比如p = "c*ab*c"，s = "cddabbac"时，遇到第一个\'*\'，我们需要用递归处理p的剩余部分"ab*c" 和s的剩余部分"ddabbac"的所有尾部子集匹配。也就是："ab*c"和"ddabbac"，"ab*c" 和"dabbac"的匹配，"ab*c" 和"abbac"的匹配，... ，"ab*c" 和"c"的匹配，"ab*c" 和"\0"的匹配。

遇到第二个\'*\'，依然如此。每一个\'*\'都意味着p的剩余部分要和s的剩余部分的所有尾子集匹配一遍。

然而，我们如果仔细想想，实际上，当p中\'*\'的数量大于1个时，我们并不需要像上面一样匹配所有尾子集。

依然以 p = "c*ab*c"，s = "cddabbac"为例。

对于p = "c*ab*c"，我们可以猜想出它可以匹配的s应该长成这样： "c....ab.....c"，省略号表示0到任意多的字符。我们发现主要就是p的中间那个"ab"比较麻烦，一定要s中的\'ab\'来匹配，因此只要s中间存在一个"ab"，那么一切都可以交给后面的\'*\'了。

所以说，当我们挨个比较p和s上的字符时，当我们遇到p的第一个\'*\'，我们实际只需要不断地在s的剩余部分找和\'ab\'匹配的部分。

换言之，我们可以记录下遇到*时p和s的位置，记为presp和press，然后挨个继续比较*(++p)和*(++s)；如果发现*p != *s，就回溯回去，p = presp，s = press+1, ++press；直到比较到末尾，或者遇到了下一个\'*\'，如果遇到了下一个\'*\'，说明 "ab"部分搞定了，下面的就交给第二个\'*\'了；如果p和s都到末尾了，那么就返回true；如果到末尾了既没遇到新的\'*\'，又还存在不匹配的值，press也已经到末尾了，那么就返回false了。

这样的思路和上面的递归比起来，最大的区别就在于：

遇到\'*\'，我们只考虑遇到下一个\'*\'前的子问题，而不是考虑一直到末尾的子问题。从而避免大量的子问题计算。

我们通过记录 presp和press，每次回溯的方法，避免使用递归。

代码：

class Solution {
public:
    bool isMatch(const char *s, const char *p) {
        const char *presp = NULL, *press = NULL;    //previous starting comparison place after * in s and p.
        bool startFound = false;
        while(*s != \'\0\'){
            if(*p == \'?\'){++s; ++p;}
            else if(*p == \'*\'){
                presp = ++p;
                press = s;
                startFound = true;
            }else{
                if(*p == *s){
                    ++p;
                    ++s;
                }else if(startFound){
                    p = presp;
                    s = (++press);
                }else return false;
            }
        }
        while(*p == \'*\') ++p;
        return *p == \'\0\';
    }
};