有什么方法可以加快 anagram 求解器的排列创建速度？答案

【问题标题】：is there any way to speed up the permuation creation for an anagramsolver?有什么方法可以加快 anagram 求解器的排列创建速度？
【发布时间】：2022-01-26 19:48:05
【问题描述】：

我目前正在尝试制作一个非常快速的字谜求解器，但现在它因创建排列而遇到瓶颈。是否有另一种方法来完成整个程序或优化排列创建？这是我的代码：

#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <unordered_set>
#include <vector>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>



void get_permutations(std::string s, std::vector<std::string> &permutations)
{
    std::sort(s.begin(), s.end());
    do
    {
        permutations.push_back(s);
    } while (std::next_permutation(s.begin(), s.end()));
}


void load_file(std::unordered_set<std::string> &dictionary, std::string filename)
{
    std::ifstream words(filename);
    std::string element;
    while (words >> element)
    {
        std::transform(element.begin(), element.end(), element.begin(), ::tolower);
        dictionary.insert(element);
    }
}

void print_valid(const std::unordered_set<std::string>& dictionary, const std::vector<std::string>::const_iterator start, const std::vector<std::string>::const_iterator stop)
{
    for (auto iter = start; iter != stop; iter++)
    {

        if (dictionary.contains(*iter) == true)
        {
            std::cout << *iter << "\n";
        }
    }
}

int main()
{
    const std::string s = "asdfghjklq";
    std::vector<std::string> permutations;
    boost::asio::thread_pool pool(2);
    std::cout << "Loading english dictionary\n";
    std::unordered_set<std::string> dictionary;
    load_file(dictionary, "words");
    std::cout << "Done\n";

    //std::cout << "Enter the anagram: ";
    //getline(std::cin, s);

    clock_t start = clock();


    get_permutations(s, permutations);
    //std::cout << permutations.size() << std::endl;

    std::cout << "finished permutations\n";

    
    if (permutations.size() > 500000)
    {
        std::cout << "making new\n";
        for (size_t pos = 0; pos < permutations.size(); pos += (permutations.size() / 3))
        {
            boost::asio::post(pool, [&dictionary, &permutations, pos] { print_valid(dictionary, (permutations.begin() + pos), (permutations.begin() + pos + (permutations.size() /3) ) ); });
        }
        pool.join();
    } 
    else
    {
        print_valid(dictionary, permutations.begin(), permutations.end());
    }

    

    clock_t finish = clock();
    double time_elapsed = (finish - start) / static_cast<double>(CLOCKS_PER_SEC);
    std::cout << time_elapsed << "\n";
    std::cout << permutations.size() << std::endl;

    return 0;
}

排列的创建在get_permutations 线程池是用来测试非常大的排列集的东西

【问题讨论】：

将您的字典组织成一棵树，每个节点一个字母，然后开始按顺序遍历每个排列。如果你走到了死胡同，这可以让你切掉整个无效的排列。

标签： c++ permutation anagram

【解决方案1】：

想一想你将如何手工处理 - 你如何检查两个单词是否是彼此的字谜？

例如：banana aaannb

你会如何在一张纸上解决这个问题？你会创建所有 720 个排列并检查是否有任何一个匹配？还是有更简单、更直观的方法？

那么是什么让一个词成为另一个词的变位词，即需要满足什么条件？

这都是关于字母计数的。如果两个单词包含相同数量的所有字母，则它们是彼此的字谜。

例如：

banana -> 3x a, 2x n, 1x b
aaannb -> 3x a, 2x n, 1x b
相同的字母计数，所以它们必须是字谜！

有了这些知识，你能构建一个不需要迭代所有可能排列的算法吗？

解决方案

我只建议您在尝试提出自己的优化算法后阅读本文

您只需要为字典单词建立一个字母计数查找表，例如：

1x a, 1x n -> ["an"]
3x a, 1x b, 2x n -> ["banana", "nanaba"]
1x a, 1x p, 1x r, 1x t -> ["part", "trap"]
... etc ...

然后您可以将搜索词分解为字母数，例如banana -> 3x a, 1x b, 2x n 并在查找表中搜索分解。

结果将是您的字典中的单词列表，您可以使用给定的字母集合构建 - 也就是给定字符串的所有可能的字谜。

假设某种名为 letter_counts 的结构包含算法可能看起来像这样的字母组合：

std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

这将在O(n) 时间运行，空间复杂度为O(n)，n 是字典中的单词数。

（如果您不将 compositionMap 的构造作为算法的一部分，它将被装甲化 O(1) 时间）

与基于排列的方法相比，它具有O(n!) 时间复杂度（或者我喜欢如何称呼它O(scary)）。

这是一个仅处理字母 a-z 的完整代码示例，但您可以轻松修改 letter_counts 以使其也适用于其他字符：

godbolt example

#include <string_view>
#include <cctype>
#include <vector>
#include <string>
#include <unordered_map>
#include <iostream>

struct letter_counts {
    static const int num_letters = 26;
    int counts[num_letters];

    explicit letter_counts(std::string_view str) : counts{0} {
        for(char c : str) {
            c = std::tolower(c);
            if(c >= 'a' && c <= 'z')
                counts[c - 'a']++;
        }
    }
};

bool operator==(letter_counts const& lhs, letter_counts const& rhs) {
    for(int i = 0; i < letter_counts::num_letters; i++) {
        if(lhs.counts[i] != rhs.counts[i]) return false;
    }

    return true;
}

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

namespace std {
    template<>
    struct hash<letter_counts> {
        size_t operator()(const letter_counts& letterCounts) const
        {
            size_t result = 0;
            auto hasher = std::hash<int>{};
            for(int i : letterCounts.counts)
                hash_combine(result, hasher(i));

            return result;
        }
    };
}



std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

int main() {
    std::vector<std::string> dict = {
        "banana",
        "nanaba",
        "foobar",
        "bazinga"
    };

    std::string word = "aaannb";

    for(auto& str : find_anagrams(dict, word)) {
        std::cout << str << std::endl;
    }
}

【讨论】：

【解决方案2】：

您使用的排列方法太慢了，尤其是因为 n 个不同字符的字符串的排列数量以超指数方式扩展。尝试诸如散列和相等谓词之类的东西，其中哈希基于排序的字符串，而相等谓词仅测试两个字符串的排序版本是否相等。您可以使用 boost::unordered_map 来创建自定义散列函数并将符合字谜的单词添加到键集中。

【讨论】：

正如目前所写，您的答案尚不清楚。请edit 添加其他详细信息，以帮助其他人了解这如何解决所提出的问题。你可以找到更多关于如何写好答案的信息in the help center。

【解决方案3】：

请注意，组合的数量往往会很快变得非常大。如果您按字母顺序对两个单词的字符进行排序，然后排序的字符串匹配，则两个单词是字谜。基于这一事实，我制作了以下示例，将字典放入多重映射中，可以快速找到单词的所有字谜。它通过使用按字母顺序排序的输入字符串作为映射的键来实现这一点。

现场演示：https://onlinegdb.com/fXUVZruwq

#include <algorithm>
#include <iostream>
#include <locale>
#include <map>
#include <vector>
#include <set>

// create a class to hold anagram information
class anagram_dictionary_t
{
public:
    
    // create a dictionary based on an input list of words.
    template<typename std::size_t N>
    explicit anagram_dictionary_t(const std::string (&words)[N])
    {
        for (std::string word : words)
        {
            auto key = make_key(word);
            std::string lower{ word };
            to_lower(lower);
            m_anagrams.insert({ key, lower});
        }
    }

    // find all the words that match the anagram
    auto find_words(const std::string& anagram)
    {
        // get the unique key for input word
        // this is done by sorting all the characters in the input word alphabetically
        auto key = make_key(anagram);

        // lookup all the words with the same key in the dictionary
        auto range = m_anagrams.equal_range(key);

        // create a set of found words
        std::set<std::string> words;
        for (auto it = range.first; it != range.second; ++it)
        {
            words.insert(it->second);
        }

        // return the words
        return words;
    }

    // function to check if two words are an anagram
    bool is_anagram(const std::string& anagram, const std::string& word)
    {
        auto words = find_words(anagram);
        return (words.find(word) != words.end());
    }

private:
    // make a unique key out of an input word
    // all anagrams should map to the same key value
    static std::string make_key(const std::string& word)
    {
        std::string key{ word };
        to_lower(key);

        // two words are anagrams if they sort to the same key
        std::sort(key.begin(), key.end());
        return key;
    }

    static void to_lower(std::string& word)
    {
        for (char& c : word)
        {
            c = std::tolower(c, std::locale());
        }
    }

    std::multimap<std::string, std::string> m_anagrams;
};

int main()
{
    anagram_dictionary_t anagram_dictionary{ {"Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", "Blueberry" } };
    
    std::string anagram{ "aaannb"};
    auto words = anagram_dictionary.find_words(anagram);
    
    std::cout << "input word = " << anagram << "\n found words : ";
    for (const auto& word : words)
    {
        std::cout << word << "\n";
    }

    return 0;
}

【讨论】：