查找字符串句子的组合 - 频率表到目标频率表的组合答案

【问题标题】：Find combination of string sentences - Combinations of frequency tables to target frequency table查找字符串句子的组合 - 频率表到目标频率表的组合
【发布时间】：2022-01-10 11:19:05
【问题描述】：

following 文章中解释了该问题。

我有一个句子列表，例如 1000 个句子的列表。

我想找一个句子组合来匹配/'match最接近'某个频率表：

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

我想过通过使用类似的组合从句子列表中找到所有可能的组合 here (so comb(1000, 1); to comb(1000, 1000); ) 然后将每个组合与频率表进行比较，使距离最小。因此，对可能组合中的所有频率表求和，并将该和与目标进行比较，应记录与目标差异最小的组合。可能有多个最匹配的组合。

问题是所有组合的计算都需要很长时间才能完成，显然需要几天时间。有没有一种已知的算法可以有效地解决这个问题？理想情况下最多几分钟？

输入句子：

在储藏区看到的房车比露营地多。

她尽力帮助他。曾经有几天我希望与我的身体分离，但今天不是那种日子。

旋转棒棒糖与流行摇滚糖果有问题。

两人不顾远处的雷声，沿着狭缝峡谷走下去。

州际公路两旁种植着数英亩的杏仁树，与疯狂驾驶的疯子相得益彰。

他不是詹姆斯·邦德；他叫罗杰·摩尔。

风滚草拒绝翻滚，但更愿意腾跃。

她很反感他分不清柠檬水和 > 柠檬水的区别。

他不想去看牙医，但还是去了。

找到与以下频率表最接近的句子组合：

[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]

例子：

第六句频率表

他不是詹姆斯·邦德；他叫罗杰·摩尔。

是 [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]

频率表取上下相等，不包括特殊字符。

【问题讨论】：

追随您的想法并尝试了解您想要实现的目标有点困难。你能举个实际的例子吗？有一个实际的句子列表（但不超过 10 个句子）和一个实际的频率表，以及实际所需的输出？
另外，我对你的问题的理解让我想到了“平衡化学反应”。化学反应不是一个句子列表，而是一个分子列表；分子包含原子，就像句子包含字母一样；为了平衡方程，算法必须确定每个分子的正确数量，以使每个原子的数量一致；就像你要确定每个句子的个数，使每个字母的个数一致。
或者，您的问题可能类似于 multiset cover 问题，其中频率形成一个多重集，每个句子都是一个子多重集，并且您想选择覆盖频率多重集的最少句子数。
不是每个句子的长度；所选句子的数量。在多集覆盖问题中，有效的解决方案是频率至少目标中的频率的解决方案；最优解是频率至少是目标中的频率，并且选择的多重集的数量最少的解。但是在您的情况下，您不仅希望频率至少与目标一样高：您希望频率尽可能接近目标。所以你不需要优化句子的数量。
你如何定义“最接近”的解决方案？

标签： c++ string algorithm data-structures computer-science

【解决方案1】：

只要有人从下面的句子中找到3c、3a、3b、3d或30c、30a、30b、30d的组合，有5%以上或以下的都可以解决。

S1: aaaaaaaaaaaaaaaaaa bbbbbb c
S2: aaaaaaaa bbbbbbbb d
S3: aaaaaaaaaaa bbbbbbbbb c dd
S4: aaaaaaaaaa bbbbbbbb

现实一点。没有解决方案，不是 NP-hard 也不是 NP-complete，没有解决方案。句子中字母的出现次数（例如元音，如 i 或 a）不等于其他字母（如 x 或 w）。我们可以像here 提供的代码一样找到最佳匹配项或更改要求。我试图用 KnapSack algorithm 和 欧几里得距离 和 标准差 来解决这个问题，但没有人给我这样的答案，因为没有句子相同大小的字母。

【讨论】：

【解决方案2】：

贪心算法

测试所有可能的句子组合的第一个想法太慢了。如果你有n 句子，那么就有2**n（2 的 n 次方）可能的句子组合。例如，当 n=1000 时，有2**1000 ≈ 10**300 可能的组合。那是一个 1 后跟 300 个零：超过了宇宙中粒子的数量，也超过了可能的国际象棋游戏的数量！

这里是一个贪心算法的建议。没有特别优化，运行时间为O(k * n**2)，其中n为句子数，k为最长句子长度。

思路如下：

将分数分配给每个句子number of useful characters - number of superfluous characters。例如，如果一个句子包含 20 个'a'，而目标只需要 15 个'a'，我们将计算 15 个有用的'a' 和 5 个多余的'a'，所以字符 'a' 对分数的贡献为 10那句话。
将得分最高的句子添加到结果中；
更新目标以删除结果中已经存在的字符；
更新每个句子的分数以反映更新后的目标。
循环直到没有句子得分为正。

我懒得在 C++ 中实现它，所以这里是在 python 中，使用一个最大堆和一个计数器。在代码之后我写了一个快速解释来帮助你把它翻译成 C++。

from collections import Counter
import heapq

sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']

target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})

print(target)

counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences]  # remove punctuation, spaces, uncapitalize, then count frequencies

def get_score(sentence_count, target):
    return sum((sentence_count & target).values()) - sum((sentence_count - target).values())

candidates = []
for sentence, count in zip(sentences, counts):
    score = get_score(count, target)
    candidates.append((-score, sentence, count))

heapq.heapify(candidates)    # order candidates by score
                             # python's heapq only handles min-heap
                             # but we need a max-heap
                             # so I added a minus sign in front of every score

selection = []
while candidates and candidates[0][0] < 0:  # while there is a candidate with positive score
    score, sentence, count = heapq.heappop(candidates)  # greedily selecting best candidate
    selection.append(sentence)
    target = target - count                             # update target by removing characters already accounted for
    candidates = [(-get_score(c,target), s, c) for _,s,c in candidates]  # update scores of remaining candidates
    heapq.heapify(candidates)                       # reorder candidates according to new scores

# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']

# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})

# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})

# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})

解释：

Python 的Counter( ) 将句子转换为映射character -> frequency；
对于两个计数器a 和b，a & b 是多集交集，a - b 是多集差异；
对于计数器a，sum(a.values()) 是总计数（所有频率的总和）；
heapq.heapify 将列表转换为最小堆，这是一种允许轻松访问具有最低分数的元素的数据结构。我们实际上想要的是最高分的句子，而不是最低分，所以我用负数替换了所有分数。

贪心算法的非最优性

我应该提一下，这个贪心算法是一种近似算法。在每次迭代中，它选择得分最高的句子；但不能保证最优解确实包含那句话。

很容易建立一个贪心算法找不到最优解的例子：

target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})

sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'abcdefghijklm',
    'nopqrstuvwxyz'
]

有了这个目标，分数如下：

[
    (17, 'The quick brown fox jumps over the lazy dog.'),
    (13, 'abcdefghijklm'),
    (13, 'nopqrstuvwxyz')
]

这两个“半字母表”各有 13 分，因为它们包含 13 个字母表。句子“The quick brown fox...”的得分为 17 = 26 - 9，因为它包含 26 个字母表，加上 9 个多余的字母（例如，有 3 个多余的 'o' 和 2 个多余的' e')。

显然，最佳解决方案是用字母表的两半完美地覆盖目标。但是我们的贪心算法会先选择“quick brown fox”这句话，因为它的得分更高。

【讨论】：

嗨，Stef，感谢您的优化添加如果从完整的字母句子中删除 g，算法应该选择两个半字母。现在不确定算法当前是否涵盖了这一点，但它应该...我还将您的 Python 代码转换为 CPP，如下所示
赏金将分配给最佳解决方案

【解决方案3】：

typedef struct
{
    wstring text{ L"" };            
    vector<int> encoded_text;
    int counter[26] // frequency table
    {
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,
    };

    int score = INT_MIN;

} Sentence;  

 
int m_target[26]
{
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10
};

bool orderByScore(const Sentence &a, const Sentence &b)
{
    return b.score < a.score;
}

int SentencesCounter::GetScore(Sentence sentence, int* target)
{
    int sum1 = 0;
    int sum2 = 0;

    for (size_t i = 0; i < 26; i++)
    {
        int sentenceFreq = sentence.counter[i];
        int targetFreq = target[i];

        sum1 += min(sentenceFreq, targetFreq);
        sum2 += max(0, sentenceFreq - targetFreq);
    }

    return sum1 - sum2;
}

vector<Sentence> SentencesCounter::SolveSO(vector<Sentence> &sentences)
{
    vector<Sentence> candidates{ sentences };

    for (size_t i = 0; i < candidates.size(); i++)
    {
        candidates[i].score = GetScore(candidates[i], m_target);
    }

    sort(candidates.begin(), candidates.end(), orderByScore);

    int target[26];
    memcpy(target, m_target, 26 * sizeof(int));

    vector<Sentence> selection;
    while (candidates.front().score > 0) // while there is a candidate with positive score
    {
        Sentence s = candidates.front();
        if(s.encoded_text.size() > 0) selection.push_back(s);
        candidates.front().score = INT_MIN;

        for (size_t i = 0; i < 26; i++) { target[i] -= s.counter[i]; } // update target

        size_t i;
        for (i = 0; i < candidates.size(); i++)
        {
            if (candidates[i].score > INT_MIN) // int min means already added to selection
                candidates[i].score = GetScore(candidates[i], target);
            else if (i != 0) break; // int min found at other index than top
        }

        partial_sort(candidates.begin(), candidates.begin() + i, candidates.end(), orderByScore);
    }
    return selection
}

尝试在伪 CPP 中从 Stef 复制 Python 代码

【讨论】：

【解决方案4】：

这可以简化为与目标问题绝对差最小的子序列和。

问题如下：您有一个数组A，其值为整数，例如[1,5,3,2,6]，以及一个整数值T，即目标。您想从A 中找到元素的子序列A'，从而使abs(target - sum(A')) 最小化。

在您的情况下，A 的各个整数值是二维的，其中包含每个句子的字符频率表，目标也是二维的，因为它包含字符数。您希望最小化绝对差的总和。

这显然是一个动态规划问题。如果没有优化，时间复杂度将是指数级的，我们需要检查2^n 可能性（对于每个元素，我们有两种可能性：我们要么接受它，要么离开它）。我认为这就是您通过创建所有组合在问题中提到的内容。

但是通过优化我们可以实现n * T，其中n 是A 中的元素数量，T 是目标的值。这当然是如果我们只想要最接近的数字本身，而不是求和等于该数字的元素。

要获得导致最佳解决方案的子序列本身的元素，您有 2 个选项：

回溯，具有前面解释的指数时间复杂度。
具有路径重建的 DP，其中时间复杂度保持可控，如上所述。

这些问题和算法是众所周知的，我认为不需要解释。

据我了解，您的具体问题如何映射到这个问题也很明显。当然，您希望如何实现它有一些复杂性。但是，如果您的问题与上述子序列和问题之间的关系不清楚，请告诉我，以便我进一步解释。

以下是我发现的一些链接，可以帮助您解决此问题。请注意，它们不是一个直截了当的答案，因为这个问题相对复杂。

Closest Subsequence Sum Problem 在 LeetCode 上。这可以处理您只寻找最接近的总和，而不是导致该总和的路径的情况。讨论页面充满了不同的想法和详细的解释（按最多票排序）。
DP and Path Reconstruction：这是关于DP系列的一部分。
Primer on DP
Reconstructing the Path of the Optimal Solution

【讨论】：

您好 user1984 感谢您的分析，您是否有某个示例代码将 DP/回溯应用于类似问题。不幸的是，我自己并没有太多构建动态编程解决方案的经验。
不客气。让我看看有没有发现什么。 @BigChief
不幸的是，我没有现成的东西，但我在答案的底部添加了一些资源。他们中的大多数都很长，需要一些研究，但这是这类问题的本质，恕我直言。 @BigChief
@BigChief 在您昨天发表最后一条评论后，我没有进行任何新的编辑。我仍然认为这是一个具有路径重建问题的 dp，其中 dp 部分在概念上类似于最接近子序列和问题，如第一个项目符号中所述。
我同意，这是 NP 难的。其实如果你去en.wikipedia.org/wiki/NP-hardness，给出的例子就是子集和问题。

【解决方案5】：

我们试图找到本文中显示的解决方案，但我认为该解决方案并不好。 https://www.codeproject.com/Articles/5320281/A-problem-finding-optimal-number-of-sentences-and

【讨论】：

【解决方案6】：

这在我看来像是一个高级的knapsack 问题。输入大小的上限 (1000) 也有帮助，看起来 O(n^2) 复杂度在这里应该是可以接受的。

在标准背包问题中，您有 2 个输入，value and weight 和一个 limit，您可以将 total weight 携带到其中，这样total value 就会最大化。

在这里，您的限制将是您的target 频率表，例如。

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

输入weights 将是单个句子的频率表，例如，在您给出的 10 句示例中，不要将输入视为句子，而是按以下方式查看输入：

More RVs were seen in the storage lot than at the campground ->
{'m': 2, 'o': 4, 'r': 5, 'e': 8, 'v': 1, 's': 3, 'w': 1, 'n': 4, 'i': 1, 't': 6, 'h': 3, 'a': 4, 'g': 2, 'l': 1, 'c': 1, 'p': 1, 'u': 1, 'd': 1}
She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days ->
{'s': 8, 'h': 9, 'e': 16, 'd': 8, 'i': 4, 'r': 4, 'b': 5, 't': 9, 'o': 8, 'l': 1, 'p': 2, 'm': 3, 'a': 7, 'v': 1, 'n': 4, 'y': 5, 'w': 3, 'f': 2, ',': 1, 'u': 1, '’': 1}
The swirled lollipop had issues with the pop rock candy ->
{'t': 3, 'h': 4, 'e': 4, 's': 4, 'w': 2, 'i': 4, 'r': 2, 'l': 4, 'd': 3, 'o': 4, 'p': 4, 'a': 2, 'u': 1, 'c': 2, 'k': 1, 'n': 1, 'y': 1}
...
...
...
He didn’t want to go to the dentist, yet he went anyway ->
{'h': 3, 'e': 6, 'd': 3, 'i': 2, 'n': 5, 't': 9, 'w': 3, 'a': 3, 'o': 3, 'g': 1, 's': 1, 'y': 3}
and so on...

现在，在这种情况下，我们没有 values 列表，我们需要在标准背包的情况下最大化该列表。我们的value 将仅来自组合频率表，因为我们的miximisation 条件是min differential of the target freq table and combined freq table。我们需要一个函数来满足这种最大化条件，而不是正常添加最大化。

注意：在编写此答案时，我假设您具有 DP 和标准背包算法的先验知识。如果没有，您确实需要首先研究它，因为它构成了此解决方案的基础。

注2：答案中肯定有一些我可以进一步阐述的内容。如果有任何不清楚或需要明确解释的地方，请随时在 cmets 中提问，我很乐意编辑答案以回复。

【讨论】：

我之前实现过，但是对于大输入来说性能太差了。例如，字母目标 100 的状态数为 (100^36)。但是对于少量的句子输入是可以的。
@MajidHajibaba 州的数量不应该那么大。它应该只等于n^2，其中n=number of input sentences。
@BigChief 我现在没有代码。将尝试用一些东西更新答案，这至少可以帮助你编写完整的代码。伪代码或python会起作用吗？我已经很久没有用 C++ 编写代码了，所以...
另外，@MajidHajibaba 我认为您不需要“达到”“信件目标”
@BigChief 我还希望对目标数组中的句子总数、句子大小和值进行一些限制/上限。