【问题标题】:Duplicate text-finding重复文本查找
【发布时间】:2010-10-25 23:24:37
【问题描述】:

我的主要问题是试图找到一个合适的解决方案来自动转动这个,例如:

d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+

进入这个:

[d+c+d+f+]4

即查找彼此相邻的重复项,然后从这些重复项中创建一个较短的“循环”。 到目前为止,我还没有找到合适的解决方案,我期待着回应。附言为避免混淆,上述示例并不是唯一需要“循环”的内容,它因文件而异。哦,这适用于 C++ 或 C# 程序,两者都可以,尽管我也愿意接受任何其他建议。此外,主要思想是所有工作都将由程序本身完成,除了文件本身之外没有用户输入。 这是完整的文件,供参考,我对拉伸页面表示歉意: #0 @16 v225 y10 w250 t76

l16 $ED $EF $A9 p20,20 >ecegb>dd+d+f+a+>c+f+d+ccegbgegecec d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+ r1^1

/ l8 r1r1r1r1 f+f+g+cg+r4 a+c+a+g+cg+r4f+f+g+cg+r4 a+c+a+g+cg+r4f+f+g+cg+r4 a+c+a+g+cg+r4 f+f+g+cg+r4 a+c+a+g+r4g+16f16c+ a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+cc4d+ c+d+4g+4a+4 r1^2^4^a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+cc4d+ c+d+4g+4a+4 r1^2^4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 P>

#4 @22 v250 y10

l8 o3 rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+ / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 P>

#2 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1 d+4f4f+4g+4 a+4r1^4^2 / d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 > d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 a+4^g+r2 f+1a+4^g+r2 f+1 f+4^fr2 d+1 f+4^fr2 d+2^d+4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 P>

#3 @10 v210 y10

r1^1 o3 c8r8d8r8 c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8 c8 @10d16d16@21 c8 @10d16d16@21 c8 @10d16d16@21 / c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@ 10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@ 21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8 c4@10d8@21c8 @10d16d16d16d16d16r16 c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@ 10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@ 21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8 c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@ 10d8@21c8c4@10d8@21c8c8@10d8@21c8c4@10d8@21c8c8@10d8@21c8 c4@10d8@21c8 @10b16b16>c16c16

#7 @16 v230 y10

l16 $ED $EF $A9 cceeggbbggeeccee d+d+f+f+a+a+f+f+d+d+d+d+ cceeggeecccc d+d+ffd+d+

#5 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1r1 d+4r1^2^4 / cr2 c+4^cr2cr2 c+4^cr2cr2 c+4^cr2cr2 c+4^cr2 a+4^>cr2 c+4^cr2 cr2 c+4^c r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 r2 f+4^fr2 d+1f+4^fr2 d+1 c+4^cr2 c+4^cr2

【问题讨论】:

  • 学校压缩项目什么的? :p
  • 可以用 Markdown 封装代码吗?

标签: c# c++ text compression duplicates


【解决方案1】:

不确定这是否是您要查找的内容。

我把字符串“testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest”转换成“[test] 4 4notadupe[d+c+d+f+]4 [测试]3"

我相信有人会想出一个更好更有效的解决方案,因为它在处理完整文件时有点慢。我期待其他答案。

        string stringValue = "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest";

        for(int i = 0; i < stringValue.Length; i++)
        {
            for (int k = 1; (k*2) + i <= stringValue.Length; k++)
            {
                int count = 1;

                string compare1 = stringValue.Substring(i,k);
                string compare2 = stringValue.Substring(i + k, k);

                //Count if and how many duplicates
                while (compare1 == compare2) 
                {
                    count++;
                    k += compare1.Length;
                    if (i + k + compare1.Length > stringValue.Length)
                        break;

                    compare2 = stringValue.Substring(i + k, compare1.Length);
                } 

                if (count > 1)
                {
                    //New code.  Added a space to the end to avoid [test]4 
                    //turning using an invalid number ie: [test]44.
                    string addString = "[" + compare1 + "]" + count + " ";

                    //Only add code if we are saving space
                    if (addString.Length < compare1.Length * count)
                    {
                        stringValue = stringValue.Remove(i, count * compare1.Length);
                        stringValue = stringValue.Insert(i, addString);
                        i = i + addString.Length - 1;
                    }
                    break;
                }
            }
        }

【讨论】:

  • 谢谢,这很好用。当您说整个文件很慢时,我不确定您的意思是什么,因为对我来说似乎很快。 (最多 1-3 秒)但是非常感谢,无论如何。
【解决方案2】:

您可以使用 Smith-Waterman 算法进行局部对齐,将字符串与自身进行比较。

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

编辑:要使算法适应自对齐,您需要强制对角线中的值为零 - 也就是说,惩罚将整个字符串与自身完全对齐的简单解决方案。然后会弹出“次佳”对齐方式。这将是最长的两个匹配子串。重复同样的事情以找到越来越短的匹配子字符串。

【讨论】:

    【解决方案3】:

    LZW 可以提供帮助:它使用前缀字典来搜索重复的模式,并将此类数据替换为对先前条目的引用。我认为根据您的需要调整它应该不难。

    【讨论】:

    【解决方案4】:

    为什么不直接使用System.IO.Compression

    【讨论】:

    • 主要是因为“字符串需要超过 3-400 个字符”这样的方法不受欢迎。
    猜你喜欢
    • 1970-01-01
    • 2018-01-18
    • 1970-01-01
    • 2021-11-04
    • 1970-01-01
    • 1970-01-01
    • 2015-04-05
    • 1970-01-01
    • 2013-04-27
    相关资源
    最近更新 更多