在 InputStream 中过滤（搜索和替换）字节数组答案

【问题标题】：Filter (search and replace) array of bytes in an InputStream在 InputStream 中过滤（搜索和替换）字节数组
【发布时间】：2011-12-06 07:27:11
【问题描述】：

我有一个 InputStream，它将 html 文件作为输入参数。我必须从输入流中获取字节。

我有一个字符串："XYZ"。我想将此字符串转换为字节格式，并检查我从 InputStream 获得的字节序列中的字符串是否匹配。如果有，我必须将匹配替换为其他字符串的 bye 序列。

有没有人可以帮助我解决这个问题？我已经使用正则表达式来查找和替换。但是查找和替换字节流，我不知道。

以前，我使用jsoup解析html并替换字符串，但是由于一些utf编码问题，当我这样做时文件似乎损坏了。

TL;DR：我的问题是：

有一种方法可以在 Java 的原始 InputStream 中查找和替换字节格式的字符串吗？

【问题讨论】：

你为什么将文件作为字节流读取？如果您将其读取为字符串（例如，使用 StringReader），您可以解决您的问题并忘记编码
为什么要将字符串转换为字节数组并进行比较，而不是比较原始字符串？
基本上你需要的是tutorials.jenkov.com/java-howto/…。

标签： java input bytearray

【解决方案1】：

不确定您是否选择了解决问题的最佳方法。

也就是说，我不喜欢（并且有政策不）用“不”回答问题，所以这里......

看看FilterInputStream。

来自文档：

FilterInputStream 包含一些其他输入流，将其用作基本数据源，可能沿途转换数据或提供附加功能。

把它写下来是一个有趣的练习。这是一个完整的示例：

import java.io.*;
import java.util.*;

class ReplacingInputStream extends FilterInputStream {

    LinkedList<Integer> inQueue = new LinkedList<Integer>();
    LinkedList<Integer> outQueue = new LinkedList<Integer>();
    final byte[] search, replacement;

    protected ReplacingInputStream(InputStream in,
                                   byte[] search,
                                   byte[] replacement) {
        super(in);
        this.search = search;
        this.replacement = replacement;
    }

    private boolean isMatchFound() {
        Iterator<Integer> inIter = inQueue.iterator();
        for (int i = 0; i < search.length; i++)
            if (!inIter.hasNext() || search[i] != inIter.next())
                return false;
        return true;
    }

    private void readAhead() throws IOException {
        // Work up some look-ahead.
        while (inQueue.size() < search.length) {
            int next = super.read();
            inQueue.offer(next);
            if (next == -1)
                break;
        }
    }

    @Override
    public int read() throws IOException {    
        // Next byte already determined.
        if (outQueue.isEmpty()) {
            readAhead();

            if (isMatchFound()) {
                for (int i = 0; i < search.length; i++)
                    inQueue.remove();

                for (byte b : replacement)
                    outQueue.offer((int) b);
            } else
                outQueue.add(inQueue.remove());
        }

        return outQueue.remove();
    }

    // TODO: Override the other read methods.
}

示例用法

class Test {
    public static void main(String[] args) throws Exception {

        byte[] bytes = "hello xyz world.".getBytes("UTF-8");

        ByteArrayInputStream bis = new ByteArrayInputStream(bytes);

        byte[] search = "xyz".getBytes("UTF-8");
        byte[] replacement = "abc".getBytes("UTF-8");

        InputStream ris = new ReplacingInputStream(bis, search, replacement);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        int b;
        while (-1 != (b = ris.read()))
            bos.write(b);

        System.out.println(new String(bos.toByteArray()));

    }
}

给定字符串 "Hello xyz world" 它打印的字节数：

Hello abc world

【讨论】：

+1 表示干净的、基于队列的实现，但根据应用程序，这种简单的方法很慢可能很重要：O(MN)*，其中M 是模式长度，N 是文件长度。此外，根据您要搜索的内容，忽略 HTML 的结构可能会给您带来麻烦。
好点，我完全同意你的看法。我只是做了我认为有趣的部分:-) 甚至没有实现所有的读取方法..
尝试寻找新的字节[] { (byte) 0xFF, (byte) 0x00} 你会感到惊讶你必须使用 byte_value 和 0xFF 值作为 byte->integer 而不是简单的写入 byte_value 例如 outQueue。报价（（int）b）；必须是 outQueue.offer((int) (b&0xFF));
这很好，但也不适用于特殊字符。我不得不使用 ` byte next = (byte) super.read(); inQueue.offer((int) next);` 因为否则super.read() 会将事物转换为 int，从而导致查找失败
“特殊字符”是什么意思？

【解决方案2】：

我也需要这样的东西，并决定推出我自己的解决方案，而不是使用上面@aioobe 的示例。看看code。您可以从 maven Central 中提取该库，或者直接复制源代码。

这就是你使用它的方式。在这种情况下，我使用嵌套实例来替换两个模式，两个修复 dos 和 mac 行尾。

new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

这是完整的源代码：

/**
 * Simple FilterInputStream that can replace occurrances of bytes with something else.
 */
public class ReplacingInputStream extends FilterInputStream {

    // while matching, this is where the bytes go.
    int[] buf=null;
    int matchedIndex=0;
    int unbufferIndex=0;
    int replacedIndex=0;

    private final byte[] pattern;
    private final byte[] replacement;
    private State state=State.NOT_MATCHED;

    // simple state machine for keeping track of what we are doing
    private enum State {
        NOT_MATCHED,
        MATCHING,
        REPLACING,
        UNBUFFER
    }

    /**
     * @param is input
     * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
     */
    public static InputStream newLineNormalizingInputStream(InputStream is) {
        return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
    }

    /**
     * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
     * @param in input
     * @param pattern pattern to replace.
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, String pattern, String replacement) {
        this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
    }

    /**
     * Replace occurances of pattern in the input.
     * @param in input
     * @param pattern pattern to replace
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
        super(in);
        Validate.notNull(pattern);
        Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
        this.pattern = pattern;
        this.replacement = replacement;
        // we will never match more than the pattern length
        buf = new int[pattern.length];
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
        if (b == null) {
            throw new NullPointerException();
        } else if (off < 0 || len < 0 || len > b.length - off) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return 0;
        }

        int c = read();
        if (c == -1) {
            return -1;
        }
        b[off] = (byte)c;

        int i = 1;
        try {
            for (; i < len ; i++) {
                c = read();
                if (c == -1) {
                    break;
                }
                b[off + i] = (byte)c;
            }
        } catch (IOException ee) {
        }
        return i;

    }

    @Override
    public int read(byte[] b) throws IOException {
        // call our own read
        return read(b, 0, b.length);
    }

    @Override
    public int read() throws IOException {
        // use a simple state machine to figure out what we are doing
        int next;
        switch (state) {
        case NOT_MATCHED:
            // we are not currently matching, replacing, or unbuffering
            next=super.read();
            if(pattern[0] == next) {
                // clear whatever was there
                buf=new int[pattern.length]; // clear whatever was there
                // make sure we start at 0
                matchedIndex=0;

                buf[matchedIndex++]=next;
                if(pattern.length == 1) {
                    // edgecase when the pattern length is 1 we go straight to replacing
                    state=State.REPLACING;
                    // reset replace counter
                    replacedIndex=0;
                } else {
                    // pattern of length 1
                    state=State.MATCHING;
                }
                // recurse to continue matching
                return read();
            } else {
                return next;
            }
        case MATCHING:
            // the previous bytes matched part of the pattern
            next=super.read();
            if(pattern[matchedIndex]==next) {
                buf[matchedIndex++]=next;
                if(matchedIndex==pattern.length) {
                    // we've found a full match!
                    if(replacement==null || replacement.length==0) {
                        // the replacement is empty, go straight to NOT_MATCHED
                        state=State.NOT_MATCHED;
                        matchedIndex=0;
                    } else {
                        // start replacing
                        state=State.REPLACING;
                        replacedIndex=0;
                    }
                }
            } else {
                // mismatch -> unbuffer
                buf[matchedIndex++]=next;
                state=State.UNBUFFER;
                unbufferIndex=0;
            }
            return read();
        case REPLACING:
            // we've fully matched the pattern and are returning bytes from the replacement
            next=replacement[replacedIndex++];
            if(replacedIndex==replacement.length) {
                state=State.NOT_MATCHED;
                replacedIndex=0;
            }
            return next;
        case UNBUFFER:
            // we partially matched the pattern before encountering a non matching byte
            // we need to serve up the buffered bytes before we go back to NOT_MATCHED
            next=buf[unbufferIndex++];
            if(unbufferIndex==matchedIndex) {
                state=State.NOT_MATCHED;
                matchedIndex=0;
            }
            return next;

        default:
            throw new IllegalStateException("no such state " + state);
        }
    }

    @Override
    public String toString() {
        return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
    }

}

【讨论】：

【解决方案3】：

以下方法可行，但我不知道对性能的影响有多大。

用InputStreamReader 包裹InputStream，
用替换字符串的FilterReader 包裹InputStreamReader，然后
用ReaderInputStream 包裹FilterReader。

选择合适的编码很重要，否则流的内容会被破坏。

如果你想用正则表达式来替换字符串，那么你可以使用我的一个工具Streamflyer，它是FilterReader的一个方便的替代品。您将在 Streamflyer 的网页上找到字节流的示例。希望这会有所帮助。

【讨论】：

【解决方案4】：

没有任何内置的字节流搜索和替换功能 (InputStream)。

而且，有效且正确地完成此任务的方法并不是立即显而易见的。我已经为流实现了 Boyer-Moore 算法，它运行良好，但需要一些时间。如果没有这样的算法，您必须诉诸蛮力方法，在这种方法中 look for the pattern starting at every position in the stream, 可能会很慢。

即使您将 HTML 解码为文本，using a regular expression to match patterns might be a bad idea, 因为 HTML 不是“常规”语言。

因此，即使您遇到了一些困难，我还是建议您继续使用将 HTML 解析为文档的原始方法。虽然您在字符编码方面遇到问题，但从长远来看，修复正确的解决方案可能比临时解决错误的解决方案更容易。

【讨论】：

【解决方案5】：

我需要一个解决方案，但发现这里的答案会导致过多的内存和/或 CPU 开销。基于简单的基准测试，以下解决方案在这些方面明显优于其他解决方案。

此解决方案特别节省内存，即使使用 >GB 流也不会产生可衡量的成本。

也就是说，这不是一个零 CPU 成本的解决方案。 CPU/处理时间开销对于除了最苛刻/资源敏感的场景之外的所有场景可能都是合理的，但开销是真实存在的，在评估在给定上下文中使用此解决方案的价值时应该考虑到这一开销。

在我的例子中，我们正在处理的最大实际文件大小约为 6MB，我们看到在 44 个 URL 替换时增加了约 170 毫秒的延迟。这是针对在具有单个 CPU 共享 (1024) 的 AWS ECS 上运行的基于 Zuul 的反向代理。对于大多数文件（小于 100KB），增加的延迟是亚毫秒级的。在高并发（因此 CPU 争用）下，增加的延迟可能会增加，但我们目前能够在单个节点上同时处理数百个文件，而不会产生明显的延迟影响。

我们正在使用的解决方案：

import java.io.IOException;
import java.io.InputStream;

public class TokenReplacingStream extends InputStream {

    private final InputStream source;
    private final byte[] oldBytes;
    private final byte[] newBytes;
    private int tokenMatchIndex = 0;
    private int bytesIndex = 0;
    private boolean unwinding;
    private int mismatch;
    private int numberOfTokensReplaced = 0;

    public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
        assert oldBytes.length > 0;
        this.source = source;
        this.oldBytes = oldBytes;
        this.newBytes = newBytes;
    }

    @Override
    public int read() throws IOException {

        if (unwinding) {
            if (bytesIndex < tokenMatchIndex) {
                return oldBytes[bytesIndex++];
            } else {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                unwinding = false;
                return mismatch;
            }
        } else if (tokenMatchIndex == oldBytes.length) {
            if (bytesIndex == newBytes.length) {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                numberOfTokensReplaced++;
            } else {
                return newBytes[bytesIndex++];
            }
        }

        int b = source.read();
        if (b == oldBytes[tokenMatchIndex]) {
            tokenMatchIndex++;
        } else if (tokenMatchIndex > 0) {
            mismatch = b;
            unwinding = true;
        } else {
            return b;
        }

        return read();

    }

    @Override
    public void close() throws IOException {
        source.close();
    }

    public int getNumberOfTokensReplaced() {
        return numberOfTokensReplaced;
    }

}

【讨论】：

【解决方案6】：

当我需要在 Servlet 中提供模板文件时，我想出了这段简单的代码，用一个值替换某个关键字。它应该非常快且内存不足。然后使用 Piped Streams，我想您可以将它用于各种事情。

/JC

public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
    replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}

public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
    char[] searchChars = search.toCharArray();
    int[] buffer = new int[searchChars.length];

    int x, r, si = 0, sm = searchChars.length;
    while ((r = in.read()) > 0) {

        if (searchChars[si] == r) {
            // The char matches our pattern
            buffer[si++] = r;

            if (si == sm) {
                // We have reached a matching string
                out.write(replace);
                si = 0;
            }
        } else if (si > 0) {
            // No match and buffered char(s), empty buffer and pass the char forward
            for (x = 0; x < si; x++) {
                out.write(buffer[x]);
            }
            si = 0;
            out.write(r);
        } else {
            // No match and nothing buffered, just pass the char forward
            out.write(r);
        }
    }

    // Empty buffer
    for (x = 0; x < si; x++) {
        out.write(buffer[x]);
    }
}

【讨论】：