std::string 操作：空格、“换行符转义 '\'”和注释 #答案

【问题标题】：std::string manipulation: whitespace, "newline escapes '\'" and comments #std::string 操作：空格、“换行符转义 '\'”和注释 #
【发布时间】：2010-05-22 14:49:34
【问题描述】：

在这里寻求肯定。我有一些手写代码，我并不羞于说我很自豪，它读取文件，删除前导空格，处理换行符“\”并删除以 # 开头的 cmets。它还删除了所有空行（也是空白行）。有什么想法/建议吗？我可能会用 std::runtime_errors 替换一些 std::cout ......但这不是这里的优先事项:)

const int RecipeReader::readRecipe()
{
    ifstream is_recipe(s_buffer.c_str());
    if (!is_recipe)
        cout << "unable to open file" << endl;
    while (getline(is_recipe, s_buffer))
    {
        // whitespace+comment
        removeLeadingWhitespace(s_buffer);
        processComment(s_buffer);
        // newline escapes + append all subsequent lines with '\'
        processNewlineEscapes(s_buffer, is_recipe);
        // store the real text line
        if (!s_buffer.empty())
            v_s_recipe.push_back(s_buffer);
        s_buffer.clear();
    }
    is_recipe.close();
    return 0;
}

void RecipeReader::processNewlineEscapes(string &s_string, ifstream &is_stream)
{
    string s_temp;
    size_t sz_index = s_string.find_first_of("\\");
    while (sz_index <= s_string.length())
    {
        if (getline(is_stream,s_temp))
        {
            removeLeadingWhitespace(s_temp);
            processComment(s_temp);
            s_string = s_string.substr(0,sz_index-1) + " " + s_temp;
        }
        else
            cout << "Error: newline escape '\' found at EOF" << endl;
        sz_index = s_string.find_first_of("\\");
    }
}

void RecipeReader::processComment(string &s_string)
{
    size_t sz_index = s_string.find_first_of("#");
    s_string = s_string.substr(0,sz_index);
}

void RecipeReader::removeLeadingWhitespace(string &s_string)
{
    const size_t sz_length = s_string.size();
    size_t sz_index = s_string.find_first_not_of(" \t");
    if (sz_index <= sz_length)
    s_string = s_string.substr(sz_index);
    else if ((sz_index > sz_length) && (sz_length != 0)) // "empty" lines with only whitespace
        s_string.clear();
}

一些额外信息：传递给 ifstream 的第一个 s_buffer 包含文件名，std::string s_buffer 是类数据成员，std::vector v_s_recipe 也是。欢迎任何评论:)

更新：为了不忘恩负义，这里是我的替代，多合一的功能，我现在想要做的（未来持有：括号，也许引号......）：

void readRecipe(const std::string &filename)
{
    string buffer;
    string line;
    size_t index;
    ifstream file(filename.c_str());
    if (!file)
        throw runtime_error("Unable to open file.");

    while (getline(file, line))
    {
        // whitespace removal
        line.erase(0, line.find_first_not_of(" \t\r\n\v\f"));
        // comment removal TODO: store these for later output
        index = line.find_first_of("#");
        if (index != string::npos)
            line.erase(index, string::npos);
        // ignore empty buffer
        if (line.empty())
            continue;
        // process newline escapes
        index = line.find_first_of("\\");
        if (index != string::npos)
        {
            line.erase(index,string::npos); // ignore everything after '\'
            buffer += line;
            continue; // read next line
        }
        else // no newline escapes found
        {
            buffer += line;
            recipe.push_back(buffer);
            buffer.clear();
        }
    }
}

【问题讨论】：

既然你没有在这里问一个具体的问题，这应该是 CW ？
仅将引用作为 const 返回，const int 是不必要的，因为您不操作成员变量。

标签： c++ algorithm string

【解决方案1】：

绝对放弃匈牙利符号。

【讨论】：

“is_”前缀特别不幸。
我没有花时间对此发表评论，但 Joel Spolsky 写了一篇非常棒的文章，介绍了匈牙利符号以及它是如何被歪曲的。您在这里使用的形式是变态的形式：表示变量的类型是多余的，不会带来任何东西。如果你想阅读更多，我只能推荐 Joel 的文章joelonsoftware.com/articles/Wrong.html。与 Joel 一样，它有点冗长，您可以使用浏览器的搜索功能来直截了当；）
感谢阅读，由于 StackOverflow 上的所有负面 cmets，我已经放弃了它 :)。

【解决方案2】：

这还不错，但我认为您将std::basic_string<T> 视为字符串太多，而不足以作为 STL 容器。例如：

void RecipeReader::removeLeadingWhitespace(string &s_string)
{
    s_string.erase(s_string.begin(), 
        std::find_if(s_string.begin(), s_string.end(), std::not1(isspace)));
}

【讨论】：

我正在阅读 isspace ，似乎语言环境和所有内容都存在问题。我应该担心还是尝试使用更接近我当前实现的其他方法？
@rubenvb：它有语言环境问题，是的，但不比使用" \t" 多；）如果你担心，你可以使用boost::is_any_of(" \t")。
istringstream::skipws 更容易完成。从技术上讲，string 不是容器。
@Potatoswatter: 为什么string 不是容器？似乎满足这里列出的要求：sgi.com/tech/stl/Container.html
@Billy：它需要 char_traits 对所包含的类型进行专门化。再三考虑，我想这取决于将 23.1/3 解释为必要或充分条件。并且 21.3/2 确实断言 basic_string 是一个可逆序列。所以我想从技术上讲你是对的 :v) 。无论如何，这里工作的正确工具是字符串流，因为整行最终都会被解析。对于多通道解析器，这看起来像这里的目标，但可能是多余的，OP 可以存储字符串流数组而不是字符串数组。

【解决方案3】：

几个cmets：

正如另一个答案（我 +1）所说 - 放弃匈牙利符号。除了在每一行添加不重要的垃圾之外，它真的什么也没做。此外，ifstream 产生 is_ 前缀是丑陋的。 is_ 通常表示一个布尔值。
用processXXX 命名一个函数几乎不会提供关于它实际在做什么的信息。使用 removeXXX，就像使用 RemoveLeadingWhitespace 函数一样。
processComment 函数执行不必要的复制和分配。使用s.erase(index, string::npos);（npos 是默认值，但这更明显）。
不清楚您的程序做了什么，但如果您需要像这样对文件进行后处理，您可以考虑存储不同的文件格式（如 html 或 xml）。这取决于目标。
使用find_first_of('#') 可能会给您一些误报。如果它出现在引号中，则不一定表示评论。（但同样，这取决于您的文件格式）
将find_first_of(c) 与一个字符一起使用可以简化为find(c)。
processNewlineEscapes 函数复制了readRecipe 函数的一些功能。你可以考虑重构如下：

-

string s_buffer;
string s_line;
while (getline(is_recipe, s_line)) {
  // Sanitize the raw line.
  removeLeadingWhitespace(s_line);
  removeComments(s_line);
  // Skip empty lines.
  if (s_line.empty()) continue;
  // Add the raw line to the buffer.
  s_buffer += s_line;
  // Collect buffer across all escaped lines.
  if (*s_line.rbegin() == '\\') continue;
  // This line is not escaped, now I can process the buffer.
  v_s_recipe.push_back(s_buffer);
  s_buffer.clear();
}

【讨论】：

谢谢。这实际上是“构建系统”的“项目文件”的预处理器（个人项目，学术，不要认真对待“构建系统”）。这里使用换行符转义，就像 Qt4 的 qmake 读取它们一样，以提高文件本身的可读性。我还将看一下重复的功能，也许它最终都适合一个功能。还将着手解决问题，并使用更多 STL 功能和更好的函数名称重试。
@rubenvb：啊，我明白了。重读后，我意识到“逃跑”是什么：)通过如何将这两个功能结合在一起的想法来增强答案。

【解决方案4】：

我不喜欢修改参数的方法。为什么不返回strings 而不是修改输入参数？例如：

string RecipeReader::processComment(const string &s)
{
    size_t index = s.find_first_of("#");
    return s_string.substr(0, index);
}

我个人觉得这澄清了意图并使方法的作用更加明显。

【讨论】：

【解决方案5】：

我会考虑用 boost::regex 代码替换您所有的处理代码（几乎所有您编写的代码）。

【讨论】：

呸！（即使不是很糟糕，谁愿意为几个像这样的简单字符串过程拉入几百 kb 的库？）
我在思考如何解决这个问题时的想法。哎呀，我会使用 tr1/regex，而不是 boost :)

【解决方案6】：

几个cmets：

如果s_buffer 包含要打开的文件名，它应该有一个更好的名称，如s_filename。
不应重复使用s_buffer 成员变量来存储读取文件时的临时数据。函数中的局部变量也可以，缓冲区不需要成为成员变量。
如果不需要将文件名存储为成员变量，则可以将其作为参数传递给readRecipe()
processNewlineEscapes() 应在追加下一行之前检查找到的反斜杠是否位于行尾。目前，任何位置的任何反斜杠都会触发在反斜杠位置添加下一行。此外，如果有多个反斜杠，find_last_of() 可能比find_first_of() 更易于使用。
在检查find_first_of() 在processNewlineEscapes() 和removeLeadingWhitespace() 中的结果时，与string::npos 进行比较以检查是否找到任何东西会更清晰。

removeLeadingWhitespace()末尾的逻辑可以简化：

size_t sz_index = s_string.find_first_not_of(" \t");
if (sz_index != s_string.npos)
   s_string = s_string.substr(sz_index);
else // "empty" lines with only whitespace
   s_string.clear();

【讨论】：

我现在正在更改名称并重新安排我的班级。反斜杠是我想要的，反斜杠之后的所有内容都被忽略。我也在使用 npos （以前不知道如何使用它。你们太棒了：D

【解决方案7】：

您可能希望查看Boost.String。它是处理流的简单算法集合，尤其具有trim 方法:)

现在，进入评论本身：

不要费心删除匈牙利符号，如果这是您的风格，请使用它，但是您应该尝试改进方法和变量的名称。 processXXX 绝对没有表示任何有用...

从功能上讲，我担心您的假设：这里的主要问题是您不关心 espace 序列（例如，\n 使用反斜杠）并且您不担心字符字符串的存在：@987654325由于您的“评论”预处理，@ 会产生无效行

此外，因为您在处理换行符转义之前删除了 cmets：

i = 3; # comment \
         running comment

将被解析为

i = 3; running comment

语法不正确。

从接口的角度来看：在这里让方法成为类成员并没有什么好处，你真的不需要RecipeReader 的实例......

最后，我发现从流中读取两种方法很尴尬。

我的小烦恼：以const 值返回没有任何目的。

这是我自己的版本，我相信比展示比讨论更容易：

// header file

std::vector<std::string> readRecipe(const std::string& fileName);

std::string extractLine(std::ifstream& file);

std::pair<std:string,bool> removeNewlineEscape(const std::string& line);
std::string removeComment(const std::string& line);

// source file

#include <boost/algorithm/string.hpp>

std::vector<std::string> readRecipe(const std::string& fileName)
{
  std::vector<std::string> result;

  ifstream file(fileName.c_str());
  if (!file) std::cout << "Could not open: " << fileName << std::endl;

  std::string line = extractLine(file);
  while(!line.empty())
  {
    result.push_back(line);
    line = extractLine(file);
  } // looping on the lines

  return result;
} // readRecipe


std::string extractLine(std::ifstream& file)
{
  std::string line, buffer;
  while(getline(file, buffer))
  {
    std::pair<std::string,bool> r = removeNewlineEscape(buffer);
    line += boost::trim_left_copy(r.first); // remove leading whitespace
                                            // based on the current locale
    if (!r.second) break;
    line += " "; // as we append, we insert a whitespace
                 // in order unintended token concatenation
  }

  return removeComment(line);
} // extractLine

//< Returns the line, minus the '\' character
//<         if it was the last significant one
//< Returns a boolean indicating whether or not the line continue
//<         (true if it's necessary to concatenate with the next line)
std::pair<std:string,bool> removeNewlineEscape(const std::string& line)
{
  std::pair<std::string,bool> result;
  result.second = false;

  size_t pos = line.find_last_not_of(" \t");
  if (std::string::npos != pos && line[pos] == '\')
  {
    result.second = true;
    --pos; // we don't want to have this '\' character in the string
  }

  result.first = line.substr(0, pos);
  return result;
} // checkNewlineEscape

//< The main difficulty here is NOT to confuse a # inside a string
//< with a # signalling a comment
//< assuming strings are contained within "", let's roll
std::string removeComment(const std::string& line)
{
  size_t pos = line.find_first_of("\"#");
  while(std::string::npos != pos)
  {
    if (line[pos] == '"')
    {
      // We have detected the beginning of a string, we move pos to its end
      // beware of the tricky presence of a '\' right before '"'...
      pos = line.find_first_of("\"", pos+1);
      while (std::string::npos != pos && line[pos-1] == '\')
        pos = line.find_first_of("\"", pos+1);
    }
    else // line[pos] == '#'
    {
      // We have found the comment marker in a significant position
      break;
    }
    pos = line.find_first_of("\"#", pos+1);
  } // looking for comment marker

  return line.substr(0, pos);
} // removeComment

它的效率相当低（但我相信编译器会进行优化），但我相信它的行为是正确的，尽管它未经测试，因此请谨慎对待。我主要专注于解决功能问题，我遵循的命名约定与您的不同，但我认为这并不重要。

【讨论】：

感谢您的评论，我已经解决了大部分问题。这件事是更大图景的一部分，我希望在一个类中使用这个功能，它是 C++，毕竟不是普通的 C。我试图不使用 boost，并且在代码量上，最终结果（参见更新）比你的 boost 版本短。我会考虑换行/注释顺序，但我计划每个注释行都有一个'#'，很像 C++ cmets '//'，所以你运行的注释确实是一个语法错误（并且会等我有勇气写语法检查器的时候再处理。引号和括号是WIP :) 谢谢

【解决方案8】：

我想指出一个小而可爱的版本，它缺少\ 支持，但跳过了空白行和 cmets。（注意对std::getline 的调用中的std::ws。

#include <algorithm>
#include <iostream>
#include <sstream>
#include <string>

int main()
{
  std::stringstream input(
      "    # blub\n"
      "# foo bar\n"
      " foo# foo bar\n"
      "bar\n"
      );

  std::string line;
  while (std::getline(input >> std::ws, line)) {
    line.erase(std::find(line.begin(), line.end(), '#'), line.end());

    if (line.empty()) {
      continue;
    }

    std::cout << "line: \"" << line << "\"\n";
  }
}

输出：

line: "foo"
line: "bar"

【讨论】：