如何计算markdown语法文件中粗体字和斜体字的数量答案

【问题标题】：How to count the number of bold words and italic words in a markdown syntax file如何计算markdown语法文件中粗体字和斜体字的数量
【发布时间】：2016-04-22 03:56:33
【问题描述】：

我读过粗体和斜体字可以在 Markdown 语言中分别由 ** bold_text ** 和 * italic_text * 表示。要同时使用粗体和斜体文本，您可以用 4 个星号表示粗体，用 2 个下划线表示斜体（反之亦然）。

我想编写一个 bash 脚本来确定粗体字和斜体字的数量。我想这归结为计算双星号、单星号、双下划线和单下划线的数量。我的问题是如何计算文件中特定字符串（如“**”或“__”）的数量，这样我就可以知道有多少个粗体和斜体字。

#!/bin/bash

if [ -z "$1" ]; then
    echo "No input file specified."
else 
    ls $1 > /dev/null 2> /dev/null && 
    echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "File $1 does not exist."
fi

示例输入文件：

**This is bold and _italic_** text.

预期输出：

粗体字：5 斜体字：1 粗斜体字：1

【问题讨论】：

请展示您的编码工作。
好的，我将添加我的脚本到此为止。
添加了示例输入和预期输出。
一方面，您说“我的问题是：如何计算文件中** 和__ 的数量”（不是那么难），但另一方面，您的示例输入/输出似乎期望翻译成这些之间包含的确切单词数（不是那么容易）。是哪个？
我建议您获取pandoc 的副本，这是解析markdown 的事实标准，并采用标准的writer 脚本（易于阅读lua，但您不知道需要一个编译器）并修改它以计算粗体和斜体字...github.com/jgm/pandoc/blob/master/data/sample.lua

标签： regex linux bash markdown

【解决方案1】：

简单的方法

一些假设：

粗体使用__，斜体使用*（尽管它也可能是**和_）
没有“有趣的东西”，比如带有这些字符的（内联）代码，或者转义的 _ 或 *，或者带有前导 * 的列表，这会让我们无法计数

现在，要计算粗体字，我们可以使用

grep -Po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l

这会查找两对 __ 之间的任何内容。我使用 Perl 正则表达式引擎 (-P) 来启用非贪婪匹配 (.*?)；否则，像__bold__ not bold __bold__ 这样的东西只会是一场比赛。 -o 只返回匹配项。

第二个 grep 匹配单词：一个或多个非空格字符的任意序列； wc -l 计算输出的行数。

斜体也一样：

grep -Po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l

要组合这些（粗体和斜体），必须组合命令列表。对于粗体中的斜体：

grep -Po '__.*?__' infile.md | grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l

和粗体斜体：

grep -Po '\*.*?\*' infile.md | grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l

清理更真实的文件

现在，一个真正的降价文件可能会有一些额外的惊喜（参见“假设”）：

* List item with **bold word**

Line with **bold words and \* an escaped asterisk**

Here is an *italicized* word

And *italics with a **bold** word inside*

And **bold words with *italics* inside**

    Code can have tons of *, ** and _ and we want to ignore them all

Also `inline code can have * and ** and _ to be ignored`, right?

将呈现为

用粗体字列出项目

粗体字和 * 转义星号

这是一个斜体字

并且斜体字里面有一个粗体字

并且内有斜体的粗体字
Code can have tons of *, ** and _ and we want to ignore them all
还有inline code can have * and ** and _ to be ignored，对吧？

清理此类内容的一种方法是使用 sed 脚本：

/^$/d                           # Delete empty lines
/^    /d                        # Delete code lines (start with four spaces)
s/`[^`]*`//g                    # Remove inline code
/^\* /s/^\* (.*)/\1/            # Remove asterisk from list items
s/\\\*//g                       # Remove escaped asterisks
s/\\_//g                        # Remove escaped underscores
s/`[^`]*`//g                    # Remove inline code
s/\*\*/__/g                     # Make sure bold uses underscores
s/(^|[^_])_([^_]|$)/\1\*\2/g    # Make sure italics use asterisks

结果如下：

$ sed -rf md.sed infile.md
List item with __bold word__
Line with __bold words and  an escaped asterisk__
Here is an *italicized* word
And *italics with a __bold__ word inside*
And __bold words with *italics* inside__
Also , right?

准备好被第一部分的命令消费了。

把它们放在一起

将markdown文件名作为参数的脚本中的所有内容：

#!/bin/bash

fname="$1"
tempfile="$(mktemp)"

sed -r '
    /^$/d
    /^    /d
    s/`[^`]*`//g
    /^\* /s/^\* (.*)/\1/
    s/\\\*//g
    s/\\_//g
    s/`[^`]*`//g
    s/\*\*/__/g
    s/(^|[^_])_([^_]|$)/\1\*\2/g
' "$fname" > "$tempfile"

bold=$(grep -Po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
italic=$(grep -Po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
both=$((
    $(grep -Po '__.*?__' "$tempfile" |
        grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l)
    +
    $(grep -Po '\*.*?\*' "$tempfile" |
        grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l)
))

rm -f "$tempfile"

echo "Bold words: $bold"
echo "Italic words: $italic"
echo "Bold and italic words: $both"

可以这样使用：

$ ./wordcount infile.md
Bold words: 14
Italic words: 8
Bold and italic words: 2

不足之处

这可能会被包含下划线的单词所干扰。一些降价风格会忽略这些，并认为它们是单词的一部分。
我确定我在清理过程中遗漏了一些边缘情况

【讨论】：

【解决方案2】：

我的解决方案是将 ** 更改为另一个东西以使问题更容易。
我选了~，你可以换成别的东西

$ cat test
**bold**
*italic*
**bold**

sed 's/\*\*/~/g' test
~bold~
*italic*
~bold~

现在对于粗体，你应该数一下 ~ 的数量，最后除以 2 数一数~

$ cat test | tr -d -c '~'
~~~~
$ cat test | tr -d -c '~' | wc -c
4

现在将它除以 2，首先将输出保存在一个变量中。

$ bold=`cat test | tr -d -c '~' | wc -c`
$ expr $bold / 2
2

对斜体做类似的事情。

【讨论】：