sed 从文件中删除 URL答案

【问题标题】：sed to remove URLs from a filesed 从文件中删除 URL
【发布时间】：2010-11-26 07:41:26
【问题描述】：

我正在尝试编写一个可以从文件中删除 url 的 sed 表达式

例子

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)

但我不明白：

sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile

已修复！！！！！！

处理几乎所有情况，甚至是格式错误的 URL

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more

【问题讨论】：

在处理 url、文件路径等时，我更喜欢使用“|”作为 sed 分隔符，所以我不必转义 /。示例：sed 's|/path/to/some/file/|/newpath/to/new/file/|g'
@JP19，喜欢就试试这个

标签： sed

【解决方案1】：

以下内容将删除 http:// 或 https:// 以及直到下一个空格的所有内容：

sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile  
 updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N  Thx to HMB Contributor @kdpartak :)

编辑：

我应该使用：

sed -e 's!http[s]\?://\S*!!g' posFile

“[s]\?”与“\(s\)\{0,1\}”相比，“s”是一种更易读的书写方式

“\S*”比“[^[:space:]]*”更易读的“任何非空格字符”版本

在我写这个答案（brew install gnu-sed FTW）时，我一定一直在使用随 Mac 一起安装的 sed。

那里有更好的 URL 正则表达式（例如那些考虑到 HTTP(S) 以外的方案的那些），但是根据您提供的示例，这对您有用。为什么要把事情复杂化？

【讨论】：

Johnsyweb 你能解释一下你的 sed 表达式吗？特别是 {0,1} 表示法。
感谢您的 Mac 评论。在我阅读您的答案并在第一次工作的 centos 盒子上尝试之前，我在我的 mac 上测试了 10 分钟完全有效的正则表达式。
对于任何想知道已编辑答案中的's! ... !!g' 位的人，它似乎只是一种逃避封闭文本的方式。根据我的测试，sed -e 's!http[s]\?://\S*!!g' 似乎与sed -e 's/http[s]\?:\/\/\S*//g' 相同

【解决方案2】：

接受的答案提供了我用来从文件中删除 URL 等的方法。然而，它留下了“空白”行。这是一个解决方案。

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file

GNU sed 标志，使用的表达式是：

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*

然而，

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'

留下非打印字符，大概是\n（换行符）。标准的基于sed 的方法来删除“空白”行、制表符和空格，例如

sed -i 's/^[ \t]*//; s/[ \t]*$//'

在这里不起作用：如果您不使用“分支标签”来处理换行符，则无法使用 sed（一次读取输入一行）替换它们。

解决方法是使用下面的 perl 表达式：

perl -i -pe 's/^'`echo "\012"`'${2,}//g'

使用 shell 替换，

'`echo "\012"`'

替换八进制值

\012

（即换行符，\n），出现 2 次或更多次，

{2,}

（否则我们将打开所有行），用别的东西；这里：

//

也就是说，什么都没有。

[下面的第二个参考提供了这些值的精彩表格！]

使用的 perl 标志是：

-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file

参考资料：

perl 标志：Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII 控制码：https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
删除网址：sed to remove URLs from a file
分支标签：How can I replace a newline (\n) using sed?
GNU sed 手册：https://www.gnu.org/software/sed/manual/sed.html
正则表达式快速指南：https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

示例：

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$

【讨论】：