使用bash从html文件中提取文本答案

【问题标题】：extracting text from html file with bash使用bash从html文件中提取文本
【发布时间】：2017-01-01 12:50:05
【问题描述】：

我有一个脚本：

cd ../data;
dossier=$(ls crawl);

let "compte = 1";

for file in $dossier
do

lynx --dump --nolist $file >> ../data/txt/$compte'.txt';

let "compte = compte + 1"; 
done

我正在使用lynx 从我所有的 HTML 文件中检索文本，但问题是当我打开我的文本文件时，它是这样写的：

410 GONE

This doesn't exist any more. Try html.com.

我不知道为什么，因为当我在终端和爬虫文件夹中时，我对每个 HTML 文件执行 lynx 转储，它正在生成文本文件，但是当我想将它与脚本一起使用时读取我所有的 HTML 文件并在它们上使用 lynx 结果并不好。

【问题讨论】：

标签： html bash lynx

【解决方案1】：

您需要协议和（不确定）路径。例如：

lynx -dump file:///where/my/file/is/file.html

【讨论】：