用 bash 抓取网页如何编写脚本答案

【问题标题】：Web scraping with bash how to write script用 bash 抓取网页如何编写脚本
【发布时间】：2020-06-04 21:25:57
【问题描述】：

我正在尝试使用 bash https://www.mcdelivery.com.pk/pk/browse/menu.html 进行网络抓取并过滤左侧的菜单以显示交易名称和价格。我需要帮助，请让我知道如何编写 bash 脚本来获取交易的名称和价格。提前致谢

【问题讨论】：

欢迎来到 Stack Overflow！寻求代码帮助的问题必须包括在问题本身中重现它所需的最短代码，最好是在Stack Snippet 中。请参阅如何创建 Minimal, Reproducible Example。在您的问题中显示预期结果并引用您遇到的任何（确切）错误也非常有帮助。您应该展示自己为解决这个问题所做的任何研究。
如果您的问题是关于干净的bash，为什么还要有python 标签？
这太宽泛/模糊，而且可能离题。请参阅How to Ask、help center。

标签： bash web-scraping

【解决方案1】：

使用sed grep html 文件中的名称和价格。每个交易名称都包含在类 product-title 的 h5 元素中，每个价格都包含在类 starting-price 的 span 元素中。查看有关此答案 [1] 的一些示例，这应该可以帮助您前进。

[1]Extract part of the code and parse HTML in bash

【讨论】：

嘿，如果您能帮助我编写代码，那就太好了。请做正确的示例脚本。

【解决方案2】：

请注意网站格式可能会有所不同，因此仅使用行号可能不起作用。如果您不想使用 HTML 标签阅读器，您可以使用脚本语言完成所有操作

（在 bash 中使用数组执行以下操作，或者如果你想让它也与其他 shell 一起运行，则只使用变量）

使用curl下载HTML文件
（$0 = 脚本名称，这里用作临时文件名）
```
curl "${URL}${LINE}" > "${0%.*}.tmp~"
```
使用字符串操作 ${LINE##*&} 从您的 URI 中仅打印 'catId=12'
使用grep搜索字符串'catId=12'，使用-w标志匹配精确短语
管道输出到另一个grep并搜索项目名称'<span.*</span>'

使用cut将项目的值提取到文件名

file="$(grep -w "${LINE##*&}" "${0%.*}.tmp~" | grep -o '<span.*</span>' | cut -d\> -f2 | cut -d\< -f1)"

将项目名称转换为一些方便的文件名（因为我们不喜欢文件名中的* 或' 等字符
对字母数字字符、空格（和._-）使用白名单，并通过字符串操作删除所有禁用字符
用_替换所有空格

遍历文件名并将___的所有链替换为单个_

file="${file//[^[:alnum:]^.^\/^_^\ -]/}"
file="${file// /_}"
while [[ "${file}" =~ "__" ]]
  do
    file="${file//__/_}"
done

使用grep搜索<div class="panel panel-default panel-product">并将字节偏移量保存到product_offset_1 - product_offset_5
以字节为单位保存文件大小为最后一个product_offset_6

告诉grep count 匹配到变量j

product_offset=()
for offset in $(grep -bha '<div class="panel panel-default panel-product">' "${file}.html" | cut -d: -f1)
  do
     product_offset+=($offset)
done
# file size in bytes as last product_offset
product_offset+=($(stat -c%s "${file}.html"))
# count matches
j=$(grep -hac '<div class="panel panel-default panel-product">' "${file}.html")

在product_offset_1 - product_offset_6 上创建一个循环，并以字节为单位计算长度

在循环内：使用dd 仅打印offset_1

i=0
while [ $i -lt $j ]
  do
    # calculate product_offset
    beg=${product_offset[$i]}
    end=${product_offset[$((i+1))]}
    len=$(expr ${end:-0} - ${beg:-0})

    # print only part of file
    dd if="${file}.html" bs=1 skip=$beg count=$len status=none

    i=$((i+1))
done

在循环内：搜索每个部分的<h5 class="product-title">

使用cut将product-title的值提取到title_1 - title_5

title=()
    # search title within product_offset
    item="$(dd if="${file}.html" bs=1 skip=${beg:-0} count=${len:-0} status=none | grep '<h5 class="product-title">' | cut -d\> -f2 | cut -d\< -f1)"
    title+=("$item")

在循环内：搜索每个部分的<span class="starting-price">

使用cut 将starting-price 的值提取到price_1 - price_5
（同上）

price=()
     # search price within product_offset
    item="$(dd if="${file}.html" bs=1 skip=${beg:-0} count=${len:-0} status=none | grep '<span class="starting-price">' | cut -d\> -f2 | cut -d\< -f1)"
    price+=("$item")

最后，循环遍历所有变量并使用$i作为变量名

将数组标题 + 价格打印到文件中（或在上面的同一循环中执行此操作）

# read array title + price    
i=0
while [ $i -lt $j ]
  do
    # echo -e "(${title[$i]}, ${price[$i]})\n" | tee -a "${file}.txt"
    printf "%s-\tName:\t%s\n\tPrice:\t%s\n\n" $((i+1)) "${title[$i]}" "${price[$i]}" | tee -a "${file}.txt"
    i=$((i+1))
done

为了更好地概览您的脚本中的所有内容

#!/bin/bash

# just for demo
> URI.txt
URI='?daypartId=1&catId='
URL=https://www.mcdelivery.com.pk/pk/browse/menu.html

# just for demo
for id in 1 2 3 4 5 6 8 10 11 12 14
  do
    echo -e "${URI}${id}" >> URI.txt
done

ARRAY=()
while read -r LINE || [[ -n $LINE ]]
do
    [ "$LINE" ] && ARRAY+=("$LINE")
done < URI.txt

for LINE in "${ARRAY[@]}"
  do
    # curl into file main.tmp~
    echo -e "${URL}${LINE}"
    curl "${URL}${LINE}" > "${0%.*}.tmp~" 2> /dev/null

    # get item name and convert into simple file name
    file="$(grep -w "${LINE##*&}" "${0%.*}.tmp~" | grep -o '<span.*</span>' | cut -d\> -f2 | cut -d\< -f1)"
    file="${file//[^[:alnum:]^.^\/^_^\ -]/}"
    file="${file// /_}"
    while [[ "${file}" =~ "__" ]]
      do
        file="${file//__/_}"
    done

    # fallback file name
    [ "$file" ] || file="file${LINE##*&}"

    # rename main.tmp~ into Deals.html and create empty file Deals.txt
    mv -f "${0%.*}.tmp~" "${file}.html" > "${file}.txt"

    # declare arrays
    product_offset=()
    title=()
    price=()

    for offset in $(grep -bha '<div class="panel panel-default panel-product">' "${file}.html" | cut -d: -f1)
      do
         product_offset+=($offset)
    done
    # file size in bytes as last product_offset
    product_offset+=($(stat -c%s "${file}.html"))
    # count matches
    j=$(grep -hac '<div class="panel panel-default panel-product">' "${file}.html")

    # write array title + price from Deals.html    
    i=0
    while [ $i -lt $j ]
      do
        # calculate product_offset
        beg=${product_offset[$i]}
        end=${product_offset[$((i+1))]}
        len=$(expr ${end:-0} - ${beg:-0})

        # search title within product_offset
        item="$(dd if="${file}.html" bs=1 skip=${beg:-0} count=${len:-0} status=none | grep '<h5 class="product-title">' | cut -d\> -f2 | cut -d\< -f1)"
        title+=("$item")

        # search price within product_offset
        item="$(dd if="${file}.html" bs=1 skip=${beg:-0} count=${len:-0} status=none | grep '<span class="starting-price">' | cut -d\> -f2 | cut -d\< -f1)"
        price+=("$item")

        i=$((i+1))
    done

    echo -e "################################################\n#"
    echo -e "# ${file//_/ }\n#"
    echo -e "################################################\n"

    # read array title + price    
    i=0
    while [ $i -lt $j ]
      do
        # echo -e "(${title[$i]}, ${price[$i]})\n" | tee -a "${file}.txt"
        printf "%s-\tName:\t%s\n\tPrice:\t%s\n\n" $((i+1)) "${title[$i]}" "${price[$i]}" | tee -a "${file}.txt"
        i=$((i+1))
    done
done

exit 0

编程愉快，祝你好运！

【讨论】：

var = curl https://www.mcdelivery.com.pk/pk/browse/menu.html"$LINE" | grep '<span>.*</span>' | sed 's/<[^>]\+>//g' > $var.txt 我想将 curl 中的数据保存到一个名为 grep 抓取的文件中。就像grep 命令获取beverages 一样，它应该创建一个名为beverage.txt 的文件
我希望你明白为什么检查每个标题的位置 + 价格并保持配对很重要。有几种方法可以做到这一点，这只是一个简单的想法。（想象一下，你的 grep 给出 5x 标题 + 8x 价格，一旦在不同的文本文件中分离，你将如何配对？）
知道怎么做吗？比如将数据（标题）提取到变量中，然后制作该变量的文件？
用你想要的输出更新并将所有内容保存到文件“beverages.txt”