【问题标题】:Webscraping data from a table but the tbody tag is missingWeb 从表中抓取数据,但缺少 tbody 标记
【发布时间】:2021-04-08 21:25:06
【问题描述】:

我正在尝试从该站点抓取数据表: https://fl511.com/list/events/traffic?start=0&length=25&filters%5B0%5D%5Bi%5D=5&filters%5B0%5D%5Bs%5D=Incidents&order%5Bi%5D=8&order%5Bdir%5D=asc

但不幸的是,当我打印出表格时,它不会返回 tbody 标记(信息存储在其中)。显示所有其他标签。有解决办法吗?

url = Request(
    url,
    headers={'User-Agent': 'Mozilla/5.0'}
    )
webpage = urlopen(url).read()

table = soup.find_all('table')
print(table)

【问题讨论】:

  • 也许页面使用javascript来加载tbody及其内容?

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

数据是通过 Javascript 从外部源加载的。您可以使用此示例如何加载数据:

import json
import requests

data = {
    "draw": 1,
    "columns": [
        {
            "data": None,
            "name": "",
            "searchable": False,
            "orderable": False,
            "search": {"value": "", "regex": False},
            "title": "",
            "visible": True,
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "region",
            "name": "region",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "county",
            "name": "county",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "roadwayName",
            "name": "roadwayName",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "direction",
            "name": "direction",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "type",
            "name": "type",
            "searchable": False,
            "orderable": True,
            "search": {"value": "Incidents", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "severity",
            "name": "severity",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "description",
            "name": "description",
            "searchable": False,
            "orderable": False,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "startTime",
            "name": "startTime",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": "lastUpdated",
            "name": "lastUpdated",
            "searchable": False,
            "orderable": True,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
        {
            "data": 10,
            "name": "",
            "searchable": False,
            "orderable": False,
            "search": {"value": "", "regex": False},
            "isUtcDate": False,
            "isCollection": False,
        },
    ],
    "order": [{"column": 8, "dir": "asc"}],
    "start": 0,
    "length": 25,
    "search": {"value": "", "regex": False},
}

url = "https://fl511.com/List/GetData/traffic"


data = requests.post(url, json=data).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for i, d in enumerate(data["data"], 1):
    print(i, d["description"])

print()
print("Records total:", data["recordsTotal"])
print("Records filtered:", data["recordsFiltered"])

打印:

1 Crash in Highlands County on US-27 South, at Lake Josephine Dr. Right lane blocked. Last updated at 04:24 PM.
2 Emergency vehicles in Highlands County on US-27 North, at Lake Josephine Dr. Right lane blocked. Last updated at 04:25 PM.
3 Crash in Manatee County on US-41 North, at Pearl Ave. All lanes blocked. Last updated at 04:29 PM.
4 Crash in Polk County on I-4 East, beyond CR-557. 2 Left lanes blocked. Last updated at 04:32 PM.
5 Emergency vehicles in Manatee County on US-41 South, at Pearl Ave. Left lane blocked. Last updated at 04:35 PM.
6 Crash in Miami-Dade County on I-195 East, beyond North Miami Ave. Right lane blocked. Last updated at 05:03 PM.
7 Crash in Santa Rosa County on I-10 East, ramp to Exit 22 (SR-281/Avalon Blvd). Right shoulder blocked. Last updated at 05:05 PM.
8 Emergency vehicles in Santa Rosa County on I-10 West, at Exit 22 (SR-281/Avalon Blvd). Left shoulder blocked. Last updated at 05:02 PM.
9 Multi-vehicle crash in Duval County on I-295 E South, before Between Atlantic Blvd/St Johns Bluff Rd. Left shoulder blocked. Last updated at 05:30 PM.

Records total: 93
Records filtered: 9

【讨论】:

  • 非常感谢您的帮助。我希望你不介意我问,你是如何找到从哪里获取数据的 URL 的?您是如何获得过滤器的所有信息的?
  • @Kab2k 您可以在 Firefox 开发者工具->网络选项卡中检查页面正在执行的所有请求(Chrome 有类似的东西)。
  • 哦,是的,我明白你对 URL 的意思了。但是过滤器怎么样?因为我找不到任何 jsonfile?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-08-19
  • 1970-01-01
  • 2021-07-18
  • 2018-08-25
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多