【问题标题】:Extract data from Dell Community Forum for a specific date从戴尔社区论坛中提取特定日期的数据
【发布时间】:2022-11-03 01:34:45
【问题描述】:

我想从特定日期的戴尔社区论坛线程中提取用户名、帖子标题、发布时间和消息内容,并将其存储到 Excel 文件中。

例如, 网址:https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017

我想提取帖子标题:“我收到时间同步错误,上次同步时间显示为 2015 年的一天”

以及仅限日期 10-25-2022 的 cmets 的详细信息(用户名、发布时间、消息)

  1. jraju,04:20 AM,“这台电脑是台式机 inspiron 3910 型号。戴尔仅在本周提供。”
  2. Mary G,上午 9:10,“尝试重新启动计算机并再次连接到 Internet,看看是否可以解决问题。 不要忘记运行 Windows 更新以在新计算机上获取所有必要的更新。”
  3. RoHe,下午 1:00,“您可能想阅读修复:Windows 11 上的时间同步失败。完全忽略有关下载软件工具的部分,并在同一页面向下滚动到部分:如何手动同步时间在 Windows 11 PC 上。注意:在第 6 步中,如果 time.windows.com 不起作用,请从该屏幕上的下拉菜单中选择不同的服务器。”

    不是任何其他cmets。

    我对此很陌生。

    直到现在我才设法在没有日期过滤器的情况下提取信息(没有用户名)。

    我对此很陌生。

    直到现在我才设法在没有日期过滤器的情况下提取信息(没有用户名)。

    
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
    
    result = requests.get(url)
    doc = BeautifulSoup(result.text, "html.parser")
    
    ###### time ######
    time = doc.find_all('span', attrs={'class':'local-time'})
    print(time)
    ##################
    
    ##### date #######
    date = doc.find_all('span', attrs={'class':'local-date'})
    print(date)
    #################
    
    #### message ######
    article_text = ''
    article = doc.find_all("div", {"class":"lia-message-body-content"})
    for element in article:
        article_text += '\n' + ''.join(element.find_all(text = True))
        
    print(article_text)
    ##################
    all_data = []
    for t, d, m in zip(time, date, article):
        all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
    
    with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        for row in all_data:
            writer.writerow(row)
    

【问题讨论】:

    标签: python web-scraping beautifulsoup csvwriter


    【解决方案1】:

    在我看来,您的选择器存在问题,并且您在一般范围(整个 HTML 正文)中搜索它们。我的方法是缩小“组件”的范围并在其中进行搜索:

    1. 找到包含所有 cmets 的 div
    2. 在其中搜索每个评论评论容器
    3. 从每个评论容器中获取用户名、日期和评论信息

      以下是如何实现这一目标:

      import requests
      from bs4 import BeautifulSoup
      
      url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
      
      result = requests.get(url)
      soup = BeautifulSoup(result.text, "html.parser")
      
      date = '10-25-2022'
      comments = []
      
      comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
      comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
      for comment in comments_body:
          if date in comment.find('span',{'class':'local-date'}).text:
              comments.append({
                  'name': comment.find('a',{'class':'lia-user-name-link'}).text,
                  'date': comment.find('span',{'class':'local-date'}).text,
                  'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
              })
      
      data = {
          "title": soup.find('div', {'class':'lia-message-subject'}).text,
          "comments": comments
      }
      
      print(data)
      

      此脚本生成一个 JSON 对象(字符串化),如下所示:

      {
         "title":"
      
      
      
      
      							I am getting time sync errror and the last synced time shown as a day in 2015
      						
      
      
      
      ",
         "comments":[
            {
               "name":"Mary G",
               "date":"
      
      u200e10-24-2022
      11:01 AM
      
      ",
               "comment":"
      What model computer?
      \xa0
      "
            },
            {
               "name":"jraju",
               "date":"
      
      u200e10-25-2022
      04:20 AM
      
      ",
               "comment":"
      This pc is desktop inspiron 3910 model . The dell supplied only this week.
      "
            },
            {
               "name":"Mary G",
               "date":"
      
      u200e10-25-2022
      09:10 AM
      
      ",
               "comment":"
      Try rebooting the computer and connecting to the internet again to see if that clears it up.\xa0
      Don't forget to run Windows Update to get all the necessary updates on a new computer.\xa0
      \xa0
      "
            },
            {
               "name":"RoHe",
               "date":"
      
      u200e10-25-2022
      01:00 PM
      
      ",
               "comment":"
      You might want to read Fix: Time synchronization failed on Windows 11.
      Totally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.
      NOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.
      
      Ron\xa0\xa0 Forum Member since 2004\xa0\xa0 I'm not a Dell employee
      
      "
            },
            {
               "name":"jraju",
               "date":"
      
      u200e10-26-2022
      02:18 AM
      
      ",
               "comment":"
      Hi, Rohe, I already I tried all the things in the link posted in manual section in the link. Changed the servers but always get an error occurred in syncing. It is a bug in the windows system , i think.I have tried all other things except registry tweaks.\xa0I think that the issue is connected to time server.I went to security and set the location default to my place and then tried once more. Now the 2015 synced time gone and the\xa0 synced time was changed to yesterday time.But this is a temporary solution because the next click check produced the same failed sync.I request the dell to give a lasting solution to this as time is an important factor , be it file saving having access to the internet etc.I still come across the sync time failure in some of the recent forum threads .it appears to be time expired error, some times peer not reachable etc.only sync does not work often.thanks.
      "
            },
            {
               "name":"NischalP",
               "date":"
      
      u200e10-26-2022
      04:42 AM
      
      ",
               "comment":"
      Thanks!\xa0
      "
            },
            {
               "name":"RoHe",
               "date":"
      
      u200e10-26-2022
      02:34 PM
      
      ",
               "comment":"
      @jraju\xa0 It\'s more likely a Windows problem that Microsoft has to fix, especially since there are lots of posts about this all over the internet. 
      Did you open Start>Run>services.msc and stop the Windows Time service? Then manually start it and set its Startup type to Automatic. Don\'t change anything else in services.msc. Just reboot PC and monitor for a few days to see if it\'s working.
      If that doesn\'t help, you could also try this:
      
      At desktop, open a CMD prompt window, Run as administrator
      At the prompt, type in: DISM.exe /Online /Cleanup-image /Restorehealth and press Enter. Be sure to include a space in front of each / and note any error messages when that\'s done.
      Assuming no errors in #2, at the CMD prompt again, type in: sfc /scannow and press Enter. Be sure to include a space in front of the / and note any errors when that\'s done.
      Assuming no "unfixed" errors in #2 or #3, just reboot PC and monitor for a few days...
      
      
      Ron\xa0\xa0 Forum Member since 2004\xa0\xa0 I\'m not a Dell employee
      
      "
            },
            {
               "name":"jraju",
               "date":"
      
      u200e10-27-2022
      04:36 AM
      
      ",
               "comment":"
      I have tried dism command and got restore health command completed successfully.what does that mean
      "
            },
            {
               "name":"RoHe",
               "date":"
      
      u200e10-27-2022
      12:43 PM
      
      ",
               "comment":"
      That means it either didn't find any problems and/or was able to fix something. So that's good.
      Did you run sfc /scannow ?
      
      Ron\xa0\xa0 Forum Member since 2004\xa0\xa0 I'm not a Dell employee
      
      "
            }
         ]
      }
      

      作为WebScrapingAPI 的工程师,我还可以向您推荐我们的工具,它可以防止检测,使您的刮板长期更可靠。

      唯一需要更改才能使其工作的是您请求的 URL。在这种情况下,目标网站将成为我们 API 端点的参数。其他一切都保持不变。

      url 变量将变为:

      url = 'https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017'
      

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-07-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多