抓取在线社交论坛-无法获取日期答案

【问题标题】：Scraping online social forum- not able to get date抓取在线社交论坛-无法获取日期
【发布时间】：2020-02-03 17:45:27
【问题描述】：

我目前正在尝试抓取这个在线论坛：https://community.whattoexpect.com/forums/postpartum-depression.html

如您所见，主页是一个包含所有帖子的列表页面，然后我必须点击每个帖子才能获得完整的内容和回复。每个帖子我需要的信息是：帖子的标题、作者、日期和消息内容。我还需要为每个帖子的每个回复提供相同的信息（作者、日期和内容）

到目前为止，我遇到的问题是我可以在网站上看到发布日期，但不能使用 Scrapy。这是元素检查：

但是在 Scrapy shell 中，日期文本应该是一个空白区域：

我尝试了许多 CSS 和 Xpath 方法，但没有任何效果。

最后，我希望生成的 CSV 文件在第一行包含原始帖子信息，然后在下一行包含回复信息。这样我就可以跟踪交互。并且，帖子标题或其他一些消息 ID 将是通用标识符。这是一个例子：

好的，到目前为止，这是我的 Scrapy 蜘蛛。这很混乱，因为我一直在尝试解决日期问题。

    import scrapy

类围产蜘蛛（scrapy.Spider）： name = '围产期'

start_urls = ['http://www.community.whattoexpect.com/forums/postpartum-depression.html']

def parse(self, response):
    for post_link in response.xpath('//*[@class="group-discussions__list__item__block"]/a/@href').extract():
        link = response.urljoin(post_link)
        yield scrapy.Request(link, callback=self.parse_post)

    # Checks if the main page has a link to next page if True keep parsing.
    next_page = response.xpath('(//a[@class="page-link"])[1]/@href').extract_first()
    if next_page:
        yield scrapy.Request(next_page, callback=self.parse)


def parse_thread(self, response):

    original_post = response.xpath("//*[@class='__messageContent fr-element fr-view']/p/text()").extract()
    title = response.xpath("//*[@class='discussion-original-post__title']/text()").extract_first()
    author_name = response.xpath("//*[@class='discussion-original-post__author__name']/text()").extract_first()
    timestamp = response.xpath("//*[@class='discussion-original-post__author__updated']/text()").extract_first()
    replies_list = response.xpath("//*[@class='discussion-replies__list']").getall()

    for reply in replies_list:
        # reply content
        replies = "".join(reply.xpath(".//*[@class='wte-reply__content']/p/text()").extract())
        reply_author= reply.xpath("//*[@class='wte-reply__author']/text()").extract_first()

    yield {
        "title": title,
        "post": original_post,
        "author_name": author_name,
        "replies": replies,
        "reply_author": reply_author,
        "time": timestamp
    }

【问题讨论】：

标签： python web-scraping scrapy

【解决方案1】：

JS 从属性 data-date 在页面上呈现人类可读的日期。这个号码是Unix Timestamp。

你可以轻松应对。

import datetime  # We need this module to deal with date

unixtimestamp = response.xpath("//*[@class='discussion-original-post__author__updated']/@data-date").extract_first()  # Extracting timestamp

unixtimestamp = int(unixtimestamp)/1000  # Removing milliseconds
timestamp = datetime.datetime.utcfromtimestamp(unixtimestamp).strftime("%m/%d/%Y %H:%M")

日期采用 UTC。如果您需要另一个时区 - 您必须增加几个小时或转换为另一个 tz。

【讨论】：