【发布时间】:2019-04-13 21:09:14
【问题描述】:
网址:https://myanimelist.net/anime/236/Es_Otherwise
我试图在 URL 中抓取以下内容:
我试过了:
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
或者当前的 XPath 不工作或者我错过了什么......
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
但我发现某些文本超出了跨度类,因此我想使用 css/XPath 公式使该文本超出跨度。
【问题讨论】:
-
嗨。请你写一段左右来更好地解释你的问题?
-
您想使用什么语言?您是否与该网站达成协议以抓取内容?
-
我使用 python 和 scrapy 框架