【发布时间】:2014-05-12 14:28:08
【问题描述】:
我在这个页面上:http://www.metacritic.com/browse/games/title/ps4/a?view=condensed
我想进入每个项目并获得开发者和流派,但我的代码似乎不起作用。
比如我想进入这个页面:http://www.metacritic.com/game/playstation-4/angry-birds-star-wars
然后离开它并继续其余的做同样的事情并添加到数据库中。我可以在我的代码中进行哪些更改以使其正常工作?现在数据库用于开发,类型为空,但它会获取其余数据,所以它就像它永远不会进入 parse_Game
我还在 parseGame 中添加了打印语句,但它们都没有打印
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from metacritic.items import MetacriticItem
import MySQLdb
import re
from string import lowercase
class MetacriticSpider(BaseSpider):
def start_requests(self):
#iterate through ps4 pages
for c in lowercase:
for i in range(self.max_id):
yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback = self.parseps4)
#gets the developer and genre of a game
def parseGame(self, response):
print("Here")
item = response.meta['item']
db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []
item['dev'] = site.xpath('.//span[contains(@class, "summary_detail developer")]/span[1]/text()').extract()
item['genre'] = site.xpath('.//span[contains(@class, "summary_detail product_genre")]/span[1]/text()').extract()
cursor.execute("INSERT INTO ps4 (dev, genre) VALUES (%s,%s)",[item['dev'][0],item['genre'][0]])
items.append(item)
print item['dev']
print item['genre']
def parseps4(self, response):
#some local variables
db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []
#iterates through each site
for site in sites:
with db1:
item = MetacriticItem()
#sets the item
item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()
item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()
item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()
item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()
#some processing to check if there is a score attached, if there is, it adds it to the database
if ("tbd" in item['cscore'][0] and "tbd" not in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" not in item['uscore'][0]):
cursor.execute("INSERT INTO ps4 (title, criticalscore, userscore, releasedate) VALUES (%s,%s,%s, %s)",[(' '.join(item['title'][0].split())).replace("(PS4)","",1),item['cscore'][0],item['uscore'][0],item['release'][0]])
items.append(item)
itemLink = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href' ).extract()
req = Request('http://www.metacritic.com' + itemLink[0], callback = self.parseGame)
req.meta['item'] = item
【问题讨论】:
-
您好像忘记在
Request('http://www.metacritic.com' + itemLink[0], callback = self.parseGame)之前添加yield。 -
@alecxe 我试过这个,不幸的是它不起作用。还有其他想法吗?
-
至少还有一个问题。在
parseGameitem中未定义。您需要在meta中将item从parseps4传递到parseGame:参见doc.scrapy.org/en/latest/topics/…。 -
你在
parseps4的末尾缺少yield req。 -
能否也添加蜘蛛的类定义和导入语句,以便我可以在本地尝试和调试?
标签: python mysql database parsing web-scraping