【问题标题】:special character being extracted using scrapy使用scrapy提取特殊字符
【发布时间】:2023-01-12 22:52:49
【问题描述】:

我是数据抓取的初学者,我目前正在使用 scrapy 抓取quotes to scrape 网站。

我的问题是当我在 div 框中抓取文本时,我使用代码 text = div.css('.text::text').extract() 来提取段落。但是,当我将文本存储在 .csv 文件中时,它会将双引号视为特殊字符,然后误解双引号并将其更改为其他字符。

如何放置 if 条件,以便在提取过程中不存储这些双引号?

class QuoteSpider(scrapy.Spider):
    name = 'quotes'   #***spiderName***    #THESE 2 VARIABLES MUST HAVE THESE NAME EVERYTIME UR WRITING A SPIDER AS THE SCRAPY,SPIDER CLASS WE INHERIT        
    start_urls = [       #EXPECTS THESE TWO VARIABLES TO BE AVAILBLE IN THE FILE
        'http://quotes.toscrape.com/'
    ]
    
    def parse(self, response):      #response variable will store the source code of the webpage we want to scrap      
      items = QuotetutorialItem()   #Creating an instance of the class created in the items.py file
      allDiv = response.css('.quote')
      for div in allDiv:
         text = div.css('.text::text').extract()    #goes into the .text class to get the text
         authors = div.css('.author::text').extract()   #goes into the .author class to get the text of the author
         aboutAuthors = div.css('.quote span a').xpath('@href').extract()     #goes into the .quote div, then into the span and then gets the <a> tag from all of the boxes in the .quote div and then gets the link using xpath
         tags = div.css('.tags .tag::text').extract()
         
         items['storeText'] = text           #the names passed in the list iterator should be the same- 
         items['storeAuthors'] = authors     #- as the names of the member variables in the items.py file
         items['storeAboutAuthors'] = aboutAuthors
         items['storeTags'] = tags
         
         yield items

【问题讨论】:

    标签: python csv web-scraping


    【解决方案1】:

    由于引号以 字符开头和结尾,您可以考虑这种方法:

    • 从字符串中删除第一个和最后一个字符。

    Example:

    # Sample quote:
    quote_sample = "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
    
    # Modify the string - by taking all the characters after the first and before the last character: 
    quote_sample = quote_sample[1:-1]
    
    # Print the modified quote:
    print(quote_sample[1:-1])
    

    结果 - 报价没有 字符:

    A woman is like a tea bag; you never know how strong it is until it's in hot water.
    

    获得报价后,您可以替换字符。

    代码:

    quote_sample = quote_sample.replace("“", "").replace("”", "")
    

    【讨论】:

    • 谢谢,这是一个很好的方法,但我想知道如何删除`“和”`,以便当它出现在某处之间的句子中时我可以将其删除
    • @FaizanUlHaq在我看来,这不是最好的选择, 但是,您可以替换字符。我已经编辑了我的答案。
    猜你喜欢
    • 1970-01-01
    • 2015-05-15
    • 2021-01-09
    • 2014-07-22
    • 1970-01-01
    • 1970-01-01
    • 2015-08-22
    • 2020-03-17
    • 1970-01-01
    相关资源
    最近更新 更多