【问题标题】:Scrapy - Messy text in a csv formatScrapy - csv 格式的杂乱文本
【发布时间】:2015-05-05 18:15:37
【问题描述】:

我成功地从网站列表中提取了我需要的文本。问题是当我以 csv 格式保存它时,一些行会因为长文本和文本中的行之间的中断而变得混乱。 例如:

(无法上传图片:()

所以以 0s/1s 开头的行来自不同的网站,但此图像中的最后一个网站在 csv 文件中开始了几个新行。这使我无法继续进行文本分析。

任何帮助都将受到高度赞赏,因为到目前为止找不到解决方案。

非常感谢

编辑 - 添加代码: 这行也不是:

data = "".join(sel.select("//body//text()").extract()).strip()

也不是这行代码:

data = " ".join(" ".join(sel.select("//body//text()").extract()).strip().split())

没用

【问题讨论】:

  • 您能否添加有关该提取文本的更多详细信息或提供一些您从该页面提取的示例链接和实体?

标签: csv text scrapy


【解决方案1】:

您可以通过对给定文本执行join()split() 来删除换行符和所有字符。在生成项目之前,请确保您已正确清理提取的文本。

假设我想获取以下url 的一些体育新闻,会是这样的,

In [1]: text = response.xpath('//div[@id="page-1"]/p//text()').extract()

In [2]: text 
Out[2]: 
[u'\nThe retirement of Jonathan Trott from international cricket last night cast \nfurther doubt on the position of Peter Moores, who admitted he was uncertain \nabout his own future as head coach. The decision to recall Trott as an \nopening batsman for the series against West Indies, 18 months after his \nbreakdown in Australia, backfired spectacularly as England slid to a defeat \nin Bridgetown on Sunday that enabled the home side to level the Test series \n1-1.\n',
 u'\nThe defeat in Bridgetown added to the pressure on Moores after a disastrous \nWorld Cup earlier this year. The head coach conceded yesterday that']

In [3]: cleaned_text = ' '.join(' '.join(text).split())

In [4]: cleaned_text 
Out[4]: u'The retirement of Jonathan Trott from international cricket last night cast further doubt on the position of Peter Moores, who admitted he was uncertain about his own future as head coach. The decision to recall Trott as an opening batsman for the series against West Indies, 18 months after his breakdown in Australia, backfired spectacularly as England slid to a defeat in Bridgetown on Sunday that enabled the home side to level the Test series 1-1. The defeat in Bridgetown added to the pressure on Moores after a disastrous World Cup earlier this year. The head coach conceded yesterday that'

希望这可能会有所帮助

【讨论】:

  • 非常感谢您的回复。很抱歉一开始没有复制我的代码,但这里是: data = "".join(sel.select("//body//text()").extract()).strip() 并在你之后建议我将其更改为: data = " ".join(" ".join(sel.select("//body//text()").extract()).strip().split()) 不幸的是仍然导致同样的问题(虽然以更友好的方式)
  • 可以给我网址吗?
  • 尽管我认为我发现了问题,但仍然无法正常工作 - CSV 文件中的单元格长度限制:stackoverflow.com/questions/18842866/… 将在超过此限制后尝试中断/剪切文本。谢谢
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-12-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-07-08
相关资源
最近更新 更多