【发布时间】:2016-05-13 17:06:55
【问题描述】:
为了了解我当前的问题,这里是关于更广泛问题的一些背景信息:
我有一个由多个文档组成的大型文本文件。我需要找到一种方法将这个文件组织成它的组成部分。不幸的是,所有单独的文档都有不同的格式,它们之间唯一的共同点是每个文档的开头都包含日期,并且每次都以相同的格式书写:dd MONTH yyyy。我使用日期作为书挡来隔离它们之间的文本。
#the date pattern with positive lookbehind
bookend_1 = "(?<=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"
#The date pattern with positive lookahead
bookend_2 = "(?=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"
#using the bookends to find the text in between dates
docs = regex.findall(bookend_1+'(.*?)'+ bookend_2, psc_comm_raw, re.DOTALL|re.MULTILINE)
使用正则表达式,我创建了两个列表:一个是所有日期,一个是出现在日期之间的所有文本段落。我将这些列表压缩成一个元组。我无法将它们压缩到字典中,因为有些日期是重复的。
psc_comm_tuple = list(zip(date, docs))
这里有几行 psc_comm_tuple。
[('27 JULY 2004',
' ADDIS ABABA, ETHIOPIA\n\nPSC/PR/Comm.(XIII)\n\nCOMMUNIQUÉ\n\nPSC/PR/Comm.(XIII) Page l\n\nCOMMUNIQUÉ OF THE THIRTEENTH MEETING OF THE PEACE AND SECURITY COUNCIL\n\nThe Peace and Security Council (PSC) of the African Union (AU), at its thirteenth meeting, held on 27 July 2004, adopted the following communiqué on the crisis in the Darfur region of the Sudan:\n\nCouncil,\n\n1.\tReiterates its deep concern over the grave situation that still prevails in the Darfur region of the Sudan, in particular the continued attacks by the Janjaweed militia against the civilian population, as well as other human rights abuses and the humanitarian crisis;\n\n2.\tUnderlines the urgent need to implement decision AU/Dec.54(111) on Darfur, adopted by the 3rd Ordinary Session of the Assembly...'),
('29 JANUARY 2001',
'\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its seventy-third * ordinary session at the level of Ambassadors on 29 January 2001, in Addis Ababa. The session was chaired by Ambassador Kati Ohara Korga, Permanent Representative of Togo to the OAU.\n\nHaving considered the Report of the Secretary General on the Democratic Republic of the Congo (DRC) and the situation in that country, the Central Organ:\n\n1.\tstrongly condemns the assassination of Pre...'),
('20 MARCH 2001',
"\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 74th ordinary session at ambassadorial level, in Addis Ababa, Ethiopia, on Tuesday March 20, 2001. The session was chaired by Ambassador Ohara Korga, Permanent representative of Togo to the OAU....'),
('22 AUGUST 2001',
'\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 75th Ordinary Session at Ambassadorial level in Addis Ababa, Ethiopia, on Wednesday 22 August 2001....')...]
我的最终目标是创建一个包含两列的 CSV:一列用于日期,另一列用于与该日期关联的文本正文。
import csv
import os
with open('psc_comm.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['date','text'])
for row in psc_comm_tuple:
csv_out.writerow(row)
当我将元组输出写入 csv 时,有些行完全没问题。但是一些输出变得混乱——文本被分成看似随机的块,并且有空白行、句子片段行。有数百个这样的事件。当我回顾原始文档并找到句子中断的相应位置时,我看不出文本本身有什么特别或独特之处。没有特殊字符。这只是纯文本。但是,它们似乎确实是特别长的文本部分,所以我想知道 CSV 文件中的单个单元格可以包含多少信息是否存在限制。
我的问题是:为什么 CSV 输出在某些地方如此时髦,而在其他地方则不然?每个单元格可以输入多少文本有限制吗?
【问题讨论】:
-
您可以在电子表格中看到多少文本是有限制的,是否有可能在您的 excel 程序中被截断了一堆?如果用纯文本打开呢?
-
欢迎来到 stackoveflow,@chickpeaze。很好的尝试将您的问题交给网络,但如果您能缩小问题范围,您远更有可能获得帮助。至少,给您正在处理的数据示例:编辑您的问题并从
psc_comm_tuple添加几行。 -
也有可能
csv.writer正在尝试将字符转义添加到会破坏格式的字符(如逗号)并且您的程序没有按照csv.writer的预期方式解释它. -
@alexis,我想你的意思是“编辑你的问题”
-
@alexis,感谢您的提示!我在上面添加了一些行,我会确保在以后的问题中继续这样做。
标签: python regex string python-3.x csv