【发布时间】:2013-11-23 14:58:44
【问题描述】:
下面的 pandas 脚本会不断修改我导出到 CSV 的数据,而这是不应该的。
如果您将原始文件与修改后的testing2.csv 进行比较,您会看到第一行的0.357 之类的数字变成:0.35700000000000004 而在第2 行,0.1128 的数字根本没有改变。 .
应该不修改这些数字,它们都应该保持原样。
testing.py
import re
import pandas
# each block in the text file will be one element of this list
matchers = [[]]
i = 0
with open('testing.txt') as infile:
for line in infile:
line = line.strip()
# Blocks are seperated by blank lines
if len(line) == 0:
i += 1
matchers.append([])
# assume there are always two blank lines between items
# and just skip to the lext line
infile.next()
continue
matchers[i].append(line)
# This regular expression matches the variable number of students in each block
studentlike = re.compile('(\d+) (.+) (\d+/\d+)')
# These are the names of the fields we expect at the end of each block
datanames = ['Data', 'misc2', 'bla3']
# We will build a table containing a list of elements for each student
table = []
for matcher in matchers:
# We use an iterator over the block lines to make indexing simpler
it = iter(matcher)
# The first two elements are match values
m1, m2 = it.next(), it.next()
# then there are a number of students
students = []
for possiblestudent in it:
m = studentlike.match(possiblestudent)
if m:
students.append(list(m.groups()))
else:
break
# After the students come the data elements, which we read into a dictionary
# We also add in the last possible student line as that didn't match the student re
dataitems = dict(item.split() for item in [possiblestudent] + list(it))
# Finally we construct the table
for student in students:
# We use the dictionary .get() method to return blanks for the missing fields
table.append([m1, m2] + student + [dataitems.get(d, '') for d in datanames])
textcols = ['MATCH2', 'MATCH1', 'TITLE01', 'MATCH3', 'TITLE02', 'Data', 'misc2', 'bla3']
csvdata = pandas.read_csv('testing.csv')
textdata = pandas.DataFrame(table, columns=textcols)
# Add any new columns
newCols = textdata.columns - csvdata.columns
for c in newCols:
csvdata[c] = None
mergecols = ['MATCH2', 'MATCH1', 'MATCH3']
csvdata.set_index(mergecols, inplace=True, drop=False)
textdata.set_index(mergecols, inplace=True,drop=False)
csvdata.update(textdata)
csvdata.to_csv('testing2.csv', index=False)
testing.csv
- http://pastebin.com/raw.php?i=HxVE0nA0(因文件大小而上传)
testing.txt
MData (N/A)
DMATCH1
3 Tommy 144512/23332
1 Jim 90000/222311
1 Elz M 90000/222311
1 Ben 90000/222311
Data $50.90
misc2 $10.40
bla3 $20.20
MData (B/B)
DMATCH2
4 James Smith 2333/114441
4 Mike 90000/222311
4 Jessica Long 2333/114441
Data $50.90
bla3 $5.44
有人知道如何解决这个问题吗?
(上面的例子100%完美地重现了这个问题。我花了很长时间才找出导致这个问题的原因。)
【问题讨论】:
标签: python python-2.7 csv pandas