【发布时间】:2020-03-11 02:30:39
【问题描述】:
我不知道如何描述我遇到的问题,所以我只是展示一下。 我有 2 个数据表,我正在使用正则表达式来搜索和提取这些表中的值,这取决于它是否与正确的单词匹配。我会把整个脚本供参考。
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
此脚本完全按照我的意愿执行,并且在我运行测试数据文件时按预期工作,但在我运行实际数据文件时失败。我不知道如何解释这个问题,所以我只是展示一下。这是输出:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
如您所见,存在一个问题,因为我的 y 正则表达式没有找到所有值。此外,我的 x 正则表达式找到了多少个问题(只有 5 个而不是应该的数百个)。最初我认为这只是一个显示的东西(它没有显示数百个匹配项,因为它需要太长时间),我还认为它中间的 ... 打印我的表格也是为了显示目的。但是,如果我复制部分 HNCA.txt 数据并将其另存为单独的文件,则可以解决问题。
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
我不会发布整个输出,但如您所见,现在它找到了所有正确的匹配项。它现在也显示整个表格,而不是做......并且只显示上半部分和下半部分。我不完全明白这个问题是从哪里引起的。为什么它只显示我表格的上半部分和下半部分,但是如果我将其复制并粘贴到另一个文件中,它会显示整个内容。为什么即使没有显示正则表达式也不搜索整个表格(基于它显示上半部分和下半部分的事实,让我认为整个表格都在那里,但它又没有显示它,因为它试图简化显示,但是为什么显示的内容会影响正则表达式正在搜索的内容)?
【问题讨论】: