【发布时间】:2021-04-03 23:05:35
【问题描述】:
最近,我问了以下问题 - Unwanted white spaces resulting into distorted column,@sharathnatraj 的回答令人满意,而且效果很好。
答案是:
import re
with open('trial1.txt', 'r') as f:
lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)
样本数据集:
1 CAgF3O3S silver trifluoromethanesulfonate 2923-28-6 256.937 629.15 1 --- --- --- --- --- --- --- ---
2 CAgN silver cyanide 506-64-9 133.884 >573.15 1 --- --- --- --- --- --- --- ---
3 CAgNO silver cyanate 3315-16-0 149.883 --- --- --- --- --- --- --- --- --- ---
4 CAgNS silver-i- thiocyanate 1701-93-5 165.950 --- --- --- --- --- --- --- --- --- ---
5 CAgN3O6 silver trinitromethanide 25987-94-4 257.894 370.95 1 --- --- --- --- --- --- --- ---
6 CAgN3S2 silver azidodithioformate 74093-43-9 226.030 --- --- 1154.15 3 --- --- --- --- --- ---
7 CAg2Cl3O3P silver trichloromethanephosphonate --- 413.073 --- --- --- --- --- --- --- --- --- ---
8 CAg2N2 disilver cyanamide --- 255.757 --- --- --- --- --- --- --- --- --- ---
9 CAg2O3 silver carbonate 534-16-7 275.741 487.15 1 --- --- --- --- --- --- --- ---
10 CAsCl2F3 dichloro-trifluoro-methyl-arsine 421-32-9 214.833 --- --- 353.30 3 --- --- --- --- --- ---
11 CAuN gold-i- cyanide 506-65-0 222.985 --- --- --- --- --- --- --- --- --- ---
12 CB4 boron carbide 12069-32-8 55.255 2623.15 1 3773.15 3 --- --- --- --- --- ---
13 CBaO3 barium carbonate 513-77-9 197.336 811.00 1 1723.15 3 --- --- --- --- --- ---
14 CBrClF2 bromochlorodifluoromethane 353-59-3 165.365 113.65 1 270.60 1 25 1.8100 1 25 1.3371 2
15 CBrClN2O4 bromochlorodinitromethane 33829-48-0 219.379 282.45 1 --- --- 20 2.3040 3 25 1.5710 2
16 CBrCl2F bromodichlorofluoromethane 353-58-2 181.819 113.65 1 325.90 1 25 1.6960 3 25 1.5755 2
17 CBrCl3 bromotrichloromethane 75-62-7 198.273 252.15 1 376.65 1 25 1.9940 1 25 1.5060 2
18 CBrFO carbonic bromide fluoride 753-56-0 126.913 --- --- 252.59 3 --- --- --- 25 1.5660 2
但是,我意识到当化学名称中有 2 个空格时,上述解决方案有效,并且当有超过 2 个空格(例如第 18 行)时,列被扭曲了。
因此,我尝试如下修改,但它不起作用
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})\s([a-z]{5,})", r"\1\2\3", line) for line in lines]
使用此解决方案,第 18 行是固定的,但会扭曲其他行(例如 1 到 5)
在我的数据集中,化学名称最多有 4 个空格(此处未显示)。
因此,我想知道是否有任何解决此问题的方法。
【问题讨论】:
-
您是否尝试过stackoverflow.com/questions/65380200/… 中提出的第二种解决方案?对我来说,那个有效。
-
@Lydia van Dyke:是的,我做到了。它给了我类似的输出。尤其是当我有超过 2 个空格时。
标签: python pandas string dataframe csv