可变数量的不需要的空白导致列失真答案

【问题标题】：Variable number of unwanted white spaces resulting into distorted column可变数量的不需要的空白导致列失真
【发布时间】：2021-04-03 23:05:35
【问题描述】：

最近，我问了以下问题 - Unwanted white spaces resulting into distorted column，@sharathnatraj 的回答令人满意，而且效果很好。

答案是：

import re
with open('trial1.txt', 'r') as f:
    lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines] 
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)

样本数据集：

    1 CAgF3O3S silver trifluoromethanesulfonate 2923-28-6 256.937 629.15 1 --- --- --- --- --- --- --- ---
    2 CAgN silver cyanide 506-64-9 133.884 >573.15 1 --- --- --- --- --- --- --- ---
    3 CAgNO silver cyanate 3315-16-0 149.883 --- --- --- --- --- --- --- --- --- ---
    4 CAgNS silver-i- thiocyanate 1701-93-5 165.950 --- --- --- --- --- --- --- --- --- ---
    5 CAgN3O6 silver trinitromethanide 25987-94-4 257.894 370.95 1 --- --- --- --- --- --- --- ---
    6 CAgN3S2 silver azidodithioformate 74093-43-9 226.030 --- --- 1154.15 3 --- --- --- --- --- ---
    7 CAg2Cl3O3P silver trichloromethanephosphonate --- 413.073 --- --- --- --- --- --- --- --- --- ---
    8 CAg2N2 disilver cyanamide --- 255.757 --- --- --- --- --- --- --- --- --- ---
    9 CAg2O3 silver carbonate 534-16-7 275.741 487.15 1 --- --- --- --- --- --- --- ---
    10 CAsCl2F3 dichloro-trifluoro-methyl-arsine 421-32-9 214.833 --- --- 353.30 3 --- --- --- --- --- ---
    11 CAuN gold-i- cyanide 506-65-0 222.985 --- --- --- --- --- --- --- --- --- ---
    12 CB4 boron carbide 12069-32-8 55.255 2623.15 1 3773.15 3 --- --- --- --- --- ---
    13 CBaO3 barium carbonate 513-77-9 197.336 811.00 1 1723.15 3 --- --- --- --- --- ---
    14 CBrClF2 bromochlorodifluoromethane 353-59-3 165.365 113.65 1 270.60 1 25 1.8100 1 25 1.3371 2
    15 CBrClN2O4 bromochlorodinitromethane 33829-48-0 219.379 282.45 1 --- --- 20 2.3040 3 25 1.5710 2
    16 CBrCl2F bromodichlorofluoromethane 353-58-2 181.819 113.65 1 325.90 1 25 1.6960 3 25 1.5755 2
    17 CBrCl3 bromotrichloromethane 75-62-7 198.273 252.15 1 376.65 1 25 1.9940 1 25 1.5060 2
    18 CBrFO carbonic bromide fluoride 753-56-0 126.913 --- --- 252.59 3 --- --- --- 25 1.5660 2

但是，我意识到当化学名称中有 2 个空格时，上述解决方案有效，并且当有超过 2 个空格（例如第 18 行）时，列被扭曲了。

因此，我尝试如下修改，但它不起作用

l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})\s([a-z]{5,})", r"\1\2\3", line) for line in lines]

使用此解决方案，第 18 行是固定的，但会扭曲其他行（例如 1 到 5）

在我的数据集中，化学名称最多有 4 个空格（此处未显示）。

因此，我想知道是否有任何解决此问题的方法。

【问题讨论】：

您是否尝试过stackoverflow.com/questions/65380200/… 中提出的第二种解决方案？对我来说，那个有效。
@Lydia van Dyke：是的，我做到了。它给了我类似的输出。尤其是当我有超过 2 个空格时。

标签： python pandas string dataframe csv

【解决方案1】：

因此，名称列似乎应该收集所有字符串，直到我们得到看起来像数字或一堆减号的东西。我的方法是这样的：

import re
import pandas as pd

numeric = re.compile("[0-9-]+")
sep = "|"

if __name__ == "__main__":
    with open('trial1.txt', 'r') as f:
        with open('tmp.txt', 'w') as tmp_file:
            for line_no, line in enumerate(f, start=1):
                raw_cols = line.split(" ")
                fixed_cols = []
                merging = False

                for i, raw_col in enumerate(raw_cols):
                    col = raw_col
                    if numeric.match(col):
                        merging = False
                    if merging:
                        fixed_cols[2] += " " + col
                    else:
                        fixed_cols.append(col)

                    if i == 2 and line_no > 1:
                        merging = True

                tmp_file.write(sep.join(fixed_cols))

    df = pd.read_csv(open("tmp.txt"), sep=sep)

    print(df)

我假设文件中没有管道 | 符号。临时结果存储在文件 tmp.txt 中。合并列时，我添加了一个额外的空白fixed_cols[2] += " " + col。

【讨论】：

【解决方案2】：

您可以尝试以下解决方案，与此问题中的第二个类似（那个也是我的）：

Unwanted white spaces resulting into distorted column

with open ('trial1.txt') as f:
    l=f.readlines()

l=[i.split() for i in l]
target=len(l[1])
for i in range(1,len(l)):
    if len(l[i])>target:
        l[i][2]=l[i][2]+' '+l[i][3]
        l[i].pop(3)
l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
l=[i+'\n' for i in l]
 
with open ('trial2.txt', 'w') as f:
    f.writelines(l)

df = pd.read_csv('trial2.txt', sep='#', index_col=0)

一些补充说明：

照顾好目标。我使用第一行作为正确的长度，如果这一行不正确，您必须使用其他行，或者更好的是，手动分配目标。你的情况是 14 岁。
如果您有更多的空格将您的第 3 列元素分成 2 列以上，您可以使用与以下相同的逻辑：
```
 if len(l[i])>target:
     l[i][2]=l[i][2]+' '+l[i][3]
     l[i].pop(3)
```

例如，如果长度为 16，这意味着 3d 列被分成 3 个部分，你可以使用这个：

如果 len(l[i])==16: l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4] l[i].pop(4) l[i].pop(3)

并将所有这些组合在一个 if 语句中，如下所示：

if len(l[i])==16:
    l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4]
    l[i].pop(4)
    l[i].pop(3)
elif len(l[i])==15:
    l[i][2]=l[i][2]+' '+l[i][3]
    l[i].pop(3)

You can add as many if above this code, for length==17, length=18, etc

【讨论】：

嗨@IoaTzimas：我的txt文件样本在上面的“样本数据集”下给出。我的 txt 文件构建是相同的。谢谢。
stackoverflow.com/questions/65380200/… 的标题行呢？使用标题，我能够解析数据。没有标题，我得到一个错误。 pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 7, saw 16
嗨@LydiavanDyke 我更新了该答案以使其更通用。问题是标题行有 16 个项目，其余行有 14 个。我使用 len(l[0])（标题）作为目标，它不适用于 OP，但它适用于 target=l[1]，这就是我更新它的原因。最好的方法是手动设置目标，因为所需输出的列数是已知的。在您的情况下，您可以手动设置它，方法是将：target=len(l[1]) 更改为 target=16