【问题标题】:Variable number of unwanted white spaces resulting into distorted column可变数量的不需要的空白导致列失真
【发布时间】:2021-04-03 23:05:35
【问题描述】:

最近,我问了以下问题 - Unwanted white spaces resulting into distorted column,@sharathnatraj 的回答令人满意,而且效果很好。

答案是:

import re
with open('trial1.txt', 'r') as f:
    lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines] 
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)

样本数据集:

    1 CAgF3O3S silver trifluoromethanesulfonate 2923-28-6 256.937 629.15 1 --- --- --- --- --- --- --- ---
    2 CAgN silver cyanide 506-64-9 133.884 >573.15 1 --- --- --- --- --- --- --- ---
    3 CAgNO silver cyanate 3315-16-0 149.883 --- --- --- --- --- --- --- --- --- ---
    4 CAgNS silver-i- thiocyanate 1701-93-5 165.950 --- --- --- --- --- --- --- --- --- ---
    5 CAgN3O6 silver trinitromethanide 25987-94-4 257.894 370.95 1 --- --- --- --- --- --- --- ---
    6 CAgN3S2 silver azidodithioformate 74093-43-9 226.030 --- --- 1154.15 3 --- --- --- --- --- ---
    7 CAg2Cl3O3P silver trichloromethanephosphonate --- 413.073 --- --- --- --- --- --- --- --- --- ---
    8 CAg2N2 disilver cyanamide --- 255.757 --- --- --- --- --- --- --- --- --- ---
    9 CAg2O3 silver carbonate 534-16-7 275.741 487.15 1 --- --- --- --- --- --- --- ---
    10 CAsCl2F3 dichloro-trifluoro-methyl-arsine 421-32-9 214.833 --- --- 353.30 3 --- --- --- --- --- ---
    11 CAuN gold-i- cyanide 506-65-0 222.985 --- --- --- --- --- --- --- --- --- ---
    12 CB4 boron carbide 12069-32-8 55.255 2623.15 1 3773.15 3 --- --- --- --- --- ---
    13 CBaO3 barium carbonate 513-77-9 197.336 811.00 1 1723.15 3 --- --- --- --- --- ---
    14 CBrClF2 bromochlorodifluoromethane 353-59-3 165.365 113.65 1 270.60 1 25 1.8100 1 25 1.3371 2
    15 CBrClN2O4 bromochlorodinitromethane 33829-48-0 219.379 282.45 1 --- --- 20 2.3040 3 25 1.5710 2
    16 CBrCl2F bromodichlorofluoromethane 353-58-2 181.819 113.65 1 325.90 1 25 1.6960 3 25 1.5755 2
    17 CBrCl3 bromotrichloromethane 75-62-7 198.273 252.15 1 376.65 1 25 1.9940 1 25 1.5060 2
    18 CBrFO carbonic bromide fluoride 753-56-0 126.913 --- --- 252.59 3 --- --- --- 25 1.5660 2

但是,我意识到当化学名称中有 2 个空格时,上述解决方案有效,并且当有超过 2 个空格(例如第 18 行)时,列被扭曲了。

因此,我尝试如下修改,但它不起作用

l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})\s([a-z]{5,})", r"\1\2\3", line) for line in lines]

使用此解决方案,第 18 行是固定的,但会扭曲其他行(例如 1 到 5)

在我的数据集中,化学名称最多有 4 个空格(此处未显示)。

因此,我想知道是否有任何解决此问题的方法。

【问题讨论】:

  • 您是否尝试过stackoverflow.com/questions/65380200/… 中提出的第二种解决方案?对我来说,那个有效。
  • @Lydia van Dyke:是的,我做到了。它给了我类似的输出。尤其是当我有超过 2 个空格时。

标签: python pandas string dataframe csv


【解决方案1】:

因此,名称列似乎应该收集所有字符串,直到我们得到看起来像数字或一堆减号的东西。我的方法是这样的:

import re
import pandas as pd

numeric = re.compile("[0-9-]+")
sep = "|"

if __name__ == "__main__":
    with open('trial1.txt', 'r') as f:
        with open('tmp.txt', 'w') as tmp_file:
            for line_no, line in enumerate(f, start=1):
                raw_cols = line.split(" ")
                fixed_cols = []
                merging = False

                for i, raw_col in enumerate(raw_cols):
                    col = raw_col
                    if numeric.match(col):
                        merging = False
                    if merging:
                        fixed_cols[2] += " " + col
                    else:
                        fixed_cols.append(col)

                    if i == 2 and line_no > 1:
                        merging = True

                tmp_file.write(sep.join(fixed_cols))

    df = pd.read_csv(open("tmp.txt"), sep=sep)

    print(df)

我假设文件中没有管道 | 符号。临时结果存储在文件 tmp.txt 中。合并列时,我添加了一个额外的空白fixed_cols[2] += " " + col

【讨论】:

    【解决方案2】:

    您可以尝试以下解决方案,与此问题中的第二个类似(那个也是我的):

    Unwanted white spaces resulting into distorted column

    with open ('trial1.txt') as f:
        l=f.readlines()
    
    l=[i.split() for i in l]
    target=len(l[1])
    for i in range(1,len(l)):
        if len(l[i])>target:
            l[i][2]=l[i][2]+' '+l[i][3]
            l[i].pop(3)
    l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
    l=[i+'\n' for i in l]
     
    with open ('trial2.txt', 'w') as f:
        f.writelines(l)
    
    df = pd.read_csv('trial2.txt', sep='#', index_col=0)
    

    一些补充说明:

    1. 照顾好目标。我使用第一行作为正确的长度,如果这一行不正确,您必须使用其他行,或者更好的是,手动分配目标。你的情况是 14 岁。

    2. 如果您有更多的空格将您的第 3 列元素分成 2 列以上,您可以使用与以下相同的逻辑:

       if len(l[i])>target:
           l[i][2]=l[i][2]+' '+l[i][3]
           l[i].pop(3)
      

    例如,如果长度为 16,这意味着 3d 列被分成 3 个部分,你可以使用这个:

    如果 len(l[i])==16: l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4] l[i].pop(4) l[i].pop(3)

    并将所有这些组合在一个 if 语句中,如下所示:

    if len(l[i])==16:
        l[i][2]=l[i][2]+' '+l[i][3]+' '+l[i][4]
        l[i].pop(4)
        l[i].pop(3)
    elif len(l[i])==15:
        l[i][2]=l[i][2]+' '+l[i][3]
        l[i].pop(3)
    
    You can add as many if above this code, for length==17, length=18, etc
    

    【讨论】:

    • 嗨@IoaTzimas:我的txt文件样本在上面的“样本数据集”下给出。我的 txt 文件构建是相同的。谢谢。
    • stackoverflow.com/questions/65380200/… 的标题行呢?使用标题,我能够解析数据。没有标题,我得到一个错误。 pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 7, saw 16
    • 嗨@LydiavanDyke 我更新了该答案以使其更通用。问题是标题行有 16 个项目,其余行有 14 个。我使用 len(l[0])(标题)作为目标,它不适用于 OP,但它适用于 target=l[1],这就是我更新它的原因。最好的方法是手动设置目标,因为所需输出的列数是已知的。在您的情况下,您可以手动设置它,方法是将:target=len(l[1]) 更改为 target=16
    猜你喜欢
    • 2021-03-30
    • 1970-01-01
    • 2021-09-16
    • 1970-01-01
    • 1970-01-01
    • 2015-11-19
    • 2020-02-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多