【问题标题】:python scrape next strings to a given stringpython将下一个字符串刮到给定的字符串
【发布时间】:2019-08-16 23:04:13
【问题描述】:

我有 +1000 个 txt 文件要抓取(Python)。我已经创建了列出所有 .txt 文件路径的 file_list 变量。我有五个字段要抓取:file_form、日期、公司、公司 ID 和价格范围。对于前四个变量,我没有任何问题,因为它们在每个 .txt 文件开头的单独行中非常结构化:

FILE FORM:      10-K
DATE:           20050630
COMPANY:        APPLE INC
COMPANY CIK:    123456789

我对这四个使用了以下代码:

    import sys, os, re
    exemptions=[]    
        for eachfile in file_list:
                line2 = ""  # for the following loop I need the .txt in lines. Right now, the file is read one in all. Create var with lines
                with open(eachfile, 'r') as f:
                    for line in f:
                        line2 = line2 + line  # append each line. Shortcut: "line2 += line"
                        if "FILE FORM" in line:
                            exemptions.append(line.strip('\n').replace("FILE FORM:", "")) #append line stripping 'S-1\n' from field in + replace FILE FORM with blanks
                        elif "COMPANY" in line:
                            exemptions.append(line.rstrip('\n').replace("COMPANY:", ""))  # rstrip=strips trailing characters '\n'
                        elif "DATE" in line:
                            exemptions.append(line.rstrip('\n').replace("DATE:", ""))  # add field 
                        elif "COMPANY CIK" in line:
                            exemptions.append(line.rstrip('\n').replace("COMPANY CIK:", ""))  # add field
print(exemptions)

这些给了我一个列表exemptions,其中包含上面示例中的所有关联值。但是,“价格范围”字段位于 .txt 文件的中间,如下所示:

We anticipate that the initial public offering price will be between $         and
$         per share.

而且我不知道如何将$whateveritis;and $whateveritis;per share. 作为我最后的第五个变量。好消息是很多文件使用相同的结构,有时我有 $amounts 而不是“&nbsp”。示例:We anticipate that the initial public offering price will be between $12.00 and $15.00  per share.

我希望这个“12.00;and;15.00”作为我在 exemptions 列表中的第五个变量(或者类似的东西,我可以在 csv 文件中轻松处理)。

非常感谢您。

【问题讨论】:

    标签: python string parsing


    【解决方案1】:

    看起来您已经导入了正则表达式,为什么不使用它呢? \$[\d.]+\ and \$[\d.]+ 之类的正则表达式应该与价格匹配,然后您可以从那里轻松地对其进行优化:

    import sys, os, re
        exemptions=[]    
        for eachfile in file_list:
                line2 = ""
                with open(eachfile, 'r') as f:
                    for line in f:
                        line2 = line2 + line
    
                        m = re.search('\$[\d.]+\ and \$[\d.]+', line)
    
                        if "FILE FORM" in line:
                            .
                            .
                            .
                        elif m:
                            exemptions.append(m.group(0))   # m.group(0) will be the first occurrence and you can refine it from there
    
    print(exemptions)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-05-24
      • 2017-07-14
      • 1970-01-01
      相关资源
      最近更新 更多