【问题标题】:How to do sed like text replace with python?如何用python做sed之类的文本替换?
【发布时间】:2011-05-24 13:47:52
【问题描述】:

我想启用此文件中的所有 apt 存储库

cat /etc/apt/sources.list
## Note, this file is written by cloud-init on first boot of an instance                                                                                                            
## modifications made here will not survive a re-bundle.                                                                                                                            
## if you wish to make changes you can:                                                                                                                                             
## a.) add 'apt_preserve_sources_list: true' to /etc/cloud/cloud.cfg                                                                                                                
##     or do the same in user-data
## b.) add sources in /etc/apt/sources.list.d                                                                                                                                       
#                                                                                                                                                                                   

# See http://help.ubuntu.com/community/UpgradeNotes for how to upgrade to                                                                                                           
# newer versions of the distribution.                                                                                                                                               
deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick main                                                                                                                   
deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick main                                                                                                               

## Major bug fix updates produced after the final release of the                                                                                                                    
## distribution.                                                                                                                                                                    
deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates main                                                                                                           
deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates main                                                                                                       

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu                                                                                                         
## team. Also, please note that software in universe WILL NOT receive any                                                                                                           
## review or updates from the Ubuntu security team.                                                                                                                                 
deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick universe                                                                                                               
deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick universe                                                                                                           
deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates universe
deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates universe

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu 
## team, and may not be under a free licence. Please satisfy yourself as to
## your rights to use the software. Also, please note that software in 
## multiverse WILL NOT receive any review or updates from the Ubuntu
## security team.
# deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick multiverse
# deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick multiverse
# deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates multiverse
# deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-updates multiverse

## Uncomment the following two lines to add software from the 'backports'
## repository.
## N.B. software from this repository may not have been tested as
## extensively as that contained in the main release, although it includes
## newer versions of some applications which may provide useful features.
## Also, please note that software in backports WILL NOT receive any review
## or updates from the Ubuntu security team.
# deb http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-backports main restricted universe multiverse
# deb-src http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ maverick-backports main restricted universe multiverse

## Uncomment the following two lines to add software from Canonical's
## 'partner' repository.
## This software is not part of Ubuntu, but is offered by Canonical and the
## respective vendors as a service to Ubuntu users.
# deb http://archive.canonical.com/ubuntu maverick partner
# deb-src http://archive.canonical.com/ubuntu maverick partner

deb http://security.ubuntu.com/ubuntu maverick-security main
deb-src http://security.ubuntu.com/ubuntu maverick-security main
deb http://security.ubuntu.com/ubuntu maverick-security universe
deb-src http://security.ubuntu.com/ubuntu maverick-security universe
# deb http://security.ubuntu.com/ubuntu maverick-security multiverse
# deb-src http://security.ubuntu.com/ubuntu maverick-security multiverse

使用 sed 这是一个简单的sed -i 's/^# deb/deb/' /etc/apt/sources.list 最优雅(“pythonic”)的方法是什么?

【问题讨论】:

  • pythonpy (github.com/russell91/pythonpy) 为您提供了一种与命令行交互的好方法:cat /etc/apt/sources.list | py -x 're.sub(r"^# deb", "deb", x)'

标签: python regex linux


【解决方案1】:

你可以这样做:

with open("/etc/apt/sources.list", "r") as sources:
    lines = sources.readlines()
with open("/etc/apt/sources.list", "w") as sources:
    for line in lines:
        sources.write(re.sub(r'^# deb', 'deb', line))

with 语句确保文件正确关闭,并在"w" 模式下重新打开文件会在您写入文件之前清空文件。 re.sub(pattern, replace, string) 相当于 sed/perl 中的 s/pattern/replace/。

编辑:示例中的固定语法

【讨论】:

  • 使用with 是个好主意,但这样您只需将新的sources.list 附加到旧的。
  • 这看起来很棒(在语法上),但它复制了文件。我需要截断吗?另外,这是将整个文件加载到内存中还是逐行操作的“流式”方法?
  • 正如 plundra 所指出的,您的解决方案以非原子方式写入,因此会引发竞争条件(例如,其他进程和/或线程在重写时尝试同时读取此类文件)。那是个问题。但它仍然优雅而活泼。
  • 复制/移动原始文件然后将with open(copied_or_moved_original, "r") as source: with(original_name, "w") as destination 嵌套在try...except 中可能会更安全一些。然后您可以在出现错误时轻松恢复原始文件,这适用于太大而无法完全存储在内存中的文件......(与写入临时文件然后替换原始文件相比,从其副本覆盖原始文件具有可能更好地与文件系统版本控制(例如 Shadow Copy 和 NTFS Streams)一起工作的优势)
  • ...fileinput.input(..., inplace=True) 似乎基本上为你做这个
【解决方案2】:

使用 no 外部命令或其他依赖项在纯 Python 中编写一个本土的 sed 替换是一项充满高贵地雷的高尚任务。谁会想到?

尽管如此,这是可行的。这也是可取的。我们都去过那里,人们:“我需要处理一些纯文本文件,但我只有 Python、两条塑料鞋带和一罐发霉的地堡级黑樱桃。帮助。”

在这个答案中,我们提供了一个同类最佳的解决方案,将先前答案的精彩组合在一起,而没有那些令人不快的不是 - 令人敬畏的。正如 plundra 所指出的,David Miller 的 otherwise top-notch answer 以非原子方式写入所需的文件,因此会引发竞争条件(例如,来自其他线程和/或尝试同时读取该文件的进程)。那很糟。 Plundra 的otherwise excellent answer 解决了那个问题,同时引入了更多问题——包括许多致命的编码错误、一个严重的安全漏洞(未能保留原始文件的权限和其他元数据)以及替换正则表达式的过早优化具有低级字符索引。这也很糟糕。

真棒,团结起来!

import re, shutil, tempfile

def sed_inplace(filename, pattern, repl):
    '''
    Perform the pure-Python equivalent of in-place `sed` substitution: e.g.,
    `sed -i -e 's/'${pattern}'/'${repl}' "${filename}"`.
    '''
    # For efficiency, precompile the passed regular expression.
    pattern_compiled = re.compile(pattern)

    # For portability, NamedTemporaryFile() defaults to mode "w+b" (i.e., binary
    # writing with updating). This is usually a good thing. In this case,
    # however, binary writing imposes non-trivial encoding constraints trivially
    # resolved by switching to text writing. Let's do that.
    with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
        with open(filename) as src_file:
            for line in src_file:
                tmp_file.write(pattern_compiled.sub(repl, line))

    # Overwrite the original file with the munged temporary file in a
    # manner preserving file attributes (e.g., permissions).
    shutil.copystat(filename, tmp_file.name)
    shutil.move(tmp_file.name, filename)

# Do it for Johnny.
sed_inplace('/etc/apt/sources.list', r'^\# deb', 'deb')

【讨论】:

  • 在以下情况下失败 s/^Q(.*)/"&"/:您将用文字 "&" 替换匹配项,而不是用引号括起匹配项的预期任务
  • @Adrian 这本身并不是失败 - 因为它失败只是因为 Python 具有与 Sed 使用的不同的正则表达式方言,它不会将 & 解释为模式的内容 - 在这种情况下,你应该使用\0。简单地告诉用户他们应该使用 Python 的 re 方言应该没问题。在这种情况下,可能不应该在名称中使用“sed”。
【解决方案3】:

massedit.py (http://github.com/elmotec/massedit) 为你做脚手架,只留下正则表达式来编写。它仍处于测试阶段,但我们正在寻求反馈。

python -m massedit -e "re.sub(r'^# deb', 'deb', line)" /etc/apt/sources.list

将以 diff 格式显示差异(之前/之后)。

添加 -w 选项以将更改写入原始文件:

python -m massedit -e "re.sub(r'^# deb', 'deb', line)" -w /etc/apt/sources.list

或者,您现在可以使用 api:

>>> import massedit
>>> filenames = ['/etc/apt/sources.list']
>>> massedit.edit_files(filenames, ["re.sub(r'^# deb', 'deb', line)"], dry_run=True)

【讨论】:

  • @MaximVeksler 这在以下情况下失败 s/^Q(.*)/"&"/:您将用文字 "&" 替换匹配项,而不是用引号括起匹配项的预期任务
  • 是否可以一次替换多个正则表达式?
  • 是的,查看 -g(生成)和 -f(函数或文件)选项,它们分别允许您创建要修改的模板 Python 文件并在每个输入文件上使用它由您的源到源工具处理。如果我不清楚,只需使用 -g 生成文件并检查它。它应该更有意义。
【解决方案4】:

这是一种不同的方法,我不想编辑我的其他答案。 嵌套 with,因为我不使用 3.1(with A() as a, B() as b: 工作的地方)。

更改sources.list 可能有点矫枉过正,但我​​想把它放在那里以备将来搜索。

#!/usr/bin/env python
from shutil   import move
from tempfile import NamedTemporaryFile

with NamedTemporaryFile(delete=False) as tmp_sources:
    with open("sources.list") as sources_file:
        for line in sources_file:
            if line.startswith("# deb"):
                tmp_sources.write(line[2:])
            else:
                tmp_sources.write(line)

move(tmp_sources.name, sources_file.name)

这应该确保没有其他人阅读文件的竞争条件。 哦,我更喜欢 str.startswith(...) 当你可以不用正则表达式时。

【讨论】:

  • 我完全希望尽可能不涉及正则表达式:) 同时:str.startswith() 和 NamedTemporaryFile 展示了一种包含电池的 python 方法,这使得它在很多方面都非常有用是时候做这样的简单任务了。
  • 出于兴趣,你为什么用shutil.move而不是os.rename
  • @Mark Longair:os.rename 在文件系统之间不起作用。例如,如果 /tmptmpfs 上,它将失败。
  • 截至 2015 年,这可能是最好的答案。事实上,这是一个很好的答案。 不幸的是,它也是非常错误的。 由于NamedTemporaryFile() 默认为mode='w+b',因此在编写文本字符串时必须明确指定编码。同样,原始文件的所有元数据(例如权限)必须在整个移动过程中保留。
【解决方案5】:

如果您使用的是 Python3,以下模块将为您提供帮助: https://github.com/mahmoudadel2/pysed

wget https://raw.githubusercontent.com/mahmoudadel2/pysed/master/pysed.py

将模块文件放入您的 Python3 模块路径,然后:

import pysed
pysed.replace(<Old string>, <Replacement String>, <Text File>)
pysed.rmlinematch(<Unwanted string>, <Text File>)
pysed.rmlinenumber(<Unwanted Line Number>, <Text File>)

【讨论】:

    【解决方案6】:

    如果你真的想在不安装新 Python 模块的情况下使用 sed 命令,你可以简单地执行以下操作:

    import subprocess
    subprocess.call("sed command")
    

    【讨论】:

      【解决方案7】:

      试试pysed:

      pysed -r '# deb' 'deb' /etc/apt/sources.list
      

      【讨论】:

        【解决方案8】:

        不确定优雅,但这至少应该是可读的。对于sources.list,可以事先阅读所有行,对于较大的内容,您可能希望在循环时“就地”更改。

        #!/usr/bin/env python
        # Open file for reading and writing
        with open("sources.list", "r+") as sources_file:
            # Read all the lines
            lines = sources_file.readlines()
        
            # Rewind and truncate
            sources_file.seek(0)
            sources_file.truncate()
        
            # Loop through the lines, adding them back to the file.
            for line in lines:
                if line.startswith("# deb"):
                    sources_file.write(line[2:])
                else:
                    sources_file.write(line)
        

        编辑:使用with-statement 更好地处理文件。之前截断之前也忘记倒带了。

        【讨论】:

        • 我只是读取文件,关闭它,以写入模式重新打开它,然后写入修改后的版本。这样就不用担心寻找和截断了。
        • @Thomas,是的。不觉得pythonic :-P 想用一个临时文件来做,然后把它也移动到适当的位置,成为原子的(-ish)。
        • 我不知道是否有 Pythonic 的方式来修改文件。不过,临时文件的想法有一些优点。
        【解决方案9】:

        你可以这样做:

        p = re.compile("^\# *deb", re.MULTILINE)
        text = open("sources.list", "r").read()
        f = open("sources.list", "w")
        f.write(p.sub("deb", text))
        f.close()
        

        或者(恕我直言,这从组织的角度来看更好)您可以将您的 sources.list 拆分为多个部分(一个条目/一个存储库)并将它们放在 /etc/apt/sources.list.d/

        【讨论】:

          【解决方案10】:

          Cecil Curry 有一个很好的答案,但是他的答案只适用于多行正则表达式。多行正则表达式很少使用,但有时很方便。

          这是对他的 sed_inplace 函数的改进,如果需要,它允许它与多行正则表达式一起工作。

          警告:在多行模式下,它将读取整个文件,然后执行正则表达式替换,因此您只想在小型文件上使用此模式 - 不要尝试在千兆字节上运行在多行模式下运行时的大小文件。

          import re, shutil, tempfile
          
          def sed_inplace(filename, pattern, repl, multiline = False):
              '''
              Perform the pure-Python equivalent of in-place `sed` substitution: e.g.,
              `sed -i -e 's/'${pattern}'/'${repl}' "${filename}"`.
              '''
              re_flags = 0
              if multiline:
                  re_flags = re.M
          
              # For efficiency, precompile the passed regular expression.
              pattern_compiled = re.compile(pattern, re_flags)
          
              # For portability, NamedTemporaryFile() defaults to mode "w+b" (i.e., binary
              # writing with updating). This is usually a good thing. In this case,
              # however, binary writing imposes non-trivial encoding constraints trivially
              # resolved by switching to text writing. Let's do that.
              with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
                  with open(filename) as src_file:
                      if multiline:
                          content = src_file.read()
                          tmp_file.write(pattern_compiled.sub(repl, content))
                      else:
                          for line in src_file:
                              tmp_file.write(pattern_compiled.sub(repl, line))
          
              # Overwrite the original file with the munged temporary file in a
              # manner preserving file attributes (e.g., permissions).
              shutil.copystat(filename, tmp_file.name)
              shutil.move(tmp_file.name, filename)
          
          from os.path import expanduser
          sed_inplace('%s/.gitconfig' % expanduser("~"), r'^(\[user\]$\n[ \t]*name = ).*$(\n[ \t]*email = ).*', r'\1John Doe\2jdoe@example.com', multiline=True)
          

          【讨论】:

            【解决方案11】:

            如果我想要类似 sed 的东西,那么我通常只使用sh 库调用sed 本身。

            from sh import sed
            
            sed(['-i', 's/^# deb/deb/', '/etc/apt/sources.list'])
            

            当然,也有缺点。就像本地安装的 sed 版本可能与您测试的版本不同。在我的情况下,这种事情可以在另一层轻松处理​​(例如通过事先检查目标环境,或使用已知版本的 sed 部署在 docker 映像中)。

            【讨论】:

              【解决方案12】:

              这是perl -p 的单模块 Python 替换:

              # Provide compatibility with `perl -p`
              
              # Usage:
              #
              #     python -mloop_over_stdin_lines '<program>'
              
              # In, `<program>`, use the variable `line` to read and change the current line.
              
              # Example:
              #
              #         python -mloop_over_stdin_lines 'line = re.sub("pattern", "replacement", line)'
              
              # From the perlrun documentation:
              #
              #        -p   causes Perl to assume the following loop around your
              #             program, which makes it iterate over filename arguments
              #             somewhat like sed:
              # 
              #               LINE:
              #                 while (<>) {
              #                     ...             # your program goes here
              #                 } continue {
              #                     print or die "-p destination: $!\n";
              #                 }
              # 
              #             If a file named by an argument cannot be opened for some
              #             reason, Perl warns you about it, and moves on to the next
              #             file. Note that the lines are printed automatically. An
              #             error occurring during printing is treated as fatal. To
              #             suppress printing use the -n switch. A -p overrides a -n
              #             switch.
              # 
              #             "BEGIN" and "END" blocks may be used to capture control
              #             before or after the implicit loop, just as in awk.
              # 
              
              import re
              import sys
              
              for line in sys.stdin:
                  exec(sys.argv[1], globals(), locals())
                  try:
                      print line,
                  except:
                      sys.exit('-p destination: $!\n')
              

              【讨论】:

                【解决方案13】:

                我希望能够查找和替换文本,同时在我插入的内容中包含匹配的组。我写了这个简短的脚本来做到这一点:

                https://gist.github.com/turtlemonvh/0743a1c63d1d27df3f17

                其中的关键部分如下所示:

                print(re.sub(pattern, template, text).rstrip("\n"))
                

                这是一个如何工作的示例:

                # Find everything that looks like 'dog' or 'cat' followed by a space and a number
                pattern = "((cat|dog) (\d+))"
                
                # Replace with 'turtle' and the number. '3' because the number is the 3rd matched group.
                # The double '\' is needed because you need to escape '\' when running this in a python shell
                template = "turtle \\3"
                
                # The text to operate on
                text = "cat 976 is my favorite"
                

                使用 this 调用上述函数会产生:

                turtle 976 is my favorite
                

                【讨论】:

                  【解决方案14】:

                  [以上所有答案均无效!]

                  我有一个文件中的多个键值替换的情况,大约 1000 行。 替换后文件结构应保持不变。 例如:

                  key1=value_tobe_replaced1
                  key2=value_tobe_replaced1
                  .     .
                  .     .
                  key1000=value_tobe_replaced1000
                  

                  我试过了:

                  1. @elmotec 为 massedit 投票的答案。

                  2. @Cecil Curry 的回答。

                  3. @Keithel 的回答。

                  这三个答案确实对我有很大帮助,但经过测试,我发现第 1 次和第 2 次的花费将近 40-50 美元。 3rd 不适合多次更换,所以我修好了。

                  注意:在继续之前参考答案。

                  这是我的代码:

                  换行方式:

                  start_time = datetime.datetime.now()
                  with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
                      with open(abs_keypair_file) as kf:
                          for line in kf:
                              line_to_write = ''
                              match_flag = False
                              for (key, value) in tuple_list:
                                  # print '  %s = %r' % (key, value)
                                  if  not re.search(patten, line, flags=re.I):
                                      continue
                                  line_to_write = re.sub(r'\$\({}\)'.format(key), value, line, flags=re.I)
                                  match_flag = True
                  
                              if not match_flag:
                                  line_to_write = line
                              tmp_file.write(line_to_write)
                  
                  shutil.copystat(abs_keypair_file, tmp_file.name)
                  shutil.move(tmp_file.name, abs_keypair_file)
                  
                  time_costs = datetime.datetime.now() - start_time
                  print 'time costs: %s' % time_costs
                  
                  time costs: 0:00:42.533879
                  

                  文件替换模式:

                  start_time = datetime.datetime.now()
                  with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
                      with open(abs_keypair_file) as kf:
                          text = kf.read()
                          for (key, value) in tuple_list:
                              text = re.sub(patten, value, text, flags=re.M|re.I)
                          tmp_file.write(text)
                  shutil.copystat(abs_keypair_file, tmp_file.name)
                  shutil.move(tmp_file.name, abs_keypair_file)
                  
                  time_costs = datetime.datetime.now() - start_time
                  print 'time costs: %s' % time_costs
                  
                  time costs: 0:00:00.348458
                  

                  所以我建议如果你符合我的情况并且你的文件大小不是太大,你可以关注file replacement mode

                  如果文件很大,如何替换?我不知道。

                  希望这会有所帮助。

                  【讨论】:

                    猜你喜欢
                    • 1970-01-01
                    • 2020-02-20
                    • 2014-12-22
                    • 2018-08-29
                    • 2012-12-07
                    • 2011-11-25
                    • 2022-11-14
                    • 2022-10-21
                    相关资源
                    最近更新 更多