【问题标题】:Regex for detection company names in Python用于在 Python 中检测公司名称的正则表达式
【发布时间】:2021-07-04 20:59:54
【问题描述】:

我想使用 Python 用正则表达式检测公司名称。

这是我的想法:

  1. 公司名称应包含 1 到 3 个单词
  2. 公司名称的第一个单词应大写
  3. 公司名称中的一个单词可以有 .com 或 .co (Amazon.com Inc)
  4. 公司名称的最后一个单词(第四个单词)应为 Inc., Ltd, GmbH, AG, GmbH, Group, Holding 等。
  5. 名称的最后一个单词和 Inc. , Ltd, GmbH, AG 之间有时可以是 ',' 或 ', '

我尝试过类似的方法,但它不起作用:

address_1 = 'I work in Amazon.com Inc.'
address_2 = 'Company named Swiss Medic Holding invested in vaccine'
address_3 = 'what do you think about Abercrombie & Fitch Co. ?'
address_4 = 'do you work in Delta Group?'
address_5 = 'I have worked in CocaCola Gmbh'

regex_company = '([A-Z][\w]+[ -]+){1,3}(Ltd|ltd|LTD|llc|LLC|Inc|inc|INC|plc|Corp|Group)'
found = re.search(regex_company, address)

我想打印检测到的公司的结果 我使用相同的正则表达式逻辑来查找街道地址并且效果很好,但对于公司名称却没有。 这是我使用的正则表达式:

regex_street = "(\d{0,6})(?:\w)\s([A-Z][\w]+[ -]+){1,3}(Street|St|Road|Rd)

正则表达式逻辑:数字 + 1-3 个单词 + street/st/road/rd

【问题讨论】:

  • 这可能是正则表达式的可能性范围太大。是否有其他方法可以获取公司名称?
  • I want to 你面临什么问题?什么是“不起作用”?
  • 条件 3 使得它变得困难,因为正则表达式不是为“仅一个”而制作的。匹配每个单词以.com/.co结尾的模式是否可以?
  • @peer 很好的回答是肯定的和否定的,我没有这样的情况,我有 2 个或 3 个单词的名字,只有一个 .com,它只有一个以 .com 结尾的单词

标签: python regex


【解决方案1】:

使用

\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b

regex proof

解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    m?                       'm' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (between 0 and 2
                           times (matching the most amount
                           possible)):
--------------------------------------------------------------------------------
    [ -]+                    any character of: ' ', '-' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      &                        '&'
--------------------------------------------------------------------------------
      [ -]+                    any character of: ' ', '-' (1 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      co                       'co'
--------------------------------------------------------------------------------
      m?                       'm' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
  ){0,2}                   end of grouping
--------------------------------------------------------------------------------
  [,\s]+                   any character of: ',', whitespace (\n, \r,
                           \t, \f, and " ") (1 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  (?i:                     group, but do not capture (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and # normally):
--------------------------------------------------------------------------------
    ltd                      'ltd'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    llc                      'llc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    inc                      'inc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    plc                      'plc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      rp                       'rp'
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    group                    'group'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    holding                  'holding'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    gmbh                     'gmbh'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Python code

import re

regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"

test_str = ("I work in Amazon.com Inc.\n"
    "Company named Swiss Medic Holding invested in vaccine\n"
    "what do you think about Abercrombie & Fitch Co. ?\n"
    "do you work in Delta Group?\n"
    "I have worked in CocaCola Gmbh")

print(re.findall(regex, test_str))

结果['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']

【讨论】:

  • 似乎无法正常工作,因为它返回非大写单词(即工作、命名、关于...)。
  • @JoanLaraGanau 不,它工作正常。我提供了代码和结果。
  • 它现在可以正常工作,因为您对其进行了编辑。无论如何,干得好,这是一个非常棘手的正则表达式。编辑:也许你只添加了r"...
  • 但公司名称后面必须有ltd、group、plc、holding等
  • @taga 正是我的解决方案的作用。在最后找到它们。
【解决方案2】:

使用https://regex101.com 来测试正则表达式,这很棒。对于您的具体示例,这是您想要的正则表达式。在此示例中,我认为不需要测试可选的 .com。

regex_company = '[A-Z]([^ ]*[ &]*){0,2}(Inc\.|Ltd|GmbH|AG|Gmbh|Group|Holding|Co\.)'

for address in [address_1, address_2, address_3, address_4, address_5]:
    found = re.search(regex_company, address)
    if found:
        print(found)

# prints:
# <regex.Match object; span=(10, 25), match='Amazon.com Inc.'>
# <regex.Match object; span=(14, 33), match='Swiss Medic Holding'>
# <regex.Match object; span=(24, 47), match='Abercrombie & Fitch Co.'>
# <regex.Match object; span=(15, 26), match='Delta Group'>
# <regex.Match object; span=(17, 30), match='CocaCola Gmbh'>

【讨论】:

  • 请查看我用于街道检测的正则表达式的类型,您可以为公司名称复制类似的内容吗?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-08-11
  • 1970-01-01
  • 2020-11-25
相关资源
最近更新 更多