Python - 从域和页面标题解析公司名称答案

【问题标题】：Python - Parsing company name from Domain and Page TitlePython - 从域和页面标题解析公司名称
【发布时间】：2017-06-15 10:54:45
【问题描述】：

我一直在努力从 HTML 中的域和页面标题中解析公司名称。假设我的域是：

http://thisismycompany.com

页面标题为：

This is an example page title | My Company

我的假设是，当我匹配其中最长的公共子字符串时，在小写并删除除字母数字之外的所有内容之后，这很可能是公司名称。

因此，最长的公共子字符串 (Link to python 3 code) 将返回 mycompany。我将如何将此子字符串匹配回原始页面标题，以便我可以检索空格和大写字符的正确位置。

【问题讨论】：

标签： python parsing

【解决方案1】：

我考虑过使用正则表达式是否可行，但我认为只使用普通的字符串操作/比较会更容易，特别是因为这似乎不是一项时间敏感的任务。

def find_name(normalized_name, full_name_container):
  n = 0
  full_name = ''
  for i in range(0, len(full_name_container)):
    if n == len(normalized_name):
      return full_name

    # If the characters at the current position in both
    # strings match, add the proper case to the final string
    # and move onto the next character
    if (normalized_name[n]).upper() == (full_name_container[i]).upper():
      full_name += full_name_container[i]
      n += 1

    # If the name is interrupted by a separator, add that to the result  
    elif full_name_container[i] in ['-', '_', '.', ' ']:
      full_name += full_name_container[i]

    # If a character is encountered that is definitely not part of the name
    # Re-start the search
    else:
      n = 0
      full_name = ''

  return full_name

print(find_name('mycompany', 'Some stuff My Company Some Stuff'))

这应该打印出“我的公司”。硬编码可能会中断规范化名称的可能项目列表（如空格和逗号）可能是您必须改进的地方。

【讨论】：

太棒了。谢谢。这个方法实际上是我一开始想到的实现，但无法让它工作。与此同时，我也发现了一个不同的实现。我也会将其添加为答案，以便您和其他人查看。

【解决方案2】：

我已经通过生成标题的所有可能子字符串的列表来解决它。然后将它与我从最长公共子字符串函数中得到的匹配匹配。

def get_all_substrings(input_string):
    length = len(input_string)
    return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])

longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'

for substring in get_all_substrings(page_title):
    if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
        match = substring
        break

print(match)

编辑：source used

【讨论】：

我觉得这可能是更好的解决方案。它可能适用于比我更多的案例。然而，我的可能在更简单的例子上更有效。
我同意。另一个改进可能是将两个循环结合起来，并在找到匹配项时让它中断。这意味着它需要更少的子字符串，而不是全部（当然，除非最后一个是匹配的）