【问题标题】:Getting Variables inside Javascript Function using BeautifulSoup, Python, Regex使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量
【发布时间】:2020-02-10 01:10:01
【问题描述】:

在 Javascript 函数中定义了一个数组 images,需要将其从字符串中提取并转换为 Python 列表对象。

Python 的Beautifulsoup 被用于进行解析。

        var images = [
            {   
                src: "http://example.com/bar/001.jpg",  
                title: "FooBar One" 
            },  
            {   
                src: "http://example.com/bar/002.jpg",  
                title: "FooBar Two" 
            },  
        ]
        ;

问题:为什么我下面的代码无法捕获这个images 数组,我们该如何解决?

谢谢!

所需的输出 一个 Python 列表对象。

[
    {   
        src: "http://example.com/bar/001.jpg",  
        title: "FooBar One" 
    },  
    {   
        src: "http://example.com/bar/002.jpg",  
        title: "FooBar Two" 
    },  
]

实际代码

import re
from bs4 import BeautifulSoup

# Example of a HTML source code containing `images` array
html = '''
<html>
<head>
<script type="text/javascript">

    $(document).ready(function(){
        var images = [
            {   
                src: "http://example.com/bar/001.jpg",  
                title: "FooBar One" 
            },  
            {   
                src: "http://example.com/bar/002.jpg",  
                title: "FooBar Two" 
            },  
        ]
        ;
        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>
<body>
<p>Some content</p>
</body>
</head>
</html>
'''

pattern = re.compile('var images = (.*?);')
soup = BeautifulSoup(html, 'lxml')
scripts = soup.find_all('script')  # successfully captures the <script> element
for script in scripts:
    data = pattern.match(str(script.string))  # NOT extracting the array!!
    if data:
        print('Found:', data.groups()[0])     # NOT being printed

【问题讨论】:

  • 您是否希望得到类似{ src: "http://example.com/bar/001.jpg", title: "FooBar One" } 的东西?
  • @JackFleeting 抱歉,我没有提到所需的输出。更新了问题。我正在寻找包括 [] 在内的整个数组,以便将其转换为 Python 列表对象。

标签: python regex python-3.x beautifulsoup lxml


【解决方案1】:

您可以使用较短的惰性正则表达式和hjson 库来处理未引用的键

import re, hjson

html = '''
<html>
<head>
<script type="text/javascript">

    $(document).ready(function(){
        var images = [
            {   
                src: "http://example.com/bar/001.jpg",  
                title: "FooBar One" 
            },  
            {   
                src: "http://example.com/bar/002.jpg",  
                title: "FooBar Two" 
            },  
        ]
        ;
        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>
'''
p = re.compile(r'var images = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(html)[0])
print(data)

【讨论】:

    【解决方案2】:

    re.match 从字符串的开头匹配。您的正则表达式必须传递整个字符串。使用

    pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)
    

    字符串仍然不是有效的 python 列表格式。你必须做一些操作才能申请ast.literal_eval

    for script in scripts:
        data = pattern.match(str(script.string))
        if data:
            list_str = data.groups()[0]
            # Remove last comma
            last_comma_index = list_str.rfind(',')
            list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]
            # Modify src to 'src' and title to 'title'
            list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)
            # Strip
            list_str = list_str.strip()
            final_list = ast.literal_eval(list_str.strip())
            print(final_list)
    

    输出

    [{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]
    

    【讨论】:

    • 太好了,您的解决方案与images 数组的字符串表示相匹配!但是,当我尝试 ast.literal_eval(data.groups()[0]) 将其转换为 Python 列表对象时,我收到错误 SyntaxError: unexpected EOF while parsing
    • json.loads(data.groups()[0]) 给出了错误json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes
    【解决方案3】:

    方法一

    也许,

     \bvar\s+images\s*=\s*(\[[^\]]*\])
    

    可能在某种程度上起作用:

    测试

    import re
    from bs4 import BeautifulSoup
    
    # Example of a HTML source code containing `images` array
    html = '''
    <html>
    <head>
    <script type="text/javascript">
    
        $(document).ready(function(){
            var images = [
                {   
                    src: "http://example.com/bar/001.jpg",  
                    title: "FooBar One" 
                },  
                {   
                    src: "http://example.com/bar/002.jpg",  
                    title: "FooBar Two" 
                },  
            ]
            ;
            var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
    
    </script>
    <body>
    <p>Some content</p>
    </body>
    </head>
    </html>
    '''
    
    soup = BeautifulSoup(html, 'html.parser')
    scripts = soup.find_all('script')  # successfully captures the <script> element
    
    for script in scripts:
        data = re.findall(
            r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)
        print(data[0])
    

    输出

    [ {
    src: "http://example.com/bar/001.jpg",
    标题:“FooBar 一号” },
    {
    src: "http://example.com/bar/002.jpg",
    标题:“FooBar 2” },
    ]


    如果您希望简化/修改/探索表达式,在regex101.com 的右上角面板中已对此进行了说明。如果您愿意,您还可以在this link 中观看它如何与一些示例输入匹配。


    方法二

    另一种选择是:

    import re
    
    string = '''
    <html>
    <head>
    <script type="text/javascript">
    
        $(document).ready(function(){
            var images = [
                {   
                    src: "http://example.com/bar/001.jpg",  
                    title: "FooBar One" 
                },  
                {   
                    src: "http://example.com/bar/002.jpg",  
                    title: "FooBar Two" 
                },  
            ]
            ;
            var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
    
    </script>
    <body>
    <p>Some content</p>
    </body>
    </head>
    </html>
    '''
    
    expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'
    
    matches = re.findall(expression, string, re.DOTALL)
    
    output = []
    for match in matches:
        output.append(dict({"src": match[0], "title": match[1]}))
    
    print(output)
    

    输出

    [{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]
    

    Demo

    【讨论】:

      【解决方案4】:

      这是一种到达那里的方法,没有正则表达式,甚至没有 beautifulsoup - 只是简单的 Python 字符串操作 - 只需 4 个简单的步骤 :)

      step_1 = html.split('var images = [')
      step_2 = " ".join(step_1[1].split())
      step_3 = step_2.split('] ; var other_data = ')
      step_4= step_3[0].replace('}, {','}xxx{').split('xxx')
      print(step_4)
      

      输出:

      ['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }',
       '{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']
      

      【讨论】:

        猜你喜欢
        • 2015-01-18
        • 1970-01-01
        • 2016-09-10
        • 2019-01-17
        • 2019-04-19
        • 2017-08-24
        • 2020-07-04
        • 1970-01-01
        相关资源
        最近更新 更多