【问题标题】:Parsing information from content for database从数据库内容中解析信息
【发布时间】:2014-03-14 17:57:23
【问题描述】:

我有一个充满文章的数据库表。在某些情况下,文章底部有一个我想解析以从中获取信息的块。例如,以下是文章表中的两个可能值:

<p>Test test <blockquote class="pull">text quote</blockquote></p>

<p>&nbsp;</p>

<p><span class="italic">italic text</span></p>

<div class="bottom-block"><div class="picture" style="background-image:url('/generator?f=somepicture.jpg');"></div><div class="blurb">Blurb about person<a href="http://website.com">http://website.com</a></div></div>

还有一个例子:

<p>Some content</p>
<div class="bottom-block"><img alt="John Doe" class="picture" src="/assets/images/JOHN_DOE_1.jpg"><div class="blurb"><p>John Doe is a guy from Texas. <a href="http://johnswebsite.com" target="_blank">John's Website</a> and has a large following.</p></div></div>

以上是在数据库中看到的两个值的示例。现在,我希望能够提取某些信息。更准确地说,我想提取 Name、Url、ImageName 和 Blurb

在第一个示例中,在对该值运行查询后,我想看看:

名称: Url:http://website.com ImageName:somepicture.jpg Blurb:Blurb about person<a href="http://website.com">http://website.com</a>

在第二个例子中:

名称:John Doe Url:http://johnswebsite.com ImageName:JOHN_DOE_1.jpg Blurb:<p>John Doe is a guy from Texas. <a href="http://johnswebsite.com" target="_blank">John's Website</a> and has a large following.</p>

我正在玩一个 SQL 查询,它做得不错,但仍然有很多不一致之处。

SELECT id, url, content, TRIM(BOTH '\n' FROM TRIM(TRAILING '&lt;/div&gt;\n&lt;/div&gt;' FROM TRIM(TRAILING '&lt;/div&gt;&lt;/div&gt;' FROM TRIM(SUBSTRING(content, LOCATE('class="bottom-block"',content)+18))))) as block_extract, TRIM(BOTH '\n' FROM TRIM(TRAILING '&lt;/div&gt;\n&lt;/div&gt;' FROM TRIM(TRAILING '&lt;/div&gt;&lt;/div&gt;' FROM TRIM(SUBSTRING(content, LOCATE('class="blurb"',content)+12))))) as blurb FROM articles WHERE content LIKE '%bottom-block%' GROUP BY block_extract;

【问题讨论】:

  • 这是一个 monstterr SQL 语句。
  • 为什么你的问题被标记为 PHP?您是否需要在 SQL 语句中执行此操作,或者您可以使用 PHP 来解析数据(这显然更容易)?更一般地说,您希望在什么情况下解析这些数据?

标签: php mysql regex parsing trim


【解决方案1】:

这是一种 DOM 方式:

$results = array();

$fields = array('name', 'img', 'url', 'blurb');

$queries = array('name'  => '//img/@alt',
                 'img'   => '//img[@class = "picture"]/@style |
                             //img/@src |
                             //div[@class = "picture"]/@style',
                 'url'   => '//div[@class = "blurb"]//a/@href',
                 'blurb' => '//div[@class = "blurb"]');

$imgPattern = <<<'EOD'
~
(?|
    .*? background-image:url\( [^)]*? ([^?="\')/]+ \.(?:png|jpe?g|gif) ).*
  | 
    .*? ([^=;/]+)$
)
~ix
EOD;

foreach ($data as $html) {
    $srcDom = new DOMDocument();
    @$srcDom->loadHTML($html);

    $elts = $srcDom->getElementsbyTagName("body")->item(0)->childNodes;

    $tmp['other'] = '';
    foreach ($elts as $elt) {
        if ( $elt->nodeType === XML_ELEMENT_NODE &&
             $elt->hasAttribute('class') &&
             $elt->getAttribute('class') == 'bottom-block' )
            $bbnode = $elt;
        else
            $tmp['other'] .= $srcDom->saveHTML($elt);
    }
    echo htmlspecialchars(print_r($other, true));
    if ( $bbnode ):
        $bbDom = new DOMDocument();
        $bbDom->appendChild($bbDom->importNode($bbnode, true));

        $xpath = new DOMXPath($bbDom);

        foreach($fields as $field) {
            $$field = $xpath->query($queries[$field]);

            if ( $field == 'blurb' ):
                $tmp[$field] = '';
                foreach ($$field->item(0)->childNodes as $child) {
                    $tmp[$field] .= $bbDom->saveHTML($child);
                }
            else:
                $tmp[$field] = ($$field->length) ? $$field->item(0)->nodeValue : '';
            endif;
        }
        $tmp['img'] = preg_replace($imgPattern, '$1', $tmp['img']);
    endif;
    $results[] = $tmp;
}

echo htmlspecialchars(print_r($results, true));

【讨论】:

  • 感谢您的回答。我还有其他图像编码为:&lt;a href="#"&gt;&lt;img class="picture" src="/generator?type=1&amp;amp;w=80&amp;amp;h=80&amp;amp;f=picture.jpg"&gt;&lt;/a&gt; 我如何编辑 preg_replace 以匹配它?
  • 另外,我可以在查询数组中添加什么样的查询来获得原始 HTML 减去“底部块”div。例如,我在搞砸类似的东西:$queries['original_content_without_blurb'] =&gt; '*[not(self::div[@class = "bottom-block"])]'
  • 感谢 Casimir,但 bb 字段包含 'bottom-block' 的内容。事实上,我希望 bb 字段包含原始 HTML 减去底部块 div。有什么想法吗?
  • 我又找到了 image_name 的另一种情况。类似:&lt;img class="picture" style="background-image:url('/imageGenerator?type=15&amp;w=80&amp;h=80&amp;f=scotiabank.jpg');" /&gt; 抱歉,我在 RegEx 方面不是很擅长,否则我会尝试在自己身上添加这个。我觉得当前的 RegEx 应该能捕捉到这种情况,因为它基本上是“div”标签和“img”标签之间的区别。
  • 我发现了我上面提到的图像问题:'image_name' =&gt; '//img/@src | //img[@class = "picture"]/@style | //div[@class = "picture"]/@style',
【解决方案2】:

好的,所以我不知道如何使用 SQL 查询来执行此操作,但我将使用 PHP 执行此操作。基本前提是使用五个单独的匹配查询,然后将它们打印出来。匹配查询如下:

  1. 底部块内容
  2. 图片
  3. 网址
  4. 简介
  5. 名字

这里有一些代码来演示。

// GET THE BOTTOM BLOCK CONTENT
preg_match('~(?<=<div class="bottom-block">).*?(?=</div>$)~ims', $mysql_row, $bottom_block_array);
$string = $bottom_block_array[0];

// GRAB THE IMAGES
preg_match_all('~[A-Z0-9_]+\.(?:jpg|jpeg|gif|png)(?=\'|")~i', $string, $images);
$images = $images[0];

// GRAB THE URLS
preg_match_all('~(?<=href=").*?(?=")~ims', $string, $urls);
$urls = $urls[0];

// GRAB THE BLURBS
preg_match_all('~(?<=<div class="blurb">).*?(?=</div>)~ims', $string, $blurbs);
$blurbs = $blurbs[0];

// GRAB THE NAMES
preg_match_all('~(?<=alt=").*?(?=")~ims', $string, $names);
$names = $names[0];



// LOOP THROUGH AND PRINT OUT ALL OF THE NAMES (OR ONLY ONE, IF DESIRED)
if ($names) {
    foreach ($names AS $name) {print "\nName: ".$name;} // USE THIS IF YOU WANT ALL OF THE NAMES
    // print "\nName: ".$names[0]; // USE THIS IF YOU ONLY WANT ONE POSSIBLE NAME TO SHOW UP
}
else {print "\nName:";}


if ($urls) {
    foreach ($urls AS $url) {print "\nUrl: ".$url;} // PRINT OUT ALL URLS
    // print "\nUrl: ".$urls[0]; // PRINT OUT ONLY ONE URL    
}
else {print "\nUrl:";}


if ($images) {
    foreach ($images AS $image) {print "\nImageName: ".$image;} // PRINT OUT ALL THE IMAGES
    // print "\nImageName: ".$images[0]; // PRINT OUT ONLY ONE IMAGE
}
else {print "\nImageName:";}


if ($blurbs) {
    foreach ($blurbs AS $blurb) {print "\nBlurb: ".$blurb;} // PRINT OUT ALL OF THE BLURBS
    // print "\nBlurb: ".$blurbs[0]; // PRINT OUT ONLY ONE BLURB
}
else {print "\nBlurb:";}


print "\n\n\n\n\n";

Here is a working demo

【讨论】:

    猜你喜欢
    • 2017-12-30
    • 1970-01-01
    • 2012-10-01
    • 1970-01-01
    • 2017-05-01
    • 2017-05-12
    • 2011-09-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多