使用PhpWord隐藏HTML：错误 – DOMDocument :: loadXML()：p上的命名空间前缀o未在实体中定义

2019-06-22 08:16:00 阅读：536 来源： 互联网

我试图隐藏用Php单词格式化的HTML.

我用summernote创建了一个html表单. Summernote允许用户格式化文本.此文本使用html标记保存到数据库中.

接下来使用phpWord,我想将捕获的信息输出到word文档中.请参阅以下代码：

$rational = DB::table('rationals')->where('qualificationheader_id',$qualId)->value('rational');

 $wordTest = new \PhpOffice\PhpWord\PhpWord();
        $newSection = $wordTest->addSection();
        $newSection->getStyle()->setPageNumberingStart(1);


    \PhpOffice\PhpWord\Shared\Html::addHtml($newSection,$rational);
    $footer = $newSection->addFooter();
    $footer->addText($curriculum->curriculum_code.'-'.$curriculum->curriculum_title);



    $objectWriter = \PhpOffice\PhpWord\IOFactory::createWriter($wordTest,'Word2007');
    try {
        $objectWriter->save(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));
    } catch (Exception $e) {
    }

    return response()->download(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));

保存在数据库中的文本如下所示：

<p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial;"><span style="font-family: Arial;">The want for this qualification originated from the energy crisis in
South Africa in 2008 together with the fact that no existing qualifications
currently focuses on energy efficiency as one of the primary solutions.  </span><span style="font-family: Arial;">The fact that energy supply remains under
severe pressure demands the development of skills sets that can deliver the
necessary solutions.</span><span style="font-family: Arial;">  </span><o:p></o:p></span></p><p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; font-family: Arial;">This qualification addresses the need from Industry to acquire credible
and certified professionals with specialised skill sets in the energy
efficiency field. The need for this skill set has been confirmed as a global
requirement in few of the International commitment to the reduction of carbon

我收到以下错误：

ErrorException (E_WARNING)
DOMDocument::loadXML(): Namespace prefix o on p is not defined in Entity, line: 1

解决方法:

问题

解析器抱怨你的文本在元素标签中包含名称空间,更具体地说是标签上的前缀< o：p> (其中o：是前缀).好像是some kind of formatting for Word.

分析

对于格式良好的HTML结构,可以在声明中包含名称空间,从而告诉解析器这些前缀实际上是什么.但由于它似乎只是将被解析的HTML代码的一部分,因此它是不可能的.

可以提供DOMXPath with the namespace,以便PHPWord可以使用它.不幸的是,API中的DOMXPath isn’t public因此无法实现.

相反,似乎最好的方法是从标签中去除前缀,并使警告消失.

编辑2018-10-04：我已经发现了一种方法来保留标签中的前缀并仍然使错误消失,但执行并不是最佳的.如果有人可以提供更好的解决方案,请随时编辑我的帖子或发表评论.

解

根据分析,解决方案是删除前缀,反过来我们必须预先解析代码. Since PHPWord is using DOMDocument,我们也可以使用它,并确保我们不需要安装任何(额外的)依赖项.

PHPWord正在使用loadXML解析HTML,这是一个抱怨格式化的函数.在这种方法中可以抑制错误消息,我们将在两个解决方案中进行此操作.这是在passing an additional parameter完成的loadXML和loadHTML函数.

解决方案1：预解析为XML并删除前缀

第一种方法将html代码解析为XML,并递归遍历树并删除标记名称上出现的任何前缀.

我创建了一个应该解决这个问题的类.

class TagPrefixFixer {

    /**
      * @desc Removes all prefixes from tags
      * @param string $xml The XML code to replace against.
      * @return string The XML code with no prefixes in the tags.
    */
    public static function Clean(string $xml) {
        $doc = new DOMDocument();
        /* Load the XML */
        $doc->loadXML($xml,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
        );
        /* Run the code */
        self::removeTagPrefixes($doc);
        /* Return only the XML */
        return $doc->saveXML();
    }

    private static function removeTagPrefixes(DOMNode $domNode) {
        /* Iterate over each child */
        foreach ($domNode->childNodes as $node) {
            /* Make sure the element is renameable and has children */
            if ($node->nodeType === 1) {
                /* Iterate recursively over the children.
                 * This is done before the renaming on purpose.
                 * If we rename this element, then the children, the element
                 * would need to be moved a lot more times due to how 
                 * renameNode works. */
                if($node->hasChildNodes()) {
                    self::removeTagPrefixes($node);
                }
                /* Check if the tag contains a ':' */
                if (strpos($node->tagName, ':') !== false) {
                    print $node->tagName;
                    /* Get the last part of the tag name */
                    $parts = explode(':', $node->tagName);
                    $newTagName = end($parts);
                    /* Change the name of the tag */
                    self::renameNode($node, $newTagName);
                }
            }
        }
    }

    private static function renameNode($node, $newName) {
        /* Create a new node with the new name */
        $newNode = $node->ownerDocument->createElement($newName);
        /* Copy over every attribute from the old node to the new one */
        foreach ($node->attributes as $attribute) {
            $newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
        }
        /* Copy over every child node to the new node */
        while ($node->firstChild) {
            $newNode->appendChild($node->firstChild);
        }
        /* Replace the old node with the new one */
        $node->parentNode->replaceChild($newNode, $node);
    }
}

要使用代码,只需调用TagPrefixFixer :: Clean函数即可.

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print TagPrefixFixer::Clean($xml);

产量

<?xml version="1.0"?>
<p>Foo <b>Bar</b></p>

解决方案2：预解析为HTML

我注意到如果你使用loadHTML而不是loadXML PHPWord is using,它会在将HTML加载到类中时删除前缀.

此代码明显缩短.

function cleanHTML($html) {
    $doc = new DOMDocument();
    /* Load the HTML */
    $doc->loadHTML($html,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
    );
    /* Immediately save the HTML and return it. */
    return $doc->saveHTML();
}

要使用此代码,只需调用cleanHTML函数即可

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print cleanHTML($html);

产量

<p>Foo <b>Bar</b></p>

解决方案3：保留前缀并添加名称空间

在将数据提供给解析器之前,我尝试用给定的Microsoft Office namespaces包装代码,这也将解决问题.具有讽刺意味的是,我没有找到一种方法来添加DOMDocument解析器的命名空间,而不会实际提出原始警告.所以 – 这个解决方案的执行有点hacky,我不建议使用它,而是建立自己的.但你明白了这个想法：

function addNamespaces($xml) {
    $root = '<w:wordDocument
        xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
        xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
        xmlns:o="urn:schemas-microsoft-com:office:office">';
    $root .= $xml;
    $root .= '</w:wordDocument>';
    return $root;
}

要使用此代码,只需调用addNamespaces函数即可

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print addNamespaces($xml);

产量

<w:wordDocument
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
    xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
    xmlns:o="urn:schemas-microsoft-com:office:office">
    <o:p>Foo <o:b>Bar</o:b></o:p>
</w:wordDocument>

然后可以将此代码提供给PHPWord函数addHtml,而不会引发任何警告.

可选解决方案(已弃用)

在之前的回复中,这些是作为(可选)解决方案提出的,但为了解决问题,我将让他们在下面.请记住,这些都不是推荐的,应谨慎使用.

关闭警告

由于它只是一个警告,而不是一个致命的暂停异常,你可以关掉警告.您可以通过在脚本顶部包含此代码来完成此操作.但是,这仍然会减慢您的应用程序,最好的方法始终是确保没有警告或错误.

// Show the default reporting except from warnings
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED & ~E_WARNING);

设置源自default reporting level.

使用正则表达式

在将数据库保存到数据库之前,或者在获取它以用于此函数之后,可能(可能)使用regex on your text删除(大多数)命名空间.由于它已经存储在数据库中,因此最好在从数据库中获取后使用下面的代码.正则表达式虽然可能会遗漏某些事件,但在最坏的情况下会弄乱HTML.

正则表达式：

$text_after = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text_before);

例：

$text = '<o:p>Foo <o:b>Bar</o:b></o:p>';
$text = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text);
echo $text; // Outputs '<p>Foo <b>Bar</b></p>'

再现问题

为了重现这个问题,我不得不深入挖掘,因为PHPWord不是抛出异常,而是PHPWord正在使用的DOMDocument.下面的代码使用PHPWord正在使用的same parsing method,并应输出有关代码的所有警告和通知.

# Make sure to display all errors
ini_set("display_errors", "1");
error_reporting(E_ALL);

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';

# Set up and parse the code
$doc = new DOMDocument();
$doc->loadXML($html); # This is the line that's causing the warning.
# Print it back
echo $doc->saveXML();

标签：php,laravel,summernote,phpword
来源： https://codeday.me/bug/20190622/1261801.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

使用PhpWord隐藏HTML：错误 – DOMDocument :: loadXML()：p上的命名空间前缀o未在实体中定义