PHP 5.4.16 DOMDocument removes parts of Javascript

Question

794 votes

1 answers

PHP 5.4.16 DOMDocument removes parts of Javascript

Get the solution ↓↓↓

Solution:

Here's a hack that might be helpful. The idea is to replace the script contents with a string that's guaranteed to be valid HTML and unique then replace it back.

It replaces all contents inside script tags with the MD5 of those contents and then replaces them back.

$scriptContainer = [];
$str = preg_replace_callback ("#<script([^>]*)>(.*?)</script>#s", function ($matches) use (&$scriptContainer) {
     $scriptContainer[md5($matches[2])] = $matches[2];
        return "<script".$matches[1].">".md5($matches[2])."</script>";
    }, $str);
$dom = new \DOMDocument();
@$dom->loadHTML($str);
$final = strtr($dom->saveHTML(), $scriptContainer);

Herestrtr is just convenient due to the way the array is formatted, usingstr_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML()) would also work.

I find it very suprising that PHP does not properly parse HTML content. It seems to instead be parsing XML content (wrongly so as well because CDATA content is parsed instead of being treated literally). However it is what it is and if you want a real document parser then you should probably look into a Node.js solution with jsdom

Undefined asked

2023-03-31

Write your answer

Share solution ↓

Additional Information:

Date the issue was resolved:

2023-03-31

Link To Source
Link To Answer People are also looking for solutions of the problem: composer detected issues in your platform: your composer dependencies require a php version ">= 8.0.2".

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Answer

Solution:

If you have a<script> within a<script>, the following (not so smart) solution will handle that. There is still a problem: if the<script> tags are not balanced, the solution will not work. This could occur, if your Javascript usesString.fromCharCode to print the String</script>.

$scriptContainer = array();

function getPosition($tag) {
    return $tag[0][1];
}

function getContent($tag) {
    return $tag[0][0];
}

function isStart($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "<s");
}

function isEnd($tag) {
    $x = getContent($tag);
    return ($x[0].$x[1] === "</");
}

function mask($str, $scripts) {
    global $scriptContainer;

    $res = "";
    $start = null;
    $stop = null;
    $idx = 0;

    $count = 0;
    foreach ($scripts as $tag) {

            if (isStart($tag)) {
                    $count++;
                    $start = ($start === null) ? $tag : $start;
            }

            if (isEnd($tag)) {
                    $count--;
                    $stop = ($count == 0) ? $tag : $stop;
            }

            if ($start !== null && $stop !== null) {
                    $res .= substr($str, $idx, getPosition($start) - $idx);
                    $res .= getContent($start);
                    $code = substr($str, getPosition($start) + strlen(getContent($start)), getPosition($stop) - getPosition($start) - strlen(getContent($start)));
                    $hash = md5($code);
                    $res .= $hash;
                    $res .= getContent($stop);

                    $scriptContainer[$hash] = $code;

                    $idx = getPosition($stop) + strlen(getContent($stop));
                    $start = null;
                    $stop = null;
            }
    }

    $res .= substr($str, $idx);
    return $res;
}

preg_match_all("#\<script[^\>]*\>|\<\/script\>#s", $html, $scripts, PREG_OFFSET_CAPTURE|PREG_SET_ORDER);
$html = mask($html, $scripts);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_use_internal_errors(false);

// handle some things within DOM

echo strtr($dom->saveHTML(), $scriptContainer);

If you replace the "script" String within thepreg_match_all with "style" you can also mask the CSS styles, which can contain tag names too (i.e. within comments).

PHP 5.4.16 DOMDocument removes parts of Javascript

Solution:

Answer

Solution:

Share solution ↓

Additional Information:

Didn't find the answer?

Similar questions

Write quick answer

About the technologies asked in this question

PHP

JavaScript

Node.js

CSS

HTML

Welcome to programmierfrage.com

Get answers to specific questions

Help Others Solve Their Issues