php - Extract img src from a text element in an XML feed

I have an XML feed that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<smf:xml-feed xmlns:smf="http://www.simplemachines.org/" xmlns="http://www.simplemachines.org/xml/recent" xml:lang="en-US">
<recent-post>
<time>April 04, 2021, 04:20:47 pm</time>
<id>1909114</id>
<subject>Title</subject>
<body><![CDATA[<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>]]></body>
</recent-post>
</smf:xml-feed>
I want to extract the imagesrc
from thebody
and then save it to a new XML file that includes an element forimage
.
So far, I have
$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'smf:xml-feed/recent-post/body' );
foreach( $nodes as $node )
{
$html = new DOMDocument();
$html->loadHTML( $node->nodeValue );
$src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
echo $src;
}
But when I try to print out$nodes
, I get nothing. What am I missing?
Answer
Solution:
This looks like a Simple Machines feed. However the namespaces are missing and the "body" element should be a CDATA section with an html fragment as text. I would expect to look like this:
<smf:xml-feed
xmlns:smf="http://www.simplemachines.org/"
xmlns="http://www.simplemachines.org/xml/recent"
xml:lang="en-US">
<recent-post>
<time>April 04, 2021, 04:20:47 pm</time>
<id>1909114</id>
<subject>Title</subject>
<body><![CDATA[
<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>
]]>
</body>
</recent-post>
</smf:xml-feed>
The XML defines two namespaces. To use them in Xpath expressions you have to register prefixes for them. I suggest iterating therecent-post
elements. Then fetch the text content of specific child nodes using expression with string casts.
Thebody
element contains the HTML fragment as text, so you need to load it into a separate document. Then you can Xpath on this document to fetch thesrc
of theimg
:
$feedDocument = new DOMDocument();
$feedDocument->preserveWhiteSpace = false;
$feedDocument->loadXML($xmlString);
$feedXpath = new DOMXPath($feedDocument);
// register namespaces
$feedXpath->registerNamespace('smf', 'http://www.simplemachines.org/');
$feedXpath->registerNamespace('recent', 'http://www.simplemachines.org/xml/recent');
// iterate the posts
foreach($feedXpath->evaluate('/smf:xml-feed/recent:recent-post') as $post) {
// demo: fetch post subject as string
var_dump($feedXpath->evaluate('string(recent:subject)', $post));
// create a document for the HTML fragment
$html = new DOMDocument();
$html->loadHTML(
// load the text content of the body element
$feedXpath->evaluate('string(recent:body)', $post),
// just a fragment, no need for html document elements or DTD
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
);
// Xpath instance for the html document
$htmlXpath = new DOMXpath($html);
// fetch first src attribute of an img
$src = $htmlXpath->evaluate('string(//img/@src)');
var_dump($src);
}
Output:
string(5) "Title"
string(9) "image.png"
Answer
Solution:
There are several problems with your code, some which I have to make assumptions on...
In
$dom->loadXML($xml);
this is expecting the actual source XML and not a URL, you would need to useload()
instead.
I would have to assume that thesmf
namespace is defined somewhere in the document, for testing purposes I have altered the sample XML to...
<smf:xml-feed xml:lang="en-US" xmlns:smf="http://a.com">
I've also altered the query to
//smf:xml-feed/recent-post/body
to test this code.
Finally, not sure why you create another document inside the loop, but you should be able to process this directly from the node in the loop, so I use$node
as the base for thegetElementsByTagName()
call...
$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->load($xml);
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( '//smf:xml-feed/recent-post/body' );
foreach( $nodes as $node )
{
$src = $node->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
echo $src;
}
Share solution ↓
Additional Information:
Link To Answer People are also looking for solutions of the problem: composer detected issues in your platform: your composer dependencies require a php version ">= 7.3.0".
Didn't find the answer?
Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.
Similar questions
Find the answer in similar questions on our website.
Write quick answer
Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.