Have you ever wanted to save an online website locally? In most browsers you can do this with a couple of clicks but what if you wanted to do this programmatically? Maybe you want to save all the resources, not just the HTML page, but also the external JavaScript and style sheets and control where they are saved. If we are talking PHP, We generally think of web scraping via curl because we really need more flexibility and features.
So Are you looking for a PHP script to extract URLs and resources from the webpage? This tutorial will provide a code snippet that will help you to extract all URLs/links from a given website.
There are many code snippets available online or on many other blogs and website but everyone is not able to optimize your blog or website so you need some optimized code snippet. So now checkout out code snippet for your blog and website that will give you all features for your desired code. Now grab the ready to use code and paste it where you want.
Table of Contents
Resources That You Can Extract From A WebPage:
<link rel="stylesheet" href="https://www.example.com/style.css"> <script src="https://www.example.com/script.js"></script> <iframe src="https://www.example.com" width="468" height="60"></iframe> <a href="https://www.example.com">Test</a> <object data="flash.swf"></object> <embed src="flash.swf"></embed>
How To Get All External Resources Like href
& src
From A WebPage Using PHP?
Here is the awesome code snippet that will return all things in one request and run. Move ahead to just copy-paste the function and know how to use it.
<?php // Download The Remote WebPage $websiteURL= "https://www.google.com"; $curl = curl_init($websiteURL); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); $webPageContent = curl_exec($curl); print("Download size: Of Main Page " . curl_getinfo($curl, CURLINFO_SIZE_DOWNLOAD) .''); //get the download size of page // Match And Extract src and href Tags URLs preg_match_all('/(?:src=)"([^"]*)"/m', $webPageContent, $matchessrc); // Get All src URLs preg_match_all('/link.*\s*(?:href=)"([^"]*)"/m', $webPageContent, $matcheslink); // Get All link->href URLs $matches = array_merge($matchessrc[1], $matcheslink[1]); $domain = parse_url($websiteURL, PHP_URL_SCHEME). '://'.parse_url($websiteURL, PHP_URL_HOST); $path = parse_url($websiteURL, PHP_URL_PATH); $checked = array(); print_r($matches); // Print All Resources URLs foreach($matches as $m) { if($m[0] == '/') // Convert / Pathe URL To Main Domain $m = $domain.$m; elseif(substr($m, 0, 5) != 'http:' and substr($m, 0, 6) != 'https:') $m = $domain.'/'.$path.'/'.$m; if(in_array($m, $checked)) // Remove Duplicate Resources URLS continue; $checked[] = $m; } ?>
Final Words:
Be aware that the is placed well in your document. Rest all is in your hand if you want to customize it or play with it. That’s all we have. If you have any problem with this code in your template then feel free to contact us with a full explanation of your problem. We will reply to you as time allowed to us. Don’t forget to share this with your friends so they can also take benefit from it and leave.
Very Good Info. I have been using php. But, never thought there is something like this. Thank you.
Welcome here and thanks for reading our article and sharing your view. This will be very helpful to us to let us motivate to provide you more awesome and valuable content from a different mind. Thanks for reading this article.
VERY VERY NICE SCRIPT!!
Welcome here and thanks for reading our article and sharing your view.