There are plenty of causes you might need to uncover every one of the URLs on a web site, but your precise intention will decide Anything you’re looking for. For illustration, you might want to:
Discover just about every indexed URL to analyze challenges like cannibalization or index bloat
Accumulate present and historic URLs Google has found, specifically for site migrations
Locate all 404 URLs to Recuperate from put up-migration glitches
In Each individual state of affairs, only one Software received’t give you every thing you will need. Regrettably, Google Lookup Console isn’t exhaustive, and also a “web site:case in point.com” lookup is proscribed and challenging to extract information from.
With this publish, I’ll stroll you thru some applications to construct your URL listing and prior to deduplicating the data using a spreadsheet or Jupyter Notebook, based upon your web site’s dimensions.
Aged sitemaps and crawl exports
In the event you’re seeking URLs that disappeared with the Are living internet site lately, there’s a chance anyone on the group could have saved a sitemap file or simply a crawl export ahead of the changes were being produced. In case you haven’t now, look for these information; they could generally deliver what you require. But, in the event you’re looking through this, you most likely didn't get so lucky.
Archive.org
Archive.org
Archive.org is a useful Resource for Web optimization tasks, funded by donations. If you try to find a website and choose the “URLs” solution, you are able to obtain up to 10,000 outlined URLs.
However, There are many limitations:
URL limit: You may only retrieve up to web designer kuala lumpur 10,000 URLs, which happens to be inadequate for bigger web pages.
Quality: Several URLs could possibly be malformed or reference source files (e.g., illustrations or photos or scripts).
No export solution: There isn’t a created-in method to export the listing.
To bypass The shortage of the export button, make use of a browser scraping plugin like Dataminer.io. Nevertheless, these limitations suggest Archive.org might not present a whole Resolution for larger sized web pages. Also, Archive.org doesn’t point out whether or not Google indexed a URL—however, if Archive.org observed it, there’s a fantastic opportunity Google did, much too.
Moz Pro
When you may ordinarily utilize a url index to uncover exterior web pages linking to you personally, these applications also find URLs on your web site in the procedure.
Ways to use it:
Export your inbound back links in Moz Pro to secure a rapid and straightforward list of focus on URLs from your web page. In case you’re coping with an enormous Web page, think about using the Moz API to export data over and above what’s manageable in Excel or Google Sheets.
It’s essential to Take note that Moz Pro doesn’t affirm if URLs are indexed or uncovered by Google. Nonetheless, considering the fact that most web-sites apply a similar robots.txt policies to Moz’s bots since they do to Google’s, this technique typically functions well being a proxy for Googlebot’s discoverability.
Google Search Console
Google Search Console provides a number of beneficial resources for building your listing of URLs.
Back links reviews:
Comparable to Moz Pro, the Links segment offers exportable lists of focus on URLs. Sadly, these exports are capped at 1,000 URLs each. You may implement filters for specific webpages, but considering the fact that filters don’t utilize towards the export, you may perhaps really need to rely upon browser scraping applications—limited to 500 filtered URLs at any given time. Not best.
Efficiency → Search engine results:
This export gives you a list of webpages obtaining research impressions. Though the export is proscribed, You need to use Google Research Console API for much larger datasets. There's also cost-free Google Sheets plugins that simplify pulling far more intensive info.
Indexing → Pages report:
This area offers exports filtered by difficulty sort, while they're also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful supply for gathering URLs, using a generous limit of a hundred,000 URLs.
Better still, it is possible to use filters to make various URL lists, effectively surpassing the 100k Restrict. By way of example, if you need to export only weblog URLs, observe these ways:
Step one: Add a phase into the report
Move 2: Click “Make a new section.”
Move three: Outline the segment having a narrower URL sample, which include URLs that contains /website/
Note: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.
Server log information
Server or CDN log information are Potentially the final word Software at your disposal. These logs seize an exhaustive record of each URL route queried by end users, Googlebot, or other bots during the recorded period.
Factors:
Data dimension: Log files is usually large, countless web sites only keep the final two months of data.
Complexity: Analyzing log documents can be difficult, but several applications are offered to simplify the procedure.
Incorporate, and good luck
After you’ve collected URLs from these sources, it’s time to mix them. If your internet site is sufficiently small, use Excel or, for much larger datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are continually formatted, then deduplicate the list.
And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Good luck!