How to Find All Present and Archived URLs on an internet site

There are lots of motives you could require to search out all the URLs on an internet site, but your specific goal will decide Anything you’re searching for. As an example, you may want to:

Establish every indexed URL to analyze concerns like cannibalization or index bloat
Acquire latest and historic URLs Google has found, specifically for site migrations
Discover all 404 URLs to Recuperate from submit-migration problems
In Every single circumstance, one Resource received’t Provide you all the things you need. Sadly, Google Search Console isn’t exhaustive, along with a “website:instance.com” search is limited and difficult to extract data from.

In this post, I’ll wander you thru some instruments to build your URL list and before deduplicating the info employing a spreadsheet or Jupyter Notebook, determined by your web site’s measurement.

Previous sitemaps and crawl exports
In the event you’re searching for URLs that disappeared with the Stay internet site just lately, there’s a chance somebody on your staff could possibly have saved a sitemap file or possibly a crawl export before the changes were being designed. In case you haven’t presently, check for these data files; they could usually offer what you need. But, when you’re looking through this, you probably didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable Resource for Web optimization duties, funded by donations. If you try to find a website and select the “URLs” solution, you'll be able to access as many as 10,000 stated URLs.

Even so, there are a few limits:

URL limit: You could only retrieve approximately web designer kuala lumpur 10,000 URLs, which is insufficient for bigger web-sites.
Top quality: Lots of URLs could be malformed or reference useful resource files (e.g., illustrations or photos or scripts).
No export choice: There isn’t a designed-in method to export the listing.
To bypass The dearth of an export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these restrictions signify Archive.org may not give an entire Answer for more substantial sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but if Archive.org located it, there’s a very good likelihood Google did, far too.

Moz Professional
Whilst you could possibly ordinarily use a url index to uncover external websites linking for you, these instruments also discover URLs on your site in the method.


The way to utilize it:
Export your inbound links in Moz Pro to obtain a swift and straightforward listing of concentrate on URLs from your website. In case you’re working with a massive Web site, consider using the Moz API to export info outside of what’s workable in Excel or Google Sheets.

It’s imperative that you note that Moz Pro doesn’t confirm if URLs are indexed or found by Google. Nonetheless, considering that most web-sites utilize a similar robots.txt rules to Moz’s bots as they do to Google’s, this method generally works nicely as being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console provides a number of beneficial sources for developing your list of URLs.

Links reports:


Similar to Moz Professional, the One-way links portion delivers exportable lists of focus on URLs. Sad to say, these exports are capped at one,000 URLs Every single. It is possible to use filters for certain pages, but since filters don’t utilize into the export, you may perhaps should depend upon browser scraping equipment—limited to five hundred filtered URLs at any given time. Not excellent.

Functionality → Search Results:


This export gives you an index of web pages getting search impressions. While the export is proscribed, You need to use Google Lookup Console API for larger datasets. In addition there are no cost Google Sheets plugins that simplify pulling extra extensive details.

Indexing → Internet pages report:


This segment presents exports filtered by concern style, even though these are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent resource for accumulating URLs, by using a generous limit of a hundred,000 URLs.


Better yet, you are able to utilize filters to build diverse URL lists, efficiently surpassing the 100k Restrict. By way of example, in order to export only weblog URLs, abide by these methods:

Stage one: Include a phase for the report

Move 2: Simply click “Produce a new phase.”


Stage three: Define the section having a narrower URL sample, like URLs that contains /weblog/


Observe: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log files
Server or CDN log documents are perhaps the last word Software at your disposal. These logs capture an exhaustive checklist of each URL route queried by customers, Googlebot, or other bots in the recorded period.

Issues:

Data dimension: Log documents can be enormous, a great number of websites only retain the last two weeks of information.
Complexity: Examining log data files may be demanding, but many tools can be found to simplify the method.
Blend, and fantastic luck
When you’ve gathered URLs from every one of these resources, it’s time to mix them. If your internet site is sufficiently small, use Excel or, for larger sized datasets, tools like Google Sheets or Jupyter Notebook. Make certain all URLs are persistently formatted, then deduplicate the listing.

And voilà—you now have an extensive list of recent, outdated, and archived URLs. Fantastic luck!

Leave a Reply

Your email address will not be published. Required fields are marked *