I’m going to get technical with you all because once I see an SEO issue with more than one client, I feel compelled to put together a sort of PSA. This is one of those “don’t do this at home” type of posts. The SEO issue I’m going to outline probably only happens on large enterprise type sites where developers are trying to save time building out sites or use similar kittens across multiple sites of the same family. It all comes down to an understanding of relative vs absolute links AND making sure that your site generates a 404 status for URLs that are not “real” on your site.
Let’s start out with a few definitions.
What is an absolute link?
An absolute link is a hyperlink that contains a full URL, so for example:
<a href=”https://www.domain.com/folder/folder/page.html”> Page</a>
What is a relative link
A relative link is a link that doesn’t include the protocol ( “http://” or “https://”) and/or the domain ( www.example.com ) – or subdomain (or folder) in the code. A relative link look like this:
<a href=”/page.html”>Page </a>
When searchbots see these links they are forced to assume that the link is on the same site (and folder level) on which the relative link appears. Developers tend to use relative links because they think they are convenient and easier to port from the development to live servers, or if you need to reference the same link across domains on the same server.
Developers often think that relative links are more convenient because they are shorter and are easier if you have to apply them to multiple domains. However, you can only use them to reference links on the same server as the page that hosts them.
Relative links come in different flavors.
Path relative link
These links look like this:
Because these links don’t reference a domain or have a forward slash (/), the bots and browsers will assume that this link needs to be appended to the page on which the link appears.
Directory Traversal Relative Link
These links look like this:
The dot-dot-slash (“ ../”) is a command to append this link to the parent directory of the current page. Sometimes these links can have typos in them, and if you remove a dot you then have dot-slash (“ ./”) which is instead a circular reference to “here” and this can turn into a trap for search engine bots and a waste of crawl budget. Especially if you don’t have a 404 page that is being triggered, the bot will keep adding the link to the current page and folder structure infinitely.
Why are 404s important in this scenario?
404 is the error message and status that is displayed by the browser if the resource is not found.
I’m going to walk you through how miscoded relative links and a lack of a 404 page that triggers a 404 status can create crawl waste as Googlebot tries to generate these links.
Here’s why absolute URLs are more ideal (than relative links)
1. An absolute URL makes it harder for scrapers to lift information from your website.
If you have all of your internal links as relative URLs, it would be very easy for a scraper to simply scrape your entire website and put it up on a new domain.
2. They create duplicate kittens
If you have a site that has multiple versions of your root domain indexed in Google (www vs non-www) and you do not have a canonical tag point to correct version of the site, or active 404-page things can go awry.
Because from Google’s perspective, Google could potentially enter your site on any one of these pages. At this point, if all of the internal links are relative URLs, they can then crawl and index your entire site using whatever format they started from – creating a duplicate kittens issue. So say your “main” domain is www.domain.com but Googlebot starts the crawl with domain.com, and you have relative links like:
<a href=”/page.html”> Page </a>
Then Google will crawl they will index all of those URLs as domain.com/page. Or maybe they will crawl and index both – diluting your inbound link equity and page authority.
3. Relative URLs can also create spider traps
This is because they associate the link with the current DIRECTORY, not the domain. Here’s an example:
A link to ./ means you’re linking to the current directory.
When the current URL ends in /contact/ this means that a link to ./example/ resolves to /contact/example/.
So clicking that link would take me to http://www.example.com/contact/example/.
If your CMS is set up to serve the same page for /contact/example/ as it serves for /contact/, which is often the case, you’ll now have a spider trap. Because that /kittens/example/ page also links to ./example/, which now resolves to /contact/example/example/
When you combine this with the lack of 404 status being generated for some of the fake pages, you have an epic waste of crawl budget.
I have seen URLs in the crawls I do for SEO audits looking like this:
has been duplicated 129,876 times, and that is just in the sample crawl of 270K pages.
Here’s what happens when you fix this issue
Here’s the traffic impact of fixing this issue on an enterprise site – the crawl was infinite and no 404 page (or status) was triggered. We fixed the relative link issue and ensured a 404 page was generated:
Other issues with relative links
- Don’t use them for canonical tags. Always specify canonical URLs in absolute form, or don’t use the canonical tag.
- Don’t use them for images. You want to ensure that Google has the full domain as the owner of the image to reduce duplication from competing sites in image search.
Want to read more about absolute URLs? Check out this blog and Google guidance around their handling of relative URLs.