When we refer to crawl accessibility, we are referring to how search engines see your website. Whilst the user has the ability to see the content dressed up with a beautiful appealing design search engines only see code.
Most search engines when visiting a site for the first time will work their way through the code opening each page which is referenced in the content, storing vital pieces of that information in their search database.
Search engines don’t have forever, so if you have a larger than normal site there are ways to help the search engines out to ensure they focus their time on content that matters and to present your website pages consistently.
We implement best practice each time we develop and launch a new website for a client. Whilst some of the areas we cover in this article are technical, it is worth noting that they can be implemented with basic coding knowledge.
The topics covered in this article relate to the robots.txt file, the XML sitemap, Link canocical, .htaccess file and Google Webmaster tools.
I want to start with the robots.txt file as it is the first point of call for any search engine that visits your website. It provides the search engines with details of:
As an example
User-agent: * Sitemap: https://divergencehosting.com.au/sitemap.xml Disallow: /admin Disallow: /*.pdf Disallow: /*.doc Allow: /*.xml$
This example allows access to all parts of the website. The search engine can access the .xml file however the /admin area is closed off. This is done for secuirty reasons as we don’t want Google registering the admin as it makes it prone to attacks by hackers. Any file ending in .pdf or .doc are not permitted to be registered by a search engine.
Search engines will register everything that is referenced in a website including doc files, xml files and pdf files. Once a document is registered in a search database it is hard to remove it, so it is beneficial to prevent it from being directly refereed to via search which allows for:
The XML sitemap provides all search engines easy access to a list of all the pages on your website. Once you have built an initial website it is important that you register the sitemap with search engines to ensure that your websites gets registered and begins to appear in their search results as quickly as possible. For Google this can be as easy as submitting a link via webmaster tools.
The sitemap.xml is also extremely beneficial as the file also includes the last modified date. This allows search engines to quickly identify the pages that have changed since the last visit.
The sitemap also provides search engines with a clear idea of how you want the web page address to appear in search. An example of this is that the below 12 links are the same page. Without a sitemap or any modification to the .htaccess file search engines will register the page based in what it first finds and it will consider all other pages duplicates.
http://yoursite.com.au http://yoursite.com.au/ http://yoursite.com.au/index.html http://www.yoursite.com.au http://www.yoursite.com.au/ http://www.yoursite.com.au/index.html https://yourwebsite.com.au https://yourwebsite.com.au/ https://yourwebsite.com.au/index.html https://www.yourwebsite.com.au https://www.yourwebsite.com.au/ https://www.yourwebsite.com.au/index.html
It is best practice to ensure that the way you want the URL to appear in search is the way it is presented in the sitemap. Apply relevant changes in the .htaccess file to prevent duplications.
There are a large number of online websites that can generate an XML sitemap for you or there are add-ons for content management systems or shopping carts that you can consider. Best practice is to ensure it is in place, so best to develop the file and updated it regularly.
The canonical link reference can be added to each page that you create on your website. This provides search engines with a clear indication once again what the name of the page is. It is beneficial to align the link with the same presentation of the page in the sitemap.
It is best practice to include the link canonical reference like the example below in the page header of each web page.
link rel="canonical" “https://yourwebsite.com.au/pagename”
This is the most technical part of crawl accessibility. I would strongly suggest that you engage the help of an experienced coder to perform these changes as any mistakes in this file will cause an outage of your website.
There are a lot of changes you can make in this file to improve the performance of your website. From a crawl accessibility perspective I would focus on the following changes.
By doing this any links to your website by a 3rd party wont cause a duplicate page issue with search engines as you are automatically directly the search engine to the correct page name.
Finally we have Google Webmaster tools which are a fantastic way to monitor the crawl accessibility of your website. Google will indicate any issues in your site allowing you to quick address them. As mentioned earlier webmaster tools has a feature for you to provide a link to your websites sitemap.
Not only will improving the crawl accessibility of your website benefit your appearance in search results, t will also ensure that your analytic’s doesn't register multiple versions of the same page.
If you wish to learn more on how to improve the crawl accessibility of your website when dealing with Google check out this great resource Consolidate duplicate URLs