As we mentioned in our previous blog, search engines use bots to crawl the web, collect information on website pages, and index them. They then use algorithms to analyze these pages to determine their rankings on SERPs. To show up on SERPs, you must optimize your website and content in a way that is visible to search engines.
What is Search Engine Crawling?
Crawling is the discovery process used by search engine web crawlers (also known as bots or spiders) to visit and index new pages, and locate dead links. Pages that are already known by search engines are also crawled to determine any content changes. Some of the major search engine crawlers include Googlebot, Bingbot and Baidu Spider.
Search engines use algorithms that determine which sites are crawled more frequently and how often. If you regularly make changes to your website, or get a large influx of visitors everyday, your website may be crawled more frequently.
How do Site Crawlers Index Your Site?
The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As the crawlers visit these websites, they detect and record any links they find and use these to discover new pages. They go from link to link and bring back data about these webpages to search engines.
Once all the pages are crawled, they are then indexed in a database of billions of web pages. Their content is analyzed, organized and interpreted by search engine’s algorithms. When someone performs a search, search engines scour the pages in their index to show relevant content in the SERPs.
Robots.txt
You can’t directly control what pages search engine crawlers decide to crawl, but you can give them clues as to which pages they should ignore.
Search crawlers access your robots.txt file before they start a crawl. This file contains information on your sitemap, as well as lists the URLs that should be or should not be crawled. You can use the robots.txt disallow directive to keep crawlers away from pages you don’t want to appear in search results, including thank you and confirmation pages.
If the crawlers can't find a robots.txt file for a site, they will proceed to crawl and index the site normally.
Crawling Non-Text Files
Search engines normally attempt to crawl and index every URL they encounter, including URLs for non-text files such as images and videos. However, since crawlers can't read the content of these files, they can only extract a limited amount of information such as filename and metadata.
Google, for example, can index the content of most types of pages and files, including PDF and XLS documents, multimedia files (such as mpg and mpg), and graphic files (such as psd and jpeg). Here is a full list of files indexed by Google.
Utilizing Your Sitemap
Sitemaps are a list of URLs created by a website to provide search engines with a list of pages to be crawled. One of the easiest ways to ensure Google is indexing all your website pages is to submit your sitemap through Google Search Console. You can also use this tool to submit individual pages as well. This is normally done when new content is published to a site, or recent changes have been made.
Optimizing your Website for Search Crawlers
In order for your website to rank well in SERPs, it’s important to make sure search engines can crawl and index your site correctly. Here are some techniques and considerations to help crawlers find your important pages.
- It’s essential that your website has a clear navigation. If you structure your navigation website in a way that is inaccessible to search engines, you will not get listed in search results. A few common mistakes include coding links on your website in JavaScript and having an inconsistent navigation across the different website pages.
- Play close attention to on-site/internal links. The more integrated your internal linking structure, the easier it will be to crawl your site. If your pages are inaccessible due to broken links, crawlers will have a hard time discovering and indexing your website pages.
- Don't put your website content only in non-text files such as images, video, animations, or other formats. This makes it hard for search engines to understand and crawl your content.
- Avoid duplicate content by using rel=“canonical” element.