1. Understanding Search Engines: Crawling, Indexing and Ranking Explained

piperjacobsen
Mar 20, 2024
9 min read

Updated: Mar 25, 2024

I began to touch on how search engines rank pages for keywords in my first chapter of my "What is SEO" beginners guide. However, in order to even show up as a suggested page when users search for something, Google or any search engine must first be able to access and log your website or web page.

Use my quick link menu, or keep scrolling to start...

How do search engines find web pages? Step 1: What is Crawling?

Step 2: Indexing: What is a search index?

Step 3: Ranking: Search Engine Results

How can I check if Google can crawl and index my site?

Method 1: Using a chrome extension
Method 2: Using URL Validator in Google Search Console

Why am I not showing up in search?

Cause 1: Robots.txt
Cause 2: Bad Internal Linking
Cause 3: Lacking External Links
Cause 4: Google penalties
Cause 5: New site needs extra time

What if I do not want Google to find my site?

What are tactics to help my site get crawled quicker and better?

After crawling comes Indexing...

Directives to tell search engines how to index your pages

Ranking: How do search engines rank pages?

Next Steps

How do search engines find web pages?

All search engines execute the following functions:

Crawling: They use so called 'spider' bots that scour through the internet to read and understand any content they come across
Indexing: Once content is found, they store and organise it (or index it) in a large database
Ranking: Now there's a record of a page, a search engine can rank an indexed page on how relevant it is for a particular user search. These results are eventually ordered in the search engine results pages (SERPs).

Step 1: What is Crawling?

Crawling is the process that search engines use to find new pages and content. Search engines use crawlers - sometimes dubbed spiders or bots, like Google's very own Googlebot - to discover this information. The formats that crawlers can find include PDF's,webpages ,videos, microsoft files like powerpoint and even CSVs. (The full list of indexable content formats is given by Google in their documentation)

Crucially, spider bots can only discover new content through links on other content that it has already had indexed. So in the case of Googlebot, it will fetch previous content, and search through this content for links, and follow this chain of links to find new pieces of content it didn't previously index.

Once its found fresh stuff, it adds it this content to its' central database. This process is called indexing.

Step 2: Indexing: What is a search index?

When search engines find new content through links they store this new information in a search index, a large database of all the content it has discovered thus far.

How much content can search indexes find?

As you can imagine, this number of documents held in Google's search index is behemoth. In fact a recent cross examining of the USA vs Google anti-trust Google's VP of search revealed it has an estimated 400 billion pages stored in its data base.

Step 3: Ranking: Search Engine Results

Once Google or any search engine has finished indexing a page to its database, it can now decide how worthy that page is to be recommended for several keywords a user types into the search bar. There are around 200 ranking factors that Google uses to determine this including but not limited to some of the most important:

High-quality Content
Number & Quality Backlinks
Core Website Vitals (Load speed)
Expertise and Authority of your brand/site (Domain ranking)
Informational Architecture of the site
Keyword Optimisation
Security of the site & other technical considerations
Schema Markup

If you want to ponder over the full list, Backlinko already has a wonderful article with extensive detail for each.

When someone performs any keyword search, a search engine will run through their index to find their most relevant offering of information already stored.

It presents this in a search engine results page ordered by relevance of pages it thinks will best satisfy the users need. Generally, the higher a page is to the top of the search results page the more Google thinks it will be useful to your initial search.

How can I check if Google can crawl my site?

You can check if a page is likely to be crawled and indexed in several ways:

Method 1: Using a chrome extension which checks your robot.txt file
Method 2: Using a site search operator to see number of listed results
Method 3: Using the URL inspection tool in GSC

Method 1: Using a chrome extension

To check if a web page is indexable, you can use a chrome extension like robots.txt exclusion checker.

Simply load the page you want to check, and click the extension and you should see a pop up like below.

If the rule is allowed for a robots TXT status, this means your site has not been blocking search engines from crawling your site. The allowed for the meta robots and x robots tags, means the page code is set up to tell Google "please index and store my page in your database".

There are plenty of good SEO extensions out there for your site which can give you more information than this + an indexation status. Lets use the same example with another favourite of mine: Detailed SEO extension.

Here you can see they tell you whether a URL is indexable and thus storable in Google's database.

Method 2: Using URL Validator in Google Search Console

A quicker way to see large scale indexing and ranking data of your site is to connect your domain to Google Search Console. GSC offers a wealth of data on your site's impressions and clicks to pages and crucially in this example, gives you an indexing report on pages number indexed.

You can use the URL Inspection tool to test whether a URL might be indexable.

You should make sure this page is part of the domain you have registered with Google as a note, but hopefully it should provide a quick example of whether Google has found your content yet and any troubleshooting to help it find you.

Check our Google's full documentation on the URL inspection tool for more info.

Why am I not showing up in search?

Don't panic, there can be several causes of your page or site not being crawled including it simply being very new and Google not having time to crawl it yet.

Other causes include:

Cause 1:Your robots.txt file blocks search engines from crawling your site
Cause 2: Your internal linking makes it hard for crawlers to find pages in your site
Cause 3:Your site doesn't have any links from external sources
Cause 4: Google is penalising your site for violating its terms of service. (Google's Spam Policies)
Cause 5: Your site is new and Google needs some more time to find it

Cause 1: Robots.txt

Your sites robots.txt file will tell search engine crawlers which URLs they can and cannot access on your site through a process of set rules tired to the URL paths structure. If you've just read that sentence and you're thinking... "english please?!" don't worry. At this stage all you need to worry about is that you have a robots.txt file and that it allows for crawlers like Google bot.

Robots.txt files are usually located at the root of your site ie 'domain.com/robots.txt'

A robots.txt file will look something like this. (I've taken this example from Clinique by typing in "clinique.co.uk/robots.txt"

You should be looking to check your site allows for this line "User-Agent: *"

This line indicates to all crawlers that you can crawl all the pages on the site that I mark as 'allow', and not crawl all the rules I mark on 'disallow'. I'll go into Robot.txt more in detail at some later point.

Cause 2: Bad Internal Linking

As crawlers find new pages on your site thorough links, if you have a new page thats not linked to anywhere on the site (sometimes called an orphan pages), it will be impossible for search engines to find and then index this content. To help this you can make sure to always include links through other relevant copy with relevant anchor text and provide an updated version of a site map to Google.

Cause 3: Lacking External Links

If you're site has no link anywhere outside of your own pages, it can be super hard to get your content recognised by Google. You can promote a link to your new site at the very least through some form of personal socials like instagram or LinkedIn. This is a good low cost alternative, but you can also make sure to put some investment into your digital PR outreach to push brand awareness.

Cause 4: Google penalties

Sites that violate Googles policies to protect users from being shown spammy and unhelpful content can see their content be removed from the index. Read up on Google's Spam Policies to learn more.

Cause 5: New site needs extra time

If your site is extremely new, sometimes Google can naturally struggle to find a connecting link for your site. Keep in mind it can take several days to several weeks before Google begins to recognise your content.

What if I do not want Google to find my site?

There are some rare scenarios where you may not want pages on your site to be found. You may be building a new section of your site and you don't want users to find unfinished pages, or you may find that you don't want Google to waste time looking at some E-commerce filter pages if your site it already too big. In these cases, you will go about using the same tools as Robots.txt and no index tags, to help manage what you want Google to find and index.

What are tactics to help my site get crawled quicker and better?

Ensure text is hard coded into html and not within pictures or other non text formats
Ensure your site navigation and general information on site follows a logical structure so crawlers can find pages easily
Keeping an up to date sitemap
Reducing the number of 404/ broken pages a crawler comes across by implementing appropriate 301 redirects. (These redirect user to non broken pages)

After crawling comes Indexing

By now we know the steps to allowing search engines to crawl your site. But the next step should be understanding how Google will store this content that it finds. Just because Google has found your page through crawling, doesn't mean it will always end up indexing it.

When Google finds a page, it renders it ( a process of loading and showing it as it would be viewed by the user) and then logs said page.

It may be that you want Google to be able to access some pages at one point in time, or to access other links through the site, and then later down the line you want to avoid Google from storing the same page. (Maybe the content is not relevant, maybe its a specific URL filter who knows, but there are several ways you can indicate to Google, 'don't store my page in your database!'

Directives to tell search engines how to index your pages

There are two main tags you can add to page code that will indicate to Google or any search engine that you ultimately do not want Google to index and store a page, and ultimately that you don't think it should be recommended to users.

These are Robots Meta Tags and X robots tags

Robots Meta Tags

This is placed in the head of the HTML of a page and give instructions specific to different search engine crawlers

Types of Robots Meta Directives:

noindex (telling crawlers you don't want your page indexed)

Example : <meta name="robots" content="noindex,">

nofollow (telling crawlers that links on a page shouldn't be followed)

Example: <meta name="robots" content="nofollow">

These two are often seen combines to tell search engine crawlers that you don't want a page indexed AND you want to stop the crawler from following the links it will find on the page.

X Robots Tags

X robots tags are added into the HTTP response header of a URL. It will tell a search engine crawler whether to crawl and then index your page. It might look something like this.

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
(…)
X-Robots-Tag: noindex
(…)

What is the difference between a robots meta tag and an x robots meta tag?

Robots X tags will tell a search engine crawler not to index before crawling, so can be more effective to save crawl budget used on your site. However, you can use either if you want to stop a page being indexed.

Now your site should not only be crawled right, but can now be indexed and stored properly in a search engine database. 'Properly' here meaning, just the way you desire the site to be logged.

Ranking: How do search engines rank pages?

Ranking is all about search engines deciding what the most relevant page is to suggest when a user types a search into the search engine bar. To do this, search engines have a complex algorithm going on in the background which considers a number of factors not limited to the quality of content, how many backlinks it gets, the relevancy of the content to your site and said site's authority. Google's machine learning part of its algorithm - dubbed Rank Brain - is constantly evolving how it ranks pages and the metrics it does so by. Numerous updates over the years have caused fluctuations in SERPs. I cover these exact metrics across the rest of the guides.

Next Steps: Understanding Keywords & Keyword Research

So now we understand how search engines find, store and decide to suggest the content of your website. Now would be a good time to learn some more detail about the appropriate keywords you would want your site to rank for. How do you find them, how do we figure out business priorities from them and what kind of content we need to rank for them.

Next Article is coming soon...

1. Understanding Search Engines: Crawling, Indexing and Ranking Explained

How do search engines find web pages?

Step 1: What is Crawling?

Step 2: Indexing: What is a search index?

How much content can search indexes find?

Step 3: Ranking: Search Engine Results

How can I check if Google can crawl my site?

Method 1: Using a chrome extension

Method 2: Using URL Validator in Google Search Console

Why am I not showing up in search?

Cause 1: Robots.txt

Cause 2: Bad Internal Linking

Cause 3: Lacking External Links

Cause 4: Google penalties

Cause 5: New site needs extra time

What if I do not want Google to find my site?

What are tactics to help my site get crawled quicker and better?

After crawling comes Indexing

Directives to tell search engines how to index your pages

Robots Meta Tags

X Robots Tags

What is the difference between a robots meta tag and an x robots meta tag?

Ranking: How do search engines rank pages?

Next Steps: Understanding Keywords & Keyword Research

Recent Posts

Comments