My Search Engine Algorithm Testing Process

My approach to directly testing Google's algorithm, built off of Kyle Roof's testing methodology. If you don't test what you're doing, are you really doing SEO? Or guessing?

Disclaimer: This is a frequently iterating process with plenty of ideas or processes that need to be fully fleshed out.

There's plenty of information on the internet about "how to do SEO." What things are ranking factors, how you should link your content, what google is looking for, etc. A lot of that information is even good. But how many people who do SEO in some capacity actually know - how many have actually tested it?

Without actually running a proper test you could argue it's all technically guesswork.

This is why when I came across Kyle Roof's ideas on testingI almost immediately got a test site propped up and started running tests. The frequency of the tests quickly died off however. I'd build a process that ultimately was too cumbersome to easily set up and run a test, and between work and other things I rarely tested things the way I wanted.

This V2 iteration is an ongoing process to systematize this as much as I can in order to keep the testing (and learning) flowing as smoothly as possible.

Setting Up Your Testing Environment

First and foremost I've got to reiterate that I first came across this idea via Kyle Roof. I didn't make up the actual testing process on my own.

How To Run Clean SEO Tests - Kyle Roof

The process simplified is as follows. Get yourself a nonsense domain and a nonsense keyword with zero results.

This gives you a clean environment to run a test. From here you can publish pages on that website, using lorem ipsum and inserting your gibberish keyword to get those pages indexed. This allows you to make changes and compare different variables in about a clean an environment on Google's algorithm as is possible.

Testing Methods

Right now I have primary test methods.

Just like Kyle Roof's linkedin post above, rank 5 pages and change something on the 3rd.
Publish pages with different variables changed to see their ranking order, and then try to reverse or change their order.
Some variation of this on a non-dedicated-test site to test other factors (I'm still sorting this one out...)

Rank 5 Pages, Change the 3rd.

The idea here is to get 5 identical pages ranking on your target keyword. This sets up the environment for you to make a change and see how it affects the rankings.

By making changes to the 3rd, such as adding words or extra keyword insertions, you can see how it changes the middle result. Then you can revert the changes to see how the algorithm reacts.

For example, below is a test on word count. In the below screenshot all of the pages are identical, each with 800 words of lorem ipsum text, and the keyword placed once in the first paragraph.

From here, I went to the middle option and added 200 words of lorem ipsum to see how it reacted.

From here I can continue experimenting. Does a keyword in the H1 outrank extra words? Does more words always work or is there a band outside the current pages in which a page will rank worse?

Publishing Multiple Variables

The other ideas is to take several pages, each with a different variable, and publish them all to see the order in which they rank.

After things have ranked you have part of the answer, though their order doesn't always give a clear answer. Now you try to change results by switching variables.

Below is a super foundational (and also via Kyle Roof, credit where credit is due) test looking at various ranking factors. 11 pages were published, each placing a keyword in a different location on the page, and only in that location.

Each letter corresponds with the letter in the page's meta-title.

Ranking Factors & Their Pages

a. Meta title
b. Meta description
c. H1
d. H2
e. H3
f. H4
g. First paragraph
h. Last paragraph
i. Image alt attribute
j. Bolded text
k. URL

And here's how they ranked on first index. Note: no meta description here, ultimately confirming the meta description itself is not a ranking factor.

*Ranking order on initial index of "standard ranking factors" test.*

This doesn't mean this is the exact hierarchy Google looks for however. From here you'd want to go change and swap various elements. See how items move around. Experiment with it.

Live Site Testing

This is more of a placeholder than anything. I'd like to create a similar testing environment to run on a live site with other ranking content. Less controlled of an environment, but allows for testing other things like theories related to "link juice" for example...

Outside of the personal reminder to actually get this put together, I don't have much more to put here. I guess come back later if you're reading this?

Managing Tests and Hypotheses

I've built out an Airtable workspace to manage tests, data, and hypotheses. While not sharable at the moment because it's still relatively work-in-progress, I eventually plan to create a template to share here. Get in touch if you're reading this and it's something you actively would be interested in.

It's primary goal is to help me track hypotheses that are tested and archived, actively being tested, or ideas to test in the future. Aiming to answer questions such as:

Is word count a ranking factor?
Does direct traffic vs organic traffic to a page make any difference or impact on rankings?
Does exact match vs keyword variations impact ranking factor (think healthcare vs health care)

When aiming to answer these questions I form a hypothesis to test. Example: "Word Count is a positive ranking factor: A word count has an upper bound outside of the average of the top 5 in which it becomes a negative factor. Otherwise, more word count will rank higher."

Through a series of tests I'm aiming to test that hypothesis and use a formula to calculate p-value for the hypothesis.

Each test is outlined as a child of the hypothesis, and scored on a 0-1 score. 0 supporting the H0 or null hypothesis, 1 being in full support of the h1 or standard hypothesis. 0.5 being no effect.

Hypothesis Significance Formula

At the hypothesis level I have a formula that will roll up the results across multiple tests to give me a "confidence score." The formula below is used to give me a score for how confident I am in the data I've seen. The more tests I run, the better I trust this score.

This formula is in Airtable's formula format but should be relatively easily readable.

1/(1+(POWER(2.71828,(-({Average Result Value}-(1/5))/SQRT((0.95*(1-0.95))/{Tests Count})))))

1/(1+(...)) is the logistic function. It transforms any real number into a value between 0 and 1. This gives us our final probability-like score.
e (2.71828) is Euler's number, the base of natural logarithms. POWER(2.71828, x) is equivalent to e^x, which is the exponential function.
({Average Result Value}-(1/5)) calculates how much your observed results deviate from what you'd expect by random chance (1/5 for a test with 5 search results).
SQRT((0.95*(1-0.95))/{Tests Count}) calculates the standard error. It's based on my target probability of success (0.95) and sample size ({Tests Count}).
The entire expression inside POWER(2.71828, ...) is essentially a z-score. It measures how many standard deviations the result is from the mean.
The negative sign before this z-score ensures that positive results (better than random) yield a final score above 0.5, and negative results yield a score below 0.5.
0.95 in the formula represents your hypothesis confidence - how often we want to see an effect in ideal conditions. It's not the same as the final confidence score.
As {Tests Count} increases, the standard error decreases, making the formula more sensitive to small differences from random chance.
The formula effectively combines concepts from hypothesis testing (z-scores, standard error) with logistic regression (the logistic function) to produce a final score.

All of this formula aims to give you a confidence score between 0 and 1.

Scores close to 1 mean the hypothesis is highly likely to be true.
Scores close to 0.5 mean the changes have no or near random effect.
Scores close to 0 mean the inverse or opposite of the hypothesis is highly likely to be true.

In the instance of our word count example given above, I've currently got a confidence score of 0.59 across 3 tests. Meaning word count in and of itself does not appear to positively or negatively affect page ranking on it's own. (Although I plan to test this more).

How I'm Doing It: My Workflow and Tech Stack

I wanted to keep this relatively straightforward but still keep some sort of graphical interface to use. Through some research I landed on Ghost and hosted it on a PikaPods Pod

Pikapods is great and cost effective (also it's how I'm hosting this website as well). I think went about stripping down a Ghost theme to have a barebones and easy to manage environment for running these tests.

From here I am manually keeping track of data and results in a document. I'm working on a more proper framework and workflow for tracking new and old data, but at this point it's relatively unrefined and just a markdown document...