Solving the Sub-Domain Equation: Predicting Traffic and Value when Merging Sub-Domains

Posted by russvirante

To sub-domain or not to sub-domain, that is the question. Should you keep your content on separate sub-domains or the same domain? If I do merge my sub-domains, will I gain or lose traffic? How much?

Since my first days in SEO back in 2004, the sub-folder vs. sub-domain debate has echoed through nearly every site architecture discussion in which I have participated. It seems trivial in many respects that we would focus so intently on what essentially boils down to the ordering of words in a URL, especially given that www. itself is a sub-domain. However, for a long time, there has been good reason to consider the question vary carefully. Today I am writing about the problem in general, and I propose a programmatic strategy for answering the sub-domain/sub-folder debate.

For the purposes of this article, let's assume there is a company named Example Business that sells baseball cards, baseball jerseys and baseball hats. They have two choices for setting up their site architecture.

They can use sub-domains...

Or, they can use directories...

Many of you have probably dealt with the exact question, and for some of you this question has reared its head dozens if not hundreds of times. For those of you less familiar problem, let's do a brief history on sub-domains, sub-folders, and their interaction with Google's algo so we can get a feeling of the landscape.

Sub-domains and SEOs: A quick historical recap

First, really quickly, here is the breakdown of your average URL. We are most interested in comparing the sub-domain with the directory to determine which might be better for rankings.

parts of a url

This may date me a bit, either as a Noob or an Old-Timer depending on when you got in the game. I started directly after the Florida update in 2003. At that time, if I recall correctly, the sub-domain / sub-folder debate was not quite as pronounced. Most of the decisions we were making at the time regarding sub-domain had more to do with quick technical solutions (ie: putting one sub-domain on a different machine) than with explicit search optimization.

However, it seemed at that time our goal as SEOs was merely to find one more place to shove a keyword. Whether we used dashes (hell, I bought a double--dashed domain at one point) or sub-domains, Google's algos seemed to, at least temporarily, value the sub-domain to be keyword rich. Domains were expensive, but sub-domains were free. Many SEOs, myself included, began rolling out sites with tons of unique, keyword-rich sub-domains.

Google wasn't blind to this manipulation, though, and beginning around 2004, with some degree of effectiveness Google was able to kill off an apparent benefit to sub-domain spam. However, it still seemed to persist to some degree in discussions from 2006, 2007, 2008, and 2009. For a while, there seemed to be a feather in the cap of sub-domains specifically for SEO.

Fast forward a few years and Google introduces a new, wonderful feature called host crowding and indented results. Many of you likely remember this feature, but essentially, if you had two pages from the same host ranking in the top 10, the second would be pulled up directly under the other and given an indent for helpful organization. This was a huge blow to sub-domain strategies. Now ranking positions 1 and 10 on the same host was essentially the same as owning the top two positions, but on separate hosts it was valueless. In this case, it would make sense for "Example Business" to use sub-folders rather than sub-domains. If the content shared the same sub-domain, every time their website had 2 listings in the top 10 for a keyword, the second would be tucked up nicely under the first, effectively jumping multiple positions. If they were on separate sub-domains, they would not get this benefit.

Host Crowding Made Consolidating to a Single Domain Beneficial

Google was not done, however. They have since taken away our beautiful indented listings and deliberate host crowding and, at the same time given us Panda. Initial takes on Panda indicated that sub-domain and topical sub-domain segregation could bring positive results as Panda was applied at a host name level. Now it might make sense for "Example Business" to use sub-domains, especially if segmenting off low quality user generated content.

Given these changes, it is understandable why the sub-domain debate has raged on. While many have tried to discredit the debate altogether, there are legitimate, algorithmic reasons to choose a sub-domain or a sub-folder.

Solving the sub-domain equation

One of the beauties of contemporary SEO is having access to far better data than we've ever had. While I do lament the loss of keyword data in Google Analytics, so much other data is available at our fingertips than ever before. We now have the ability to transform intuition by smart SEOs into cold hard math.

When Virante, the company of which I am CTO, was approached a few months ago by a large website to help answer this question, we jumped at the opportunity. I now had the capability of turning my assumptions and my confidences into variables and variances and build a better solution. The client had chosen to go with the subdomain method for many years. They had heard concepts like "Domain Authority" and wondered if their subdomains spread themselves too thin. Should they merge their subdomains together? All of them, or just a few?

Choosing a mathematical model for analysis

OK, now for the fun stuff. There are a lot of things that we as SEOs don't know, but have a pretty good idea about. We might call these assumptions, gut instincts, experience, intuitions but, in math, we can refer to them as variables. For each of these assumptions, we also have confidence levels. We might be very confident about one assumption of ours (like backlinks improve rankings) and less confident about another (longer content improves rankings). So, we have our variables and we have how confident we are about them. When we don't know the actual values of these variables (in science we would refer to them as independent variables), Monte Carlo simulations often prove to be one of the most effective mathematical models we can use.

Definition: Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; i.e., by running simulations many times over in order to calculate those same probabilities heuristically just like actually playing and recording your results in a real casino situation: hence the name. - Wikipedia

With Monte Carlo simulations, we essentially brute force our way to an answer. We come up with all of the possibilities, drop them into a bag, and pick one from the bag over and over again until we have an average result. Or think about it this way. Let's say I handed you a bag with 10000 marbles and asked you which color of marble in the bag is most common. You could pour them all out and try and count them, or you could shake the bag and then pick out 1 marble at time. Eventually, you would have a good sample of the marbles and be able to estimate that answer without having to count them all.

We can do the same thing here. Instead of asking which color a marble is, we ask "If I merge one URL with another, what is the likelihood that it will receive more traffic from Google?". We then just have to load all of the variables that go into answering that question into our proverbial bag (a database) and randomly select over and over again to get an estimate.

So here are the details, hopefully you can follow and do this yourself.

Step 1: Determine the keyword landscape

The thing we need to know is every possible keyword for which the client might rank, how much potential traffic is available for that keyword, and how valuable is that keyword in terms of CPC. The CPC value allows us to determine the true value of the traffic, not just the volume. We want to improve rankings for valuable keywords more than random ones. This client in particular is in a very competitive industry that relies on a huge number of mid/long-tail keywords. We built a list of over 46,000 keywords related to their industry using GrepWords (you could use SEMRush to do the same).

Step 2: Determine the search landscape

We now need to know where they actually rank for these keywords and we need to know all the potential sub-domains we might need to test. We queued all 46K keywords with the AuthorityLabs API and within 24 hours we had the top 100 results in Google for each. We then parsed the data and extracted the position and rank of every sub-domain for the site. There were around 25 sub-domains that we discovered, but ultimately chose to only analyze the 9 that made up the majority of non-branded traffic.

Step 3: Determine the link overlap

Finally, we need to know about the links pointing to these sub-domains. If they all have links from the same sites, we might not get any benefit when we merge the sub-domains together. Using Mozscape API Link Metrics call, we pulled down the root linking domains for each site. When we do our Monte Carlo simulation, we can determine how their link profiles overlap and make decisions based on that impact.

Step 4: Create our assumptions

As we have mentioned, there are a lot of things we don't know, but we have a good idea about. Here we get to add in our assumptions as variables. You will see variables expressed as X and Y in these assumptions. This is where your expertise as an SEO comes into play.


Question 1: If two sub-domains rank for the same keyword in the top 10, what happens to the top ranked keyword?
Assumption 1: X% of the time, the second ranking will be lost as Google values domain diversity.
Example: It turns out that http://baseball-jerseys.example.com and http://baseball-hats.example.com both rank in the top 10 for the keyword "Baseball Hats and Jerseys". We assume that 30% of the time, the lower of the two rankings will be lost because Google values domain diversity.

Question 2: If two sub-domains rank for the same keyword in the top 10, what happens to the top ranked subdomain?
Assumption 2: Depending on the X% of link overlap, there is a Y% chance of improving 1 position.
Example: It turns out that http://baseball-jerseys.example.com and http://baseball-hats.example.com both rank in the top 10 for the keyword "Baseball Hats and Jerseys". We assume that 70% of the time, based on X% of link overlap, the top ranking page will move up 1 position.

Question 3: If two sub-domains merge, what happens to all rankings of top ranked subdomain, even when dual rankings are not present?
Assumption 3: Depending on X% of link overlap, there is a Y% chance of improving 1 position.
Example: On keywords where http://baseball-jerseys.example.com and http://baseball-hats.example.com don't have overlapping keyword rankings, we that 20% of the time, based on X% of link overlap, their keywords will improve 1 position.

These are just some of the questions you might want to include in your modeling method. There might be other factors you want to take into account, and you certainly can. The model can be quite flexible.

Step 5: Try not to set fire to the computer

So now that we have our variables, the idea is to pick the proverbial marble out of the bag. We will create a random scenario using our assumptions, sub-domains and keywords and determine what the result of that single random scenario is. We will then repeat this hundreds of thousands of times to get the average result for each sub-domain grouping.


We essentially need to do the following...

  1. Select a random set of sub-domains.
    For example, it might be sub-domains 1, 2 and 4. It could also be all of the sub-domains.
  2. Determine the link overlap between the sub-domains
  3. Loop through every keyword ranking those sub-domains we determined when building the Keyword and Search Landscape back in Step 2. Then, for each ranking...
    1. Randomly select our answer to #1 (ie: is this the 3 out of 10 times that we will lose rankings?)
    2. Randomly select our answer to #2 (ie: is this the 7 out of 10 times that we will increase rankings?)
    3. Randomly select our answer to #3 (ie: is this the 2 out of 10 times we will increase rankings?)
  4. Find out what our new traffic and search value will be.
    Once you apply those variables above, you can guess what the new ranking will be. Use the Search Volume, CPC, and estimated CTR by ranking to determine what the new traffic and traffic value will be.
  5. Add It Up
    Add up the estimated search volume and the estimated search value for each of the keywords.
  6. Store that result
  7. Repeat hundreds of thousands of times.
    In our case, we ended up repeating around 800,000 times to make sure we had a tight variance around the individual combinations.

Step 6: Analyze the results

OK, so now you have 800,000 results, so what do we do? The first thing we do segment those results by their sub-domain combination. In this case, we had little over 500 different sub-domain combinations. Second, we an average traffic and traffic value for each of those sub-domain combinations from those 800,000 results. We can then graph all those results to see which sub-domain combination had, on average, the highest predicted Traffic and Value.

To be honest, graphs are a terrible way of figuring out the answer, but it is the best tool we have to convey it in a blog post. You can see exactly why below. With over 500 different potential sub-domain combinations, it is difficult to visualize all of them at the same time. In the graph below, you see all of them, with each bar representing the average score for an individual sub-domain combination. For all subsequent graphs, I have taken a random sample of only 50 of the sub-domain combinations so it is easier to visualize.

Big graph

As mentioned previously, one of the things we try and predict is not just the volume of the traffic, but also the value of that traffic by multiplying it by CPC value of each keyword for which they rank. This is important if you care more about valuable commercial terms than just any keyword for which they might rank.

As the graph above exposes, there were some sub-domain combinations that influenced traffic more than value, and vice-versa. With this simulation, we could find a sub-domain combination that maximized the value or the traffic equation. A company that makes money off of display advertising might prefer to look at traffic, while one that makes money off of selling goods would likely pay more attention to the traffic value number.

There were some neat trends that the Monte Carlo simulation revealed. Of the sub-domains tested, 3 in particular tended to have a negative rankings effect on nearly all of the combinations. Each time a good sub-domain was merged, these 3 would intermix with combinations to slightly lower the traffic volume and traffic values. It turned out these 3 sub-domains had very few backlinks and only brand keyword rankings. Subsequently, there was huge keyword overlaps and almost no net link benefit when merged. We were easily able to exclude these from the sub-domain merger plan. We would have never guessed this, or seen this trend, without this kind of mathematical modeling.

Finally, we were able to look closely at sub-domain merger combinations that offered more search value and less search traffic, or vice-versa. Ultimately, though, 3 options vied for the top spot. They were statistically indistinguishable from one another in terms of potential traffic and traffic value. This meant the client wasn't tied to a single potential solution, they could weigh other factors like the difficulty of merging some sub-domains and internal political concerns.

Modeling uncertainty

As SEOs, there is a ton we don't know. Over time, we build a huge amount of assumptions and, with those assumptions, levels of confidence for each. I am very confident that a 301 redirect will pass along rankings, but not 100%. I am pretty confident that keyword usage in the title improves rankings, but not 100% confident. The beauty of the Monte Carlo approach is that it allows us to graph our uncertainties.

The graphs you saw above were the averages (means) for each of the sub-domain combinations. There were actually hundreds of different outcomes generated for each of those sub-domain combinations. If we were to plot those different outcomes, they may look like what you see in the image directly above. If I had just made a gut decision and modeled what I thought, without giving a range, I would have come up with only a single data point. Instead, I estimated my uncertainties, turned them into a range of values, and allowed the math to tell me how those uncertainties would play out. We put what we don't know in the graph, not just what we do know. By graphing all of the possibilities, I can present a more accurate, albeit less specific, answer to my client. Perhaps a better way of putting it is this: when we just go with our gut, we are choosing 1 marble out of the bag and hoping it is the right one.

Takeaways

  1. If you are an agency or consultant, it is time to step up your game. Your gut instinct may be better than anyone else's, but there are better ways to use your knowledge to get at an answer than just think it through.

  2. Don't assume that anything in our industry is unknowable. The uncertainty that exists is largely because we, as an industry, have not yet chosen to adopt the tools that are plainly available to us in other sciences that can take into account those uncertainties. Stop looking confused and grab a scientist or statistician to bring on board.

  3. Whenever possible, look to data. As a small business owner or marketer, demand that your provider give you sound, verifiable reasons for making changes.

  4. When in doubt, repeat. Always be testing and always repeat your tests. Making confident, research-driven decisions will give you an advantage over your competition that they can't hope to undo.

Follow up

This is an exciting time for search marketers. Our industry is rapidly maturing in both its access to data and its usage of improved techniques. If you have any more questions about this, feel free to ask in the comments below or hit me up on twitter (@rjonesx). I'd love to talk through more ideas for improvements you might have!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Improving Search Rank by Optimizing Your Time to First Byte

Posted by Zoompf

Back in August, Zoompf published newly uncovered research findings examining the effect of web performance on Google's search rankings. Working with Matt Peters from Moz, we tested the performance of over 100,000 websites returned in the search results for 2000 different search queries. In that study, we found a clear correlation between a faster time to first byte (TTFB) and a higher search engine rank. While it could not be outright proven that decreasing TTFB directly caused an increasing search rank, there was enough of a correlation to at least warrant some further discussion of the topic.

The TTFB metric captures how long it takes your browser to receive the first byte of a response from a web server when you request a particular website URL. In the graph captured below from our research results, you can see websites with a faster TTFB in general ranked more highly than websites with a slower one.

We found this to be true not only for general searches with one or two keywords, but also for "long tail" searches of four or five keywords. Clearly this data showed an interesting trend that we wanted to explore further. If you haven't already checked out our prior article on Moz, we recommend you check it out now, as it provides useful background for this post: How Website Speed Actually Impacts Search Ranking.

In this article, we continue exploring the concept of Time to First Byte (TTFB), providing an overview of what TTFB is and steps you can take to improve this metric and (hopefully) improve your search ranking.

What affects TTFB?

The TTFB metric is affected by 3 components:

  1. The time it takes for your request to propagate through the network to the web server
  2. The time it takes for the web server to process the request and generate the response
  3. The time it takes for the response to propagate back through the network to your browser.
To improve TTFB, you must decrease the amount of time for each of these components. To know where to start, you first need to know how to measure TTFB.

Measuring TTFB

While there are a number of tools to measure TTFB, we're partial to an open source tool called WebPageTest.

Using WebPageTest is a great way to see where your site performance stands, and whether you even need to apply energy to optimizing your TTFB metric. To use, simply visit http://webpagetest.org, select a location that best fits your user profile, and run a test against your site. In about 30 seconds, WebPageTest will return you a "waterfall" chart showing all the resources your web page loads, with detailed measurements (including TTFB) on the response times of each.

If you look at the very first line of the waterfall chart, the "green" part of the line shows you your "Time to First Byte" for your root HTML page. You don't want to see a chart that looks like this:

bad-waterfall

In the above example, a full six seconds is getting devoted to the TTFB of the root page! Ideally this should be under 500 ms.

So if you do have a "slow" TTFB, the next step is to determine what is making it slow and what you can do about it. But before we dive into that, we need to take a brief aside to talk about "Latency."

Latency

Latency is a commonly misunderstood concept. Latency is the amount of time it takes to transmit a single piece of data from one location to another. A common misunderstanding is that if you have a fast internet connection, you should always have low latency.

A fast internet connection is only part of the story: the time it takes to load a page is not just dictated by how fast your connection is, but also how FAR that page is from your browser. The best analogy is to think of your internet connection as a pipe. The higher your connection bandwidth (aka "speed"), the fatter the pipe is. The fatter the pipe, the more data that can be downloaded in parallel. While this is helpful for overall throughput of data, you still have a minimum "distance" that needs to be covered by each specific connection your browser makes.

The figure below helps demonstrate the differences between bandwidth and latency.

latency

As you can see above, the same JPG still has to travel the same "distance" in both the higher and lower bandwidth scenarios, where "distance" is defined by two primary of factors:

  1. The physical distance from A to B. (For example, a user in Atlanta hitting a server in Sydney.)
  2. The number of "hops" between A and B, since internet traffic redirects through an increasing number of routers and switches the further it has to travel.
So while higher bandwidth is most definitely beneficial for overall throughput, you still have to travel the initial "distance" of the connection to load your page, and that's where latency comes in.

So how do you measure your latency?

Measuring latency and processing time

The best tool to separate latency from server processing time is surprisingly accessible: ping.

The ping tool is pre-installed by default on most Windows, Mac and Linux systems. What ping does is send a very small packet of information over the internet to your destination URL, measuring the amount of time it takes for that information to get there and back. Ping uses virtually no processing overhead on the server side, so measuring your ping response times gives you a good feel for the latency component of TTFB.

In this simple example I measure my ping time between my home computer in Roswell, GA and a nearby server at www.cs.gatech.edu in Atlanta, GA. You can see a screenshot of the ping command below:

ping

Ping continued to test the average response time of the server, and summarized an average response time of 15.8 milliseconds. Ideally you want your ping times to be under 100ms, so this is a good result. (but admittedly the distance traveled here is very small, more about that later).

By subtracting the ping time from your overall TTFB time, you can then break out the network latency components (TTFB parts 1 and 3) from the server back-end processing component (part 2) to properly focus your optimization efforts.

Grading yourself

From the research shown earlier, we found that websites with the top search rankings had TTFB as low as 350 ms, with the higher ranking sites pushing up to 650 ms. We recommend a total TTFB of 500ms or less.

Of that 500ms, a roundtrip network latency of no more than 100ms is recommended. If you have a large number of users coming from another continent, network latency may be as high as 200ms, but if that traffic is important to you, there are additional measures you can take to help here which we'll get to shortly.

To summarize, your ideal targets for your initial HTML page load should be:

  1. Time to First Byte of 500 ms or less
  2. Roundtrip network latency of 100 ms or less
  3. Back-end processing of 400 ms or less

So if your numbers are higher than this, what can you do about it?

Improving latency with CDNs

The solution to improving latency is pretty simple: Reduce the "distance" between your content and your visitors. If your servers are in Atlanta, but your users are in Sydney, you don't want your users to request content half way around the world. Instead, you want to move that content as close to your users as possible.

Fortunately, there's an easy way to do this: move your static content into a Content Delivery Network (CDN). CDNs automatically replicate your content to multiple locations around the world, geographically closer to your users. So now if you publish content in Atlanta, it will automatically copy to a server in Syndey from which your Australian users will download it. As you can see in the diagram below, CDNs make a considerable difference in reducing the distance of your user requests, and hence reduce the latency component of TTFB:

640px-NCDN_-_CDN

To impact TTFB, make sure the CDN you choose can cache the static HTML of your website homepage, and not just dependent resources like images, javascript and CSS, since that is the initial resource the google bot will request and measure TTFB.

There are a number of great CDNs out there including Akamai, Amazon Cloudfront, Cloudflare, and many more.

Optimizing back-end infrastructure performance

The second factor in TTFB is the amount of time the server spends processing the request and generating the response. Essentially the back-end processing time is the performance of all the other "stuff" that makes up your website:

  • The operating system and computer hardware which runs your website and how it is configured
  • The application code that's running on that hardware (like your CMS) as well as how it is configured
  • Any database queries that the application makes to generate the page, how many queries it makes, the amount of data that is returned, and the configuration of the database

How to optimize the back-end of a website is a huge topic that would (and does) fill several books. I can hardly scratch the surface in this blog post. However, there are a few areas specific to TTFB that I will mention that you should investigate.

A good starting point is to make sure that you have the needed equipment to run your website. If possible, you should skip any form of "shared hosting" for your website. What we mean by shared hosting is utilizing a platform where your site shares the same server resources as other sites from other companies. While cheaper, shared hosting passes on considerable risk to your own website as your server processing speed is now at the mercy of the load and performance of other, unrelated websites. To best protect your server processing assets, insist on using dedicated hosting resources from your cloud provider.

Also, be wary of virtual or "on-demand" hosting systems. These systems will suspend or pause your virtual server if you have not received traffic for a certain period of time. Then, when a new user accesses your site, they will initiate a "resume" activity to spin that server back up for processing. Depending on the provider, this initial resume could take 10 or more seconds to complete. If that first user is the Google search bot, your TTFB metric from that request could be truly awful.

Optimize back-end software performance

Check the configuration of your application or CMS. Are there any features or logging settings that can be disabled? Is it in a "debugging mode?" You want to get rid of nonessential operations that are happening to improve how quickly the site can respond to a request.

If your application or CMS is using an interpreted language like PHP or Ruby, you should investigate ways to decrease execution time. Interpreted languages have a step to convert them into machine understandable code which what is actually executed by the server. Ideally you want the server to do this conversion once, instead of with each incoming request. This is often called "compiling" or "op-code caching" though those names can vary depending on the underline technology. For example, with PHP you can use software like APC to speed up execution. A more extreme example would be Hip Hop, a compiler created and used by Facebook that converts PHP into C code for faster execution.

When possible, utilizing server-side caching is a great way to generate dynamic pages quickly. If your page is loading content that changes infrequently, utilizing a local cache to return those resources is a highly effective way in improving the performance of your page load time.

Effective caching can be done at different levels by different tools and are highly dependent on the technology you are using for the back-end of your website. Some caching software only cache one kind of data, while others do caching at multiple levels. For example, W3 Total Cache is a WordPress plug-in that does both database query caching as well as page caching. Batcache is a WordPress plug-in created by Automattic that does whole page caching. Memcached is a great general object cache that can be used for pretty much anything, but requires more development setup. Regardless of what technology you use, finding ways to reduce the amount of work needed to create the page by reusing previously created fragments can be a big win.

As with any software changes you'd make, make sure to continually test the impact to your TTFB as you incrementally make each change. You can also use Zoompf's free performance report to identify back-end issues which are effecting performance, such as not using chunked encoding and much more.

Conclusions

As we discussed, TTFB has 3 components: the time it takes for your request to propagate to the web server; the time it takes for the web server to process the request and generate the response; and the time it takes for the response to propagate back to your browser. Latency captures the first and third components of TTFB, and can be measured effectively through tools like WebPageTest and ping. Server processing time is simply the overall TTFB time minus the latency.

We recommend a TTFB time of 500 ms or less. Of that TTFB, no more than 100 ms should be spent on network latency, and no more than 400 ms on back-end processing.

You can improve your latency by moving your content geographically closer to your visitors. A CDN is a great way to accomplish this as long as it can be used to serve your dynamic base HTML page. You can improve the performance of the back-end of your website in a number of ways, usually through better server configuration and caching expensive operations like database calls and code execution that occur when generating the content. We provide a free web performance scanner that can help you identify the root causes of slow TTFB, as well as other performance-impacting areas of your website code, at http://zoompf.com/free.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Operation Clean Air: Clearing Up Misconceptions of Yelp's Review Filter

Posted by David-Mihm

Last week, the New York Attorney General's "Operation Clean Turf" fined 19 companies a total of $350,000 for writing fake reviews on behalf of their clients. The case sets a laudable precedent not only for the future of local search, but for digital marketing more broadly.

While the amount of the fines is hardly Earth-shattering, the outcome of this operation should give pause to any SEO or reputation-management company considering quick-and-dirty, underhanded tactics to boost their clients' rankings, "improve" their clients' reputations, or launch negative attacks on competitors.

In the wake of this settlement, however, a wave of media coverage and a study by researchers at the Harvard Business School have clouded the reality of Yelp's review filter—already poorly understood by typical business owners—even further. In this piece I hope to dispel four misconceptions that it would be easy to conclude from these recent publications.

Likely elements of review filters

Review characteristics

  • Use of extreme adjectives or profanity in the review
  • Overuse of keywords in the review
  • Inclusion of links in the review
  • 1-star or 5-star rating (see discussion of HBS study below)

User characteristics

  • Total number of reviews a user has left on the site
  • Distribution of ratings across all of a user's reviews
  • Distribution of business types among all of a user's reviews
  • Frequency of reviews that a user has left on the site
  • IP address(es) of the user when leaving reviews

Business characteristics

  • A sudden burst of reviews preceded by or followed by a long lull between them.
  • Referring URL string to business page (or lack thereof)

1. "Most aggressive" review filter ? "most successful" review filter

Yelp representatives made little effort to contain their glee at being cited by the NYAG as having the "most aggressive filter" of well-known local review sites. In an interview with Fortune, Yelp's corporate communications VP spun this statement by the NYAG as validation that his company's filter was "presumably the most progressive and successful."

As I stated in the same Fortune story, I agree 100% with the NYAG that Yelp's filter is indeed the most aggressive. Unfortunately, this aggressiveness leads, in my experience, to a far higher percentage of false positives—i.e. legitimate reviews that end up being filtered—than the review filters on other sites.

Google, for example, has struggled for almost as long as Yelp to find the perfect balance between algorithmic aggression and giving users (and indirectly, business owners) the benefit of the doubt on "suspicious" reviews. Now that a Google+ account is required to leave a review of a business, I suspect that the corresponding search history and social data of these accounts give Google a huge leg up on Yelp in identifying truly fraudulent reviews.

I'm not necessarily saying that Google, TripAdvisor, Yahoo, or any other search engine presents the most representative review corpus, but it's a pretty big stretch for Yelp to equate aggression with success.

2. "Filtered reviews" ? "fraudulent reviews"

To Yelp's credit, even they admit that legitimate reviews are sometimes filtered out by their algorithm. But you sure wouldn't know it by reading a recently published study by the Harvard Business School.

In a throwaway line that would be easy to miss, the authors state that they "focus on reviews that Yelp's algorithmic indicator has identified as fraudulent. Using this proxy…" they go on to draw four—actually five—conclusions about "fraudulent" reviews:

  1. Their star ratings tend to be more extreme than other reviews.
  2. They tend to appear more often at restaurants with few reviews or negative reviews.
  3. They tend to appear more often on independent restaurants rather than chains.
  4. They tend to appear more in competitive markets.
  5. "Fraudulent" 5-star reviews tend to appear more on claimed Yelp pages than unclaimed ones.

The authors attempt to use statistical equations to justify the foundation of their study, but the fundamental logic of their equations is flawed. I'm by no means a statistical wizard, but the authors suggest that readers like me scan filtered reviews to validate their assumption.

I would only highlight my friend Joanne Rollins' Yelp page, and thousands of other business owners' pages just like her, as qualitative evidence to rebut their logic. I don't dispute that Yelp's review filter is directionally accurate, but it's crazy to assume it's anywhere near foolproof to use it as a foundation for a study like this. It leads to self-fulfilling prophecy.

In fact, there are five very easy explanations of their conclusions that in no way lead you to believe that the overlap between filtered reviews and fraudulent reviews is even close.

  1. Yelp uses star rating as part of its filtering algorithm. This is an interesting finding, but not applicable to "fraudulent" reviews.
  2. Restaurants with few reviews or negative reviews are engaging in proactive reputation management by asking customers with positive experiences to review them. This is simply a best practice of online marketing. While it violates Yelp's guidelines, by no means does it indicate that the reviews generated by these campaigns are fraudulent.
  3. Independent restaurants tend to be much more engaged in online marketing than chains. Speaking from years of personal experience, chains have by-and-large been very slow to adopt local search marketing best practices, from search-friendly store locators to data management at local search engines to review campaigns. Independent small business owners simply tend to be more engaged in their digital success than corporate managers.
  4. Restaurateurs in competitive markets tend to be much savvier about their digital marketing opportunities than those in less-competitive, typically rural markets.
  5. Engaged restaurateurs are more likely to pursue proactive reputation management campaigns (see bullet-point number two).
While the HBS study highlights a number of interesting attributes of Yelp's review filter, it's simply impossible to draw the kinds of conclusions that the authors do about the truthfulness or fraudulence of filtered reviews.

3. "Filtered reviews" ? "useless reviews"

I consider my friend Joanne Rollins to be a fairly typical small business owner. She runs a small frame shop with the help of a couple of employees in a residential neighborhood of NW Portland. She's not shy about sharing her ire with Yelp, not only around some of their shady sales practices, but especially about her customers' reviews getting filtered.

Trying to explain some of the criteria that cause a review to be filtered simply takes too long, and Joanne is easily frustrated by the fact that a faceless computer algorithm is preventing testimonials from 13+ human beings from persuading future customers to patronize her business. On the customer side, they're usually disappointed that they've wasted time writing comments that no one will ever see.

But all is not lost when a review is filtered! With permission from the customer, I encourage you to republish your filtered Yelp reviews on your own website. There's no risk of running afoul of any duplicate content issues, since search engines cannot fill out the CAPTCHA forms required to see filtered reviews.

You as the business owner get the advantage of a few (likely) keyword-rich testimonials, and your customers get the satisfaction in knowing that hundreds of future customers will use their feedback in making a purchase decision. Marking these up in schema.org format would be the icing on the cake.

4. "Filtered reviews" ? "reviews lost forever"

A review once-filtered does not necessarily mean a review filtered-for-alltime. There are steps that I believe will make their review more likely to be promoted from the filter onto your actual business page:

  • Complete their personal Yelp profile, including photo and bio information.
  • Download the Yelp app to their mobile device and sign in.
  • Connect their Facebook account to their Yelp profile.
  • Make friends with at least a handful of other Yelpers.
  • Review at least 8-10 other businesses besides yours.
  • Leave at least one review with each star rating (i.e. 1-, 2-, 3-, 4-, 5-).

For those customers who are super-frustrated by Yelp's filtering of their review or with whom you, as a business owner, have particularly a strong relationship, consider requesting that they undertake at least a couple of those tactics. I certainly don't guarantee their success, but it's worth a shot.

The reality of Yelp's review filter

As the infographic above demonstrates, Yelp's excitement over the citation from the NYAG as having the most aggressive filter underlines a fundamental business problem for the company that I've written about for years.

Yelp's fortunes are tied to their success in selling business owners advertising. Yet these same business owners:

  • don't understand how the site works (at best)
  • think that every Yelp salesperson is out to extort them (at worst)

Despite commendable efforts like their Small Business Advisory Council, Yelp clearly has a long way to go in educating these business owners. And they certainly have a long way to go with reining in rogue salespeople.

But the bigger issue is the consistent disconnect with their customers on the issue most important to their businesses--their guidelines for solicitation and display of reviews. Until they resolve that inherent conflict, I find it hard to see how they'll grow their revenues to the levels that Wall Street clearly expects.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

On Our Wait-List? You Get a Moz Analytics!

Posted by Anthony Skinner

It is with great pleasure that I announce the wait is over! That's right, we are now letting people from our wait-list into Moz Analytics!

In many ways, I feel like a not-as-cute version of Oprah Winfrey. I may not be worth 77 million dollars, and I am not giving you a car, but it does feel good to give new subscribers who patiently waited a 30-day free trial of Moz Analytics! Over the next few weeks we will be sending emails inviting people to try out the tools. The invitation is good for seven days, so when you see the email, make sure you click the link and join us right away.

If you're not on our wait-list, you've still got time to get early access. Just head over and sign up!

Before too long, we will open Moz Analytics free trials to the general public. We plan to release improvements and fixes to Moz Analytics every 2-4 weeks. Have questions about the application? Feel free to check out the Moz Help Hub. Feedback or suggestions? Check out the feature request forum.

Otherwise, sit back and enjoy your new ride.

Anthony Skinner
CTO and Oprah Impersonator


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

What I Learned from Scraping SEOmoz's Active User Base

Posted by iPullRank

Many moons ago, when Moz was SEOmoz, I had the idea to scrape all its publicly available profile data on active users just to see what I could learn about the community. Quantitative market research is an incredibly powerful method to quickly grab insights on a brand's users. Using those insights, we can develop strong content strategies and link-building campaigns, as well as develop competitive insights.

What easier way than scraping the data from a brand's user profiles?

In Soviet iAcquire, the web crawl you.

Oh, you may have heard of Gary and Cogswell, the Russian-coded robots that escaped the Ministry of Education and Science and sought asylum in iRank (our homegrown targeting and reporting technology for scaled content marketing). They were originally assigned some very menial tasks, but I've since reprogrammed them to aid us in better marketing. They are here to lend a hand to their idol, Roger Mozbot, in the hunt for Red October. As the Russian saying goes, "Many hands make light work."

Special thanks to our creative director Robb Dorr for capturing them in the act.

So we built (and by "we" I mean I had our Manager of Research and Development Joshua Giardino build) a multi-threaded crawler in Python, and we fired it at all of the profiles of Moz users who had logged in during the previous 60 days—those people whom I'll call "active users." For those that have forgotten what their Moz profile looks like, they contain a lot of great info ripe for the plucking. I personally don't know what Moz uses them for, but with this post I hope to touch on some potential use cases. Your profile looks (or at least looked) like this, and has all of the following data points in it if you provide them.

How SEOMoz profiles once looked

  • Full Name
  • User Name
  • Email
  • Title
  • Company
  • Type of Job
  • Location
  • Favorite Thing About SEO
  • Bio
  • Favorite Topics
  • Instant Messenger Handles
  • MozPoints
  • Level
  • Membership type
  • Rank
  • # of Comments & Responses
  • Length of Membership
  • Links to other sites
  • Social Media profiles

So, now that we got this treasure trove of data on SEOs in a highly engaged community, let's see exactly what we have.

Crawl stats

  • Crawl date 2/15/13 – Yep, Casey, that was us.
  • 14,036 out of 14,872 profiles were successfully crawled – It wasn't a polite crawl at all.
  • Average crawl rate of 4 URLs/sec – I'm surprised we didn't get throttled more.
  • Total URLs Crawled 14,872 + 299 directory pages to extract profile URLs (=15,171 URLs)

Methodology

  • Scrape as many users as we can
  • Cross-tab everything until we find useful insights
  • Run linear regressions to test the validity of correlations

Limitations of the data

According to the About page Moz had over 15,000 subscribers in February of 2013, but you can be a user without being a subscriber. I've asked Mozzers in passing how many users the site has, and have gotten much bigger numbers than that. After I originally submitted this post, it was revealed to me that Moz has over 250K+ user accounts. So the issue with this data is that it is just a sample. However, sampling is inherently a part of market research; after all, you can't survey everybody. The more important point, however, is that the users we scraped were all active users within the previous 60 days, and therefore were likely more reflective of the needs of those who are highly engaged in the product.

Also, many users have not completely filled out their profiles, so when performing cross-tabulations we are often dealing with samples of slightly different sizes. Therefore, all of the insights presented only account for respondents. That is to say, we don't mention the number of people that have not filled out a given data point. Again, for those who want to know, the base number of total respondents for this study is 14,036, which makes for an approximate 5.6% sample of all users (but presumably a much larger percent of active users). Feel free to check our work.

I've talked a lot about market research and how SEO as an industry doesn't value it. Many SEOs I've encountered prefer taking shots in the dark or the guess-and-check method. This line of thinking is why the erosion of keyword data in analytics matters so much to SEOs. Market research is why it doesn't matter to channels like social media or (ugh) display.

In fact, for enterprise clients it is only about "are we capturing the right people," and "how many are we getting through each channel?" This way of thinking allows marketers to think bigger and be involved in conversations beyond meta tags and links. For those that are leery of the application to small-business marketers, you can easily leverage canned market segmentation provided by Nielsen, Experian, and others, or you can leverage segmentation in other ways.

So first, let's go over some high-level insights. Our Inbound Marketing Analyst, Jiafeng Li, ultimately cross-tabbed the data a ton of different ways, and the entire analysis that we've performed is available for download at the bottom of this post in the "Parting gifts" section.

Membership type

The Membership Type field in the Moz Profile refers to the type of Moz subscription that a user has. For the purposes of this study we basically care whether the user is "basic" or not. Basic means they are a Moz user without a paying account, while any other membership is a paying customer of six user types.

As the histogram indicates, the majority of active users are Pro members. Roughly 60% of this group has an active subscription. While interesting, this data doesn't tell us much until we bring it into context of other data points that we will examine shortly. It should be noted that this field is set programmatically, so all "respondents" have this field filled out in their profiles.

Most active users are either basic (unsubscribed) or Pro (standard subscription) users—42% basic and 49% Pro. Therefore, a large segment of these users are active subscribers paying at least the regular rate of $99/month. This also means most users are genuinely affected when the product has issues. However, it's notable that Moz does a great job of being transparent when this happens.

Moz Insight: There's no real actionable insight here without looking at data in context of other data points that we will examine later in the post.

Competitive Insight: Nearly half of Moz's active user base doesn't subscribe to the product. It would be worthwhile to segment further and reach out to these people to understand why.

Years of membership

The profile also tells us when the member signed up for her account. This is interesting to get a picture of the retention of the Moz active user base. The actual data point is the number of years since signup which shows that year over year Moz has retained more active users.

Note: Remember this data was collected in February of 2013 so that explains the small negative delta between years one and zero.

Congrats to Moz for their sustained user retention. Based on the sample they've retained more active users every year (not including year 0 which had just started).

From the outside looking in this is a clear indicator of a growing and thriving community. When researching viable opportunities this is far more important to me than any link metric. To be clear, though, this data is limited in that we don't know exactly how many users signed up and ultimately canceled altogether. Nor do we know how many users have switched user types over time. Therefore the data is a jigsaw puzzle with a couple of middle pieces missing.

This is also how we realized this is just a sample of the user base because Moz reports its subscriber growth on the About page as:

  • 2009 – 5K
  • 2011 – 10K+
  • 2012 – 15K+

However due to the fact there is an account base of over 250k+ this is clearly not indicative of all user accounts. Also, in a recent conversation with Rand I learned that the subscriber base has continued to grow well beyond the number displayed on the About page at the time of this writing.

Time spent on SEO

One of the more interesting data points requested in the user profile is the amount of time a given user spends on SEO per week. This is particularly interesting because we can use this as an indicator of savvy or engagement in the space—especially in context with job titles.

The biggest segment (20% of respondents) spend more than 50 hrs/week on SEO, and as you might imagine, the active user base is mostly made up of people that spend a ton of time on SEO. However, there are also very large segments that spend smaller amounts of time on SEO.

Insight: As a content creator, there is space for really advanced content, but there's likely an even more lucrative opportunity for basic content built for people with a shorter attention span for SEO.

Moz Insight: Moz should consider some cartoon-based shorts starring Roger explaining SEO basics and quick-hit tactics for less advanced users.

Level/MozPoints

Moz has a rudimentary system of gamification that comes into play based on how active a user is on the blog or in Q&A. Points are awarded for—you guessed it—filling out your profile, publishing blog posts on YouMoz or being promoted to the main blog, commenting, and acquiring thumbs up.

This value is set by the system and the data indicates that 90% of active users are lurkers. There's only a handful of Gianlucas out there. Based on how MozPoints are awarded, this histogram helps me understand how many users are engaged enough to be "thought leaders" as defined by the Journeyman, Authority, Guru and Oracle levels. These are the influencers I would reach out to if I wanted to place links or I wanted to get buy-in before I posted on YouMoz and wanted to ensure I got traction.

Moz Insight: Moz's gamification needs work, and actually isn't very TAGFEE. There are more actions that are beneficial to Moz that should also award points to users. For example, sharing a post on Twitter should result in a point for the sharer and the author. The rewards are also not that compelling. With all the Mozperks and free swag Moz gives away they would be well served to build a marketplace where users can redeem their points for fun stuff.

Note the change in the level names since the change to Moz. Guru has become Expert, and Journeyman has become Specialist.

Competitive Insight: 90.16% of Moz's active users are not that engaged in the blog, Q&A or comments. While the community thrives in different ways on different channels there is an opportunity for another site to spring up that rewards user engagement in a more in-depth and (dare I say it?) transparent way.

Type of work

Users self-identify the classifications of their work, and with this data point Moz better understands how well they are capturing their targets.

Moz speaks to all segments of the audience with its offering and content, but as Rand mentioned enthusiastically at MozCon, they are focused on helping small business owners do better marketing. However, the active user base is 25.7% agency or independents that are likely floating across many clients.

The remaining big segments are:

  • 16.69% Business Owners
  • 15.65% In-House

Moz Insight : Moz's active user base is not primarily made up of their core target. The real question that needs answering is, why is that? I believe cross-tabbing a little further gives us some more clues later in this analysis.

Competitive Insight: Moz's user base is full of people that make great targets for agencies and enterprise products. Product brands that serve the enterprise like Conductor or Brightedge; and agencies like Distilled, SEER, Portent, and (ahem) iAcquire are obviously well served by being featured here or at Moz events.

Years of membership vs. membership type

Since we don't have any indication of how user account types change over time, the best we can do is look at account types in context of account age to try and understand if there are any trends.

For users with membership less than a year, a higher percentage are basic users; while at more than 1 year, a higher percentage of users are pro users, indicating possible conversion to pro users after 1 year. The data indicates that the longer people are engaged with Moz, the more likely they are to subscribe to Pro.

Competitive Insight: The best time to convince users to try another product is in their first year of using Moz. The data indicates that Year 0 members aren't quite convinced this is the product for them. A competitor would be well-served to offer a longer free trial than Moz does, and actively engage the user with how-to content via email to keep them actively engaged throughout their free trial so they can understand the value of the product.

Moz Insight: The data indicates that Moz does a good job of keeping these active members happy—if they can keep them around. Users are likely kept due to Moz's investment in upgrades and remarkable content. The real question is which types of content lead to those initial conversions and which types reduce the churn? Don't worry, I've got some ways to figure that out as well.

Naturally, Moz would also be well-served to develop ways to keep users highly engaged during their free trial process with "Did You Know" weekly emails based on app usage and non-usage.

Type of work vs. membership type

We wanted to understand how the type of work correlates to membership type. What types of users own what type of membership?

Pro usage is dominated by in-house professionals, and independents are the only segment that is mostly basic users.

Moz Insight: The hypothesis I've drawn here based on the data about these active users is that independents either don't see the value in subscribing to Moz or they can't afford it. Moz should consider a certification program similar to that of HubSpot, which would allow independents to generate leads. Once certified, these independents can enjoy a cheaper subscription rate. After all, independents are even smaller-business owners.

Competitive Insight: There is an independent market worth tapping with a tool suite that costs less than $1,188 per year. It would be worth performing exploratory research to understand what type of tools independents believe are worth investing in.

Time spent on SEO (heavy users) vs. membership type

We wanted to know what types of memberships the most engaged SEO practitioners have as these people are likely the hardest to please and may have the most influence of the bunch.

For heavy SEO users in the active user base, those who spend more than 50 hours/week on SEO, agency users and in-house users have higher percentage of Pro subscribers while business owners and other types of users comprise higher percentages of basic users.

Moz Insight: The data about these active users indicates that a large portion of business owners that are heavy SEO users are basic users of Moz. Moz may be too expensive for the people it wants to serve most, or even worse, these people may not truly see the value of Moz. This may be the most useful insight to Moz, and is definitely worth exploring further through interviews of this segment.

Competitive Insight: The independent and small-business owner is the battleground for those competing with Moz. Agencies and in-house professionals typically have access to bigger budgets and a variety of tools, whereas independents and small business owners often have to choose. Therefore, this may be where all-in-one products like RavenTools and HubSpot outperform Moz. It's worth following up with exploratory research and examining any publicly available data on their users.

Level/MozPoints vs. years of membership

We wanted to see if there was any correlation between the number of years of membership and the amount of contribution to the community, wondering if it would be possible to predict when the next John Doherty or Tom Critchlow would pop up.

Among the "aspirant" users, who are less active, most of them are comparatively newer members; while among "contributor," "journeyman," and "authority," most of them are comparatively older members.

The data indicates that the insight is obvious: The longer you're with the Moz community, the more likely you are to become more engaged. The biggest group of contributors lies at the two-year mark. It would appear that Moz is already proactively cherry picking the best-of-breed posters to add to the Associate program. Competitors looking to quickly identify people for potential guest posting could look here, but again this is obvious, because if someone is good their posts tend to get tons of visibility anyway.

Regressions on membership type

There have been many discussions as of late on the value of correlation in SEO. Rand has already gone in-depth as to why correlations studies are worthwhile, but I will briefly say while correlation != causation it does bring up some interesting insights. That said, we ran linear regressions on the data that we cross-tabbed in the last few charts as follows:

  • X = "years of membership"
  • Y = "membership type value"
  • Y = 4.74x + 55.75
  • Adjusted R-Square = 0.0017 (It is extremely low, meaning the regression can't really explain the data).
  • When X = "time spent on SEO," "type of work," and "level," the adjusted R-square is low.

The results of our regression indicate how strongly membership type correlated with time spent on SEO, type of work, and level. We found that membership is not strongly correlated with any one of those given metrics, which means that while there are a lot of happy "coincidences" here, they doesn't necessarily mean any given factor is a driving force behind that correlation.

Job titles

Users have the ability to enter their job titles in their Moz profile. However, free-form text fields are difficult to analyze, since everyone's answer is very different. Enter: the word cloud.

Perhaps I am innumerate, but I've never really been a fan of Word Clouds. Bigger words, bigger value. Big whoop. That said, this one would be pretty useful if I didn't already know a lot about the Moz community. If I'm looking to create content it's probably not best to go with code-heavy stuff. This word cloud tells me that I'm mostly speaking to people that are pretty far in their SEO careers, such as marketing directors and managers. As the marketing lead for an SEO and social media agency, I could quickly verify that my exact audience is here.

Moz Insight: There is a large opportunity for higher-level or big-picture content such as what Rand delivers on his personal blog. Since the majority of the active audience appears to be pretty far in their careers, this content may prove more valuable to them.

Competitive Insight: This data further indicates that Moz is a great place to get in front of enterprise professionals, especially in a less "sales-y" capacity. Two words: Case. Studies.

Users' favorite things about SEO

Users also have the ability to share what is they love about SEO in an in-depth free form text area within their profiles. Again we leverage a word cloud due to the difficulty of segmenting responses otherwise.

This word cloud is also pretty helpful in understanding what content will resonate with the audience. One of the highest occurring ideas is that users love to get results or see their work on the first page of the SERPs. That in context with users loving the constant challenge and, to a lesser extent, the creativity required to get there leads me to believe this is an audience that will be very receptive to new approaches with proven results.

Insight: The active Moz audience is far more interested in results (and therefore case studies) rather than just ideas. This is an insight for both Moz and other marketers looking to appeal to this audience. Bring data or go home.

Users' favorite topics

This section of the user profile is somewhat of a more succinct version of the last field. Users are given options to choose from which makes it a lot easier to analyze. Even so we've leveraged the word cloud here to see what really stands out for the Moz community.

Optimization, content, analytics, research, link building appear to be the hits with the active users in the community. It looks like I've covered them all in this post, but how the post performs will be truly indicative of how well these types of content reaches those people. And that's a good point worth raising right now. How people say they act is not necessarily how they actually act. It will always be up to analytics to prove these insights right or wrong, but the point is to start out with an educated guess backed by data.

Moz Insight: As Moz is expanding its offering to be more about inbound marketing rather than just SEO this will be a good data point to measure to determine whether they are capturing more of that broad audience. However the choices are still reflective of Moz's historical SEO focus as seen in the screenshot below.

Now would be a good time to update this to reflect more of the granular facets of Inbound.

Competitive Insight: This data really drills in the ideas of what you should focus on if you're trying to get Moz users to come to you. Case studies and how-tos on optimization, content, analytics, research, and link building are the way to go, and a quick look at post analytics seems to back this up.

The real purpose of this post isn't just to show Moz how they can do better marketing, it's to show you how you can leverage user profiles to your advantage with your competitors for a variety of initiatives.

  • Lead Generation – A lot of Moz profiles show email addresses publicly, but they're rendered with JavaScript (darn you, Casey). I could have easily fired a headless browser at the site, pulled in email addresses and sent our sales team at them. (Don't worry, I didn't.)

  • Content Strategy – As noted in the analyses, the data makes it crystal clear what the audience wants in the form of content. A lot of content marketing programs take shots in the dark at what users want while this type of research allows a marketer to make a strong case for the content they would build. It's far easier to convince a client of a creative content approach tied to an audience with data than with just keywords and links.

  • Link Building – This data is basically a personalized Followerwonk. I can slice and dice features of the dataset and grab their social URLs and sites, then combine them with Domain Authority and Social Authority. That would give me a highly personalized list of link-building prospects that I could segment and target by interest. Say for example I only want links from people who've been down with Moz since the beginning: I could just filter by the users that have had accounts for seven years. Done.

This is quantitative research with the qualitative insights coming out of my own experiences with the Moz community. Moz has, in the past, done a great job of quantitative research in the form of surveys they run on their community and user base. In fact, we could have layered that data over the data we've collected to get a more complete picture of the user base, including demographics with data from GlassDoor and Payscale to figure out salaries by title. We also could have leveraged Moz's transparent analytics feature to show how content of the different types performs by subject and use those insights to get closer to what actually works for Moz.

We could have also performed qualitative research, much like Moz does with its various initiatives wherein they watch users using their products and ask questions. As a part of Moz's Customer Advisory Board (CAB), the product team often reaches out to me to get my thoughts about using Moz Analytics and get specific feedback. The next step would be to pull out a set of users that are representative of the most valuable segments and similarly have question and answer sessions.

  • Exploratory Research – I've mentioned it several times, but this is process of speaking to people in small groups with open ended conversations to understand how your audience is thinking about your product. This process is usually performed in Focus groups or open ended surveys to help define what needs to be answered by more data.
  • More Quantitative Research based on those findings – Once we collect findings from exploratory research we could then send out survey questions based on those findings to get a bigger sample of the segment or find those people through other channels like LinkedIn.

In other words insights can always be understood further or fine-tuned when used a basis to determine or answer new questions.

The mad scientists at Moz could also pull the entire 250k+ users and perform the same analysis. However, I think the analysis of the active users proves to be more actionable, as it limits the research to just those that are actively engaged. Additionally, the analysis of all users may lead to insights into why certain user segments have become completely inactive.
Moz could also layer this data with app usage data for a more complete picture of what content keeps users using the product.

Measurement and targeting applications

This slide below from my MozCon 2012 presentation may have been forgettable at the time, but this is the foundation for what I believe is the future of digital marketing. This is the framework by which arbitrage and dynamic targeting become stronger, more viable solutions.

The concept is actually called cohort analysis. Before your eyes glaze over, this is nowhere near as complicated as the Keyword-Level Demographics methodology I developed at the end of 2011. With cohort analysis we segment users based on their shared features and track them accordingly. With Keyword-Level Demographics we've done that using Facebook data to match the relevant user data to features we've identified as relevant to our predefined personas. With cohort analysis we're doing it from the other direction by first collecting data and then defining segments based on actual usage rather than just panels and surveys.

That is to say that Moz doesn't have to go as far as building personas complete with demographics and user stories; they can stop at segments. Much like your Google Analytics segments, Moz could develop affinity segments to see what content resonates with which user types throughout the site. With all the data provided in the user profile Moz can segment any number of ways and may choose to go with membership types as the base since it is one of the lowest common denominators between users. However for the sake of understanding let's use the Time Spent on SEO as our defining characteristic.

Moz could define high level segments as follows:

  • Super Heavy Users – Time spent on SEO over 50 hours/week.
  • Heavy Users – Time spent on SEO 35-50 hours/week
  • Medium Users - Time spent on SEO 20-35 hours/week
  • Light Users – Time spent on SEO 5-20 hours/week

We know Moz wants to target business owners. From the high-level insights, we have identified business owners that are super-heavy users as a segment of opportunity, since many of them are currently basic users. Now, to drill down into one of those segments we could target basic users that have "Link Building" listed as one of their interests, and spend more than 50 hours a week on SEO. Let's call this segment "Basic-50-LB." Based on the data this is indeed a valid segment:

We now know a lot about what this segment is interested in, so we can then test and optimize against it.

Now let's compare this to the interests of the business owners that are heavy SEO users and have Pro accounts. It appears to be somewhat different.

The question we want to answer is, why? And how do we push those basic users to become Pro users? There are a lot of things worth testing on the basic users to see if we can discover what affects their perception of Moz's value.

With that segment defined, Moz could track what type of content performs and then dynamically surface that type of content for that user when they log in. Moz could also track how many times that user type has to see a specific type of post before they are likely to become a Pro user. This is where geniuses like Dr. Matt Peters and Dr. Pete Meyers come in and build predictive models and Moz's entire digital marketing mix start to make Target's pregnancy prediction tactic look old school.

Further, Moz could see which products a given segment likes using the most and use that to inform their product roadmap. Did this segment become a Pro user once Followerwonk was released? Did signups increase once the Social Authority API rolled out? And finally, Moz could get more aggressive with these tests and segmented emails to users that cancel in hopes of bringing them back to Pro. For example a user very interested in link building would get emails with all of the recent link building posts, Q&A and discussion.

But to do this we first need to set up Google Analytics for measurement of cohorts. To do so we need to create a new custom segment that looks for the Custom Variable that we'll be setting when a user starts their sessions.

Steps to do so are as follows:

  • Click the Down Arrow below your Segment name
  • Click Create New Segment
  • Click Conditions under Advanced
  • Select Users and Include next to Filter
  • Select the Custom Variable you will be setting under drop down that gives you the dimensions to choose from.
  • Choose Contains and then type in the value which would be the segment name Basic-50-LB

We'd also do this for the segment we'd like to compare it to as well as capture the higher level segment "Basic-50" for bigger-picture insights.

This is actually something we do in the measurement planning phase with our clients here at iAcquire. It's actually incredibly simple, when a user logs in just pull their profile and identify which segment they are then fire off a custom variable like so:

_gaq.push([‘_setCustomVar',1,'userSegment',userSegmentName ,1]);

The steps leading up to firing the custom variable will require some custom programming, but I promise you that it's nothing more than a bunch of if-then statements. Tell your developer to relax.

Ultimately what you'll get in your analytics is these segments in context with your analytics data allowing you some very precise user insights that are completely relevant to you. In some ways this approach is actually better than Keyword Level Demographics because it doesn't require a user to be logged into Facebook and it leverages the data within the user profiles.

I know what you're thinking, "How does this apply to my site or my clients? It will be impossible for my site to get users to create a profile and fill it out." Well, can you get just a social handle or an email address? Ok, then I've got a couple solutions to that as well: FullContact and RapLeaf.

It turns out that FullContact does more than just give Paid Paid Vacation, they are also a contact data provider. Both RapLeaf and FullContact allow you to pass minimal information on a user and get a ton back. Here is some high-level information from their respective sites.

FullContact

RapLeaf

So remember when I said the email was difficult to scrape? The social handle was not. I'd be all set for lead gen with just a few API calls.

Using one of these solutions you could pull their data when they signup use it to determine their segment or persona, save that to your database and cookie them. This way there's no need for them to create a profile or opt-in in anyway aside from the initial signup. Also as long as they don't kill their cookies, the user doesn't even have to explicitly sign in. Sometimes the Internet feels like magic.

You guys know I can't give you a good idea without leaving you a way to use it.

Josh's scraper code

Since SEOmoz became Moz there have been more than enough changes to the structure of the site that this code will not work anymore, however it's a good starting point if you'd like to build a scraper for competitor user profiles in the future. You can find it (and some other cool things) on the iAcquire Github repository for you to enjoy.

More market research resources from J-Li

We take market research pretty seriously here at iAcquire. Here are two posts you shouldn't miss from our Inbound Marketing Analyst Jiafeng Li.

Cohort analytics stuff

At iAcquire search is our craft, and this post is just another example of an element of the new SEO process at work. This is the type of my stuff my team incorporates into SEO on a daily basis in addition to the creative technical ideas we come up with. The fact is, we live in the information age where big data reigns supreme, but let's not forget smaller data like we've just examined.

So It looks like Roger, Gary, and Cogswell are ready to do better marketing. Are you?

And yes, it feels amazing to be back on the blog.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!