Note: if you're new to writing SQL or BigQuery SQL, and are looking to write some custom queries of your WQA data, check out this SQL tutorial on Coding is for Losers.

The Website Quality Audit writes a number of data tables to your BigQuery warehouse, across 3 buckets:

  • Raw data feeds from Deepcrawl, Google Analytics and Google Search Console

  • Aggregated data tables, rolling up your traffic, keywords and crawls

  • Action recommendation tables, where the aforementioned actions logic is applied

  • Visualization tables, which prep your data for life as a Data Studio or Sheets workbook

When you first crack open your BigQuery console, you may be overwhelmed by the number of tables in your WQA dataset.

Not to worry - most teams really only use one table: actions_data_studio.

And for the rest of the tables, they following a strict naming conversion (a convention that we deploy across all Recipes):

BigQuery Table Naming Conventions

_proc

Models ending in _proc.sql (ga_proc.sql) directly process raw data out of BigQuery, and perform functions like deduping, renaming columns, and performing basic transformations of data.

_proc models do *not* join data together.

Every table used in modeling will first be pulled in using a _proc model, before any joins or more complex calculations are done.

_stats

Models ending in _stats (ie ga_stats.sql) denote a final model of KPIs for a given data source.

These _stats models are then rolled up into higher-level models reporting - they are the final stats from a single data source.

agg_

Models beginning in agg_ (agg_traffic.sql) join data from two or more _stats models together. For example, agg_traffic.sql joins together stats tables from Search Console and Google Analytics.

Agg_ models do not read BigQuery tables directly, but instead read other models.

actions_

Models beginning in actions_ relate to the actions recommendation algorithm. There are three layers: _proc, _hierarchy, and then _data_studio and _count as final visualization tables.

BigQuery Table Definitions

See below for details on how each table in your BigQuery dataset is derived, and how you can use it.

As we mentioned, most teams only use the visualization tables, but this table dictionary is here for you as a reference in case you need it.

Admin tables

crawl_dates

Identifies available crawls from Deepcrawl for a given site, and applies them to the appropriate report month.

For a given report month, the preferred crawl will be the latest one run in the month *following* the listed report month.

For July 2020's report, the chosen crawl will be the latest one taken in August 2020.

If no crawl was taken in August 2020, the crawl data will fall back to the latest previous crawl (July 2020 or earlier).

dates

Identifies report dates that are available for viewing.

An eligible report date is any that has data from both Google Analytics + Google Search Console for the same site.

For eligible report month dates, this model also specifies the corresponding MoM and YoY dates.

domains_proc

Contains the relevant account settings for a given site: its site name, domain name (stripped off protocol + subdomain if www), and its Google Analytics + Search Console accounts.

Base feed processing tables

Deepcrawl

deepcrawl_proc

Processes the raw Deepcrawl data after upload to BigQuery. Dedupes, strips URLs of query strings / trailing slashes, and adds some derived fields based on the crawl: the domain, url protocol, crawl date, etc.

deepcrawl_url_proc

Reads the deepcrawl_proc table, and calculates additional tags for the URL (for use in page type classification): whether that page contains a higher than average number of prices on the page, form submission options, whether a schema.org schema was found on the page, etc.

deepcrawl_stats

Final Deepcrawl table used in aggregation. Gives each page a page type score across all page types, and applies a crawl date to a given report date.

Options for page classification are:

Homepage, 404, product, article, event, local, lead generation, general content, and thin content (general content of <500 words).

Google Analytics

ga_proc

Processes the raw Google Analytics data after upload to BigQuery. Dedupes, strips URLs of any trailing query strings or slashes, and filters out data from hostnames that are outside of the domain we're analyzing.

Also calculates metrics like conversion rates (for goals + ecommerce), bounce rate, and average time on site.

ga_stats

Final Google Analytics table used in aggregation. Aggregates the MoM and YoY changes for a given URL.

Google Search Console

search_console_keyword_proc

Processes raw Search Console keyword-level data after upload to BigQuery. Dedupes, strips URLS, tags branded keywords, and calculates metrics like average position.

Filter includes keywords with only > 50 monthly impressions that appear in the top 25 positions.

search_console_url_keyword_stats

Reads the search_console_keyword_proc table, and derives 'Main' (by impressions) and 'Top' (by average position) keywords.

Identifies the top URL for those main / top keywords, in order to verify that your site isn't suffering from keyword cannibalization.

Also calculates keyword distribution (top 5 / top 10 / top 20).

search_console_keyword_stats

Reads the search_console_proc table, and generates a list of all active keyword + URL combinations present (filtered for non-branded keywords with > 50 monthly impressions appearing above position 25).

search_console_url_proc

Processes raw Search Console URL-level data after upload to BigQuery. Dedupes, strips URLs, and calculates metrics like click-through rate and average position.

search_console_url_stats

Reads the search_console_url_proc table. Aggregates the MoM and YoY changes in impressions, clicks, click-through rate + average position for a given URL.

Aggregated data tables

agg_traffic

Joins together data from Search Console + Google Analytics.

agg_all

Joins together data from agg_traffic (Search Console + Google Analytics) with Deepcrawl data from deepcrawl_stats.

Applies a number of conditional flags, that are used downstream when applying actions:

- Found in sitemap

- Rank of the URL by page type

- Sums sessions of the URL's first, second and last subfolders

- Whether the page is gaining or losing traffic MoM and YoY

- Identifies whether the URL's title and description contain their main keyword

- Identifies cannibalization in top / main keyword (ie whether another URL ranks higher for the given keyword)

Also calculates a number of metrics:

- % change in GA + GSC metrics month-over-month and year-over-year

- Sitewide medians for conversion rates, referring domains

Action recommendation tables

actions_hierarchy

Takes aggregated data from agg_all, and applies action tags for each URL.

Actions are bucketed into technical, on-page, off-page and architectural actions:

Technical actions

  • http_status_action

  • sitemap_action

  • canonical_action

  • crawl_action

On-page actions

  • content_action

  • meta_rewrite_action

  • pagination_action

  • schema_action

Off-page actions

  • external_link_action

Architectural actions

  • internal_link_action

  • category_action

  • cannibalization_action

Detailed action definition logic can be found in this doc (link).

Also contains stats + trends across GA + GSC metrics.

actions_reviewed

Applies action + page type updates that you've made from your Sheets workbook.

Visualization tables

actions_data_studio

This is the primary table used downstream in the Sheets workbook + Data Studio dashboard.

Reads data from actions_hierarchy, but filters out columns that aren’t required for decision-making.

actions_count_data_studio

Counts the number of URLs in various action buckets over time. Used in the Data Studio dashboard for seeing how the site has progressed each month.

Questions?

Drop us a note in live chat, and we'll help you navigate your BigQuery dataset.

Did this answer your question?