Note: if you're new to writing SQL or BigQuery SQL, and are looking to write some custom queries of your WQA data, check out this SQL tutorial on Coding is for Losers.
The Website Quality Audit writes a number of data tables to your BigQuery warehouse, across 3 buckets:
Raw data feeds from Deepcrawl, Google Analytics and Google Search Console
Aggregated data tables, rolling up your traffic, keywords and crawls
Action recommendation tables, where the aforementioned actions logic is applied
Visualization tables, which prep your data for life as a Data Studio or Sheets workbook
When you first crack open your BigQuery console, you may be overwhelmed by the number of tables in your WQA dataset.
Not to worry - most teams really only use one table: actions_data_studio.
And for the rest of the tables, they following a strict naming conversion (a convention that we deploy across all Recipes):
BigQuery Table Naming Conventions
Models ending in _proc.sql (ga_proc.sql) directly process raw data out of BigQuery, and perform functions like deduping, renaming columns, and performing basic transformations of data.
_proc models do *not* join data together.
Every table used in modeling will first be pulled in using a _proc model, before any joins or more complex calculations are done.
Models ending in _stats (ie ga_stats.sql) denote a final model of KPIs for a given data source.
These _stats models are then rolled up into higher-level models reporting - they are the final stats from a single data source.
Models beginning in agg_ (agg_traffic.sql) join data from two or more _stats models together. For example, agg_traffic.sql joins together stats tables from Search Console and Google Analytics.
Agg_ models do not read BigQuery tables directly, but instead read other models.
Models beginning in actions_ relate to the actions recommendation algorithm. There are three layers: _proc, _hierarchy, and then _data_studio and _count as final visualization tables.
BigQuery Table Definitions
See below for details on how each table in your BigQuery dataset is derived, and how you can use it.
As we mentioned, most teams only use the visualization tables, but this table dictionary is here for you as a reference in case you need it.
Identifies available crawls from Deepcrawl for a given site, and applies them to the appropriate report month.
For a given report month, the preferred crawl will be the latest one run in the month *following* the listed report month.
For July 2020's report, the chosen crawl will be the latest one taken in August 2020.
If no crawl was taken in August 2020, the crawl data will fall back to the latest previous crawl (July 2020 or earlier).
Identifies report dates that are available for viewing.
An eligible report date is any that has data from both Google Analytics + Google Search Console for the same site.
For eligible report month dates, this model also specifies the corresponding MoM and YoY dates.
Contains the relevant account settings for a given site: its site name, domain name (stripped off protocol + subdomain if www), and its Google Analytics + Search Console accounts.
Base feed processing tables
Processes the raw Deepcrawl data after upload to BigQuery. Dedupes, strips URLs of query strings / trailing slashes, and adds some derived fields based on the crawl: the domain, url protocol, crawl date, etc.
Reads the deepcrawl_proc table, and calculates additional tags for the URL (for use in page type classification): whether that page contains a higher than average number of prices on the page, form submission options, whether a schema.org schema was found on the page, etc.
Final Deepcrawl table used in aggregation. Gives each page a page type score across all page types, and applies a crawl date to a given report date.
Options for page classification are:
Homepage, 404, product, article, event, local, lead generation, general content, and thin content (general content of <500 words).
Processes the raw Google Analytics data after upload to BigQuery. Dedupes, strips URLs of any trailing query strings or slashes, and filters out data from hostnames that are outside of the domain we're analyzing.
Also calculates metrics like conversion rates (for goals + ecommerce), bounce rate, and average time on site.
Final Google Analytics table used in aggregation. Aggregates the MoM and YoY changes for a given URL.
Google Search Console
Processes raw Search Console keyword-level data after upload to BigQuery. Dedupes, strips URLS, tags branded keywords, and calculates metrics like average position.
Filter includes keywords with only > 50 monthly impressions that appear in the top 25 positions.
Reads the search_console_keyword_proc table, and derives 'Main' (by impressions) and 'Top' (by average position) keywords.
Identifies the top URL for those main / top keywords, in order to verify that your site isn't suffering from keyword cannibalization.
Also calculates keyword distribution (top 5 / top 10 / top 20).
Reads the search_console_proc table, and generates a list of all active keyword + URL combinations present (filtered for non-branded keywords with > 50 monthly impressions appearing above position 25).
Processes raw Search Console URL-level data after upload to BigQuery. Dedupes, strips URLs, and calculates metrics like click-through rate and average position.
Reads the search_console_url_proc table. Aggregates the MoM and YoY changes in impressions, clicks, click-through rate + average position for a given URL.
Aggregated data tables
Joins together data from Search Console + Google Analytics.
Joins together data from agg_traffic (Search Console + Google Analytics) with Deepcrawl data from deepcrawl_stats.
Applies a number of conditional flags, that are used downstream when applying actions:
- Found in sitemap
- Rank of the URL by page type
- Sums sessions of the URL's first, second and last subfolders
- Whether the page is gaining or losing traffic MoM and YoY
- Identifies whether the URL's title and description contain their main keyword
- Identifies cannibalization in top / main keyword (ie whether another URL ranks higher for the given keyword)
Also calculates a number of metrics:
- % change in GA + GSC metrics month-over-month and year-over-year
- Sitewide medians for conversion rates, referring domains
Action recommendation tables
Takes aggregated data from agg_all, and applies action tags for each URL.
Actions are bucketed into technical, on-page, off-page and architectural actions:
Detailed action definition logic can be found in this doc (link).
Also contains stats + trends across GA + GSC metrics.
Applies action + page type updates that you've made from your Sheets workbook.
This is the primary table used downstream in the Sheets workbook + Data Studio dashboard.
Reads data from actions_hierarchy, but filters out columns that aren’t required for decision-making.
Counts the number of URLs in various action buckets over time. Used in the Data Studio dashboard for seeing how the site has progressed each month.
Drop us a note in live chat, and we'll help you navigate your BigQuery dataset.