Note: if you're new to writing SQL or BigQuery SQL, and are looking to write some custom queries of your WQA data, check out this SQL tutorial on Coding is for Losers.
The Website Quality Audit writes a number of data tables to your BigQuery warehouse, across 3 buckets:
- Raw data feeds from Deepcrawl, Google Analytics and Google Search Console
- Aggregated data tables, rolling up your traffic, keywords and crawls
- Action recommendation tables, where the aforementioned actions logic is applied
- Visualization tables, which prep your data for life as a Data Studio or Sheets workbook
When you first crack open your BigQuery console, you may be overwhelmed by the number of tables in your WQA dataset.
Not to worry - most teams really only use one table: actions_data_studio.
And for the rest of the tables, they following a strict naming conversion (a convention that we deploy across all Recipes):
BigQuery Table Naming Conventions
_proc
Models ending in _proc.sql (ga_proc.sql) directly process raw data out of BigQuery, and perform functions like deduping, renaming columns, and performing basic transformations of data.
_proc models do *not* join data together.
Every table used in modeling will first be pulled in using a _proc model, before any joins or more complex calculations are done.
_stats
Models ending in _stats (ie ga_stats.sql) denote a final model of KPIs for a given data source.
These _stats models are then rolled up into higher-level models reporting - they are the final stats from a single data source.
agg_
Models beginning in agg_ (agg_traffic.sql) join data from two or more _stats models together. For example, agg_traffic.sql joins together stats tables from Search Console and Google Analytics.
Agg_ models do not read BigQuery tables directly, but instead read other models.
actions_
Models beginning in actions_ relate to the actions recommendation algorithm. There are three layers: _proc, _hierarchy, and then _data_studio and _count as final visualization tables.
BigQuery Table Definitions
See below for details on how each table in your BigQuery dataset is derived, and how you can use it.
As we mentioned, most teams only use the visualization tables, but this table dictionary is here for you as a reference in case you need it.
Admin tables
crawl_dates
Identifies available crawls from Deepcrawl for a given site, and applies them to the appropriate report month.
For a given report month, the preferred crawl will be the latest one run in the month *following* the listed report month.
For July 2020's report, the chosen crawl will be the latest one taken in August 2020.
If no crawl was taken in August 2020, the crawl data will fall back to the latest previous crawl (July 2020 or earlier).
dates
Identifies report dates that are available for viewing.
An eligible report date is any that has data from both Google Analytics + Google Search Console for the same site.
For eligible report month dates, this model also specifies the corresponding MoM and YoY dates.
domains_proc
Contains the relevant account settings for a given site: its site name, domain name (stripped off protocol + subdomain if www), and its Google Analytics + Search Console accounts.
Base feed processing tables
Deepcrawl
deepcrawl_proc
Processes the raw Deepcrawl data after upload to BigQuery. Dedupes, strips URLs of query strings / trailing slashes, and adds some derived fields based on the crawl: the domain, url protocol, crawl date, etc.
deepcrawl_url_proc
Reads the deepcrawl_proc table, and calculates additional tags for the URL (for use in page type classification): whether that page contains a higher than average number of prices on the page, form submission options, whether a schema.org schema was found on the page, etc.
deepcrawl_stats
Final Deepcrawl table used in aggregation. Gives each page a page type score across all page types, and applies a crawl date to a given report date.
Options for page classification are:
Homepage, 404, product, article, event, local, lead generation, general content, and thin content (general content of <500 words).
Google Analytics
ga_proc
Processes the raw Google Analytics data after upload to BigQuery. Dedupes, strips URLs of any trailing query strings or slashes, and filters out data from hostnames that are outside of the domain we're analyzing.
Also calculates metrics like conversion rates (for goals + ecommerce), bounce rate, and average time on site.
ga_stats
Final Google Analytics table used in aggregation. Aggregates the MoM and YoY changes for a given URL.
Google Search Console
search_console_keyword_proc
Processes raw Search Console keyword-level data after upload to BigQuery. Dedupes, strips URLS, tags branded keywords, and calculates metrics like average position.
Filter includes keywords with only > 50 monthly impressions that appear in the top 25 positions.
search_console_url_keyword_stats
Reads the search_console_keyword_proc table, and derives 'Main' (by impressions) and 'Top' (by average position) keywords.
Identifies the top URL for those main / top keywords, in order to verify that your site isn't suffering from keyword cannibalization.
Also calculates keyword distribution (top 5 / top 10 / top 20).
search_console_keyword_stats
Reads the search_console_proc table, and generates a list of all active keyword + URL combinations present (filtered for non-branded keywords with > 50 monthly impressions appearing above position 25).
search_console_url_proc
Processes raw Search Console URL-level data after upload to BigQuery. Dedupes, strips URLs, and calculates metrics like click-through rate and average position.
search_console_url_stats
Reads the search_console_url_proc table. Aggregates the MoM and YoY changes in impressions, clicks, click-through rate + average position for a given URL.
Aggregated data tables
agg_traffic
Joins together data from Search Console + Google Analytics.
agg_all
Joins together data from agg_traffic (Search Console + Google Analytics) with Deepcrawl data from deepcrawl_stats.
Applies a number of conditional flags, that are used downstream when applying actions:
- Found in sitemap
- Rank of the URL by page type
- Sums sessions of the URL's first, second and last subfolders
- Whether the page is gaining or losing traffic MoM and YoY
- Identifies whether the URL's title and description contain their main keyword
- Identifies cannibalization in top / main keyword (ie whether another URL ranks higher for the given keyword)
Also calculates a number of metrics:
- % change in GA + GSC metrics month-over-month and year-over-year
- Sitewide medians for conversion rates, referring domains
Action recommendation tables
actions_hierarchy
Takes aggregated data from agg_all, and applies action tags for each URL.
Actions are bucketed into technical, on-page, off-page and architectural actions:
Technical actions
- http_status_action
- sitemap_action
- canonical_action
- crawl_action
On-page actions
- content_action
- meta_rewrite_action
- pagination_action
- schema_action
Off-page actions
- external_link_action
Architectural actions
- internal_link_action
- category_action
- cannibalization_action
Detailed action definition logic can be found in this doc (link).
Also contains stats + trends across GA + GSC metrics.
actions_reviewed
Applies action + page type updates that you've made from your Sheets workbook.
Visualization tables
actions_data_studio
This is the primary table used downstream in the Sheets workbook + Data Studio dashboard.
Reads data from actions_hierarchy, but filters out columns that aren’t required for decision-making.
actions_count_data_studio
Counts the number of URLs in various action buckets over time. Used in the Data Studio dashboard for seeing how the site has progressed each month.
Questions?
Drop us a note in live chat, and we'll help you navigate your BigQuery dataset.