Training Data

The dataset behind SAGE

SAGE is trained on roughly 10,000 Steam games merged from SteamSpy (ownership & review aggregates) and the Steam Web API (store-page metadata). The pipeline below shows exactly how the raw sources become the steam_10k_prelaunch.csv file used by the model.

10,000 Games (rows)

114 Columns total (including top-50 tags)

99 Pre-launch features used (including top-50 tags)

6 Owner-tier classes

1 · Data sources

SteamSpy (base CSV)

Provides aggregate ownership buckets, review counts, playtime statistics, price, genre flags, and release month. This forms the base table keyed by appid.

owners
positive / negative
price
release_month
genre flags

Steam Web API (JSON)

Per-game store-page metadata: descriptions, screenshots, trailers, supported languages, platforms, categories, tags, packages, and achievements. Loaded as a dict keyed by appid.

screenshots
movies
categories
tags
supported_languages
publishers

2 · Preprocessing pipeline

Two scripts produce the final training file. The first enriches and merges; the second selects pre-launch-only features and builds the classification target.

1

Load base CSV & Steam JSON

Read SteamSpy CSV into a DataFrame; load Steam Web API JSON keyed by appid string.
2

Extract pre-launch features from JSON

For each game, derive columns: platform support, language list & weighted score, screenshot counts, description lengths, Steam category flags (achievements, cloud save, controller, VR…), top-50 tag binary columns, package & SKU counts, and game_age_days from release date.
3

Merge on appid (left join)

Base CSV ⟕ JSON features. Rows with no JSON match keep their base columns; new numeric features are filled with 0.
4

Engineer composite scores

Five derived features combine raw signals into interpretable proxies: store_page_score, platform_reach, marketing_score, localization_score, steam_integration, and an is_mature_content flag. Language support is encoded as weighted_language_score using per-language market weights.
5

Drop post-launch columns (leakage prevention)

Reviews, playtime, CCU, and owners themselves are excluded from the model — they wouldn't exist before launch and would leak the target.
6

Build ordinal target & split

Owners bucket → ordinal class 0–5 (top sparse buckets merged into ≥750K). Median imputation for any remaining numeric NaNs, then a stratified 80/20 train-test split followed by z-score scaling.

3 · Target variable

The model predicts an ordinal owner tier, not a raw owner count. The original 9 SteamSpy buckets are collapsed to 6 because the top 4 had too few samples (n = 136, 88, 53, 19) for reliable learning.

Class	Owner tier	Notes
`0`	≤ 10K	Majority class (~69%)
`1`	≤ 35K
`2`	≤ 75K
`3`	≤ 150K
`4`	≤ 350K
`5`	≥ 750K	Merged top 4 sparse buckets

4 · Pre-launch features (99)

One-hot encoded top-50 tags (tag_*) included; only signals that exist before a game ships are used. Grouped by origin:

Pricing & release

price
initialprice
is_free
release_month
release_quarter
release_dayofweek
release_is_q4
release_is_holiday
release_is_summer
release_is_tuesday

Genre flags (one-hot)

Action
Adventure
RPG
Strategy
Simulation
Sports
Racing

Store-page

screenshot_count
about_length
short_desc_length
has_detailed_desc
has_website
has_support_email
required_age

Platform & localization

platform_count
platform_windows
platform_mac
platform_linux
supported_languages_count
full_audio_languages_count
weighted_language_score

Steam ecosystem flags

has_achievements
has_cloud_save
has_controller_support
has_vr_support
has_in_app_purchases
has_family_sharing
achievement_count
category_count

Tags & packaging

tag_count
is_multiplayer
tag_* (top-50 binaries)
package_count
sku_count

Engineered Composite Scores

store_page_score
platform_reach
marketing_score
localization_score
steam_integration
is_mature_content

Columns explicitly excluded (post-launch / leakage)

positive
negative
total_reviews
positive_ratio
average_forever
average_2weeks
median_forever
median_2weeks
ccu
owners
log_owners
json_price_raw
appid
has_trailer
trailer_count
dlc_count
has_trading_cards
has_workshop
is_solo_dev
has_publisher
publisher_count
developer_count
has_multiplayer_tag
Indie
publisher_backing

5 · Reproducibility

The full pipeline lives in two Python scripts. Output is the file the model trains on:

enrich_prelaunch.py Merges SteamSpy CSV with Steam API JSON and engineers composite features.

train_prelaunch_model.py Selects pre-launch features, builds the ordinal target, trains the stacked ensemble.

steam_10k_prelaunch.csv Final 10,000 × 114 training file consumed by the model.

1 · Data sources

SteamSpy (base CSV)

Steam Web API (JSON)

2 · Preprocessing pipeline

Load base CSV & Steam JSON

Extract pre-launch features from JSON

Merge on appid (left join)

Engineer composite scores

Drop post-launch columns (leakage prevention)

Build ordinal target & split