Training Data

The dataset behind SAGE

SAGE is trained on roughly 10,000 Steam games merged from SteamSpy (ownership & review aggregates) and the Steam Web API (store-page metadata). The pipeline below shows exactly how the raw sources become the steam_10k_prelaunch.csv file used by the model.

10,000 Games (rows)
114 Columns total (including top-50 tags)
99 Pre-launch features used (including top-50 tags)
6 Owner-tier classes

1 · Data sources

SteamSpy (base CSV)

Provides aggregate ownership buckets, review counts, playtime statistics, price, genre flags, and release month. This forms the base table keyed by appid.

  • owners
  • positive / negative
  • price
  • release_month
  • genre flags

Steam Web API (JSON)

Per-game store-page metadata: descriptions, screenshots, trailers, supported languages, platforms, categories, tags, packages, and achievements. Loaded as a dict keyed by appid.

  • screenshots
  • movies
  • categories
  • tags
  • supported_languages
  • publishers

2 · Preprocessing pipeline

Two scripts produce the final training file. The first enriches and merges; the second selects pre-launch-only features and builds the classification target.

  1. 1

    Load base CSV & Steam JSON

    Read SteamSpy CSV into a DataFrame; load Steam Web API JSON keyed by appid string.

  2. 2

    Extract pre-launch features from JSON

    For each game, derive columns: platform support, language list & weighted score, screenshot counts, description lengths, Steam category flags (achievements, cloud save, controller, VR…), top-50 tag binary columns, package & SKU counts, and game_age_days from release date.

  3. 3

    Merge on appid (left join)

    Base CSV ⟕ JSON features. Rows with no JSON match keep their base columns; new numeric features are filled with 0.

  4. 4

    Engineer composite scores

    Five derived features combine raw signals into interpretable proxies: store_page_score, platform_reach, marketing_score, localization_score, steam_integration, and an is_mature_content flag. Language support is encoded as weighted_language_score using per-language market weights.

  5. 5

    Drop post-launch columns (leakage prevention)

    Reviews, playtime, CCU, and owners themselves are excluded from the model — they wouldn't exist before launch and would leak the target.

  6. 6

    Build ordinal target & split

    Owners bucket → ordinal class 0–5 (top sparse buckets merged into ≥750K). Median imputation for any remaining numeric NaNs, then a stratified 80/20 train-test split followed by z-score scaling.

3 · Target variable

The model predicts an ordinal owner tier, not a raw owner count. The original 9 SteamSpy buckets are collapsed to 6 because the top 4 had too few samples (n = 136, 88, 53, 19) for reliable learning.

ClassOwner tierNotes
0≤ 10KMajority class (~69%)
1≤ 35K
2≤ 75K
3≤ 150K
4≤ 350K
5≥ 750KMerged top 4 sparse buckets

4 · Pre-launch features (99)

One-hot encoded top-50 tags (tag_*) included; only signals that exist before a game ships are used. Grouped by origin:

Pricing & release

  • price
  • initialprice
  • is_free
  • release_month
  • release_quarter
  • release_dayofweek
  • release_is_q4
  • release_is_holiday
  • release_is_summer
  • release_is_tuesday

Genre flags (one-hot)

  • Action
  • Adventure
  • RPG
  • Strategy
  • Simulation
  • Sports
  • Racing

Store-page

  • screenshot_count
  • about_length
  • short_desc_length
  • has_detailed_desc
  • has_website
  • has_support_email
  • required_age

Platform & localization

  • platform_count
  • platform_windows
  • platform_mac
  • platform_linux
  • supported_languages_count
  • full_audio_languages_count
  • weighted_language_score

Steam ecosystem flags

  • has_achievements
  • has_cloud_save
  • has_controller_support
  • has_vr_support
  • has_in_app_purchases
  • has_family_sharing
  • achievement_count
  • category_count

Tags & packaging

  • tag_count
  • is_multiplayer
  • tag_* (top-50 binaries)
  • package_count
  • sku_count

Engineered Composite Scores

  • store_page_score
  • platform_reach
  • marketing_score
  • localization_score
  • steam_integration
  • is_mature_content
Columns explicitly excluded (post-launch / leakage)
  • positive
  • negative
  • total_reviews
  • positive_ratio
  • average_forever
  • average_2weeks
  • median_forever
  • median_2weeks
  • ccu
  • owners
  • log_owners
  • json_price_raw
  • appid
  • has_trailer
  • trailer_count
  • dlc_count
  • has_trading_cards
  • has_workshop
  • is_solo_dev
  • has_publisher
  • publisher_count
  • developer_count
  • has_multiplayer_tag
  • Indie
  • publisher_backing

5 · Reproducibility

The full pipeline lives in two Python scripts. Output is the file the model trains on:

enrich_prelaunch.py Merges SteamSpy CSV with Steam API JSON and engineers composite features.
train_prelaunch_model.py Selects pre-launch features, builds the ordinal target, trains the stacked ensemble.
steam_10k_prelaunch.csv Final 10,000 × 114 training file consumed by the model.