3  Data Understanding

CRISP-DM Phase 2. Collect initial data and build familiarity with it, identify data-quality problems, and form first hypotheses. See The CRISP-DM Process for the methodology overview.

This chapter executes the second CRISP-DM phase for PortfolioLens. It picks up the Business Understanding hand-off — baseline-first, multi-horizon, a Sharpe-based objective, and the six transmission channels (energy, agricultural commodities, defense, safe-havens & FX, shipping, regime-dependent crypto) — and asks the Phase-2 question: what data can we actually get, for free, and is it good enough to build on?

This revision also contextualizes the data against the indicator catalog in strategy.md: §5 maps every series onto that catalog’s leading/coincident/ lagging and regime framework, and a companion appendix — Indicator Council Deliberation — records a six-expert debate on which indicators are genuinely useful for predicting the best-performing investments.

Everything below is computed from a pinned, cached data snapshot (vintage 2026-06-02) produced by a one-time pull script. The chapter performs no live API calls when it renders; see §10 Reproducibility for how the snapshot is created and refreshed.

3.1 1. From business goals to a data problem

The Phase-1 thesis — geopolitical shock → economic transmission channel → asset re-pricing → opportunity — tells us where to look. Phase 2 tests whether the look is even possible with the mandated free sources: Yahoo Finance (yfinance), FRED, Alpha Vantage, and US-government APIs (EIA, US Treasury). Detailed source notes live in financial-data-sources.md; this chapter is the working record of actually pulling and vetting the data.

The four Phase-2 deliverables follow in order: collect → describe → explore → verify quality, now bridged to the strategy catalog in between.

3.2 2. Data sources & the free-source constraint

Source Auth Free-tier limit Used here for Note
yfinance (Yahoo) none unofficial scraper equities, ETFs, indices, crypto, copper OHLCV breaks silently
FRED API key ~120 req/min rates, macro, credit, conditions, commodity spots pulled via keyless fredgraph.csv (full history)
Alpha Vantage API key 25 req/day, 5/min FX (EUR/USD demo) one call, cached
EIA v2 API key ~no hard cap petroleum (WTI demo) cross-checks FRED WTI
US Treasury fiscaldata none open reporting exchange rates keyless

The Alpha Vantage 25-requests-per-day ceiling is the decisive design constraint: a single edit-render loop would exhaust it. That is why this project uses a cached architecture — a one-time pull writes the data to disk, and the book reads from disk forever after. As a bonus, the rendered book needs no API keys at all, so it builds on any machine from the committed snapshot.

Key handling. Keys live in a gitignored .env (template: .env.example); yfinance and Treasury need none. FRED is pulled via its keyless fredgraph.csv endpoint — it returns full history and sidesteps the per-key rate-limit/windowing that truncated some series, so the FRED API key is reserved for ALFRED point-in-time (vintage) data in Phase 3 (see §9).

3.3 3. Collect initial data

3.3.1 3.1 The instrument universe

The PoC universe maps each transmission channel and indicator role to concrete, free-to-obtain instruments. It is defined once in scripts/poc_universe.py:

symbol name channel source asset_type frequency needs_key
0 SPY SPDR S&P 500 ETF Trust Broad-market baseline yfinance equity-ETF business-day False
1 XLE Energy Select Sector SPDR Fund Energy yfinance equity-ETF business-day False
2 ITA iShares U.S. Aerospace & Defense ETF Defense procurement yfinance equity-ETF business-day False
3 DBA Invesco DB Agriculture Fund Agricultural commodities yfinance equity-ETF business-day False
4 BDRY Breakwave Dry Bulk Shipping ETF Shipping & insurance yfinance equity-ETF business-day False
5 GLD SPDR Gold Shares Safe havens & FX yfinance equity-ETF business-day False
6 ^VIX CBOE Volatility Index Safe havens & FX yfinance index business-day False
7 UUP Invesco DB US Dollar Index Bullish Fund Safe havens & FX yfinance equity-ETF business-day False
8 BTC-USD Bitcoin (USD) Crypto (regime-dependent) yfinance crypto daily False
9 ETH-USD Ethereum (USD) Crypto (regime-dependent) yfinance crypto daily False
10 LMT Lockheed Martin Corp. Defense procurement yfinance equity business-day False
11 DGS10 10-Year Treasury Constant Maturity Rate Safe havens & FX fred rate business-day True
12 DGS2 2-Year Treasury Constant Maturity Rate Safe havens & FX fred rate business-day True
13 T10Y3M 10Y minus 3M Treasury Spread Macro context fred rate-spread business-day True
14 DCOILWTICO WTI Crude Oil Spot Price Energy fred commodity-price business-day True
15 DHHNGSP Henry Hub Natural Gas Spot Price Energy fred commodity-price business-day True
16 VIXCLS CBOE Volatility Index (FRED) Safe havens & FX fred index business-day True
17 CPILFESL Core CPI (All Urban, less food & energy) Macro context fred macro-index monthly True
18 UNRATE Unemployment Rate Macro context fred macro-rate monthly True
19 WPU01210101 PPI by Commodity: Farm Products: Wheat Agricultural commodities fred macro-index monthly True
20 BAMLH0A0HYM2 ICE BofA US High Yield OAS Credit / risk regime fred credit-spread business-day True
21 BAMLC0A0CM ICE BofA US Corporate (IG) OAS Credit / risk regime fred credit-spread business-day True
22 BAA10Y Moody's Baa Corporate minus 10Y Treasury (cred... Credit / risk regime fred credit-spread business-day True
23 AAA10Y Moody's Aaa Corporate minus 10Y Treasury (cred... Credit / risk regime fred credit-spread business-day True
24 NFCI Chicago Fed National Financial Conditions Index Financial conditions fred index weekly True
25 ANFCI Chicago Fed Adjusted NFCI Financial conditions fred index weekly True
26 T10YIE 10-Year Breakeven Inflation Rate Inflation expectations fred rate business-day True
27 DFII10 10-Year TIPS Real Yield Real rates fred rate business-day True
28 DFF Effective Federal Funds Rate Monetary policy fred rate business-day True
29 M2SL M2 Money Supply (seasonally adjusted) Liquidity fred macro-level monthly True
30 INDPRO Industrial Production Index Growth (coincident) fred macro-index monthly True
31 SAHMREALTIME Sahm Rule Recession Indicator (real-time) Recession regime fred macro-rate monthly True
32 HG=F COMEX Copper Futures (Dr. Copper) Commodity / growth barometer yfinance commodity business-day False
33 EUR/USD EUR/USD daily (Alpha Vantage FX_DAILY) Safe havens & FX alphavantage FX business-day True
34 PET.RWTC.D WTI Crude Oil Spot (EIA) Energy eia commodity-price business-day True
35 EURO Treasury Reporting Rate of Exchange — Euro Safe havens & FX treasury FX quarterly False

3.3.2 3.2 How each source is pulled

The real fetchers live in scripts/poc_fetch.py; the orchestrator is scripts/poc_pull.py. The canonical call per source (shown, not executed at render time):

# yfinance — split/dividend-adjusted OHLCV, no key
import yfinance as yf
spy = yf.download("SPY", start="2000-01-01", end="2026-06-02", auto_adjust=True)

# FRED — keyless public CSV endpoint (full history; no pandas-datareader, which breaks on pandas 3.x)
import io, requests, pandas as pd
csv = requests.get("https://fred.stlouisfed.org/graph/fredgraph.csv",
                   params={"id": "BAA10Y", "cosd": "2000-01-01", "coed": "2026-06-02"}).text
baa10y = pd.read_csv(io.StringIO(csv), na_values=["."])

# Alpha Vantage — FX (key required; <=25 calls/day, 12s spacing)
av = requests.get("https://www.alphavantage.co/query",
                  params={"function": "FX_DAILY", "from_symbol": "EUR", "to_symbol": "USD",
                          "outputsize": "full", "apikey": "<ALPHAVANTAGE_API_KEY>"}).json()

# EIA v2 — petroleum series (key required)
eia = requests.get("https://api.eia.gov/v2/seriesid/PET.RWTC.D",
                   params={"api_key": "<EIA_API_KEY>"}).json()

# US Treasury fiscaldata — reporting rates of exchange (no key)
tr = requests.get("https://api.fiscaldata.treasury.gov/services/api/fiscal_service"
                  "/v1/accounting/od/rates_of_exchange",
                  params={"filter": "country_currency_desc:in:(Euro Zone-Euro)"}).json()

3.3.3 3.3 Initial data collection report

What the snapshot pull actually retrieved (the executed manifest of cached files):

symbol channel source frequency start end n_rows
0 SPY Broad-market baseline yfinance business-day 2000-01-03 2026-06-01 6642
1 XLE Energy yfinance business-day 2000-01-03 2026-06-01 6642
2 ITA Defense procurement yfinance business-day 2006-05-05 2026-06-01 5049
3 DBA Agricultural commodities yfinance business-day 2007-01-05 2026-06-01 4881
4 BDRY Shipping & insurance yfinance business-day 2018-03-22 2026-06-01 2059
5 GLD Safe havens & FX yfinance business-day 2004-11-18 2026-06-01 5416
6 ^VIX Safe havens & FX yfinance business-day 2000-01-03 2026-06-01 6643
7 UUP Safe havens & FX yfinance business-day 2007-03-01 2026-06-01 4844
8 BTC-USD Crypto (regime-dependent) yfinance daily 2014-09-17 2026-06-01 4276
9 ETH-USD Crypto (regime-dependent) yfinance daily 2017-11-09 2026-06-01 3127
10 LMT Defense procurement yfinance business-day 2000-01-03 2026-06-01 6642
11 DGS10 Safe havens & FX fred business-day 2000-01-03 2026-06-01 6891
12 DGS2 Safe havens & FX fred business-day 2000-01-03 2026-06-01 6891
13 T10Y3M Macro context fred business-day 2000-01-03 2026-06-02 6892
14 DCOILWTICO Energy fred business-day 2000-01-04 2026-05-26 6886
15 DHHNGSP Energy fred business-day 2000-01-04 2026-05-26 6886
16 VIXCLS Safe havens & FX fred business-day 2000-01-03 2026-06-01 6891
17 CPILFESL Macro context fred monthly 2000-01-01 2026-04-01 316
18 UNRATE Macro context fred monthly 2000-01-01 2026-04-01 316
19 WPU01210101 Agricultural commodities fred monthly 2000-01-01 2026-04-01 316
20 BAMLH0A0HYM2 Credit / risk regime fred business-day 2023-06-05 2026-06-01 793
21 BAMLC0A0CM Credit / risk regime fred business-day 2023-06-05 2026-06-01 793
22 BAA10Y Credit / risk regime fred business-day 2000-01-03 2026-06-01 6891
23 AAA10Y Credit / risk regime fred business-day 2000-01-03 2026-06-01 6891
24 NFCI Financial conditions fred weekly 2000-01-07 2026-05-22 1377
25 ANFCI Financial conditions fred weekly 2000-01-07 2026-05-22 1377
26 T10YIE Inflation expectations fred business-day 2003-01-02 2026-06-02 6109
27 DFII10 Real rates fred business-day 2003-01-02 2026-06-01 6108
28 DFF Monetary policy fred business-day 2000-01-01 2026-06-01 9649
29 M2SL Liquidity fred monthly 2000-01-01 2026-04-01 316
30 INDPRO Growth (coincident) fred monthly 2000-01-01 2026-04-01 316
31 SAHMREALTIME Recession regime fred monthly 2000-01-01 2026-04-01 316
32 HG=F Commodity / growth barometer yfinance business-day 2000-08-30 2026-06-01 6466
33 EUR/USD Safe havens & FX alphavantage business-day 2007-04-03 2026-06-02 5000
34 PET.RWTC.D Energy eia business-day 2006-06-28 2026-05-26 5000
35 EURO Safe havens & FX treasury quarterly 2001-03-31 2026-03-31 101
Pulled 36/36 instruments across all four free sources.
'All instruments pulled successfully.'

Issues encountered (honest log). Three findings worth recording, because they shaped the pipeline:

  1. The keyless FRED path initially failed (pandas-datareader is incompatible with pandas 3.x) and was replaced with FRED’s public fredgraph.csv endpoint.
  2. The IMF “Global price of Wheat” series (PWHEAMTUSD) was discontinued on FRED; substituted the maintained PPI wheat series (WPU01210101).
  3. The ICE BofA credit-spread series (BAMLH0A0HYM2 HY OAS, BAMLC0A0CM IG OAS) return only ~mid-2023 onward from FRED — a licensing restriction on the free ICE data, identical via the API and the CSV endpoint. Because a credit spread with no past recession in-sample is nearly useless (see §7), we added the free Moody’s BAA10Y/AAA10Y spreads, which span 1990→today (including 2008 and 2020). This fix was prompted directly by the indicator council.

3.4 4. Describe data

filekey name channel source asset_type frequency start end n_rows n_fields tz_aware pct_missing_value
0 DBA Invesco DB Agriculture Fund Agricultural commodities yfinance equity-ETF business-day 2007-01-05 2026-06-01 4881 6 False 0.00
1 WPU_WHEAT PPI by Commodity: Farm Products: Wheat Agricultural commodities fred macro-index monthly 2000-01-01 2026-04-01 316 1 False 0.00
2 SPY SPDR S&P 500 ETF Trust Broad-market baseline yfinance equity-ETF business-day 2000-01-03 2026-06-01 6642 6 False 0.00
3 COPPER COMEX Copper Futures (Dr. Copper) Commodity / growth barometer yfinance commodity business-day 2000-08-30 2026-06-01 6466 6 False 0.00
4 AAA10Y Moody's Aaa Corporate minus 10Y Treasury (cred... Credit / risk regime fred credit-spread business-day 2000-01-03 2026-06-01 6891 1 False 4.22
5 BAA10Y Moody's Baa Corporate minus 10Y Treasury (cred... Credit / risk regime fred credit-spread business-day 2000-01-03 2026-06-01 6891 1 False 4.22
6 HY_OAS ICE BofA US High Yield OAS Credit / risk regime fred credit-spread business-day 2023-06-05 2026-06-01 793 1 False 1.01
7 IG_OAS ICE BofA US Corporate (IG) OAS Credit / risk regime fred credit-spread business-day 2023-06-05 2026-06-01 793 1 False 1.13
8 BTC-USD Bitcoin (USD) Crypto (regime-dependent) yfinance crypto daily 2014-09-17 2026-06-01 4276 6 False 0.00
9 ETH-USD Ethereum (USD) Crypto (regime-dependent) yfinance crypto daily 2017-11-09 2026-06-01 3127 6 False 0.00
10 ITA iShares U.S. Aerospace & Defense ETF Defense procurement yfinance equity-ETF business-day 2006-05-05 2026-06-01 5049 6 False 0.00
11 LMT Lockheed Martin Corp. Defense procurement yfinance equity business-day 2000-01-03 2026-06-01 6642 6 False 0.00
12 DCOILWTICO WTI Crude Oil Spot Price Energy fred commodity-price business-day 2000-01-04 2026-05-26 6886 1 False 3.89
13 DHHNGSP Henry Hub Natural Gas Spot Price Energy fred commodity-price business-day 2000-01-04 2026-05-26 6886 1 False 3.75
14 EIA_WTI WTI Crude Oil Spot (EIA) Energy eia commodity-price business-day 2006-06-28 2026-05-26 5000 1 False 0.00
15 XLE Energy Select Sector SPDR Fund Energy yfinance equity-ETF business-day 2000-01-03 2026-06-01 6642 6 False 0.00
16 ANFCI Chicago Fed Adjusted NFCI Financial conditions fred index weekly 2000-01-07 2026-05-22 1377 1 False 0.00
17 NFCI Chicago Fed National Financial Conditions Index Financial conditions fred index weekly 2000-01-07 2026-05-22 1377 1 False 0.00
18 INDPRO Industrial Production Index Growth (coincident) fred macro-index monthly 2000-01-01 2026-04-01 316 1 False 0.00
19 T10YIE 10-Year Breakeven Inflation Rate Inflation expectations fred rate business-day 2003-01-02 2026-06-02 6109 1 False 4.11
20 M2SL M2 Money Supply (seasonally adjusted) Liquidity fred macro-level monthly 2000-01-01 2026-04-01 316 1 False 0.00
21 CPILFESL Core CPI (All Urban, less food & energy) Macro context fred macro-index monthly 2000-01-01 2026-04-01 316 1 False 0.32
22 T10Y3M 10Y minus 3M Treasury Spread Macro context fred rate-spread business-day 2000-01-03 2026-06-02 6892 1 False 4.14
23 UNRATE Unemployment Rate Macro context fred macro-rate monthly 2000-01-01 2026-04-01 316 1 False 0.32
24 DFF Effective Federal Funds Rate Monetary policy fred rate business-day 2000-01-01 2026-06-01 9649 1 False 0.00
25 DFII10 10-Year TIPS Real Yield Real rates fred rate business-day 2003-01-02 2026-06-01 6108 1 False 4.11
26 SAHM Sahm Rule Recession Indicator (real-time) Recession regime fred macro-rate monthly 2000-01-01 2026-04-01 316 1 False 0.32
27 DGS10 10-Year Treasury Constant Maturity Rate Safe havens & FX fred rate business-day 2000-01-03 2026-06-01 6891 1 False 4.14
28 DGS2 2-Year Treasury Constant Maturity Rate Safe havens & FX fred rate business-day 2000-01-03 2026-06-01 6891 1 False 4.14
29 EURUSD_AV EUR/USD daily (Alpha Vantage FX_DAILY) Safe havens & FX alphavantage FX business-day 2007-04-03 2026-06-02 5000 5 False 0.00
30 GLD SPDR Gold Shares Safe havens & FX yfinance equity-ETF business-day 2004-11-18 2026-06-01 5416 6 False 0.00
31 TREAS_EUR Treasury Reporting Rate of Exchange — Euro Safe havens & FX treasury FX quarterly 2001-03-31 2026-03-31 101 1 False 0.00
32 UUP Invesco DB US Dollar Index Bullish Fund Safe havens & FX yfinance equity-ETF business-day 2007-03-01 2026-06-01 4844 6 False 0.00
33 VIX CBOE Volatility Index Safe havens & FX yfinance index business-day 2000-01-03 2026-06-01 6643 6 False 0.00
34 VIXCLS CBOE Volatility Index (FRED) Safe havens & FX fred index business-day 2000-01-03 2026-06-01 6891 1 False 3.16
35 BDRY Breakwave Dry Bulk Shipping ETF Shipping & insurance yfinance equity-ETF business-day 2018-03-22 2026-06-01 2059 6 False 0.00

Two structural facts dominate the description and drive Phase 3:

  • Inception heterogeneity. Histories start at very different dates — SPY/LMT/^VIX and many FRED series reach back to 2000, but GLD begins 2004, ITA 2006, UUP/DBA 2007, BTC 2014, ETH 2017, and BDRY only 2018; the ICE OAS series only ~2023. Any cross-asset model must handle ragged start dates rather than assume a common window.
  • Frequency mix. Business-day equities/ETFs (~252 obs/yr), 7-day crypto, business-day FRED rates/spots (weekend gaps), weekly NFCI/ANFCI, and monthly macro (CPI, unemployment, M2, industrial production, Sahm, wheat PPI) coexist. Naive joining would silently drop weekends or fabricate values — handled as a quality finding in §7 and an alignment task in §9.

3.5 5. Indicator framework: mapping data to strategy

strategy.md’s organizing principle is that individual indicators are noisy and regime-dependent; their value comes from combining them (diffusion indices, z-scores, multi-signal confirmation). Its highest-conviction signals form a “Stage-1 regime dashboard”: yield curve, credit spreads, financial conditions, the Sahm rule, VIX, copper/gold, and the dollar. The expansion in this revision was chosen to assemble that dashboard from free data.

The mapping of our snapshot onto the catalog’s taxonomy:

strategy.md category Type Our instruments Signal use Reliability caveat
Yield curve Market-based lead DGS10, DGS2, T10Y3M recession lead; slope and disinversion gave a false signal 2022–24
Credit spreads Market-based lead BAA10Y, AAA10Y (1990+); HY_OAS,IG_OAS (2023+) risk-off early warning ICE OAS history licensing-capped → lean on Moody’s
Financial conditions Market-based NFCI, ANFCI tightening leads slowdowns ANFCI strips the cycle
Volatility Market-based VIX, VIXCLS risk-off regime; vol-scaling coincident; low VIX = complacency
Inflation exp. / real rates Market-based T10YIE, DFII10 reflation-vs-stagflation quadrant; gold driver from 2003 only
Policy / liquidity Monetary DFF, M2SL risk anchor; liquidity money→inflation link loose
Growth Coincident INDPRO, SPY business-cycle state
Recession onset Regime SAHMREALTIME (+UNRATE) onset trigger (≥0.50) coincident-early; labor-supply distortion
Inflation / labor Lagging CPILFESL, UNRATE not forward signals catalog’s named trap
Commodities / Dr. Copper Commodity COPPER, DCOILWTICO, DHHNGSP, wheat growth barometer; copper/gold spot vs ETF proxy
FX / dollar Currency UUP, EURUSD_AV, TREAS_EUR dollar smile; risk-off proxies
Equity channels XLE,ITA,LMT,DBA,BDRY transmission-channel proxies proxies, not the underlying
Crypto BTC-USD, ETH-USD regime-dependent on-chain MVRV not free (gap)
Factor premia Equity factor none buildable value/mom/quality needs single-name cross-section (gap)
LEI / ISM-PMI Composite lead none licensed, not free

The dashboard is now (mostly) assembled from free data: curve ✓, credit ✓ (BAA10Y), financial conditions ✓ (NFCI/ANFCI), Sahm ✓, VIX ✓, copper/gold ✓, dollar ✓. Only the licensed composites (LEI, ISM) and survivorship-free single-name fundamentals remain out of reach.

3.5.1 5.1 Credit spread — the recession-tested risk gauge

fig, ax = plt.subplots(figsize=(10, 4.2))
baa = panel["BAA10Y"].dropna()
ax.plot(baa.index, baa, color="tab:purple", linewidth=1.0, label="Baa − 10Y (BAA10Y)")
ax.axhline(baa.median(), color="grey", ls="--", lw=0.8, label="median")
ax.set_ylabel("Credit spread (pp)")
ax.set_title("Moody's Baa credit spread — the recession-spanning risk-regime signal")
ax.legend(loc="upper right", fontsize=9)
plt.tight_layout()
plt.show()
Figure 3.1: Moody’s Baa-minus-10Y credit spread (free, 1990+): widening spikes mark 2008, 2020, and 2022 stress. The ICE HY OAS we’d prefer is licensing-capped to ~2023+.

3.5.2 5.2 Yield curve — slope and inversions

fig, ax = plt.subplots(figsize=(10, 4.2))
curve = panel["T10Y3M"].dropna()
ax.plot(curve.index, curve, color="tab:blue", linewidth=0.9)
ax.axhline(0, color="black", lw=0.8)
ax.fill_between(curve.index, curve, 0, where=(curve < 0), color="tab:red", alpha=0.4,
                label="inverted (recession lead / false-signal risk)")
ax.set_ylabel("10y − 3m spread (pp)")
ax.set_title("Yield-curve slope")
ax.legend(loc="lower right", fontsize=9)
plt.tight_layout()
plt.show()
Figure 3.2: 10y–3m Treasury spread (T10Y3M). Shaded = inverted (negative); inversions preceded recessions but gave a long false signal in 2022–24.

3.5.3 5.3 Copper/gold ratio vs. the 10-year yield

fig, ax1 = plt.subplots(figsize=(10, 4.2))
cg = (panel["COPPER"] / panel["GLD"]).dropna()
ax1.plot(cg.index, cg, color="tab:orange", linewidth=1.0, label="copper/gold (proxy)")
ax1.set_ylabel("copper / gold (ratio, proxy)", color="tab:orange")
ax2 = ax1.twinx()
dgs10 = panel["DGS10"].dropna()
ax2.plot(dgs10.index, dgs10, color="tab:blue", linewidth=0.9, label="10y yield (DGS10)")
ax2.set_ylabel("10y yield (%)", color="tab:blue")
ax1.set_title("Copper/gold vs. the 10-year yield")
plt.tight_layout()
plt.show()
Figure 3.3: Copper/gold ratio (growth/risk-appetite barometer) vs the 10y Treasury yield — strategy.md notes they track each other.

3.6 6. Explore data

3.6.1 6.1 Summary statistics

count mean std min 25% 50% 75% max pct_missing
SPY 6642.0 203.70 163.48 49.81 84.97 124.00 267.67 758.54 31.17
XLE 6642.0 21.45 11.06 5.18 13.82 21.31 25.63 62.56 31.17
ITA 5049.0 70.80 49.94 11.72 26.48 55.78 100.89 250.42 47.68
DBA 4881.0 20.70 4.68 11.61 17.24 20.89 23.97 36.25 49.42
BDRY 2059.0 13.34 7.24 3.91 7.78 10.72 18.61 41.51 78.66
GLD 5416.0 141.97 72.28 41.26 107.26 125.54 167.38 495.90 43.88
UUP 4844.0 21.48 2.77 17.48 19.25 20.97 22.62 28.86 49.80
BTC-USD 4276.0 28603.50 32439.66 178.10 3455.37 11533.04 46366.20 124752.53 55.69
ETH-USD 3127.0 1716.06 1274.73 84.31 380.30 1732.25 2662.89 4831.35 67.60
LMT 6642.0 167.56 158.55 8.55 42.14 66.01 298.00 672.30 31.17
COPPER 6466.0 2.89 1.23 0.60 2.15 3.06 3.72 6.64 32.99
VIX 6643.0 19.84 8.32 9.14 14.03 17.82 23.21 82.69 31.16

3.6.2 6.2 Normalized price history by channel

Each tradable proxy is indexed to 100 at its first available observation (log scale, so different inception dates and magnitudes are comparable):

fig, ax = plt.subplots(figsize=(10, 5.5))
for fk in ["SPY", "XLE", "ITA", "GLD", "BTC-USD"]:
    s = panel[fk].dropna()
    if len(s):
        ax.plot(s.index, 100 * s / s.iloc[0], label=fk, linewidth=1.3)
ax.set_yscale("log")
ax.set_ylabel("Indexed to 100 at inception (log)")
ax.set_title("Transmission-channel proxies, normalized")
ax.legend(loc="upper left", ncol=3, fontsize=9)
plt.tight_layout()
plt.show()
Figure 3.4: Channel proxies normalized to 100 at each series’ inception (log scale).

3.6.3 6.3 Rolling volatility (risk regimes)

rets = q.returns_panel(panel[["SPY", "GLD", "BTC-USD"]], kind="log")
vol = rets.rolling(30).std() * np.sqrt(252)
fig, ax = plt.subplots(figsize=(10, 4.5))
for c in vol.columns:
    ax.plot(vol.index, vol[c], label=c, linewidth=1.1)
ax.set_ylabel("Annualized volatility")
ax.set_title("30-day rolling volatility")
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
Figure 3.5: 30-day annualized volatility of daily log returns — risk regimes are visible (2008, 2020, 2022).

3.6.4 6.4 Cross-asset return correlation

cm = q.correlation_matrix(panel, TRADABLE)
fig, ax = plt.subplots(figsize=(9, 7.5))
sns.heatmap(cm, annot=True, fmt=".2f", cmap="vlag", center=0,
            square=True, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title("Daily-return correlation")
plt.tight_layout()
plt.show()
Figure 3.6: Correlation of daily log returns across the tradable universe (min 60 overlapping obs).

3.6.5 6.5 First hypotheses (to be tested, not findings)

These are explicitly labelled hypotheses — exploratory patterns to be validated with rigor in later phases, never treated as confirmed edges:

  • H1 (channel structure). Energy equity (XLE) co-moves with crude; defense (ITA/LMT) and gold (GLD) show partly diversifying profiles vs. the broad market (SPY).
  • H2 (crypto regime-switching). BTC’s correlation with SPY is not constant — likely rising in risk-on periods and breaking around stress, consistent with the Phase-1 “regime-dependent” claim.
  • H3 (risk-off signature). VIX spikes, credit-spread (BAA10Y) widening, and curve dynamics should cluster around equity drawdowns — motivating a combined regime read rather than any single signal.

3.7 7. Verify data quality

3.7.1 7.1 Quality scorecard

filekey channel freq n_rows missing pct_missing duplicate_idx ordered stale(>=5) max_stale_run positivity lookahead_release_lag
0 DBA Agricultural commodities business-day 4881 pass 0.00 pass pass pass 2 pass pass
1 WPU_WHEAT Agricultural commodities monthly 316 pass 0.00 pass pass pass 1 n/a review
2 SPY Broad-market baseline business-day 6642 pass 0.00 pass pass pass 1 pass pass
3 COPPER Commodity / growth barometer business-day 6466 pass 0.00 pass pass pass 2 n/a pass
4 AAA10Y Credit / risk regime business-day 6891 warn 4.22 pass pass warn 7 n/a pass
5 BAA10Y Credit / risk regime business-day 6891 warn 4.22 pass pass warn 6 n/a pass
6 HY_OAS Credit / risk regime business-day 793 warn 1.01 pass pass pass 2 n/a pass
7 IG_OAS Credit / risk regime business-day 793 warn 1.13 pass pass warn 10 n/a pass
8 BTC-USD Crypto (regime-dependent) daily 4276 pass 0.00 pass pass pass 1 pass pass
9 ETH-USD Crypto (regime-dependent) daily 3127 pass 0.00 pass pass pass 0 pass pass
10 ITA Defense procurement business-day 5049 pass 0.00 pass pass pass 2 pass pass
11 LMT Defense procurement business-day 6642 pass 0.00 pass pass pass 1 pass pass
12 DCOILWTICO Energy business-day 6886 warn 3.89 pass pass pass 2 fail pass
13 DHHNGSP Energy business-day 6886 warn 3.75 pass pass warn 10 pass pass
14 EIA_WTI Energy business-day 5000 pass 0.00 pass pass pass 1 fail pass
15 XLE Energy business-day 6642 pass 0.00 pass pass pass 2 pass pass
16 ANFCI Financial conditions weekly 1377 pass 0.00 pass pass pass 2 n/a pass
17 NFCI Financial conditions weekly 1377 pass 0.00 pass pass pass 3 n/a pass
18 INDPRO Growth (coincident) monthly 316 pass 0.00 pass pass pass 0 n/a review
19 T10YIE Inflation expectations business-day 6109 warn 4.11 pass pass warn 5 n/a pass
20 M2SL Liquidity monthly 316 pass 0.00 pass pass pass 0 n/a review
21 CPILFESL Macro context monthly 316 warn 0.32 pass pass pass 2 n/a review
22 T10Y3M Macro context business-day 6892 warn 4.14 pass pass pass 3 n/a pass
23 UNRATE Macro context monthly 316 warn 0.32 pass pass pass 4 n/a review
24 DFF Monetary policy business-day 9649 pass 0.00 pass pass warn 419 n/a pass
25 DFII10 Real rates business-day 6108 warn 4.11 pass pass pass 3 n/a pass
26 SAHM Recession regime monthly 316 warn 0.32 pass pass pass 3 n/a review
27 DGS10 Safe havens & FX business-day 6891 warn 4.14 pass pass pass 4 n/a pass
28 DGS2 Safe havens & FX business-day 6891 warn 4.14 pass pass warn 7 n/a pass
29 EURUSD_AV Safe havens & FX business-day 5000 pass 0.00 pass pass pass 2 pass pass
30 GLD Safe havens & FX business-day 5416 pass 0.00 pass pass pass 1 pass pass
31 TREAS_EUR Safe havens & FX quarterly 101 pass 0.00 pass pass pass 1 pass pass
32 UUP Safe havens & FX business-day 4844 pass 0.00 pass pass pass 3 pass pass
33 VIX Safe havens & FX business-day 6643 pass 0.00 pass pass pass 2 n/a pass
34 VIXCLS Safe havens & FX business-day 6891 warn 3.16 pass pass pass 2 n/a pass
35 BDRY Shipping & insurance business-day 2059 pass 0.00 pass pass warn 7 pass pass

3.7.2 7.2 Cross-source consistency

Where two sources measure the same concept, they should agree — a direct quality probe:

series_a series_b overlap_rows level_corr
0 VIX VIXCLS 6643 1.0
1 DCOILWTICO EIA_WTI 4988 1.0

^VIX (yfinance) vs VIXCLS (FRED) and DCOILWTICO (FRED WTI) vs the EIA WTI series give two clean cross-source validations across three independent providers.

3.7.3 7.3 Material quality findings

Minimum WTI value in snapshot: -36.98 on 2020-04-20
ICE HY OAS free history: 785 rows from 2023-06-05 (licensing-capped)
  • Negative oil price (real, not an error). WTI’s minimum is the negative print of April 2020 — a genuine market event. It breaks log-return math and any “prices are positive” assumption.
  • Credit-spread history is licensing-capped. The ICE BofA OAS series everyone reaches for first only exist from ~2023 on the free tier — zero past recessions in-sample. The council flagged this as the debate’s blind spot; the fix was the free Moody’s BAA10Y (1990+). A signal you can’t observe across a recession can’t be calibrated on data you own.
  • Calendar misalignment. Crypto trades 7 days/week; equities ~5; NFCI is weekly; macro is monthly. Per-series statistics use each series’ native frequency; any aligned view must not be read as if all series share a calendar.
  • Look-ahead / release lag. Monthly macro (CPI, unemployment, M2, industrial production, Sahm) is stamped by reference period, known only on the later release date. Phase 3 must lag these by their publication delay — the exact trap Chapter 1 warned about.
  • Survivorship & venue caveats. Free equity sources omit delisted names (ETFs only partly mitigate); yfinance crypto is an aggregate, not a single venue. Acceptable for this PoC; flagged before any Phase-4 backtest.

3.8 8. Gaps vs. the strategy & geopolitical thesis

The free sources cannot supply several things strategy.md and the Chapter-1 thesis ultimately want. Named honestly so they are not silently assumed:

  • Licensed composites — Conference Board LEI, ISM/PMI — not free; their market-based components (curve, credit, S&P) are in the snapshot directly, a partial substitute.
  • Long ICE BofA credit history — licensing-capped to ~2023; Moody’s BAA10Y substitutes.
  • Cross-sectional equity factors (value/momentum/quality) — need a survivorship-bias-free single-name universe (CRSP-style); unbuildable on a dozen ETFs + two single names.
  • Quantified geopolitical event data (GDELT/ACLED/ICEWS) — deferred to a later phase, consistent with baseline-first; proxied for now by VIX, credit, and the conflict-sensitive sleeves.
  • Crypto on-chain (MVRV/SOPR) and real freight indices (Baltic Dry) — paid; proxied by price/volume and the BDRY ETF.

3.9 9. Implications for Data Preparation (Phase 3 hand-off)

Phase 3 must enforce the discipline strategy.md spells out, or any “edge” we find will be an artifact:

  • Point-in-time / vintage data. Source macro features from ALFRED (not revised FRED) and lag every release to its actual publication date — this is the real reason to hold the FRED API key.
  • Stationary, comparable features. Transform levels to YoY/MoM changes, rolling z-scores, percentile ranks, and diffusion indices; spreads/ratios (curve slope, copper/gold, BAA−AAA); standardized surprise (actual − consensus) where a consensus feed exists.
  • Master calendar & alignment. Business-day master index; resample crypto to it (keeping a native 7-day view); lag-aware fills for weekly/monthly macro — never fill across the release frontier.
  • Anti-overfitting discipline. With only ~3–4 genuine regime episodes in 25 years, prefer a small, fixed, equal-weighted regime score over fitted weights; validate with purged/embargoed walk-forward CV, deflated Sharpe, and elevated t-stat hurdles; benchmark against dumb buy-and-hold SPY and 60/40.
  • Council’s steer. Use the combined regime read as a gross-exposure / volatility-scaling dial, not a buy/sell trigger; route the conflict-sensitive sleeve by the reflation-vs-stagflation quadrant (T10YIE vs DFII10); defer factors and crypto on-chain until their data is funded. See the Indicator Council Deliberation. → Data Preparation.

3.10 10. Reproducibility

Snapshot vintage : 2026-06-02
Instruments      : 36
Python           : 3.12.13
pandas           : 3.0.3
numpy            : 2.4.6
yfinance         : 1.4.1

To reproduce or refresh the snapshot:

  1. Create the environment: python -m venv .venv, activate it, pip install -r requirements.txt.
  2. Register the kernel: python -m ipykernel install --user --name portfoliolens (matches jupyter: portfoliolens in _quarto.yml).
  3. Copy .env.example to .env and add free keys (optional — yfinance/FRED/Treasury run without).
  4. Pull: python scripts/poc_pull.py → writes data/raw/ and the committed data/snapshot/.
  5. Render: with the venv active, quarto render — it reads the snapshot only (no network, no keys).

The small data/snapshot/ (parquet + manifest.csv) is committed so a fresh clone renders this chapter without any pull; bulk data/raw/ is gitignored.


Phase-2 deliverables — complete: initial data collection report (§3), data description report (§4), an indicator-framework mapping to strategy.md (§5), data exploration report (§6), and data quality report (§7) — over 36 instruments of real, free-sourced data spanning 2000–2026, with the expert debate in the Indicator Council appendix.