跳转至

WOE 编码

WOE(Weight of Evidence)把分类变量或分箱后的连续变量映射为线性可分的数值,是评分卡建模的标准做法。

SuperModelingFactory 在 WOE 子包提供 主控类 + 单调分箱器 + 转换器 + 绘图器 + 分箱引擎适配器

1. 主控类 —— WOE_Master

from Modeling_Tool import WOE_Master

woe = WOE_Master(
    train_data=train_df,
    varlist=features,
    dep="bad_flag",
    missing_ref_value=-999999,
)
woe.fit(nbins=10, equal_freq=True)

train_woe = woe.transform(train_df)
test_woe = woe.transform(test_df)
oot_woe = woe.transform(oot_df)

持久化映射表

woe.save_mapping_table("./output/woe_mapping.csv")

from Modeling_Tool import load_mapping_table
varlist, woe_dict = load_mapping_table("./output/woe_mapping.csv")

2. 贪心单调分箱器 —— MonotoneWOEBinner

如果评分卡需要更强的单调约束,推荐使用 MonotoneWOEBinner

from Modeling_Tool.WOE.WOE_Monotone_Binner import MonotoneWOEBinner

binner = MonotoneWOEBinner(
    feature_cols=features,
    target_col="bad_flag",
    n_init_bins=20,
    min_bin_size=0.03,
    special_values=[-1, -100, -999999],
    cate_feats=["city_grade"],
)
binner.fit(train_df, chi2_binning=True, chi2_p=0.95)
binner.refine_cate(max_bins=5)

train_woe = binner.apply_woe(train_df)
bins = binner.get_final_bins()
edges = binner.get_bin_edges()

方法列表

方法 说明
fit(df, chi2_binning, chi2_p, n_jobs) 训练拟合
refine_cate(max_bins) 类别特征按坏率聚类合并
apply_woe(df) WOE 转换
get_final_bins() 导出分箱结果(含 WOE/IV)
load_woe_bins(bins_dict) 加载已有分箱
get_bin_edges() 取分箱边界列表
export_woe_report(path) 导出 Excel 报告
plot_woe_graph(dir, group_name=) 输出 WOE 图 PNG

3. 统一分箱引擎 —— as_woe_engine

WOE_MasterMonotoneWOEBinner 的内部产物格式不同。as_woe_engine() 会把它们转成统一接口,供 PSI、IV、相关性筛选复用。

from Modeling_Tool import as_woe_engine

engine = as_woe_engine(binner)   # 也可以传 WOE_Master
woe_table = engine.get_woe_table(features)
train_woe = engine.transform(train_df, features)

更多说明见 WOE 分箱引擎

4. 与特征筛选联动

训练期拟合一次分箱器,后续筛选、监控、建模都复用同一对象:

from Modeling_Tool import PSICalculator, VarExtractionInsights, CorrelationFilter

psi = PSICalculator(binning_engine=binner).calculate(train_df, oot_df, features)

iv_report = VarExtractionInsights(
    train_df, "bad_flag", "./iv_plots/",
    woe_engine="monotone", woe_binner=binner,
).get_var_analysis_report(train_df, features)

keep_vars = CorrelationFilter(
    train_df, "bad_flag", corr_cutpoint=0.7,
    woe_engine="monotone", woe_binner=binner,
).remove_highly_correlated(features)

train_woe = binner.apply_woe(train_df)

5. 单调性检查

from Modeling_Tool import is_monotonic, get_overall_woe_table

for var in features:
    woe_table = get_overall_woe_table(woe, train_df, [var])
    mono, direction = is_monotonic(woe_table, "WOE", direction="auto")
    print(var, mono, direction)

6. 单独 WOE 转换

from Modeling_Tool import woe_transform, woe_transformation

single_df, single_map = woe_transform(train_df, var="age", dep="bad_flag", nbins=10)
batch_result = woe_transformation(train_df, varlist=features, dep="bad_flag", nbins=10)

常见问题

什么时候选择 MonotoneWOEBinner?

当变量会进入评分卡、需要更强可解释性和单调约束时,优先使用 MonotoneWOEBinner

为什么要在筛选阶段传入 binner?

因为 PSI / IV / KS 应该基于最终上线的同一套分箱计算,否则筛选指标和建模输入可能不一致。