在 Byzer-python 中,介绍了 Byzer 对 Python 的支持。 然而在集群模式下,用户需要做一些配置,并在每个节点安装conda。这未免有些麻烦。

与之相对,Byzer 还提供了一些内置的、开箱即用的算法。 我们将重点讲解这些算法的原理并提供示例。

  1. 自动机器学习/AutoML
  2. K 均值聚类算法/KMeans
  3. 朴素贝叶斯法/NaiveBayes
  4. 交替最小二乘法/ALS
  5. 随机森林/RandomForest
  6. 线性回归/LinearRegression
  7. 逻辑回归/LogisticRegression
  8. 隐含狄利克雷分布/LDA

自动机器学习/AutoML

AutoML 是将机器学习应用于现实问题的端到端流程自动化的过程。

AutoML 可以提供将分类算法进行遍历训练的功能,这些算法包含 NaiveBayes, LogisticRegression,LinearRegression, RandomForest 以及 GBT 分类算法。AutoML 插件会对用户的输入数据进行多模型训练,然后针对模型表现指标, 进行模型排序,给用户返回表现最优的算法模型。

-- 创建测试数据
set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';

load jsonStr.`jsonStr` as data;

select vec_dense(features) as features ,label as label from data
as data1;

train data1 as AutoML.`/tmp/auto_ml` where

-- 如果参数 algos 不设置,数据就会自动被以下这些算法训练:GBTs,LinearRegression,LogisticRegression,NaiveBayes,RandomForest 

algos="LogisticRegression,NaiveBayes" 

-- 如果参数 keepVersion 设置成 true,以后每次运行脚本,Byzer 都会为你的模型保存一个最新的版本

and keepVersion="true" 

-- 用参数 evaluateTable 指明验证集,它将被用来给评估器提供一些评价指标,如:F1、准确度等

and evaluateTable="data1";

最后输出结果如下:

name

AutoML支持如下几个特性:

  • 可以通过参数 keepVersion 来设置是否保留版本。
  • AutoML 支持在用户指定的算法集合里进行模型训练,用户通过配置 algos 参数(目前支持 " GBTs, LinearRegression, LogisticRegression, NaiveBayes, RandomForest " 的子集),让数据集在指定的算法集合中进行训练,获取最优模型
  • AutoML 会根据算法的表现排序,默认是按照 accuracy,用户可以指定按照 f1 或者其他的 metrics 进行排序。
  • AutoML 预测的时候,会根据历史训练的所有模型中挑选出表现最好的模型进行打分预测,用户无需指定特定模型。

批量预测

用户可以通过 predict 语法来完成对数据集的批量预测,以下 Byzer 代码的解释为:

用 predict 语法预测数据集 data1 通过被保存在路径/tmp/auto_ml 下的 AutoML 模型训练后得到的结果

predict data1 as AutoML.`/tmp/auto_ml`;

结果如下:

name

 

K 均值聚类算法/KMeans

KMeans,k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是,预将数据分为K组,则随机选取K个对象作为初始的聚类中心,然后计算每个对象与各个种子聚类中心之间的距离,把每个对象分配给距离它最近的聚类中心。

首先我们新增一些数据。

set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features from data
as data1;

结果如下:

name

聚类算法属于无监督算法,所以没有 Label 的概念。接着,我们可以训练了:

train data1 as KMeans.`/tmp/alg/kmeans`
where k="2"
and seed="1";

批量预测

API 预测

训练完成后,可以注册模型为函数,进行预测:

register KMeans.`/tmp/alg/kmeans` as kcluster;
select kcluster(features) as catagory from data1 as output;

结果如下:

name

 

朴素贝叶斯算法/NaiveBayes

NaiveBayes 是一种分类算法。和决策树模型相比,朴素贝叶斯分类器(Naive Bayes Classifier 或 NBC)发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率。同时,NBC模型所需估计的参数很少,对缺失数据不太敏感,算法也比较简单。

-- 创建测试数据
set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features ,label as label from data
as data1;

-- 用朴素贝叶斯训练数据
train data1 as NaiveBayes.`/tmp/model` where

-- 如果参数 keepVersion 设置成 true,以后每次运行脚本,Byzer 都会为你的模型保存一个最新的版本
keepVersion="true" 

--  用参数 evaluateTable 指明验证集,它将被用来给评估器提供一些评价指标,如:F1、准确度等
and evaluateTable="data1"

-- 指明参数组0(即:第一组参数组) 的参数
and `fitParam.0.featuresCol`="features"
and `fitParam.0.labelCol`="label"
and `fitParam.0.smoothing`="0.5"

-- 指明参数组1(即:第二组参数组)的参数
and `fitParam.1.featuresCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.smoothing`="0.2"
;

最后输出结果如下:

name   value
---------------------------------
modelPath    /tmp/model/_model_10/model/1
algIndex     1
alg          org.apache.spark.ml.classification.NaiveBayes
metrics      f1: 0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status       success
startTime    20180913 59:15:32:685
endTime      20180913 59:15:36:317
trainParams  Map(smoothing -> 0.2,featuresCol -> features, labelCol -> label)
---------------------------------
modelPath    /tmp/model/_model_10/model/0
algIndex     0
alg          org.apache.spark.ml.classification.NaiveBayes
metrics      f1:0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status       success
startTime    20180913 59:1536:318
endTime      20180913 59:1538:024
trainParams  Map(smoothing -> 0.2, featuresCol -> features, labelCol -> label)

对于大部分内置算法而言,都支持如下几个特性:

  1. 可以通过 keepVersion 来设置是否保留版本。
  2. 通过 fitParam. 数字序号 配置多组参数,设置 evaluateTable 后系统自动算出 metrics.

批量预测

predict data1 as NaiveBayes.`/tmp/model`;

结果如下:

features                                label  rawPrediction                                            probability  prediction
{"type":1,"values":[5.1,3.5,1.4,0.2]}    0    {"type":1,"values":[16.28594461094461,3.7140553890553893]}    {"type":1,"values":[0.8142972305472306,0.18570276945276948]}    0
{"type":1,"values":[5.1,3.5,1.4,0.2]}    1    {"type":1,"values":[16.28594461094461,3.7140553890553893]}    {"type":1,"values":[0.8142972305472306,0.18570276945276948]}    0

API 预测

register NaiveBayes.`/tmp/model` as rf_predict;

-- 参数 algIndex 你可以指明用哪一组参数训练出的模型
register NaiveBayes.`/tmp/model` as rf_predict where
algIndex="0";

-- 参数 autoSelectByMetric 可以用来指明用那个指标来判断最优模型
register NaiveBayes.`/tmp/model` as rf_predict where
autoSelectByMetric="f1";

select rf_predict(features) as predict_label, label from data1 as output;
  • 参数algIndex 可以让用户指定用哪组参数得到算法模型。
  • 当然用户也可以让系统自动选择,前提是在训练时配置了参数evalateTable 预先评估模型的表现情况, 然后使用参数 autoSelectByMetric 指定判定指标即可选出最优算法模型。
  • 最后,就可以像使用一个函数一样对一个 feature 进行预测了。

 

交替最小二乘法/ALS

ALS在协同算法里面很流行。通过它可以很方便的搭建一个推荐系统。

他的数据格式比较简单,需要 userCol, itemCol,ratingCol 三个。

set jsonStr='''
{"a":1,"i":2,"rate":1},
{"a":1,"i":3,"rate":1},
{"a":2,"i":2,"rate":1},
{"a":2,"i":7,"rate":1},
{"a":1,"i":2,"rate":1},
{"a":1,"i":6,"rate":1},
''';

load jsonStr.`jsonStr` as data;

现在我们可以使用ALS进行训练了:

train data as ALSInPlace.`/tmp/model` where

-- 第一组参数
`fitParam.0.maxIter`="5"
and `fitParam.0.regParam` = "0.01"
and `fitParam.0.userCol` = "a"
and `fitParam.0.itemCol` = "i"
and `fitParam.0.ratingCol` = "rate"

-- 第二组参数   
and `fitParam.1.maxIter`="1"
and `fitParam.1.regParam` = "0.1"
and `fitParam.1.userCol` = "a"
and `fitParam.1.itemCol` = "i"
and `fitParam.1.ratingCol` = "rate"

-- 计算 rmse     
and evaluateTable="data"
and ratingCol="rate"

-- size of recommending items for user  
and `userRec` = "10"

-- size of recommending users for item
-- and `itemRec` = "10"
and coldStartStrategy="drop";

在这里,我们配置了两组参数,并且使用 rmse 来评估效果,最后的结果是给每个用户 10 条内容。如果需要给每个内容推荐 10 个用户则设置 itemRec 参数即可。

最后的结果如下:

name

可以看看最后的预测结果:

load parquet.`/tmp/model/data/userRec` as userRec;
select * from userRec as result;

name

批量预测

该算法不支持批量预测以及 API 预测。

 

随机森林/RandomForest

RandomForest 随机森林是利用多个决策树对样本进行训练、分类并预测的一种分类算法,主要应用于回归和分类场景。在对数据进行分类的同时,还可以给出各个变量的重要性评分,评估各个变量在分类中所起的作用。

-- 创建测试数据集
set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features ,label as label from data
as data1;

-- 使用随机森林算法进行训练
train data1 as RandomForest.`/tmp/model` where

-- 如果参数 keepVersion 设置成 true,以后每次运行脚本,Byzer 都会为你的模型保存一个最新的版本
keepVersion="true" 

-- 用参数 evaluateTable 指明验证集,它将被用来给评估器提供一些评价指标,如:F1、准确度等
and evaluateTable="data1"

-- 指明参数组0(即:第一组参数组)的参数
and `fitParam.0.featuresCol`="features"
and `fitParam.0.labelCol`="label"
and `fitParam.0.maxDepth`="2"

-- 指明参数组1(即:第二组参数组)的参数
and `fitParam.1.featuresCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.maxDepth`="10"
;

最后输出结果如下:

name   value
---------------------------------
modelPath    /tmp/model/_model_10/model/1
algIndex     1
alg          org.apache.spark.ml.classification.RandomForestClassifier
metrics      f1: 0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status       success
startTime    20180913 59:15:32:685
endTime      20180913 59:15:36:317
trainParams  Map(maxDepth -> 10)
---------------------------------
modelPath    /tmp/model/_model_10/model/0
algIndex     0
alg          org.apache.spark.ml.classification.RandomForestClassifier
metrics      f1:0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status       success
startTime    20180913 59:1536:318
endTime      20180913 59:1538:024
trainParams  Map(maxDepth -> 2, featuresCol -> features, labelCol -> label)

对于大部分内置算法而言,都支持如下几个特性:

  1. 可以通过 keepVersion 来设置是否保留版本。
  2. 通过 fitParam.数字序号 配置多组参数,设置 evaluateTable 后系统自动算出 metrics.

批量预测

predict data1 as RandomForest.`/tmp/model`;

结果如下:

features                                label  rawPrediction                                            probability  prediction
{"type":1,"values":[5.1,3.5,1.4,0.2]}    0    {"type":1,"values":[16.28594461094461,3.7140553890553893]}    {"type":1,"values":[0.8142972305472306,0.18570276945276948]}    0
{"type":1,"values":[5.1,3.5,1.4,0.2]}    1    {"type":1,"values":[16.28594461094461,3.7140553890553893]}    {"type":1,"values":[0.8142972305472306,0.18570276945276948]}    0

API 预测

register RandomForest.`/tmp/model` as rf_predict;

-- 参数 algIndex 你可以指明用哪一组参数训练出的模型
register RandomForest.`/tmp/model` as rf_predict where
algIndex="0";

-- 参数 autoSelectByMetric 可以用来指明用那个指标来判断最优模型,此处选择 F1
register RandomForest.`/tmp/model` as rf_predict where
autoSelectByMetric="f1";

select rf_predict(features) as predict_label, label from data1 as output;
  • 参数algIndex 可以让用户指定用哪组参数得到算法模型。
  • 当然用户也可以让系统自动选择,前提是在训练时配置了参数evalateTable 预先评估模型的表现情况, 然后使用参数 autoSelectByMetric 指定判定指标即可选出最优算法模型。
  • 最后,就可以像使用一个函数一样对一个 feature 进行预测了。

 

逻辑回归/Logistic Regression

Logistic Regression 一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域。

-- 创建测试数据
set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features , label as label from data
as data1;

-- select * from data1 as output1;
-- 使用逻辑回归
train data1 as LogisticRegression.`/tmp/model_2` where

-- 如果参数 keepVersion 设置成 true,以后每次运行脚本,Byzer 都会为你的模型保存一个最新的版本
keepVersion="true" 

-- 用参数 evaluateTable 指明验证集,它将被用来给评估器提供一些评价指标,如:F1、准确度等
and evaluateTable="data1"

-- 指明参数组0(即:第一组参数组)的参数
and `fitParam.0.labelCol`="label"
and `fitParam.0.featuresCol`="features"
and `fitParam.0.fitIntercept`="true"

--  指明参数组1(即:第二组参数组)的参数
and `fitParam.1.featuresCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.fitIntercept`="false"
;

最后输出结果如下:

name        value
---------------    ------------------
modelPath    /_model_5/model/1
algIndex    1
alg            org.apache.spark.ml.classification.LogisticRegression
metrics        f1: 0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status        success
message    
startTime    20210824 42:14:33:761
endTime        20210824 42:14:41:984
trainParams    Map(labelCol -> label, featuresCol -> features, fitIntercept -> false)
---------------    ------------------
modelPath    /_model_5/model/0
algIndex    0
alg            org.apache.spark.ml.classification.LogisticRegression
metrics        f1: 0.7625000000000001 weightedPrecision: 0.8444444444444446 weightedRecall: 0.7999999999999999 accuracy: 0.8
status        success
message    
startTime    20210824 42:14:41:985
endTime        20210824 42:14:47:830
trainParams    Map(featuresCol -> features, labelCol -> label, fitIntercept -> true)

对于大部分内置算法而言,都支持如下几个特性:

  1. 可以通过 keepVersion 来设置是否保留版本。
  2. 通过 fitParam.数字序号 配置多组参数,设置 evaluateTable 后系统自动算出 metrics.

批量预测

predict data1 as LogisticRegression.`/tmp/model`;

结果如下:

features                                label            rawPrediction                                        probability                                                        prediction
{"type":1,"values":[5.1,3.5,1.4,0.2]}    0    {"type":1,"values":[1.0986123051777668,-1.0986123051777668]}    {"type":1,"values":[0.7500000030955607,0.24999999690443933]}    0
{"type":1,"values":[5.1,3.5,1.4,0.2]}    1    {"type":1,"values":[1.0986123051777668,-1.0986123051777668]}    {"type":1,"values":[0.7500000030955607,0.24999999690443933]}    0

API 预测

register LogisticRegression.`/tmp/model_2` as lr_predict;

-- 参数 algIndex 你可以指明用哪一组参数训练出的模型
register LogisticRegression.`/tmp/model_2` as lr_predict where
algIndex="0";

-- 参数 autoSelectByMetric 可以用来指明用那个指标来判断最优模型,此处选择 F1
register LogisticRegression.`/tmp/model_2` as lr_predict where
autoSelectByMetric="f1";

select lr_predict(features) as predict_label, label from data1 as output;
  • 参数algIndex 可以让用户指定用哪组参数得到算法模型。
  • 当然用户也可以让系统自动选择,前提是在训练时配置了参数evalateTable 预先评估模型的表现情况, 然后使用参数 autoSelectByMetric 指定判定指标即可选出最优算法模型。
  • 最后,就可以像使用一个函数一样对一个 feature 进行预测了。

 

线性回归/LinearRegression

线性回归 (Linear Regression) 是利用称为线性回归方程的最小二乘函数对一个或多个自变量和因变量之间关系进行建模的一种回归分析。

-- 创建测试数据
set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features , label as label from data
as data1;

-- select * from data1 as output1;
-- 使用线性回归
train data1 as LinearRegression.`/tmp/model_3` where

-- 如果参数 keepVersion 设置成 true,以后每次运行脚本,Byzer 都会为你的模型保存一个最新的版本
keepVersion="true" 

-- 用参数 evaluateTable 指明验证集,它将被用来给评估器提供一些评价指标,如:F1、准确度等
and evaluateTable="data1"

-- 指明参数组0(即:第一组参数组)的参数
and `fitParam.0.labelCol`="label"
and `fitParam.0.featuresCol`="features"
and `fitParam.0.elasticNetParam`="0.1"

-- 指明参数组1(即:第二组参数组)的参数
and `fitParam.1.featuresCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.elasticNetParam`="0.8"
;

最后输出结果如下:

name   value
---------------    ------------------
modelPath    /_model_4/model/1
algIndex    1
alg            org.apache.spark.ml.regression.LinearRegression
metrics        f1: 0.0 weightedPrecision: 0.0 weightedRecall: 0.0 accuracy: 0.0
status        success
message    
startTime    20210824 52:14:52:441
endTime        20210824 52:14:53:429
trainParams    Map(labelCol -> label, featuresCol -> features, elasticNetParam -> 0.8)
---------------------------------
modelPath    /_model_4/model/0
algIndex    0
alg            org.apache.spark.ml.regression.LinearRegression
metrics        f1: 0.0 weightedPrecision: 0.0 weightedRecall: 0.0 accuracy: 0.0
status        success
message    
startTime    20210824 52:14:53:429
endTime        20210824 52:14:54:228
trainParams    Map(featuresCol -> features, elasticNetParam -> 0.1, labelCol -> label)

对于大部分内置算法而言,都支持如下几个特性:

  1. 可以通过 keepVersion 来设置是否保留版本。
  2. 通过 fitParam.数字序号 配置多组参数,设置 evaluateTable 后系统自动算出 metrics.

批量预测

predict data1 as LinearRegression.`/tmp/model_3`;

结果如下:

features                                label    prediction
{"type":1,"values":[5.1,3.5,1.4,0.2]}    0    0.24999999999999645
{"type":1,"values":[5.1,3.5,1.4,0.2]}    1    0.24999999999999645

API 预测

register LinearRegression.`/tmp/model_3` as lr_predict;

-- 参数 algIndex 你可以指明用哪一组参数训练出的模型
register LinearRegression.`/tmp/model_3` as lr_predict where
algIndex="0";

-- 参数 autoSelectByMetric 可以用来指明用那个指标来判断最优模型,此处选择 F1
register LinearRegression.`/tmp/model_3` as lr_predict where
autoSelectByMetric="f1";

select lr_predict(features) as predict_label, label from data1 as output;
  • 参数 algIndex 可以让用户指定用哪组参数得到算法模型。
  • 当然用户也可以让系统自动选择,前提是在训练时配置了参数evalateTable 预先评估模型的表现情况, 然后使用参数 autoSelectByMetric 指定判定指标即可选出最优算法模型。
  • 最后,就可以像使用一个函数一样对一个 feature 进行预测了。

 

隐含狄利克雷分布/LDA

在机器学习领域,LDA 是两个常用模型的简称:Linear Discriminant Analysis 和 Latent Dirichlet Allocation。本章节中的 LDA 仅指代 Latent Dirichlet Allocation. LDA 在主题模型中占有非常重要的地位,常用来文本分类。

LDA由Blei, David M.、Ng, Andrew Y.、Jordan于2003年提出,用来推测文档的主题分布。它可以将文档集中每篇文档的主题以概率分布的形式给出,从而通过分析一些文档抽取出它们的主题分布后,便可以根据主题分布进行主题聚类或文本分类。

下面看看如何使用:

set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as data;
select vec_dense(features) as features ,label as label from data
as data1;
train data1 as LDA.`/tmp/model` where

-- k: number of topics, or number of clustering centers
k="3"

-- docConcentration: the hyperparameter (Dirichlet distribution parameter) of article distribution must be >1.0. The larger the value is, the smoother the predicted distribution is
and docConcentration="3.0"

-- topictemperature: the hyperparameter (Dirichlet distribution parameter) of the theme distribution must be >1.0. The larger the value is, the more smooth the distribution can be inferred
and topicConcentration="3.0"

-- maxIterations: number of iterations, which need to be fully iterated, at least 20 times or more
and maxIter="100"

-- setSeed: random seed
and seed="10"

-- checkpointInterval: interval of checkpoints during iteration calculation
and checkpointInterval="10"

-- optimizer: optimized calculation method currently supports "em" and "online". Em method takes up more memory, and multiple iterations of memory may not be enough to throw a stack exception
and optimizer="online"
;

上面大部分参数都不需要配置。训练完成后会返回状态如下:

name

批量预测

predict data1 as LDA.`/tmp/model` ;

结果如下:

name

API 预测

目前只支持spark 2.3.x

register LDA.`/tmp/model` as lda;
select label,lda(4) topicsMatrix,lda_doc(features) TopicDistribution,lda_topic(label,4) describeTopics from data as result;

同样的当你注册 LDA 函数事,会给隐式生成多个函数:

  1. lda 接受一个词
  2. lda_doc 接受一个文档
  3. lda_topic 接受一个主题,以及显示多少词

 

模型自解释性

模型训练后,可以使用 modelExplain 语法查看模型参数。下面举例说明。 首先,训练 2 个随机森林模型,并保存至 /tmp/model 目录。

set jsonStr='''
{"features":[5.1,3.5,1.4,0.2],"label":0.0},
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.4,2.9,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[4.7,3.2,1.3,0.2],"label":1.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
{"features":[5.1,3.5,1.4,0.2],"label":0.0}
''';
load jsonStr.`jsonStr` as mock_data;


select vec_dense(features) as features, label as label from mock_data as mock_data_1;

-- use RandomForest
train mock_data_1 as RandomForest.`/tmp/model` 
where keepVersion="true" 
and evaluateTable="mock_data_validate"
and `fitParam.0.labelCol`="label"
and `fitParam.0.featuresCol`="features"
and `fitParam.0.maxDepth`="2"

and `fitParam.1.featuresCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.maxDepth`="10" ;

完成后,结果显示模型目录分别是 /tmp/model/_model_8/model/1  和 /tmp/model/_model_8/model/0

model_path

然后,查看模型参数。

load modelExplain.`/tmp/model/` where alg="RandomForest" and index="8" as output;

这里,结合 /tmp/model 和 index="8" ,系统读取 /tmp/model/_model_8 的模型,并以随机森林算法解释之。上面的语句等价于

load modelExplain.`/tmp/model/_model_8` where alg="RandomForest" as output;
Logo

更多推荐