我正在尝试调整使用隐式数据的ALS矩阵分解模型的参数。为此,我正在尝试使用pyspark.ml.tuning.CrossValidator来运行参数网格并选择最佳模型。我相信我的问题在于评估者,但我无法弄明白。
我可以使用回归RMSE评估器为显式数据模型工作,如下所示:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import rand
conf = SparkConf() \
.setAppName("MovieLensALS") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)
paramMapExplicit = ParamGridBuilder() \
.addGrid(alsExplicit.rank, [8, 12]) \
.addGrid(alsExplicit.maxIter, [10, 15]) \
.addGrid(alsExplicit.regParam, [1.0, 10.0]) \
.build()
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)
predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()
当我尝试为隐式数据执行此操作时(假设视图计数而不是评级),我得到一个我无法弄清楚的错误。这是代码(与上面非常相似):
dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)
predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()
我尝试使用RMSE评估程序执行此操作,但是出现错误。据我所知,我还应该能够将AUC度量用于二元分类评估器,因为隐式矩阵分解的预测是用于预测二进制矩阵p_ui的置信矩阵c_ui。 根据这篇论文,这是pyspark ALS引用的文档。
使用评估器给我一个错误,我找不到任何关于在线交叉验证隐式ALS模型的富有成效的讨论。我正在查看CrossValidator源代码,试图弄清楚出了什么问题,但遇到了麻烦。我的一个想法是,在该过程将隐式数据矩阵r_ui转换为二进制矩阵p_ui和置信矩阵c_ui之后,我不确定它在评估阶段比较预测的c_ui矩阵是什么。
这是错误:
Traceback (most recent call last):
File "<ipython-input-16-6c43b997005e>", line 1, in <module>
cvModel = cv.fit(dfCounts)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
return self._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
model = est.fit(train, epm[j])
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
return self.copy(params)._fit(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
return f(*a, **kw)
File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
etc.......
UPDATE
我尝试缩放输入,使其在0到1的范围内并使用RMSE评估器。它似乎工作得很好,直到我尝试将其插入CrossValidator。
以下代码有效。我得到了预测,我从评估员那里得到了一个RMSE值。
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
conf = SparkConf() \
.setAppName("ALSPractice") \
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0), (0,3,0.0), (0,4,0.0),
(1,0,5.0), (1,2,4.0), (1,3,0.0), (1,4,0.0),
(2,0,0.0), (2,2,0.0), (2,3,5.0), (2,4,5.0),
(3,0,0.0), (3,1,0.0), (3,3,4.0) ],
["user", "item", "rating"])
dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
(1,0), (1,1), (1,2), (1,3), (1,4),
(2,0), (2,1), (2,2), (2,3), (2,4),
(3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])
# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))
alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)
evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))
preds = defaultModelImplicit.transform(dfCountsTest2)
我不明白为什么以下不起作用。我使用相同的估算器,相同的评估器并拟合相同的数据。为什么这些工作在上面但不在CrossValidator中:
paramMapImplicit = ParamGridBuilder() \
.addGrid(alsImplicit.rank, [8, 12]) \
.addGrid(alsImplicit.maxIter, [10, 15]) \
.addGrid(alsImplicit.regParam, [1.0, 10.0]) \
.addGrid(alsImplicit.alpha, [2.0,3.0]) \
.build()
cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)