ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

TF-IDF

2021-06-12 16:30:09  阅读:206  来源: 互联网

标签:1.0 val 262144 IDF TF 0.6931471805599453


文章目录


提示:以下是本篇文章正文内容,下面案例可供参考

一、TF-IDF

1、TF-IDF是什么?

TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。

  • TF意思是词频(Term Frequency)
  • DF(t,D)包含词语t的文档数量
  • |D|文档数
  • IDF意思是逆文本频率指数(Inverse Document Frequency)
    在这里插入图片描述
    显然,|D|比上DF(t,D)越大表示该词语越能代表该文档,当每个文档中都有该词语时,那么取对数时为0,为了防止分母为0,因此将分母加1,为了维持取对数后|D|和DF相等时为0,因此对分子也加1。

但是,一个文档中可能出现很多重复的而没有实际意义的词语,比如a,an,the,为了表示词语对文档的重要性,采用TF-IDF。 在这里插入图片描述
从公式中可以看出,词频如果很大且在很多文档中都出现,那么IDF就会很小,所以两者结合,就能很好判定词语对文档的重要性。

2、spark官方代码实现

def tfidf():Unit={
    val spark = SparkSession.builder().appName("TFIDF").getOrCreate()

    val sentenceData = spark.createDataFrame(Array(
      (0.0, "Hi I heard about Spark"),
      (0.0, "I wish Java could use case classes"),
      (1.0, "Logistic regression models are neat")
    )).toDF("label","sentence")


    /**\
     * 单词分割
     */
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val wordsData = tokenizer.transform(sentenceData)
    /*
    +-----+-----------------------------------+------------------------------------------+
    |label|sentence                           |words                                     |
    +-----+-----------------------------------+------------------------------------------+
    |0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
    |0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
    |1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
    +-----+-----------------------------------+------------------------------------------+
     */


    /**
     * 通过 hashingTF.transform() 创建特征向量
     */
    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeature")
    val featurizedData =  hashingTF.transform(wordsData)
    featurizedData.show(10,false)
/*
|label|sentence                           |words                                     |rawFeature                                                                          |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+

根据该表可以看 [hi, i, heard, about, spark] 分别对应 [18700,19036,33808,66273,173558],其中 [1.0,1.0,1.0,1.0,1.0] 代表单词在该句中出现的次数。
 */


	 /**
     * 调用IDF方法来重新构造特征向量的规模,生成的idf是一个Estimator,在特征向量上应用它的fit()方法,会产生一个IDFModel
     */
    val idf = new IDF().setInputCol("rawFeature").setOutputCol("feature")
    val idfModel = idf.fit(featurizedData)
    val rescalaData = idfModel.transform(featurizedData)
    rescalaData.show(10,false)

/*
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|sentence                           |words                                     |rawFeature                                                                          |feature                                                                                                                                                                                       |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |(262144,[46243,58267,91006,160975,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

从上表可以看出,hi仅在第一句中出现,所以hi的TF-IDF值比i大,hi更能代表第一句
 */
  }

标签:1.0,val,262144,IDF,TF,0.6931471805599453
来源: https://blog.csdn.net/qq_40365655/article/details/117826283

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有