Elasticsearch中的mapping和分析过程

2020-09-27 08:01:25 阅读：375 来源： 互联网

标签：分析 name doc mapping Elasticsearch text POST type 分词

映射mapping

自定义表结构 (原本是es自动帮我们定义的)

每个索引都有一个映射类型(6.x版本前可有多个)

参考博客: https://www.cnblogs.com/Neeo/articles/10585039.html

字段的数据类型:

1.简单类型:
*文本(text)	*关键字(keyword)	日期(data)	整形(long)	双精度(double)	   布尔(boolean)	ip

2.支持JSON的层次结构性质的类型:
对象		嵌套

3.特殊类型:
geo_point		geo_shape	  completion(纠正和建议)

mapping的操作实例

PUT a2
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        }
      }
    }
  }
}

GET a2/_mapping

POST a2/doc/1		#创建数据也可以用POST
{
  "name":"黄飞鸿",
  "age":19
}

GET a2/doc/1

dynamic三种状态

dynamic true 动态映射

PUT a2
{
  "mappings": {
    "doc":{
      "dynamic":true,		#关键所在
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        }
      }
    }
  }
}

POST a2/doc/1		#正常按照字段增加数据
{
  "name":"黄飞鸿",
  "age":19
}

POST a2/doc/2		#多加了一个定义时没有的city字段
{
  "name":"李晓龙",
  "age":19,
  "city":"广州"
}

POST a2/doc/3		#忽略定义时定义的name字段
{
  "age":19,
  "city":"广州"
}

GET a2/doc/_search		#查找没有问题!
{
  "query": {
    "match": {
      "city": "广州"
    }
  }
}

不限制新增或忽略某个字段,并且新增的字段也可以作为查询的主条件

dynamic false 静态映射 (常用)

"dynamic":true,
不限制新增或忽略某个字段,但在查找的时候不给新增的字段做分词,也就是说新增的字段不会主动添加新的映射关系,只能作为查询结果出现在查询中!所以新增的字段不能作为主查询条件

dynamic strict 严格模式

"dynamic":"strict",
不允许新增字段,但可以忽略字段

mapping的其他设置

index属性

# index属性
PUT a5
{
  "mappings": {
    "doc":{
      "dynamic":"strict",
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long",
          "index":true
        },
        "city":{
          "type":"text",
          "index":false
        }
      }
    }
  }
}


POST a5/doc/1
{
  "name":"李尔新",
  "age":19,
  "city":"长春"
}

POST a5/doc/2
{
  "name":"周子谦",
  "age":19,
  "city":"长春"
}

GET a5/doc/_search
{
  "query": {
    "match": {
      "city": "长春"
    }
  }
}

# 字段的index属性值为false的话不会为该字段创建索引,也就是无法当做查询的主条件!

copy_to属性

PUT a6
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text",
          "copy_to":"full_name"		#将这个字段的内容copy到full_name字段里
        },
        "age":{
          "type":"long",
          "copy_to":"full_name"
        },
        "full_name":{		#full_name字段
          "type":"text"
        }
      }
    }
  }
}

POST a6/doc/1
{
  "name":"周子谦",
  "age":19
}

GET a6/doc/_search
{
  "query": {
    "match": {
      "name": "周子谦"
    }
  }
}

GET a6/doc/_search
{
  "query": {
    "match": {
      "full_name": 19		#full_name能查名字也能查年龄
    }
  }
}

PUT a7
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text",
          "copy_to":["f1","f2"]		#可以copy到多个字段
        },
        "age":{
          "type":"long"
        },
        "f1":{
          "type":"text"
        },
        "f2":{
          "type":"text"
        }
      }
    }
  }
}

POST a7/doc/1
{
  "name":"周子谦",
  "age":19
}

GET a7/doc/_search
{
  "query": {
    "match": {
      "f1": "周子谦"		#f1 f2都可代替name作为主查询条件
    }
  }
}

对象属性 properties

# 对象属性
PUT a8
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        },
        "info":{
          "properties":{
            "addr":{
              "type":"text"
            },
            "tel":{
              "type":"text"
            }
          }
        }
      }
    }
  }
}

PUT a8/doc/1
{
  "name":"王涛",
  "age":33,
  "info":{
    "addr":"长春",
    "tel":"10018"
  }
}

GET a8/doc/_search
{
  "query": {
    "match": {
      "info.addr": "长存"
    }
  }
}

# 奇技淫巧:正常顺序是PUT mapping后再POST插入数据,最后才能GET查询的,我们也可以直接POST,然后GET a8/_mapping查看es帮我们自动生成的mapping,然后复制过来修改即可

ignore_above属性

PUT w1
{
  "mappings": {
    "doc":{
      "properties":{
        "t1":{
          "type":"keyword",
          "ignore_above": 5		#设置ignore_above属性
        },
        "t2":{
          "type":"keyword",
          "ignore_above": 10	#设置ignore_above属性
        }
      }
    }
  }
}
PUT w1/doc/1
{
  "t1":"elk",
  "t2":"elasticsearch"
}
GET w1/doc/_search
{
  "query":{
    "term": {
      "t1": "elk"	#查t1有结果
    }
  }
}

GET w1/doc/_search
{
  "query": {
    "term": {
      "t2": "elasticsearch"   #查t2无结果,超过设定的最大长度了
    }
  }
}

# 设定最大长度,超过长度的字符不会创建索引!

设置settings

PUT a9
{
  "settings": {
    "number_of_shards": 1,		#一个索引对应的主分片数量
    "number_of_replicas": 0		#一块主分片对应的副分片数量
  }
}

分析过程

当数据发送到es后, 在加入倒排索引之前, es对该文档进行的一系列操作

字符过滤 : 使用字符过滤器转变字符 (特殊字符, 如 & --> and)
文本切分为分词 : 将文本(档)分为多个单词或多个分词
分词过滤 : 使用分词过滤器转变每个分词
分词索引 : 最终将分词存储在Lucene倒排索引中
参考博客: https://www.cnblogs.com/Neeo/articles/10401392.html

分析器

标准分析器

POST _analyze
{
  "analyzer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

简单分析器 ( 对亚种语言效果不佳 )

POST _analyze
{
  "analyzer": "simple",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

空白分析器

# 只根据空白切分...
POST _analyze
{
  "analyzer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

停用词分析器

POST _analyze
{
  "analyzer": "stop",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

停用词:
1.功能词 is the on...
2.词汇词 want...

关键词分析器

#将整个字段作为单独的分词,一般不用...
POST _analyze
{
  "analyzer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

模式分析器

#允许我们指定一个分词切分模式,但是通常更佳的方案是使用定制的分析器,组合现有的模式分词器和所需要的分词过滤器更加合适。
POST _analyze
{
  "analyzer": "pattern",
  "explain": false, 
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

# 我们来自定制一个模式分析器，比如我们写匹配邮箱的正则。
# 需要注意的是，在json字符串中，正则的斜杠需要转义!
PUT pattern_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer":{
          "type":"pattern",
          "pattern":"\\W|_",
          "lowercase":true
        }
      }
    }
  }
}

语言和多语言分析器

#一般也不用...
POST _analyze
{
  "analyzer": "chinese",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

雪球分析器

#除了使用标准的分词和分词过滤器（和标准分析器一样）也是用了小写分词过滤器和停用词过滤器，除此之外，它还是用了雪球词干器对文本进行词干提取。
POST _analyze
{
  "analyzer": "snowball",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

字符过滤器 char_filter

HTML字符过滤器
映射字符过滤器 (敏感词过滤器)
模式过滤器
参考博客: https://www.cnblogs.com/Neeo/articles/10613612.html

分词器 tokenizer

标准分词器
关键词分词器
字母分词器 (根据非字母的符号切分)
小写分词器
空白分词器
模式分词器
UAX URL电子邮件分词器 *
路径层次分词器 *
参考博客: https://www.cnblogs.com/Neeo/articles/10402742.html

分词过滤器 token filter

常见分词过滤器
自定义分词过滤器
自定义小写分词过滤器
参考博客: https://www.cnblogs.com/Neeo/articles/10403757.html

ik 分词器

一个开源的, 轻量级的中文分词工具包
参考博客: https://www.cnblogs.com/Neeo/articles/10614012.html
保证ik分词器和es版本一致
解压将文件打包放到es下的plugins目录里

  GET _analyze
{
  "analyzer": "ik_smart",
  "text": "上海自来水来自海上"
}

#更细
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "上海自来水来自海上"
}

PUT ik1
{
  "mappings": {
    "doc": {
      "dynamic": false,
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word"   #ik分词器
        }
      }
    }
  }
}
#增加数据
PUT ik1/doc/1
{
  "content":"今天是个好日子"
}
PUT ik1/doc/2
{
  "content":"心想的事儿都能成"
}
PUT ik1/doc/3
{
  "content":"我今天不活了"
}
#查找中文分词 没问题
GET ik1/doc/_search
{
  "query": {
    "match": {
      "content": "今天"
    }
  }
}

标签：分析,name,doc,mapping,Elasticsearch,text,POST,type,分词
来源： https://www.cnblogs.com/straightup/p/13737584.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9