current position:Home>Full text search based on elasticsearch

Full text search based on elasticsearch

2022-01-27 01:30:41 IT1124

Catalog

  • Abstract
  • 1 Technology selection
    • 1.1 ElasticSearch
    • 1.2 springBoot
    • 1.3 ik Word segmentation is
  • 2 Environmental preparation
  • 3 Project framework
  • 4 Realization effect
    • 4.1 Search page
    • 4.2 Search results page
  • 5 Specific code implementation
    • 5.1 The implementation object of Full-text Retrieval
    • 5.2 Client configuration
    • 5.3 Business code writing
    • 5.4 External interface
    • 5.5 page
  • 6 Summary

Abstract

For a company , More and more data , It is a difficult problem to find this information quickly , There is a special field in the computer field IR(Information Retrival) Research if you get information , Do Information Retrieval . Domestic search engines such as Baidu also belong to this field , It is very difficult to implement a search engine by yourself , However, information search is very important for every company , Developers can also choose some open source projects in the market to build their own on-site search engine , This article will go through ElasticSearch To build such an information retrieval project .

1 Technology selection

  • Search engine services use ElasticSearch
  • External services provided web Service selection springboot web

1.1 ElasticSearch

Elasticsearch It's based on Lucene Search server for . It provides a distributed multi-user capability of full-text search engine , be based on RESTful web Interface .Elasticsearch Yes, it is Java Language development , And as a Apache Open source distribution under license terms , Is a popular enterprise search engine .Elasticsearch For Cloud Computing , Real time search , Stable , reliable , Fast , Easy to install and use .

The official client is in Java、.NET(C#)、PHP、Python、Apache Groovy、Ruby And many other languages are available . according to DB-Engines The ranking shows ,Elasticsearch Is the most popular enterprise search engine , The second is Apache Solr, Is based on Lucene.1

Now the most common open source search engine on the market is ElasticSearch and Solr, Both are based on Lucene The implementation of the , among ElasticSearch Relatively more heavyweight , It also performs better in a distributed environment , The selection of the two needs to consider the specific business scenario and data level . When the amount of data is small , Completely need to use something like Lucene Such search engine services , Search through relational database .

1.2 springBoot

Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can “just run”.2

Now? springBoot Doing it web Development is the absolute mainstream , It's not just a development advantage , In deployment , All aspects of operation and maintenance have performed very well , also spring The influence of the ecosystem is too great , Various mature solutions can be found .

1.3 ik Word segmentation is

elasticSearch It does not support Chinese word segmentation , Need to install Chinese word segmentation plug-in , If you need to do Chinese Information Retrieval , Chinese word segmentation is the basis , Here we choose ik, After downloading, put elasticSearch Installation position of plugin directory .

2 Environmental preparation

It needs to be installed elastiSearch as well as kibana( Optional ), And need lk Word segmentation plugin .

  • install elasticSearch elasticsearch Official website . I used 7.5.1.
  • ik Plugin Download ik plug-in unit github Address . Pay attention to download and you download elasticsearch The same version ik plug-in unit .
  • take ik Plug in elasticsearch Install under directory plugins It's a bag , New registration ik, Unzip the downloaded plug-in to this directory , start-up es The plug-in will be loaded automatically when .

  • build springboot project idea ->new project ->spring initializer

3 Project framework

  • Get data usage ik Word segmentation plugin
  • Store data in es In the engine
  • adopt es The retrieval method is to retrieve the stored data
  • Use es Of java The client provides external services

4 Realization effect

4.1 Search page

Simply implement a search box similar to Baidu .

4.2 Search results page

Click the first search result is my personal blog post , To avoid data copyright issues , The author is in es The engine is full of personal blog data .

5 Specific code implementation

5.1 The implementation object of Full-text Retrieval

The following entity classes are defined according to the basic information of the blog , We mainly need to know the of each blog post url, Check the retrieved articles to jump to this url.

package com.lbh.es.entity;

import com.fasterxml.jackson.annotation.JsonIgnore;

import javax.persistence.*;

/**
 * PUT articles
 * {
 * "mappings":
 * {"properties":{
 * "author":{"type":"text"},
 * "content":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
 * "title":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
 * "createDate":{"type":"date","format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"},
 * "url":{"type":"text"}
 * } },
 * "settings":{
 *     "index":{
 *       "number_of_shards":1,
 *       "number_of_replicas":2
 *     }
 *   }
 * }
 * ---------------------------------------------------------------------------------------------------------------------
 * Copyright(c)[email protected]
 * @author liubinhao
 * @date 2021/3/3
 */
@Entity
@Table(name = "es_article")
public class ArticleEntity {
    @Id
    @JsonIgnore
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private long id;
    @Column(name = "author")
    private String author;
    @Column(name = "content",columnDefinition="TEXT")
    private String content;
    @Column(name = "title")
    private String title;
    @Column(name = "createDate")
    private String createDate;
    @Column(name = "url")
    private String url;

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getCreateDate() {
        return createDate;
    }

    public void setCreateDate(String createDate) {
        this.createDate = createDate;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }
}

5.2 Client configuration

adopt java To configure es The client of .

package com.lbh.es.config;

import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.util.ArrayList;
import java.util.List;

/**
 * Copyright(c)[email protected]
 * @author liubinhao
 * @date 2021/3/3
 */
@Configuration
public class EsConfig {

    @Value("${elasticsearch.schema}")
    private String schema;
    @Value("${elasticsearch.address}")
    private String address;
    @Value("${elasticsearch.connectTimeout}")
    private int connectTimeout;
    @Value("${elasticsearch.socketTimeout}")
    private int socketTimeout;
    @Value("${elasticsearch.connectionRequestTimeout}")
    private int tryConnTimeout;
    @Value("${elasticsearch.maxConnectNum}")
    private int maxConnNum;
    @Value("${elasticsearch.maxConnectPerRoute}")
    private int maxConnectPerRoute;

    @Bean
    public RestHighLevelClient restHighLevelClient() {
        //  Split address 
        List<HttpHost> hostLists = new ArrayList<>();
        String[] hostList = address.split(",");
        for (String addr : hostList) {
            String host = addr.split(":")[0];
            String port = addr.split(":")[1];
            hostLists.add(new HttpHost(host, Integer.parseInt(port), schema));
        }
        //  convert to  HttpHost  Array 
        HttpHost[] httpHost = hostLists.toArray(new HttpHost[]{});
        //  Building connection objects 
        RestClientBuilder builder = RestClient.builder(httpHost);
        //  Asynchronous connection delay configuration 
        builder.setRequestConfigCallback(requestConfigBuilder -> {
            requestConfigBuilder.setConnectTimeout(connectTimeout);
            requestConfigBuilder.setSocketTimeout(socketTimeout);
            requestConfigBuilder.setConnectionRequestTimeout(tryConnTimeout);
            return requestConfigBuilder;
        });
        //  Asynchronous connection number configuration 
        builder.setHttpClientConfigCallback(httpClientBuilder -> {
            httpClientBuilder.setMaxConnTotal(maxConnNum);
            httpClientBuilder.setMaxConnPerRoute(maxConnectPerRoute);
            return httpClientBuilder;
        });
        return new RestHighLevelClient(builder);
    }

}

5.3 Business code writing

Including some information about searching articles , From the article title , View relevant information from the dimensions of article content and author information .

package com.lbh.es.service;

import com.google.gson.Gson;
import com.lbh.es.entity.ArticleEntity;
import com.lbh.es.repository.ArticleRepository;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.springframework.stereotype.Service;

import javax.annotation.Resource;
import java.io.IOException;

import java.util.*;

/**
 * Copyright(c)[email protected]
 * @author liubinhao
 * @date 2021/3/3
 */
@Service
public class ArticleService {

    private static final String ARTICLE_INDEX = "article";

    @Resource
    private RestHighLevelClient client;
    @Resource
    private ArticleRepository articleRepository;

    public boolean createIndexOfArticle(){
        Settings settings = Settings.builder()
                .put("index.number_of_shards", 1)
                .put("index.number_of_replicas", 1)
                .build();
// {"properties":{"author":{"type":"text"},
// "content":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}
// ,"title":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"},
// ,"createDate":{"type":"date","format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"}
// }
        String mapping = "{"properties":{"author":{"type":"text"},n" +
                ""content":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}n" +
                ","title":{"type":"text","analyzer":"ik_max_word","search_analyzer":"ik_smart"}n" +
                ","createDate":{"type":"date","format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"}n" +
                "},"url":{"type":"text"}n" +
                "}";
        CreateIndexRequest indexRequest = new CreateIndexRequest(ARTICLE_INDEX)
                .settings(settings).mapping(mapping,XContentType.JSON);
        CreateIndexResponse response = null;
        try {
            response = client.indices().create(indexRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        if (response!=null) {
            System.err.println(response.isAcknowledged() ? "success" : "default");
            return response.isAcknowledged();
        } else {
            return false;
        }
    }

    public boolean deleteArticle(){
        DeleteIndexRequest request = new DeleteIndexRequest(ARTICLE_INDEX);
        try {
            AcknowledgedResponse response = client.indices().delete(request, RequestOptions.DEFAULT);
            return response.isAcknowledged();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return false;
    }

    public IndexResponse addArticle(ArticleEntity article){
        Gson gson = new Gson();
        String s = gson.toJson(article);
        // Create index create object 
        IndexRequest indexRequest = new IndexRequest(ARTICLE_INDEX);
        // Document content 
        indexRequest.source(s,XContentType.JSON);
        // adopt client Conduct http Request 
        IndexResponse re = null;
        try {
            re = client.index(indexRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return re;
    }

    public void transferFromMysql(){
        articleRepository.findAll().forEach(this::addArticle);
    }

    public List<ArticleEntity> queryByKey(String keyword){
        SearchRequest request = new SearchRequest();
        /*
         *  establish    Search content parameter setting object :SearchSourceBuilder
         *  be relative to matchQuery,multiMatchQuery For multiple fi eld, in other words , When multiMatchQuery in ,fieldNames When there is only one parameter , Its function and matchQuery Quite a ;
         *  And when fieldNames When there are multiple parameters , Such as field1 and field2, In the results of the query , or field1 Contained in the text, or field2 Contained in the text.
         */
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        searchSourceBuilder.query(QueryBuilders
                .multiMatchQuery(keyword, "author","content","title"));
        request.source(searchSourceBuilder);
        List<ArticleEntity> result = new ArrayList<>();
        try {
            SearchResponse search = client.search(request, RequestOptions.DEFAULT);
            for (SearchHit hit:search.getHits()){
                Map<String, Object> map = hit.getSourceAsMap();
                ArticleEntity item = new ArticleEntity();
                item.setAuthor((String) map.get("author"));
                item.setContent((String) map.get("content"));
                item.setTitle((String) map.get("title"));
                item.setUrl((String) map.get("url"));
                result.add(item);
            }
            return result;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    public ArticleEntity queryById(String indexId){
        GetRequest request = new GetRequest(ARTICLE_INDEX, indexId);
        GetResponse response = null;
        try {
            response = client.get(request, RequestOptions.DEFAULT);
        } catch (IOException e) {
            e.printStackTrace();
        }
        if (response!=null&&response.isExists()){
            Gson gson = new Gson();
            return gson.fromJson(response.getSourceAsString(),ArticleEntity.class);
        }
        return null;
    }
}

5.4 External interface

And use springboot Development web The procedure is the same .

package com.lbh.es.controller;

import com.lbh.es.entity.ArticleEntity;
import com.lbh.es.service.ArticleService;
import org.elasticsearch.action.index.IndexResponse;
import org.springframework.web.bind.annotation.*;

import javax.annotation.Resource;
import java.util.List;

/**
 * Copyright(c)[email protected]
 * @author liubinhao
 * @date 2021/3/3
 */
@RestController
@RequestMapping("article")
public class ArticleController {

    @Resource
    private ArticleService articleService;

    @GetMapping("/create")
    public boolean create(){
        return articleService.createIndexOfArticle();
    }

    @GetMapping("/delete")
    public boolean delete() {
        return articleService.deleteArticle();
    }

    @PostMapping("/add")
    public IndexResponse add(@RequestBody ArticleEntity article){
        return articleService.addArticle(article);
    }

    @GetMapping("/fransfer")
    public String transfer(){
        articleService.transferFromMysql();
        return "successful";
    }

    @GetMapping("/query")
    public List<ArticleEntity> query(String keyword){
        return articleService.queryByKey(keyword);
    }
}

5.5 page

This page uses thymeleaf, The main reason is that I really don't know , Only know the simple h5, Just make a page that can be displayed .

Search page

<!DOCTYPE html>
<html lang="en" xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>YiyiDu</title>
    <!--
        input:focus Set when the input box is clicked , The blue outer border appears 
        text-indent: 11px; and padding-left: 11px; Set the distance between the starting position of the input character and the left border 
    -->
    <style>
        input:focus {
            border: 2px solid rgb(62, 88, 206);
        }
        input {
            text-indent: 11px;
            padding-left: 11px;
            font-size: 16px;
        }
    </style>
    <!--input The initial state -->
    <style class="input/css">
        .input {
            width: 33%;
            height: 45px;
            vertical-align: top;
            box-sizing: border-box;
            border: 2px solid rgb(207, 205, 205);
            border-right: 2px solid rgb(62, 88, 206);
            border-bottom-left-radius: 10px;
            border-top-left-radius: 10px;
            outline: none;
            margin: 0;
            display: inline-block;
            background: url(/static/img/camera.jpg) no-repeat 0 0;
            background-position: 565px 7px;
            background-size: 28px;
            padding-right: 49px;
            padding-top: 10px;
            padding-bottom: 10px;
            line-height: 16px;
        }
    </style>
    <!--button The initial state -->
    <style class="button/css">
        .button {
            height: 45px;
            width: 130px;
            vertical-align: middle;
            text-indent: -8px;
            padding-left: -8px;
            background-color: rgb(62, 88, 206);
            color: white;
            font-size: 18px;
            outline: none;
            border: none;
            border-bottom-right-radius: 10px;
            border-top-right-radius: 10px;
            margin: 0;
            padding: 0;
        }
    </style>
</head>
<body>
<!-- contain table Of div-->
<!-- contain input and button Of div-->
    <div style="font-size: 0px;">
        <div align="center" style="margin-top: 0px;">
            <img src="../static/img/yyd.png" th:src = "@{/static/img/yyd.png}"  alt=" 100 million degrees " width="280px" class="pic" />
        </div>
        <div align="center">
            <!--action Realize jump -->
            <form action="/home/query">
                <input type="text" class="input" name="keyword" />
                <input type="submit" class="button" value=" Under 100 million degrees " />
            </form>
        </div>
    </div>
</body>
</html>

Search results page

<!DOCTYPE html>
<html lang="en" xmlns:th="http://www.thymeleaf.org">
<head>
    <link rel="stylesheet" href="https://cdn.staticfile.org/twitter-bootstrap/4.3.1/css/bootstrap.min.css">
    <meta charset="UTF-8">
    <title>xx-manager</title>
</head>
<body>
<header th:replace="search.html"></header>
<div class="container my-2">
    <ul th:each="article : ${articles}">
        <a th:href="${article.url}"><li th:text="${article.author}+${article.content}"></li></a>
    </ul>
</div>
<footer th:replace="footer.html"></footer>
</body>
</html>

6 Summary

Work code , After work, continue to write code and blog , Spent two days studying the following es, In fact, this thing is still very interesting , Now? IR The most basic field is still based on Statistics , So for es This kind of search engine has good performance in the case of big data . Every time I write about the actual combat, the author actually feels that there is no way to start , Because I don't know what to do ? So I also hope to get some interesting ideas, and I will do the actual combat .

copyright notice
author[IT1124],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201270130375022.html

Random recommended