Java使用Elasticsearch7x实现对word、pdft文件的全文内容检索_荔枝味的真知棒_java pdf全文检索

大大的周 02-07 985

前言，公司之前在线文档使用的Flash预览，用的es2全文检索，现在要进行项目整改，Flash现在不能用了，所以调整为KKFileView。对于ES也需要进行升级，添加IK中文分词器。所以就写了这篇文档进行总结与存档。

关于KKFileView的搭建与使用这里就不多说了，KKFileView官网基本都给出了解决方案，有一些个别的复制问题，我也在另一篇文档中写了。KKFileView在线预览初使用记录，主要解决不可复制等一些限制问题。 Elasticsearch在Java中使用

下面我贴出了已经写好工具类，方便后续使用。

文件处理

如果是纯文本的格式，那么我们直接上传就好了，但是如果是word、PDF等其他的文件形式，就需要进行预处理操作了。所以我们要先建立一个通道；关于为什么建立通道的问题，有兴趣的同学可以去看一下es的PUT请求原理；

安装文本抽取插件 ## 安装目录下运行下面的命令就可以进行安装 ./bin/elasticsearch-plugin install ingest-attachment 定义文本抽取管道

利用kibana运行下面的代码段，提示true就OK了. [记得重启es，不然一会定义管道的时候，会报错哦。如果是集群的话，所有的服务都要重启才可以] 如果不知道kibana是什么的同学，可以去学习一下对于安装kibana和运行可以去看一下我的另一篇文档 Linux安装运行 Mac安装运行 ik分词器这两篇文档也都有说

PUT /_ingest/pipeline/attachment { "description": "Extract attachment information", "processors": [ { "attachment": { "field": "content", "ignore_missing": true } }, { "remove": { "field": "content" } } ] } 创建索引

PUT /索引名称这里的properties可以根据实际字段进行调整

PUT /fileindex { "mappings": { "properties": { "id":{ "type": "keyword" }, "name":{ "type": "text", "analyzer": "ik_max_word" }, "sfName":{ "type": "text", "analyzer": "ik_max_word" }, "createBy":{ "type": "text", "analyzer": "ik_max_word" }, "type":{ "type": "keyword" }, "attachment": { "properties": { "content":{ "type": "text", "analyzer": "ik_smart" } } } } } } 如果上面两步都成功的话，那可以进行一个测试了

因为ElasticSearch是基于JSON 格式的文档数据库，所以附件文档在插入ElasticSearch之前必须进行Base64编码。先通过下面的网站将一个pdf文件转化为base64的文本。PDF to Base64

POST /docwrite/_doc?pipeline=attachment { "name":"进口红酒", "type":"pdf", "content":"这里放入你转换后的base64" }

然后我们可以通过GET来查询刚刚上传的文档是否成功。

GET /docwrite/_search

如果不出意外的话，应该是可以正常看到已经解析后的信息，这里我就不贴图了。如果不指定pipline的话，是无法被es解析的。查询出来就是不是你所认识的中文，哈哈哈。

Java操作实例 POM引入 <dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.13.4</version> </dependency> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>7.13.4</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.8</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpcore</artifactId> <version>4.4.9</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.7</version> </dependency> 必要信息的实体类 //这个实体类可以是你的实际业务为主，但是要保留content字段 public class fileMessage { String id; //用于存储文件id String name; //文件名 String type; //文件的type，pdf，word，or txt String content; //文件转化成base64编码后所有的内容。 } 上传代码

使用下面的updateESFile方法就可以上传了。具体的业务逻辑，你们可以根据实际业务来做，实体类也要根据实际业务来做。

private void updateESFile(String filePath){ File file = new File(filePath); if (!file.exists()) { System.out.println("找不到文件"); } fileMessage fileM = new fileMessage(); try { byte[] bytes = getContent(file); String base64 = Base64.getEncoder().encodeToString(bytes); fileM.setId("1"); fileM.setName(file.getName()); fileM.setContent(base64); IndexRequest indexRequest = new IndexRequest("fileindex"); //上传同时，使用attachment pipline进行提取文件 indexRequest.source(JSON.toJSONString(fileM), XContentType.JSON); indexRequest.setPipeline("attachment"); IndexResponse indexResponse = EsUtil.client.index(indexRequest, RequestOptions.DEFAULT); logger.info("send to eSearch:" + fileName); logger.info("send to eSeach results:" + indexResponse); } catch (IOException | SAXException | TikaException e) { e.printStackTrace(); } } /** * 文件转base64 * @param file * @return * @throws IOException */ private byte[] getContent(File file) throws IOException { long fileSize = file.length(); if (fileSize > Integer.MAX_VALUE) { System.out.println("file too big..."); return null; } FileInputStream fi = new FileInputStream(file); byte[] buffer = new byte[(int) fileSize]; int offset = 0; int numRead = 0; while (offset < buffer.length && (numRead = fi.read(buffer, offset, buffer.length - offset)) >= 0) { offset += numRead; } // 确保所有数据均被读取 if (offset != buffer.length) { throw new IOException("Could not completely read file " + file.getName()); } fi.close(); return buffer; } JavaEsUtil工具类 package util; import com.alibaba.fastjson.JSON; import org.apache.http.HttpHost; import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest; import org.elasticsearch.action.bulk.BulkRequest; import org.elasticsearch.action.bulk.BulkResponse; import org.elasticsearch.action.delete.DeleteRequest; import org.elasticsearch.action.delete.DeleteResponse; import org.elasticsearch.action.get.GetRequest; import org.elasticsearch.action.get.GetResponse; import org.elasticsearch.action.index.IndexRequest; import org.elasticsearch.action.index.IndexResponse; import org.elasticsearch.action.search.SearchRequest; import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.action.support.master.AcknowledgedResponse; import org.elasticsearch.action.update.UpdateRequest; import org.elasticsearch.action.update.UpdateResponse; import org.elasticsearch.client.RequestOptions; import org.elasticsearch.client.RestClient; import org.elasticsearch.client.RestHighLevelClient; import org.elasticsearch.client.indices.CreateIndexRequest; import org.elasticsearch.client.indices.CreateIndexResponse; import org.elasticsearch.client.indices.GetIndexRequest; import org.elasticsearch.common.text.Text; import org.elasticsearch.common.unit.TimeValue; import org.elasticsearch.common.xcontent.XContentType; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.SearchHit; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder; import org.elasticsearch.search.fetch.subphase.highlight.HighlightField; import java.io.IOException; import java.lang.reflect.Method; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import java.util.Map; public class EsUtil { /** * 客户端,本次使用本地连接 */ public static RestHighLevelClient client = new RestHighLevelClient( RestClient.builder( new HttpHost("127.0.0.1", 9200, "http"))); /** * 停止连接 */ public static void shutdown() { if (client != null) { try { client.close(); } catch (IOException e) { e.printStackTrace(); } } } /** * 默认类型(整个type类型传闻在8.0版本后可能会废弃,但是目前7.13版本下先保留) */ public static final String DEFAULT_TYPE = "_doc"; /** * set方法前缀 */ public static final String SET_METHOD_PREFIX = "set"; /** * 返回状态-CREATED */ public static final String RESPONSE_STATUS_CREATED = "CREATED"; /** * 返回状态-OK */ public static final String RESPONSE_STATUS_OK = "OK"; /** * 返回状态-NOT_FOUND */ public static final String RESPONSE_STATUS_NOT_FOUND = "NOT_FOUND"; /** * 需要过滤的文档数据 */ public static final String[] IGNORE_KEY = {"@timestamp", "@version", "type"}; /** * 超时时间 1s */ public static final TimeValue TIME_VALUE_SECONDS = TimeValue.timeValueSeconds(1); /** * 批量新增 */ public static final String PATCH_OP_TYPE_INSERT = "insert"; /** * 批量删除 */ public static final String PATCH_OP_TYPE_DELETE = "delete"; /** * 批量更新 */ public static final String PATCH_OP_TYPE_UPDATE = "update"; //==========================================数据操作(工具)(不参与调用es)================================================= /** * 方法描述: 剔除指定文档数据,减少不必要的循环 * * @param map 文档数据 * @return: void * @author: gxf * @date: 2021年07月27日 * @time: 10:39 上午 */ public static void ignoreSource(Map<String, Object> map) { for (String key : IGNORE_KEY) { map.remove(key); } } /** * 方法描述: 将文档数据转化为指定对象 * * @param sourceAsMap 文档数据 * @param clazz 转换目标Class对象 * @return 对象 * @author: gxf * @date: 2021年07月27日 * @time: 10:38 上午 */ public static <T> T dealObject(Map<String, Object> sourceAsMap, Class<T> clazz) { try { ignoreSource(sourceAsMap); Iterator<String> keyIterator = sourceAsMap.keySet().iterator(); T t = clazz.newInstance(); while (keyIterator.hasNext()) { String key = keyIterator.next(); String replaceKey = key.replaceFirst(key.substring(0, 1), key.substring(0, 1).toUpperCase()); Method method = null; try { method = clazz.getMethod(SET_METHOD_PREFIX + replaceKey, sourceAsMap.get(key).getClass()); } catch (NoSuchMethodException e) { continue; } method.invoke(t, sourceAsMap.get(key)); } return t; } catch (Exception e) { e.printStackTrace(); } return null; } //==========================================索引操作================================================= /** * 方法描述: 创建索引,若索引不存在且创建成功,返回true,若同名索引已存在,返回false * * @param: [index] 索引名称 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 11:01 上午 */ public static boolean insertIndex(String index) { //创建索引请求 CreateIndexRequest request = new CreateIndexRequest(index); //执行创建请求IndicesClient,请求后获得响应 try { CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT); return response != null; } catch (Exception e) { e.printStackTrace(); } return false; } /** * 方法描述: 判断索引是否存在,若存在返回true,若不存在或出现问题返回false * * @param: [index] 索引名称 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 11:09 上午 */ public static boolean isExitsIndex(String index) { GetIndexRequest request = new GetIndexRequest(index); try { return client.indices().exists(request, RequestOptions.DEFAULT); } catch (Exception e) { e.printStackTrace(); } return false; } /* * 方法描述: 删除索引,删除成功返回true,删除失败返回false * @param: [index] 索引名称 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 11:23 上午 */ public static boolean deleteIndex(String index) { DeleteIndexRequest request = new DeleteIndexRequest(index); try { AcknowledgedResponse response = client.indices().delete(request, RequestOptions.DEFAULT); return response.isAcknowledged(); } catch (Exception e) { e.printStackTrace(); } return false; } //==========================================文档操作(新增,删除,修改)================================================= /** * 方法描述: 新增/修改文档信息 * * @param index 索引 * @param id 文档id * @param data 数据 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:34 上午 */ public static boolean insertOrUpdateDocument(String index, String id, Object data) { try { IndexRequest request = new IndexRequest(index); request.timeout(TIME_VALUE_SECONDS); if (id != null && id.length() > 0) { request.id(id); } request.source(JSON.toJSONString(data), XContentType.JSON); IndexResponse response = client.index(request, RequestOptions.DEFAULT); String status = response.status().toString(); if (RESPONSE_STATUS_CREATED.equals(status) || RESPONSE_STATUS_OK.equals(status)) { return true; } } catch (Exception e) { e.printStackTrace(); } return false; } /** * 方法描述: 更新文档信息 * * @param index 索引 * @param id 文档id * @param data 数据 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:34 上午 */ public static boolean updateDocument(String index, String id, Object data) { try { UpdateRequest request = new UpdateRequest(index, id); request.doc(JSON.toJSONString(data), XContentType.JSON); UpdateResponse response = client.update(request, RequestOptions.DEFAULT); String status = response.status().toString(); if (RESPONSE_STATUS_OK.equals(status)) { return true; } } catch (Exception e) { e.printStackTrace(); } return false; } /** * 方法描述:删除文档信息 * * @param index 索引 * @param id 文档id * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:33 上午 */ public static boolean deleteDocument(String index, String id) { try { DeleteRequest request = new DeleteRequest(index, id); DeleteResponse response = client.delete(request, RequestOptions.DEFAULT); String status = response.status().toString(); if (RESPONSE_STATUS_OK.equals(status)) { return true; } } catch (Exception e) { e.printStackTrace(); } return false; } /** * 方法描述: 小数据量批量新增 * * @param index 索引 * @param dataList 数据集新增修改需要传递 * @param timeout 超时时间单位为秒 * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:31 上午 */ public static boolean simplePatchInsert(String index, List<Object> dataList, long timeout) { try { BulkRequest bulkRequest = new BulkRequest(); bulkRequest.timeout(TimeValue.timeValueSeconds(timeout)); if (dataList != null && dataList.size() > 0) { for (Object obj : dataList) { bulkRequest.add( new IndexRequest(index) .source(JSON.toJSONString(obj), XContentType.JSON) ); } BulkResponse response = client.bulk(bulkRequest, RequestOptions.DEFAULT); if (!response.hasFailures()) { return true; } } } catch (Exception e) { e.printStackTrace(); } return false; } /** * 功能描述: * @param index 索引名称 * @param idList 需要批量删除的id集合 * @return : boolean * @author : gxf * @date : 2021/6/30 1:22 */ public static boolean patchDelete(String index, List<String> idList) { BulkRequest request = new BulkRequest(); for (String id:idList) { request.add(new DeleteRequest().index(index).id(id)); } try { BulkResponse response = EsUtil.client.bulk(request, RequestOptions.DEFAULT); return !response.hasFailures(); } catch (Exception e) { e.printStackTrace(); } return false; } //==========================================文档操作(查询)================================================= /** * 方法描述: 判断文档是否存在 * * @param index 索引 * @param id 文档id * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:36 上午 */ public static boolean isExistsDocument(String index, String id) { return isExistsDocument(index, DEFAULT_TYPE, id); } /** * 方法描述: 判断文档是否存在 * * @param index 索引 * @param type 类型 * @param id 文档id * @return: boolean * @author: gxf * @date: 2021年07月27日 * @time: 10:36 上午 */ public static boolean isExistsDocument(String index, String type, String id) { GetRequest request = new GetRequest(index, type, id); try { GetResponse response = client.get(request, RequestOptions.DEFAULT); return response.isExists(); } catch (Exception e) { e.printStackTrace(); } return false; } /** * 方法描述: 根据id查询文档 * * @param index 索引 * @param id 文档id * @param clazz 转换目标Class对象 * @return 对象 * @author: gxf * @date: 2021年07月27日 * @time: 10:36 上午 */ public static <T> T selectDocumentById(String index, String id, Class<T> clazz) { return selectDocumentById(index, DEFAULT_TYPE, id, clazz); } /** * 方法描述: 根据id查询文档 * * @param index 索引 * @param type 类型 * @param id 文档id * @param clazz 转换目标Class对象 * @return 对象 * @author: gxf * @date: 2021年07月27日 * @time: 10:35 上午 */ public static <T> T selectDocumentById(String index, String type, String id, Class<T> clazz) { try { type = type == null || type.equals("") ? DEFAULT_TYPE : type; GetRequest request = new GetRequest(index, type, id); GetResponse response = client.get(request, RequestOptions.DEFAULT); if (response.isExists()) { Map<String, Object> sourceAsMap = response.getSourceAsMap(); return dealObject(sourceAsMap, clazz); } } catch (Exception e) { e.printStackTrace(); } return null; } /** * 方法描述:（筛选条件）获取数据集合 * * @param index 索引 * @param sourceBuilder 请求条件 * @param clazz 转换目标Class对象 * @return: java.util.List<T> * @author: gxf * @date: 2021年07月27日 * @time: 10:35 上午 */ public static <T> List<T> selectDocumentList(String index, SearchSourceBuilder sourceBuilder, Class<T> clazz) { try { SearchRequest request = new SearchRequest(index); if (sourceBuilder != null) { // 返回实际命中数 sourceBuilder.trackTotalHits(true); request.source(sourceBuilder); } SearchResponse response = client.search(request, RequestOptions.DEFAULT); if (response.getHits() != null) { List<T> list = new ArrayList<>(); SearchHit[] hits = response.getHits().getHits(); for (SearchHit documentFields : hits) { Map<String, Object> sourceAsMap = documentFields.getSourceAsMap(); list.add(dealObject(sourceAsMap, clazz)); } return list; } } catch (Exception e) { e.printStackTrace(); } return null; } /** * 方法描述:（筛选条件）获取数据 * * @param index 索引 * @param sourceBuilder 请求条 * @return: java.util.List<T> * @author: gxf * @date: 2021年07月27日 * @time: 10:35 上午 */ public static SearchResponse selectDocument(String index, SearchSourceBuilder sourceBuilder) { try { SearchRequest request = new SearchRequest(index); if (sourceBuilder != null) { // 返回实际命中数 sourceBuilder.trackTotalHits(true); sourceBuilder.size(10000); request.source(sourceBuilder); } return client.search(request, RequestOptions.DEFAULT); } catch (Exception e) { e.printStackTrace(); } return null; } /** * 方法描述: 筛选查询,返回使用了<span style='color:red'></span>处理好的数据. * * @param: index 索引名称 * @param: sourceBuilder sourceBuilder对象 * @param: clazz 需要返回的对象类型.class * @param: highLight 需要表现的高亮匹配字段 * @return: java.util.List<T> * @author: gxf * @date: 2021年07月27日 * @time: 6:39 下午 */ public static <T> List<T> selectDocumentListHighLight(String index, SearchSourceBuilder sourceBuilder, Class<T> clazz, String highLight) { try { SearchRequest request = new SearchRequest(index); if (sourceBuilder != null) { // 返回实际命中数 sourceBuilder.trackTotalHits(true); //高亮 HighlightBuilder highlightBuilder = new HighlightBuilder(); highlightBuilder.field(highLight); highlightBuilder.requireFieldMatch(false);//多个高亮关闭 highlightBuilder.preTags("<span style='color:red'>"); highlightBuilder.postTags("</span>"); sourceBuilder.highlighter(highlightBuilder); request.source(sourceBuilder); } SearchResponse response = client.search(request, RequestOptions.DEFAULT); if (response.getHits() != null) { List<T> list = new ArrayList<>(); for (SearchHit documentFields : response.getHits().getHits()) { Map<String, HighlightField> highlightFields = documentFields.getHighlightFields(); HighlightField title = highlightFields.get(highLight); Map<String, Object> sourceAsMap = documentFields.getSourceAsMap(); if (title != null) { Text[] fragments = title.fragments(); String n_title = ""; for (Text fragment : fragments) { n_title += fragment; } sourceAsMap.put(highLight, n_title);//高亮替换原来的内容 } list.add(dealObject(sourceAsMap, clazz)); } return list; } } catch (Exception e) { e.printStackTrace(); } return null; } /** * 方法描述: 返回索引内所有内容,返回SearchResponse对象,需要自己解析,不对数据封装 * @param: index 索引名称 * @return: SearchResponse * @author: gxf * @date: 2021/6/30 * @time: 1:28 上午 */ public static SearchResponse queryAllData(String index){ //创建搜索请求对象 SearchRequest request = new SearchRequest(index); //构建查询的请求体 SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); //查询所有数据 sourceBuilder.query(QueryBuilders.matchAllQuery()); request.source(sourceBuilder); try { return client.search(request, RequestOptions.DEFAULT); } catch (IOException e) { e.printStackTrace(); } return null; } /** * 方法描述: 返回索引内所有内容,返回指定类型 * @param: index 索引名称 * @param: clazz 需要接受转换的对象类型 * @return: java.util.List<T> * @author: gxf * @date: 2021/6/30 * @time: 1:32 上午 */ public static <T> List<T> queryAllData(String index, Class<T> clazz){ //创建搜索请求对象 SearchRequest request = new SearchRequest(index); //构建查询的请求体 SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); //查询所有数据 sourceBuilder.query(QueryBuilders.matchAllQuery()); request.source(sourceBuilder); try { SearchResponse response = client.search(request, RequestOptions.DEFAULT); if (response.getHits() != null) { List<T> list = new ArrayList<>(); SearchHit[] hits = response.getHits().getHits(); for (SearchHit documentFields : hits) { Map<String, Object> sourceAsMap = documentFields.getSourceAsMap(); list.add(dealObject(sourceAsMap, clazz)); } return list; } } catch (IOException e) { e.printStackTrace(); } return null; } } Java ES全文检索 public List<Map<String, Object>> eSearch(String msg) throws UnknownHostException { // List<Map<String, Object>> matchRsult = new LinkedList<Map<String, Object>>(); SearchSourceBuilder builder = new SearchSourceBuilder(); //因为我这边实际业务需要其他字段的查询，所以进行查询的字段就比较，如果只是查询文档中内容的话，打开注释的代码，然后注释掉这行代码 builder.query(QueryBuilders.multiMatchQuery(msg,"attachment.content","name","sfName","createBy").analyzer("ik_smart")); //builder.query(QueryBuilders.matchQuery("attachment.content", msg).analyzer("ik_smart")); SearchResponse searchResponse = EsUtil.selectDocument("fileindex", builder); SearchHits hits = searchResponse.getHits(); for (SearchHit hit : hits.getHits()) { hit.getSourceAsMap().put("msg", ""); matchRsult.add(hit.getSourceAsMap()); // System.out.println(hit.getSourceAsString()); } System.out.println("over in the main"); return matchRsult; } 完结

至此简单使用Java对ES文档上传后全文检索已经完成。祝愿大家完美撒花； ??ヽ(°▽°)ノ? ?????帅哥美女，留个赞再走吧。