Search in sources :

Example 1 with Entity

use of org.icij.datashare.Entity in project datashare by ICIJ.

the class ElasticsearchIndexer method bulkUpdate.

@Override
public <T extends Entity> boolean bulkUpdate(String indexName, List<T> entities) throws IOException {
    BulkRequest bulkRequest = new BulkRequest();
    entities.stream().map(e -> createUpdateRequest(indexName, getType(e), e.getId(), getJson(e), getParent(e), getRoot(e))).forEach(bulkRequest::add);
    return executeBulk(bulkRequest);
}
Also used : GetResponse(org.elasticsearch.action.get.GetResponse) DEFAULT_SEARCH_SIZE(org.icij.datashare.text.indexing.elasticsearch.ElasticsearchConfiguration.DEFAULT_SEARCH_SIZE) Inject(com.google.inject.Inject) QueryBuilders(org.elasticsearch.index.query.QueryBuilders) EntityUtils(org.apache.http.util.EntityUtils) Tag(org.icij.datashare.text.Tag) IndexRequest(org.elasticsearch.action.index.IndexRequest) UpdateResponse(org.elasticsearch.action.update.UpdateResponse) SearchResponse(org.elasticsearch.action.search.SearchResponse) RequestOptions(org.elasticsearch.client.RequestOptions) XContentFactory.jsonBuilder(org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder) JsonNode(com.fasterxml.jackson.databind.JsonNode) NStringEntity(org.apache.http.nio.entity.NStringEntity) Pipeline(org.icij.datashare.text.nlp.Pipeline) SearchHit(org.elasticsearch.search.SearchHit) GetRequest(org.elasticsearch.action.get.GetRequest) SliceBuilder(org.elasticsearch.search.slice.SliceBuilder) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentField(org.elasticsearch.common.document.DocumentField) Project(org.icij.datashare.text.Project) BulkItemResponse(org.elasticsearch.action.bulk.BulkItemResponse) HttpEntity(org.apache.http.HttpEntity) ContentType(org.apache.http.entity.ContentType) BulkResponse(org.elasticsearch.action.bulk.BulkResponse) Indexer(org.icij.datashare.text.indexing.Indexer) Collectors(java.util.stream.Collectors) Document(org.icij.datashare.text.Document) Stream(java.util.stream.Stream) Response(org.elasticsearch.client.Response) org.elasticsearch.index.query(org.elasticsearch.index.query) RestStatus(org.elasticsearch.rest.RestStatus) Arrays.stream(java.util.Arrays.stream) JsonObjectMapper(org.icij.datashare.json.JsonObjectMapper) NamedEntity(org.icij.datashare.text.NamedEntity) java.util(java.util) ClearScrollRequest(org.elasticsearch.action.search.ClearScrollRequest) ScriptType(org.elasticsearch.script.ScriptType) SearchRequest(org.elasticsearch.action.search.SearchRequest) WriteRequest(org.elasticsearch.action.support.WriteRequest) TimeValue(org.elasticsearch.common.unit.TimeValue) SearchSourceBuilder(org.elasticsearch.search.builder.SearchSourceBuilder) Entity(org.icij.datashare.Entity) StreamSupport(java.util.stream.StreamSupport) UpdateByQueryRequest(org.elasticsearch.index.reindex.UpdateByQueryRequest) BulkByScrollResponse(org.elasticsearch.index.reindex.BulkByScrollResponse) Script(org.elasticsearch.script.Script) Optional.ofNullable(java.util.Optional.ofNullable) UpdateRequest(org.elasticsearch.action.update.UpdateRequest) IOException(java.io.IOException) DocWriteResponse(org.elasticsearch.action.DocWriteResponse) RestHighLevelClient(org.elasticsearch.client.RestHighLevelClient) Request(org.elasticsearch.client.Request) TimeUnit(java.util.concurrent.TimeUnit) Collectors.toList(java.util.stream.Collectors.toList) SearchScrollRequest(org.elasticsearch.action.search.SearchScrollRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest)

Example 2 with Entity

use of org.icij.datashare.Entity in project datashare by ICIJ.

the class BatchDownloadRunner method call.

@Override
public File call() throws Exception {
    int throttleMs = parseInt(propertiesProvider.get(BATCH_THROTTLE).orElse("0"));
    int maxResultSize = parseInt(propertiesProvider.get(BATCH_DOWNLOAD_MAX_NB_FILES).orElse(valueOf(MAX_BATCH_RESULT_SIZE)));
    int scrollSize = min(parseInt(propertiesProvider.get(SCROLL_SIZE).orElse("1000")), MAX_SCROLL_SIZE);
    long maxZipSizeBytes = HumanReadableSize.parse(propertiesProvider.get(BATCH_DOWNLOAD_MAX_SIZE).orElse("100M"));
    long zippedFilesSize = 0;
    logger.info("running batch download for user {} on project {} with throttle {}ms and scroll size of {}", batchDownload.user.getId(), batchDownload.project, throttleMs, scrollSize);
    Indexer.Searcher searcher = indexer.search(batchDownload.project.getId(), Document.class).withoutSource("content").limit(scrollSize);
    if (batchDownload.isJsonQuery()) {
        searcher.set(batchDownload.queryAsJson());
    } else {
        searcher.with(batchDownload.query);
    }
    List<? extends Entity> docsToProcess = searcher.scroll().collect(toList());
    if (docsToProcess.size() == 0) {
        logger.warn("no results for batchDownload {}", batchDownload.uuid);
        return null;
    }
    docsToProcessSize = searcher.totalHits();
    if (docsToProcessSize > maxResultSize) {
        logger.warn("number of results for batch download > {} for {}/{} (nb zip entries will be limited)", maxResultSize, batchDownload.uuid, batchDownload.user);
    }
    try (Zipper zipper = createZipper(batchDownload, propertiesProvider, mailSenderSupplier)) {
        HashMap<String, Object> taskProperties = new HashMap<>();
        taskProperties.put("batchDownload", batchDownload);
        while (docsToProcess.size() != 0) {
            for (int i = 0; i < docsToProcess.size() && numberOfResults.get() < maxResultSize && zippedFilesSize <= maxZipSizeBytes; i++) {
                Entity doc = docsToProcess.get(i);
                int addedBytes = zipper.add((Document) doc);
                if (addedBytes > 0) {
                    zippedFilesSize += addedBytes;
                    numberOfResults.incrementAndGet();
                    batchDownload.setZipSize(zippedFilesSize);
                    updateCallback.apply(new TaskView<>(new MonitorableFutureTask<>(this, taskProperties)));
                }
            }
            docsToProcess = searcher.scroll().collect(toList());
        }
    }
    logger.info("created batch download file {} ({} bytes/{} entries) for user {}", batchDownload.filename, Files.size(batchDownload.filename), numberOfResults, batchDownload.user.getId());
    return batchDownload.filename.toFile();
}
Also used : Entity(org.icij.datashare.Entity) HashMap(java.util.HashMap) Indexer(org.icij.datashare.text.indexing.Indexer)

Example 3 with Entity

use of org.icij.datashare.Entity in project datashare by ICIJ.

the class ElasticsearchIndexer method bulkAdd.

@Override
public boolean bulkAdd(final String indexName, Pipeline.Type nerType, List<NamedEntity> namedEntities, Document parent) throws IOException {
    BulkRequest bulkRequest = new BulkRequest();
    String routing = ofNullable(parent.getRootDocument()).orElse(parent.getId());
    bulkRequest.add(new UpdateRequest(indexName, parent.getId()).doc(jsonBuilder().startObject().field("status", Document.Status.DONE).endObject()).routing(routing));
    bulkRequest.add(new UpdateRequest(indexName, parent.getId()).script(new Script(ScriptType.INLINE, "painless", "if (!ctx._source.nerTags.contains(params.nerTag)) ctx._source.nerTags.add(params.nerTag);", new HashMap<String, Object>() {

        {
            put("nerTag", nerType.toString());
        }
    })).routing(routing));
    for (Entity child : namedEntities) {
        bulkRequest.add(createIndexRequest(indexName, JsonObjectMapper.getType(child), child.getId(), getJson(child), parent.getId(), routing));
    }
    bulkRequest.setRefreshPolicy(esCfg.refreshPolicy);
    BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
    if (bulkResponse.hasFailures()) {
        for (BulkItemResponse resp : bulkResponse.getItems()) {
            if (resp.isFailed()) {
                LOGGER.error("bulk add failed : {}", resp.getFailureMessage());
            }
        }
        return false;
    }
    return true;
}
Also used : Script(org.elasticsearch.script.Script) NStringEntity(org.apache.http.nio.entity.NStringEntity) HttpEntity(org.apache.http.HttpEntity) NamedEntity(org.icij.datashare.text.NamedEntity) Entity(org.icij.datashare.Entity) UpdateRequest(org.elasticsearch.action.update.UpdateRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest) BulkItemResponse(org.elasticsearch.action.bulk.BulkItemResponse) BulkResponse(org.elasticsearch.action.bulk.BulkResponse)

Example 4 with Entity

use of org.icij.datashare.Entity in project datashare by ICIJ.

the class ElasticsearchIndexer method bulkAdd.

@Override
public <T extends Entity> boolean bulkAdd(final String indexName, List<T> objs) throws IOException {
    BulkRequest bulkRequest = new BulkRequest();
    objs.stream().map(e -> createIndexRequest(indexName, getType(e), e.getId(), getJson(e), getParent(e), getRoot(e))).forEach(bulkRequest::add);
    return executeBulk(bulkRequest);
}
Also used : GetResponse(org.elasticsearch.action.get.GetResponse) DEFAULT_SEARCH_SIZE(org.icij.datashare.text.indexing.elasticsearch.ElasticsearchConfiguration.DEFAULT_SEARCH_SIZE) Inject(com.google.inject.Inject) QueryBuilders(org.elasticsearch.index.query.QueryBuilders) EntityUtils(org.apache.http.util.EntityUtils) Tag(org.icij.datashare.text.Tag) IndexRequest(org.elasticsearch.action.index.IndexRequest) UpdateResponse(org.elasticsearch.action.update.UpdateResponse) SearchResponse(org.elasticsearch.action.search.SearchResponse) RequestOptions(org.elasticsearch.client.RequestOptions) XContentFactory.jsonBuilder(org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder) JsonNode(com.fasterxml.jackson.databind.JsonNode) NStringEntity(org.apache.http.nio.entity.NStringEntity) Pipeline(org.icij.datashare.text.nlp.Pipeline) SearchHit(org.elasticsearch.search.SearchHit) GetRequest(org.elasticsearch.action.get.GetRequest) SliceBuilder(org.elasticsearch.search.slice.SliceBuilder) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentField(org.elasticsearch.common.document.DocumentField) Project(org.icij.datashare.text.Project) BulkItemResponse(org.elasticsearch.action.bulk.BulkItemResponse) HttpEntity(org.apache.http.HttpEntity) ContentType(org.apache.http.entity.ContentType) BulkResponse(org.elasticsearch.action.bulk.BulkResponse) Indexer(org.icij.datashare.text.indexing.Indexer) Collectors(java.util.stream.Collectors) Document(org.icij.datashare.text.Document) Stream(java.util.stream.Stream) Response(org.elasticsearch.client.Response) org.elasticsearch.index.query(org.elasticsearch.index.query) RestStatus(org.elasticsearch.rest.RestStatus) Arrays.stream(java.util.Arrays.stream) JsonObjectMapper(org.icij.datashare.json.JsonObjectMapper) NamedEntity(org.icij.datashare.text.NamedEntity) java.util(java.util) ClearScrollRequest(org.elasticsearch.action.search.ClearScrollRequest) ScriptType(org.elasticsearch.script.ScriptType) SearchRequest(org.elasticsearch.action.search.SearchRequest) WriteRequest(org.elasticsearch.action.support.WriteRequest) TimeValue(org.elasticsearch.common.unit.TimeValue) SearchSourceBuilder(org.elasticsearch.search.builder.SearchSourceBuilder) Entity(org.icij.datashare.Entity) StreamSupport(java.util.stream.StreamSupport) UpdateByQueryRequest(org.elasticsearch.index.reindex.UpdateByQueryRequest) BulkByScrollResponse(org.elasticsearch.index.reindex.BulkByScrollResponse) Script(org.elasticsearch.script.Script) Optional.ofNullable(java.util.Optional.ofNullable) UpdateRequest(org.elasticsearch.action.update.UpdateRequest) IOException(java.io.IOException) DocWriteResponse(org.elasticsearch.action.DocWriteResponse) RestHighLevelClient(org.elasticsearch.client.RestHighLevelClient) Request(org.elasticsearch.client.Request) TimeUnit(java.util.concurrent.TimeUnit) Collectors.toList(java.util.stream.Collectors.toList) SearchScrollRequest(org.elasticsearch.action.search.SearchScrollRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest)

Aggregations

HttpEntity (org.apache.http.HttpEntity)3 NStringEntity (org.apache.http.nio.entity.NStringEntity)3 BulkItemResponse (org.elasticsearch.action.bulk.BulkItemResponse)3 BulkRequest (org.elasticsearch.action.bulk.BulkRequest)3 BulkResponse (org.elasticsearch.action.bulk.BulkResponse)3 UpdateRequest (org.elasticsearch.action.update.UpdateRequest)3 Entity (org.icij.datashare.Entity)3 JsonNode (com.fasterxml.jackson.databind.JsonNode)2 Inject (com.google.inject.Inject)2 IOException (java.io.IOException)2 java.util (java.util)2 Arrays.stream (java.util.Arrays.stream)2 Optional.ofNullable (java.util.Optional.ofNullable)2 TimeUnit (java.util.concurrent.TimeUnit)2 Collectors (java.util.stream.Collectors)2 Collectors.toList (java.util.stream.Collectors.toList)2 Stream (java.util.stream.Stream)2 StreamSupport (java.util.stream.StreamSupport)2 ContentType (org.apache.http.entity.ContentType)2 EntityUtils (org.apache.http.util.EntityUtils)2