Search in sources :

Example 16 with GetRequest

use of org.graylog.shaded.elasticsearch7.org.elasticsearch.action.get.GetRequest in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_duplicate_file.

@Test
public void test_duplicate_file() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc.txt")).getPath()));
    final TikaDocument document2 = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc-duplicate.txt")).getPath()));
    spewer.write(document);
    spewer.write(document2);
    GetResponse actualDocument = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    GetResponse actualDocument2 = es.client.get(new GetRequest(TEST_INDEX, new Duplicate(document2.getPath(), document.getId()).getId()), RequestOptions.DEFAULT);
    assertThat(actualDocument.isExists()).isTrue();
    assertThat(actualDocument.getSourceAsMap()).includes(entry("type", "Document"));
    assertThat(actualDocument2.isExists()).isTrue();
    assertThat(actualDocument2.getSourceAsMap()).includes(entry("type", "Duplicate"));
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) GetRequest(org.elasticsearch.action.get.GetRequest) TikaDocument(org.icij.extract.document.TikaDocument) Duplicate(org.icij.datashare.text.Duplicate) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Example 17 with GetRequest

use of org.graylog.shaded.elasticsearch7.org.elasticsearch.action.get.GetRequest in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_truncated_content.

@Test
public void test_truncated_content() throws Exception {
    ElasticsearchSpewer limitedContentSpewer = new ElasticsearchSpewer(es.client, text -> Language.ENGLISH, new FieldNames(), publisher, new PropertiesProvider(new HashMap<String, String>() {

        {
            put("maxContentLength", "20");
        }
    })).withRefresh(IMMEDIATE).withIndex("test-datashare");
    final TikaDocument document = new DocumentFactory().withIdentifier(new PathIdentifier()).create(get("fake-file.txt"));
    final ParsingReader reader = new ParsingReader(new ByteArrayInputStream("this content should be truncated".getBytes()));
    document.setReader(reader);
    limitedContentSpewer.write(document);
    GetResponse documentFields = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    assertThat(documentFields.getSourceAsMap()).includes(entry("content", "this content should"));
}
Also used : PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentFactory(org.icij.extract.document.DocumentFactory) FieldNames(org.icij.spewer.FieldNames) HashMap(java.util.HashMap) ParsingReader(org.apache.tika.parser.ParsingReader) ByteArrayInputStream(java.io.ByteArrayInputStream) GetRequest(org.elasticsearch.action.get.GetRequest) PathIdentifier(org.icij.extract.document.PathIdentifier) TikaDocument(org.icij.extract.document.TikaDocument) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Example 18 with GetRequest

use of org.graylog.shaded.elasticsearch7.org.elasticsearch.action.get.GetRequest in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_metadata.

@Test
public void test_metadata() throws Exception {
    Path path = get(Objects.requireNonNull(getClass().getResource("/docs/a/b/c/doc.txt")).getPath());
    TikaDocument document = new Extractor().extract(path);
    spewer.write(document);
    GetResponse documentFields = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    assertThat(documentFields.getSourceAsMap()).includes(entry("contentEncoding", "ISO-8859-1"), entry("contentType", "text/plain"), entry("nerTags", new ArrayList<>()), entry("contentLength", 45), entry("status", "INDEXED"), entry("path", path.toString()), entry("dirname", path.getParent().toString()));
}
Also used : Path(java.nio.file.Path) GetRequest(org.elasticsearch.action.get.GetRequest) ArrayList(java.util.ArrayList) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Example 19 with GetRequest

use of org.graylog.shaded.elasticsearch7.org.elasticsearch.action.get.GetRequest in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_long_content_length.

@Test
public void test_long_content_length() throws Exception {
    final TikaDocument document = new DocumentFactory().withIdentifier(new PathIdentifier()).create(get("t-file.txt"));
    final ParsingReader reader = new ParsingReader(new ByteArrayInputStream("test".getBytes()));
    document.setReader(reader);
    document.getMetadata().set("Content-Length", "7862117376");
    spewer.write(document);
    GetResponse documentFields = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    assertThat(documentFields.getSourceAsMap()).includes(entry("contentLength", 7862117376L));
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) ParsingReader(org.apache.tika.parser.ParsingReader) ByteArrayInputStream(java.io.ByteArrayInputStream) GetRequest(org.elasticsearch.action.get.GetRequest) PathIdentifier(org.icij.extract.document.PathIdentifier) TikaDocument(org.icij.extract.document.TikaDocument) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Example 20 with GetRequest

use of org.graylog.shaded.elasticsearch7.org.elasticsearch.action.get.GetRequest in project incubator-gobblin by apache.

the class ElasticsearchWriterIntegrationTest method testSingleRecordWrite.

@Test
public void testSingleRecordWrite() throws IOException {
    for (WriterVariant writerVariant : variants) {
        for (RecordTypeGenerator recordVariant : recordGenerators) {
            String indexName = "posts" + writerVariant.getName().toLowerCase();
            String indexType = recordVariant.getName();
            Config config = writerVariant.getConfigBuilder().setIndexName(indexName).setIndexType(indexType).setTypeMapperClassName(recordVariant.getTypeMapperClassName()).setHttpPort(_esTestServer.getHttpPort()).setTransportPort(_esTestServer.getTransportPort()).build();
            TestClient testClient = writerVariant.getTestClient(config);
            SequentialBasedBatchAccumulator<Object> batchAccumulator = new SequentialBasedBatchAccumulator<>(config);
            BufferedAsyncDataWriter bufferedAsyncDataWriter = new BufferedAsyncDataWriter(batchAccumulator, writerVariant.getBatchAsyncDataWriter(config));
            String id = TestUtils.generateRandomAlphaString(10);
            Object testRecord = recordVariant.getRecord(id, PayloadType.STRING);
            DataWriter writer = AsyncWriterManager.builder().failureAllowanceRatio(0.0).retriesEnabled(false).config(config).asyncDataWriter(bufferedAsyncDataWriter).build();
            try {
                testClient.recreateIndex(indexName);
                writer.write(testRecord);
                writer.commit();
            } finally {
                writer.close();
            }
            try {
                GetResponse response = testClient.get(new GetRequest(indexName, indexType, id));
                Assert.assertEquals(response.getId(), id, "Response id matches request");
                Assert.assertEquals(response.isExists(), true, "Document not found");
            } catch (Exception e) {
                Assert.fail("Failed to get a response", e);
            } finally {
                testClient.close();
            }
        }
    }
}
Also used : Config(com.typesafe.config.Config) BufferedAsyncDataWriter(org.apache.gobblin.writer.BufferedAsyncDataWriter) SequentialBasedBatchAccumulator(org.apache.gobblin.writer.SequentialBasedBatchAccumulator) GetResponse(org.elasticsearch.action.get.GetResponse) IOException(java.io.IOException) GetRequest(org.elasticsearch.action.get.GetRequest) RecordTypeGenerator(org.apache.gobblin.test.RecordTypeGenerator) DataWriter(org.apache.gobblin.writer.DataWriter) BufferedAsyncDataWriter(org.apache.gobblin.writer.BufferedAsyncDataWriter) BatchAsyncDataWriter(org.apache.gobblin.writer.BatchAsyncDataWriter) Test(org.testng.annotations.Test)

Aggregations

GetRequest (org.elasticsearch.action.get.GetRequest)45 GetResponse (org.elasticsearch.action.get.GetResponse)29 Test (org.junit.Test)14 IOException (java.io.IOException)13 IndexRequest (org.elasticsearch.action.index.IndexRequest)9 HashMap (java.util.HashMap)7 TikaDocument (org.icij.extract.document.TikaDocument)7 FetchSourceContext (org.elasticsearch.search.fetch.subphase.FetchSourceContext)6 ArrayList (java.util.ArrayList)5 DocumentFactory (org.icij.extract.document.DocumentFactory)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 ParsingReader (org.apache.tika.parser.ParsingReader)4 ElasticsearchException (org.elasticsearch.ElasticsearchException)4 BulkItemResponse (org.elasticsearch.action.bulk.BulkItemResponse)4 DeleteRequest (org.elasticsearch.action.delete.DeleteRequest)4 SearchRequest (org.elasticsearch.action.search.SearchRequest)4 UpdateRequest (org.elasticsearch.action.update.UpdateRequest)4 PathIdentifier (org.icij.extract.document.PathIdentifier)4 BulkRequest (org.elasticsearch.action.bulk.BulkRequest)3 MultiGetRequest (org.elasticsearch.action.get.MultiGetRequest)3