Search in sources :

Example 16 with NamedEntity

use of org.icij.datashare.text.NamedEntity in project datashare by ICIJ.

the class EmailPipeline method process.

@Override
public List<NamedEntity> process(Document doc, int contentLength, int contentOffset) {
    Matcher matcher = pattern.matcher(doc.getContent().substring(contentOffset, Math.min(contentLength + contentOffset, doc.getContentTextLength())));
    NamedEntitiesBuilder namedEntitiesBuilder = new NamedEntitiesBuilder(EMAIL, doc.getId(), doc.getLanguage()).withRoot(doc.getRootDocument());
    while (matcher.find()) {
        String email = matcher.group(0);
        int start = matcher.start();
        namedEntitiesBuilder.add(NamedEntity.Category.EMAIL, email, start + contentOffset);
    }
    if ("message/rfc822".equals(doc.getContentType())) {
        String metadataString = parsedEmailHeaders.stream().map(key -> doc.getMetadata().getOrDefault(key, "").toString()).collect(joining(" "));
        Matcher metaMatcher = pattern.matcher(metadataString);
        while (metaMatcher.find()) {
            namedEntitiesBuilder.add(NamedEntity.Category.EMAIL, metaMatcher.group(0), -1);
        }
    }
    return namedEntitiesBuilder.build();
}
Also used : NamedEntitiesBuilder(org.icij.datashare.text.NamedEntitiesBuilder) EMAIL(org.icij.datashare.text.nlp.Pipeline.Type.EMAIL) java.util(java.util) NamedEntity.allFrom(org.icij.datashare.text.NamedEntity.allFrom) AbstractPipeline(org.icij.datashare.text.nlp.AbstractPipeline) PropertiesProvider(org.icij.datashare.PropertiesProvider) Inject(com.google.inject.Inject) Document(org.icij.datashare.text.Document) Collectors.joining(java.util.stream.Collectors.joining) Matcher(java.util.regex.Matcher) Collections.unmodifiableSet(java.util.Collections.unmodifiableSet) Charset(java.nio.charset.Charset) Arrays.asList(java.util.Arrays.asList) Annotations(org.icij.datashare.text.nlp.Annotations) Pattern(java.util.regex.Pattern) Language(org.icij.datashare.text.Language) NlpStage(org.icij.datashare.text.nlp.NlpStage) NamedEntity(org.icij.datashare.text.NamedEntity) Matcher(java.util.regex.Matcher) NamedEntitiesBuilder(org.icij.datashare.text.NamedEntitiesBuilder)

Example 17 with NamedEntity

use of org.icij.datashare.text.NamedEntity in project datashare by ICIJ.

the class NlpConsumer method findNamedEntities.

void findNamedEntities(final String projectName, final String id, final String routing) throws InterruptedException {
    try {
        Document doc = indexer.get(projectName, id, routing);
        if (doc != null) {
            logger.info("extracting {} entities for document {}", nlpPipeline.getType(), doc.getId());
            if (nlpPipeline.initialize(doc.getLanguage())) {
                int nbEntities = 0;
                if (doc.getContent().length() < this.maxContentLengthChars) {
                    List<NamedEntity> namedEntities = nlpPipeline.process(doc);
                    indexer.bulkAdd(projectName, nlpPipeline.getType(), namedEntities, doc);
                    nbEntities = namedEntities.size();
                } else {
                    int nbChunks = doc.getContent().length() / this.maxContentLengthChars + 1;
                    logger.info("document is too large, extracting entities for {} document chunks", nbChunks);
                    for (int chunkIndex = 0; chunkIndex < nbChunks; chunkIndex++) {
                        List<NamedEntity> namedEntities = nlpPipeline.process(doc, maxContentLengthChars, chunkIndex * maxContentLengthChars);
                        if (chunkIndex < nbChunks - 1) {
                            indexer.bulkAdd(projectName, namedEntities);
                        } else {
                            indexer.bulkAdd(projectName, nlpPipeline.getType(), namedEntities, doc);
                        }
                        nbEntities += namedEntities.size();
                    }
                }
                logger.info("added {} named entities to document {}", nbEntities, doc.getId());
                nlpPipeline.terminate(doc.getLanguage());
            }
        } else {
            logger.warn("no document found in index with id " + id);
        }
    } catch (IOException e) {
        logger.error("cannot extract entities of doc " + id, e);
    }
}
Also used : NamedEntity(org.icij.datashare.text.NamedEntity) IOException(java.io.IOException) Document(org.icij.datashare.text.Document)

Example 18 with NamedEntity

use of org.icij.datashare.text.NamedEntity in project datashare by ICIJ.

the class BenchDocument method testReadsAndWrites.

@Test
public void testReadsAndWrites() {
    int nbDocs = 100;
    int nbNes = 100;
    LinkedList<String> neIds = new LinkedList<>();
    logger.info("writing {} documents with {} named entities", nbDocs, nbNes);
    long beginTime = System.currentTimeMillis();
    for (int docIdx = 0; docIdx < nbDocs; docIdx++) {
        Document document = new Document(project("prj"), Paths.get("/foo/bar_" + docIdx + ".txt"), "This is a content with Gael Giraud " + docIdx, Language.FRENCH, Charset.defaultCharset(), "text/plain", new HashMap<String, Object>() {

            {
                put("key1", "value1");
                put("key2", "value2");
                put("key3", "value3");
                put("key4", "value4");
                put("key5", "value5");
                put("key6", "value6");
                put("key7", "value7");
                put("key8", "value8");
                put("key9", "value9");
                put("key10", "value10");
            }
        }, Document.Status.INDEXED, 345L);
        repository.create(document);
        List<NamedEntity> neList = new ArrayList<>();
        for (int neIdx = 0; neIdx < nbNes; neIdx++) {
            NamedEntity ne = NamedEntity.create(NamedEntity.Category.PERSON, "Gael Giraud" + neIdx, Arrays.asList(23L), document.getId(), "root", Pipeline.Type.CORENLP, Language.FRENCH);
            neIds.add(ne.getId());
            neList.add(ne);
        }
        repository.create(neList);
        if (docIdx % 10 == 0) {
            logger.info("wrote {} docs", docIdx);
        }
    }
    long endTime = System.currentTimeMillis();
    logger.info("done in {}ms", endTime - beginTime);
    logger.info("reading " + neIds.size() + " NamedEntities");
    beginTime = System.currentTimeMillis();
    for (String neId : neIds) {
        repository.getNamedEntity(neId);
    }
    endTime = System.currentTimeMillis();
    logger.info("done in {}ms", endTime - beginTime);
}
Also used : NamedEntity(org.icij.datashare.text.NamedEntity) Document(org.icij.datashare.text.Document) Test(org.junit.Test)

Example 19 with NamedEntity

use of org.icij.datashare.text.NamedEntity in project datashare by ICIJ.

the class ElasticsearchIndexer method bulkAdd.

@Override
public boolean bulkAdd(final String indexName, Pipeline.Type nerType, List<NamedEntity> namedEntities, Document parent) throws IOException {
    BulkRequest bulkRequest = new BulkRequest();
    String routing = ofNullable(parent.getRootDocument()).orElse(parent.getId());
    bulkRequest.add(new UpdateRequest(indexName, parent.getId()).doc(jsonBuilder().startObject().field("status", Document.Status.DONE).endObject()).routing(routing));
    bulkRequest.add(new UpdateRequest(indexName, parent.getId()).script(new Script(ScriptType.INLINE, "painless", "if (!ctx._source.nerTags.contains(params.nerTag)) ctx._source.nerTags.add(params.nerTag);", new HashMap<String, Object>() {

        {
            put("nerTag", nerType.toString());
        }
    })).routing(routing));
    for (Entity child : namedEntities) {
        bulkRequest.add(createIndexRequest(indexName, JsonObjectMapper.getType(child), child.getId(), getJson(child), parent.getId(), routing));
    }
    bulkRequest.setRefreshPolicy(esCfg.refreshPolicy);
    BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
    if (bulkResponse.hasFailures()) {
        for (BulkItemResponse resp : bulkResponse.getItems()) {
            if (resp.isFailed()) {
                LOGGER.error("bulk add failed : {}", resp.getFailureMessage());
            }
        }
        return false;
    }
    return true;
}
Also used : Script(org.elasticsearch.script.Script) NStringEntity(org.apache.http.nio.entity.NStringEntity) HttpEntity(org.apache.http.HttpEntity) NamedEntity(org.icij.datashare.text.NamedEntity) Entity(org.icij.datashare.Entity) UpdateRequest(org.elasticsearch.action.update.UpdateRequest) BulkRequest(org.elasticsearch.action.bulk.BulkRequest) BulkItemResponse(org.elasticsearch.action.bulk.BulkItemResponse) BulkResponse(org.elasticsearch.action.bulk.BulkResponse)

Example 20 with NamedEntity

use of org.icij.datashare.text.NamedEntity in project datashare by ICIJ.

the class NamedEntityResourceTest method test_get_named_entity_in_prod_mode.

@Test
public void test_get_named_entity_in_prod_mode() {
    configure(routes -> routes.add(new NamedEntityResource(indexer)).filter(new BasicAuthFilter("/", "icij", DatashareUser.singleUser("anne"))));
    NamedEntity toBeReturned = create(PERSON, "mention", asList(123L), "docId", "root", CORENLP, FRENCH);
    doReturn(toBeReturned).when(indexer).get("anne-datashare", "my_id", "root_parent");
    get("/api/anne-datashare/namedEntities/my_id?routing=root_parent").withAuthentication("anne", "notused").should().respond(200).haveType("application/json");
}
Also used : BasicAuthFilter(net.codestory.http.filters.basic.BasicAuthFilter) NamedEntity(org.icij.datashare.text.NamedEntity) AbstractProdWebServerTest(org.icij.datashare.web.testhelpers.AbstractProdWebServerTest) Test(org.junit.Test)

Aggregations

NamedEntity (org.icij.datashare.text.NamedEntity)20 Test (org.junit.Test)16 Document (org.icij.datashare.text.Document)11 Arrays.asList (java.util.Arrays.asList)2 HashMap (java.util.HashMap)2 PropertiesProvider (org.icij.datashare.PropertiesProvider)2 Language (org.icij.datashare.text.Language)2 NamedEntitiesBuilder (org.icij.datashare.text.NamedEntitiesBuilder)2 AbstractPipeline (org.icij.datashare.text.nlp.AbstractPipeline)2 Annotations (org.icij.datashare.text.nlp.Annotations)2 AbstractProdWebServerTest (org.icij.datashare.web.testhelpers.AbstractProdWebServerTest)2 Inject (com.google.inject.Inject)1 AbstractSequenceClassifier (edu.stanford.nlp.ie.AbstractSequenceClassifier)1 CoreAnnotations (edu.stanford.nlp.ling.CoreAnnotations)1 Triple (edu.stanford.nlp.util.Triple)1 IOException (java.io.IOException)1 Charset (java.nio.charset.Charset)1 Path (java.nio.file.Path)1 java.util (java.util)1 Collection (java.util.Collection)1