Search in sources :

Example 1 with EMAIL

use of org.icij.datashare.text.nlp.Pipeline.Type.EMAIL in project datashare by ICIJ.

the class EmailPipeline method process.

@Override
public List<NamedEntity> process(Document doc, int contentLength, int contentOffset) {
    Matcher matcher = pattern.matcher(doc.getContent().substring(contentOffset, Math.min(contentLength + contentOffset, doc.getContentTextLength())));
    NamedEntitiesBuilder namedEntitiesBuilder = new NamedEntitiesBuilder(EMAIL, doc.getId(), doc.getLanguage()).withRoot(doc.getRootDocument());
    while (matcher.find()) {
        String email = matcher.group(0);
        int start = matcher.start();
        namedEntitiesBuilder.add(NamedEntity.Category.EMAIL, email, start + contentOffset);
    }
    if ("message/rfc822".equals(doc.getContentType())) {
        String metadataString = parsedEmailHeaders.stream().map(key -> doc.getMetadata().getOrDefault(key, "").toString()).collect(joining(" "));
        Matcher metaMatcher = pattern.matcher(metadataString);
        while (metaMatcher.find()) {
            namedEntitiesBuilder.add(NamedEntity.Category.EMAIL, metaMatcher.group(0), -1);
        }
    }
    return namedEntitiesBuilder.build();
}
Also used : NamedEntitiesBuilder(org.icij.datashare.text.NamedEntitiesBuilder) EMAIL(org.icij.datashare.text.nlp.Pipeline.Type.EMAIL) java.util(java.util) NamedEntity.allFrom(org.icij.datashare.text.NamedEntity.allFrom) AbstractPipeline(org.icij.datashare.text.nlp.AbstractPipeline) PropertiesProvider(org.icij.datashare.PropertiesProvider) Inject(com.google.inject.Inject) Document(org.icij.datashare.text.Document) Collectors.joining(java.util.stream.Collectors.joining) Matcher(java.util.regex.Matcher) Collections.unmodifiableSet(java.util.Collections.unmodifiableSet) Charset(java.nio.charset.Charset) Arrays.asList(java.util.Arrays.asList) Annotations(org.icij.datashare.text.nlp.Annotations) Pattern(java.util.regex.Pattern) Language(org.icij.datashare.text.Language) NlpStage(org.icij.datashare.text.nlp.NlpStage) NamedEntity(org.icij.datashare.text.NamedEntity) Matcher(java.util.regex.Matcher) NamedEntitiesBuilder(org.icij.datashare.text.NamedEntitiesBuilder)

Aggregations

Inject (com.google.inject.Inject)1 Charset (java.nio.charset.Charset)1 java.util (java.util)1 Arrays.asList (java.util.Arrays.asList)1 Collections.unmodifiableSet (java.util.Collections.unmodifiableSet)1 Matcher (java.util.regex.Matcher)1 Pattern (java.util.regex.Pattern)1 Collectors.joining (java.util.stream.Collectors.joining)1 PropertiesProvider (org.icij.datashare.PropertiesProvider)1 Document (org.icij.datashare.text.Document)1 Language (org.icij.datashare.text.Language)1 NamedEntitiesBuilder (org.icij.datashare.text.NamedEntitiesBuilder)1 NamedEntity (org.icij.datashare.text.NamedEntity)1 NamedEntity.allFrom (org.icij.datashare.text.NamedEntity.allFrom)1 AbstractPipeline (org.icij.datashare.text.nlp.AbstractPipeline)1 Annotations (org.icij.datashare.text.nlp.Annotations)1 NlpStage (org.icij.datashare.text.nlp.NlpStage)1 EMAIL (org.icij.datashare.text.nlp.Pipeline.Type.EMAIL)1