Search in sources :

Example 11 with CharSequenceReader

use of org.apache.commons.io.input.CharSequenceReader in project stanbol by apache.

the class KuromojiNlpEngine method computeEnhancements.

/**
 * Compute enhancements for supplied ContentItem. The results of the process
 * are expected to be stored in the metadata of the content item.
 * <p/>
 * The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
 * persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
 * <p/>
 * This method creates a new POSContentPart using {@link org.apache.stanbol.enhancer.engines.pos.api.POSTaggerHelper#createContentPart} from a text/plain part and
 * stores it as a new part in the content item. The metadata is not changed.
 *
 * @throws org.apache.stanbol.enhancer.servicesapi.EngineException
 *          if the underlying process failed to work as
 *          expected
 */
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    final AnalysedText at = initAnalysedText(this, analysedTextFactory, ci);
    String language = getLanguage(this, ci, false);
    if (!("ja".equals(language) || (language != null && language.startsWith("ja-")))) {
        throw new IllegalStateException("The detected language is NOT 'ja'! " + "As this is also checked within the #canEnhance(..) method this " + "indicates an Bug in the used EnhancementJobManager implementation. " + "Please report this on the dev@apache.stanbol.org or create an " + "JIRA issue about this.");
    }
    // start with the Tokenizer
    TokenStream tokenStream = tokenizerFactory.create(new CharSequenceReader(at.getText()));
    // build the analyzing chain by adding all TokenFilters
    for (TokenFilterFactory filterFactory : filterFactories) {
        tokenStream = filterFactory.create(tokenStream);
    }
    // Try to extract sentences based on POS tags ...
    int sentStartOffset = -1;
    // NER data
    List<NerData> nerList = new ArrayList<NerData>();
    // the next index where the NerData.context need to be set
    int nerSentIndex = 0;
    NerData ner = null;
    OffsetAttribute offset = null;
    try {
        // required with Solr 4
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            offset = tokenStream.addAttribute(OffsetAttribute.class);
            Token token = at.addToken(offset.startOffset(), offset.endOffset());
            // Get the POS attribute and init the PosTag
            PartOfSpeechAttribute posAttr = tokenStream.addAttribute(PartOfSpeechAttribute.class);
            PosTag posTag = POS_TAG_SET.getTag(posAttr.getPartOfSpeech());
            if (posTag == null) {
                posTag = adhocTags.get(posAttr.getPartOfSpeech());
                if (posTag == null) {
                    posTag = new PosTag(posAttr.getPartOfSpeech());
                    adhocTags.put(posAttr.getPartOfSpeech(), posTag);
                    log.warn(" ... missing PosTag mapping for {}", posAttr.getPartOfSpeech());
                }
            }
            // Sentence detection by POS tag
            if (sentStartOffset < 0) {
                // the last token was a sentence ending
                sentStartOffset = offset.startOffset();
            }
            if (posTag.hasPos(Pos.Point)) {
                Sentence sent = at.addSentence(sentStartOffset, offset.startOffset());
                // add the sentence as context to the NerData instances
                while (nerSentIndex < nerList.size()) {
                    nerList.get(nerSentIndex).context = sent.getSpan();
                    nerSentIndex++;
                }
                sentStartOffset = -1;
            }
            // POS
            token.addAnnotation(POS_ANNOTATION, Value.value(posTag));
            // NER
            NerTag nerTag = NER_TAG_SET.getTag(posAttr.getPartOfSpeech());
            if (ner != null && (nerTag == null || !ner.tag.getType().equals(nerTag.getType()))) {
                // write NER annotation
                Chunk chunk = at.addChunk(ner.start, ner.end);
                chunk.addAnnotation(NlpAnnotations.NER_ANNOTATION, Value.value(ner.tag));
                // NOTE that the fise:TextAnnotation are written later based on the nerList
                // clean up
                ner = null;
            }
            if (nerTag != null) {
                if (ner == null) {
                    ner = new NerData(nerTag, offset.startOffset());
                    nerList.add(ner);
                }
                ner.end = offset.endOffset();
            }
            BaseFormAttribute baseFormAttr = tokenStream.addAttribute(BaseFormAttribute.class);
            MorphoFeatures morpho = null;
            if (baseFormAttr != null && baseFormAttr.getBaseForm() != null) {
                morpho = new MorphoFeatures(baseFormAttr.getBaseForm());
                // and add the posTag
                morpho.addPos(posTag);
            }
            InflectionAttribute inflectionAttr = tokenStream.addAttribute(InflectionAttribute.class);
            inflectionAttr.getInflectionForm();
            inflectionAttr.getInflectionType();
            if (morpho != null) {
                // if present add the morpho
                token.addAnnotation(MORPHO_ANNOTATION, Value.value(morpho));
            }
        }
        // we still need to write the last sentence
        Sentence lastSent = null;
        if (offset != null && sentStartOffset >= 0 && offset.endOffset() > sentStartOffset) {
            lastSent = at.addSentence(sentStartOffset, offset.endOffset());
        }
        // and set the context off remaining named entities
        while (nerSentIndex < nerList.size()) {
            if (lastSent != null) {
                nerList.get(nerSentIndex).context = lastSent.getSpan();
            } else {
                // no sentence detected
                nerList.get(nerSentIndex).context = at.getSpan();
            }
            nerSentIndex++;
        }
    } catch (IOException e) {
        throw new EngineException(this, ci, "Exception while reading from " + "AnalyzedText contentpart", e);
    } finally {
        try {
            tokenStream.close();
        } catch (IOException e) {
        /* ignore */
        }
    }
    // finally write the NER annotations to the metadata of the ContentItem
    final Graph metadata = ci.getMetadata();
    ci.getLock().writeLock().lock();
    try {
        Language lang = new Language("ja");
        for (NerData nerData : nerList) {
            IRI ta = EnhancementEngineHelper.createTextEnhancement(ci, this);
            metadata.add(new TripleImpl(ta, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(at.getSpan().substring(nerData.start, nerData.end), lang)));
            metadata.add(new TripleImpl(ta, DC_TYPE, nerData.tag.getType()));
            metadata.add(new TripleImpl(ta, ENHANCER_START, lf.createTypedLiteral(nerData.start)));
            metadata.add(new TripleImpl(ta, ENHANCER_END, lf.createTypedLiteral(nerData.end)));
            metadata.add(new TripleImpl(ta, ENHANCER_SELECTION_CONTEXT, new PlainLiteralImpl(nerData.context, lang)));
        }
    } finally {
        ci.getLock().writeLock().unlock();
    }
}
Also used : NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) IRI(org.apache.clerezza.commons.rdf.IRI) TokenStream(org.apache.lucene.analysis.TokenStream) ArrayList(java.util.ArrayList) EngineException(org.apache.stanbol.enhancer.servicesapi.EngineException) Token(org.apache.stanbol.enhancer.nlp.model.Token) NlpEngineHelper.initAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) Language(org.apache.clerezza.commons.rdf.Language) NlpEngineHelper.getLanguage(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.getLanguage) BaseFormAttribute(org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute) TripleImpl(org.apache.clerezza.commons.rdf.impl.utils.TripleImpl) MorphoFeatures(org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures) Sentence(org.apache.stanbol.enhancer.nlp.model.Sentence) InflectionAttribute(org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute) PlainLiteralImpl(org.apache.clerezza.commons.rdf.impl.utils.PlainLiteralImpl) PartOfSpeechAttribute(org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute) IOException(java.io.IOException) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) TokenFilterFactory(org.apache.lucene.analysis.util.TokenFilterFactory) Graph(org.apache.clerezza.commons.rdf.Graph) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute)

Example 12 with CharSequenceReader

use of org.apache.commons.io.input.CharSequenceReader in project tutorials by eugenp.

the class JavaReaderToXUnitTest method givenUsingCommonsIO_whenWritingReaderContentsToFile_thenCorrect.

@Test
public void givenUsingCommonsIO_whenWritingReaderContentsToFile_thenCorrect() throws IOException {
    final Reader initialReader = new CharSequenceReader("CharSequenceReader extends Reader");
    final File targetFile = new File("src/test/resources/targetFile.txt");
    FileUtils.touch(targetFile);
    final byte[] buffer = IOUtils.toByteArray(initialReader);
    FileUtils.writeByteArrayToFile(targetFile, buffer);
    initialReader.close();
}
Also used : CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) Reader(java.io.Reader) CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) StringReader(java.io.StringReader) File(java.io.File) Test(org.junit.Test)

Example 13 with CharSequenceReader

use of org.apache.commons.io.input.CharSequenceReader in project tutorials by eugenp.

the class JavaXToReaderUnitTest method givenUsingCommonsIO_whenConvertingStringIntoReader_thenCorrect.

@Test
public void givenUsingCommonsIO_whenConvertingStringIntoReader_thenCorrect() throws IOException {
    final String initialString = "With Apache Commons IO";
    final Reader targetReader = new CharSequenceReader(initialString);
    targetReader.close();
}
Also used : CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) Reader(java.io.Reader) InputStreamReader(java.io.InputStreamReader) CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) StringReader(java.io.StringReader) FileReader(java.io.FileReader) Test(org.junit.Test)

Example 14 with CharSequenceReader

use of org.apache.commons.io.input.CharSequenceReader in project tutorials by eugenp.

the class JavaXToReaderUnitTest method givenUsingCommonsIO_whenConvertingInputStreamIntoReader_thenCorrect.

@Test
public void givenUsingCommonsIO_whenConvertingInputStreamIntoReader_thenCorrect() throws IOException {
    final InputStream initialStream = IOUtils.toInputStream("With Commons IO");
    final byte[] buffer = IOUtils.toByteArray(initialStream);
    final Reader targetReader = new CharSequenceReader(new String(buffer));
    targetReader.close();
}
Also used : CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Reader(java.io.Reader) InputStreamReader(java.io.InputStreamReader) CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) StringReader(java.io.StringReader) FileReader(java.io.FileReader) Test(org.junit.Test)

Aggregations

CharSequenceReader (org.apache.commons.io.input.CharSequenceReader)14 Test (org.junit.Test)7 Reader (java.io.Reader)6 StringReader (java.io.StringReader)5 FileReader (java.io.FileReader)4 IOException (java.io.IOException)4 InputStreamReader (java.io.InputStreamReader)4 TokenStream (org.apache.lucene.analysis.TokenStream)4 OffsetAttribute (org.apache.lucene.analysis.tokenattributes.OffsetAttribute)3 AnalysedText (org.apache.stanbol.enhancer.nlp.model.AnalysedText)3 Sentence (org.apache.stanbol.enhancer.nlp.model.Sentence)3 NlpEngineHelper.initAnalysedText (org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText)3 File (java.io.File)2 StringWriter (java.io.StringWriter)2 CharBuffer (java.nio.CharBuffer)2 ArrayList (java.util.ArrayList)2 Collectors.joining (java.util.stream.Collectors.joining)2 IntStream (java.util.stream.IntStream)2 IOUtils (org.apache.commons.io.IOUtils)2 SentenceTokenizer (org.apache.lucene.analysis.cn.smart.SentenceTokenizer)2