Search in sources :

Example 1 with TokenFilter

use of com.joliciel.talismane.tokeniser.filters.TokenFilter in project talismane by joliciel-informatique.

the class Tokeniser method tokeniseWithDecisions.

/**
 * Similar to {@link #tokeniseWithDecisions(String)}, but the text to be
 * tokenised is contained within a Sentence object.
 *
 * @param sentence
 *          the sentence to tokenise
 * @param labels
 *          the labels to add to any annotations added.
 * @throws IOException
 */
public List<TokenisedAtomicTokenSequence> tokeniseWithDecisions(Sentence sentence, String... labels) throws TalismaneException, IOException {
    // Initially, separate the sentence into tokens using the separators
    // provided
    TokenSequence tokenSequence = new TokenSequence(sentence, this.sessionId);
    tokenSequence.findDefaultTokens();
    List<TokenisedAtomicTokenSequence> sequences = this.tokeniseInternal(tokenSequence, sentence);
    LOG.debug("####Final token sequences:");
    int j = 1;
    for (TokenisedAtomicTokenSequence sequence : sequences) {
        TokenSequence newTokenSequence = sequence.inferTokenSequence();
        for (TokenFilter filter : filters) filter.apply(newTokenSequence);
        if (j == 1) {
            // add annotations for the very first token sequence
            List<Annotation<TokenBoundary>> tokenBoundaries = new ArrayList<>();
            for (Token token : newTokenSequence) {
                Annotation<TokenBoundary> tokenBoundary = new Annotation<>(token.getStartIndex(), token.getEndIndex(), new TokenBoundary(token.getText(), token.getAnalyisText(), token.getAttributes()), labels);
                tokenBoundaries.add(tokenBoundary);
            }
            sentence.addAnnotations(tokenBoundaries);
        }
        if (LOG.isDebugEnabled()) {
            LOG.debug("Token sequence " + j);
            LOG.debug("Atomic sequence: " + sequence);
            LOG.debug("Resulting sequence: " + newTokenSequence);
        }
        j++;
    }
    return sequences;
}
Also used : ArrayList(java.util.ArrayList) Annotation(com.joliciel.talismane.Annotation) TokenFilter(com.joliciel.talismane.tokeniser.filters.TokenFilter)

Example 2 with TokenFilter

use of com.joliciel.talismane.tokeniser.filters.TokenFilter in project talismane by joliciel-informatique.

the class TokenRegexBasedCorpusReader method processSentence.

@Override
protected void processSentence(Sentence sentence, List<CorpusLine> corpusLines) throws TalismaneException, IOException {
    try {
        super.processSentence(sentence, corpusLines);
        tokenSequence = new PretokenisedSequence(sentence, sessionId);
        for (CorpusLine corpusLine : corpusLines) {
            this.convertToToken(tokenSequence, corpusLine);
        }
        for (TokenFilter filter : filters) filter.apply(tokenSequence);
        tokenSequence.cleanSlate();
    } catch (TalismaneException e) {
        this.clearSentence();
        throw e;
    }
}
Also used : TalismaneException(com.joliciel.talismane.TalismaneException) CorpusLine(com.joliciel.talismane.corpus.CorpusLine) TokenFilter(com.joliciel.talismane.tokeniser.filters.TokenFilter)

Aggregations

TokenFilter (com.joliciel.talismane.tokeniser.filters.TokenFilter)2 Annotation (com.joliciel.talismane.Annotation)1 TalismaneException (com.joliciel.talismane.TalismaneException)1 CorpusLine (com.joliciel.talismane.corpus.CorpusLine)1 ArrayList (java.util.ArrayList)1