Search in sources :

Example 41 with TalismaneException

use of com.joliciel.talismane.TalismaneException in project talismane by joliciel-informatique.

the class TokenSequenceProcessor method getProcessors.

/**
 * Collect the processors specified in the configuration key
 * talismane.core.[sessionId].tokeniser.output.processors.<br>
 * <br>
 * Each processor must implement this interface and must have a constructor
 * matching one of the following signatures:<br>
 * - ( {@link File} outputDir, {@link String} sessionId)<br>
 * - ( {@link String} sessionId)<br>
 * <br>
 * Optionally, it can have a constructor with the following signature:<br>
 * - ( {@link Writer} writer, {@link String} sessionId)<br>
 * If a writer is provided here, then the first processor with the above
 * constructor will be given the writer.
 *
 * @param writer
 *          if specified, will be used for the first processor in the list
 *          with a writer in the constructor
 * @param outDir
 *          directory in which to write the various outputs
 * @return
 * @throws IOException
 * @throws TalismaneException
 *           if a processor does not implement this interface, or if no
 *           constructor is found with the correct signature
 */
public static List<TokenSequenceProcessor> getProcessors(Writer writer, File outDir, String sessionId) throws IOException, ReflectiveOperationException, ClassNotFoundException, TalismaneException {
    Config config = ConfigFactory.load();
    Config myConfig = config.getConfig("talismane.core." + sessionId + ".tokeniser");
    List<TokenSequenceProcessor> processors = new ArrayList<>();
    List<String> classes = myConfig.getStringList("output.processors");
    if (outDir != null)
        outDir.mkdirs();
    Writer firstProcessorWriter = writer;
    for (String className : classes) {
        @SuppressWarnings("rawtypes") Class untypedClass = Class.forName(className);
        if (!TokenSequenceProcessor.class.isAssignableFrom(untypedClass))
            throw new TalismaneException("Class " + className + " does not implement interface " + TokenSequenceProcessor.class.getSimpleName());
        @SuppressWarnings("unchecked") Class<? extends TokenSequenceProcessor> clazz = untypedClass;
        Constructor<? extends TokenSequenceProcessor> cons = null;
        TokenSequenceProcessor processor = null;
        if (firstProcessorWriter != null) {
            try {
                cons = clazz.getConstructor(Writer.class, String.class);
            } catch (NoSuchMethodException e) {
            // do nothing
            }
            if (cons != null) {
                processor = cons.newInstance(firstProcessorWriter, sessionId);
                firstProcessorWriter = null;
            }
        }
        if (cons == null) {
            try {
                cons = clazz.getConstructor(File.class, String.class);
            } catch (NoSuchMethodException e) {
            // do nothing
            }
            if (cons != null) {
                processor = cons.newInstance(outDir, sessionId);
            }
        }
        if (cons == null) {
            try {
                cons = clazz.getConstructor(String.class);
            } catch (NoSuchMethodException e) {
            // do nothing
            }
            if (cons != null) {
                processor = cons.newInstance(sessionId);
            } else {
                throw new TalismaneException("No constructor found with correct signature for: " + className);
            }
        }
        processors.add(processor);
    }
    return processors;
}
Also used : TalismaneException(com.joliciel.talismane.TalismaneException) Config(com.typesafe.config.Config) ArrayList(java.util.ArrayList) File(java.io.File) Writer(java.io.Writer)

Example 42 with TalismaneException

use of com.joliciel.talismane.TalismaneException in project talismane by joliciel-informatique.

the class PatternEventStream method next.

@Override
public ClassificationEvent next() throws TalismaneException, IOException {
    ClassificationEvent event = null;
    if (this.hasNext()) {
        TokenPatternMatch tokenPatternMatch = currentPatternMatches.get(currentIndex);
        TokeniserOutcome outcome = currentOutcomes.get(currentIndex);
        String classification = outcome.name();
        LOG.debug("next event, pattern match: " + tokenPatternMatch.toString() + ", outcome:" + classification);
        List<FeatureResult<?>> tokenFeatureResults = new ArrayList<FeatureResult<?>>();
        for (TokenPatternMatchFeature<?> feature : tokenPatternMatchFeatures) {
            RuntimeEnvironment env = new RuntimeEnvironment();
            FeatureResult<?> featureResult = feature.check(tokenPatternMatch, env);
            if (featureResult != null) {
                tokenFeatureResults.add(featureResult);
            }
        }
        if (LOG.isTraceEnabled()) {
            SortedSet<String> featureResultSet = tokenFeatureResults.stream().map(f -> f.toString()).collect(Collectors.toCollection(() -> new TreeSet<String>()));
            for (String featureResultString : featureResultSet) {
                LOG.trace(featureResultString);
            }
        }
        event = new ClassificationEvent(tokenFeatureResults, classification);
        currentIndex++;
        if (currentIndex == currentPatternMatches.size()) {
            currentPatternMatches = null;
        }
    }
    return event;
}
Also used : TokeniserAnnotatedCorpusReader(com.joliciel.talismane.tokeniser.TokeniserAnnotatedCorpusReader) SortedSet(java.util.SortedSet) LoggerFactory(org.slf4j.LoggerFactory) TokenSequence(com.joliciel.talismane.tokeniser.TokenSequence) TaggedToken(com.joliciel.talismane.tokeniser.TaggedToken) TreeSet(java.util.TreeSet) TalismaneException(com.joliciel.talismane.TalismaneException) TalismaneSession(com.joliciel.talismane.TalismaneSession) ArrayList(java.util.ArrayList) LinkedHashMap(java.util.LinkedHashMap) RuntimeEnvironment(com.joliciel.talismane.machineLearning.features.RuntimeEnvironment) ClassificationEventStream(com.joliciel.talismane.machineLearning.ClassificationEventStream) TokenPatternMatchFeature(com.joliciel.talismane.tokeniser.features.TokenPatternMatchFeature) FeatureResult(com.joliciel.talismane.machineLearning.features.FeatureResult) Map(java.util.Map) Logger(org.slf4j.Logger) Set(java.util.Set) IOException(java.io.IOException) TokeniserOutcome(com.joliciel.talismane.tokeniser.TokeniserOutcome) ClassificationEvent(com.joliciel.talismane.machineLearning.ClassificationEvent) Decision(com.joliciel.talismane.machineLearning.Decision) Collectors(java.util.stream.Collectors) List(java.util.List) Token(com.joliciel.talismane.tokeniser.Token) Sentence(com.joliciel.talismane.rawText.Sentence) RuntimeEnvironment(com.joliciel.talismane.machineLearning.features.RuntimeEnvironment) ArrayList(java.util.ArrayList) TokeniserOutcome(com.joliciel.talismane.tokeniser.TokeniserOutcome) TreeSet(java.util.TreeSet) ClassificationEvent(com.joliciel.talismane.machineLearning.ClassificationEvent) FeatureResult(com.joliciel.talismane.machineLearning.features.FeatureResult)

Example 43 with TalismaneException

use of com.joliciel.talismane.TalismaneException in project talismane by joliciel-informatique.

the class TokenPattern method parsePattern.

/**
 * Break the regexp up into chunks, where each chunk will match one token.
 *
 * @throws TalismaneException
 */
List<Pattern> parsePattern(String regexp) throws TalismaneException {
    boolean inLiteral = false;
    boolean inException = false;
    boolean inGrouping = false;
    boolean groupingHasLetters = false;
    int groupingStart = 0;
    List<Pattern> parsedPattern = new ArrayList<Pattern>();
    int currentStart = 0;
    int currentEnd = 0;
    for (int i = 0; i < regexp.length(); i++) {
        char c = regexp.charAt(i);
        if (!inLiteral && c == '\\') {
            inLiteral = true;
        } else if (inLiteral) {
            if (c == 'd' || c == 'D' || c == 'z') {
                // digit or non-digit = not a separator
                // \z is included here because we're only expecting it
                // inside negative lookahead
                currentEnd = i + 1;
            } else if (inGrouping) {
                currentEnd = i + 1;
            } else {
                // always a separator
                // either an actual separator, or the patterns \p (all
                // separators) or \s (whitespace)
                // or \b (whitespace/sentence start/sentence end)
                this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
                this.addPattern(regexp, i - 1, i + 1, parsedPattern, inException);
                currentStart = i + 1;
                currentEnd = i + 1;
            }
            inLiteral = false;
        } else if (c == '[') {
            inGrouping = true;
            groupingHasLetters = false;
            groupingStart = i;
            currentEnd = i + 1;
        } else if (c == ']') {
            if (!groupingHasLetters) {
                if (groupingStart > 0) {
                    this.addPattern(regexp, currentStart, groupingStart, parsedPattern, inException);
                }
                this.addPattern(regexp, groupingStart, i + 1, parsedPattern, inException);
                currentStart = i + 1;
                currentEnd = i + 1;
            } else {
                currentEnd = i + 1;
            }
            inGrouping = false;
        } else if (c == '{') {
            this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
            inException = true;
            currentStart = i + 1;
            currentEnd = i + 1;
        } else if (c == '}') {
            this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
            inException = false;
            currentStart = i + 1;
            currentEnd = i + 1;
        } else if (c == '.' || c == '+' || c == '(' || c == '|' || c == ')' || c == '^' || c == '?' || c == '!') {
            // special meaning characters, not separators
            currentEnd = i + 1;
        } else if (c == '-') {
            // either the dash separator, or a character range (e.g. A-Z)
            if (inGrouping) {
            // do nothing
            // we don't know if it's a separator grouping or a character
            // range
            } else {
                // a separator
                this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
                this.addPattern(regexp, i, i + 1, parsedPattern, inException);
                currentStart = i + 1;
                currentEnd = i + 1;
            }
        } else if (separatorPattern.matcher("" + c).find()) {
            if (inGrouping) {
                if (groupingHasLetters) {
                    throw new TalismaneException("Cannot mix separators and non-separators in same grouping");
                }
            } else {
                // a separator
                this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
                this.addPattern(regexp, i, i + 1, parsedPattern, inException);
                currentStart = i + 1;
                currentEnd = i + 1;
            }
        } else {
            // any other non-separating character
            if (inGrouping) {
                groupingHasLetters = true;
            }
            currentEnd = i + 1;
        }
    }
    this.addPattern(regexp, currentStart, currentEnd, parsedPattern, inException);
    if (LOG.isTraceEnabled()) {
        int i = 0;
        LOG.trace("Parsed " + regexp);
        for (Pattern pattern : parsedPattern) {
            boolean test = indexesToTest.contains(i);
            LOG.trace("Added " + pattern.pattern() + " Test? " + test);
            i++;
        }
    }
    if (indexesToTest.size() == 0) {
        throw new InvalidTokenPatternException("No indexes to test in pattern: " + this.getName());
    }
    return parsedPattern;
}
Also used : Pattern(java.util.regex.Pattern) TalismaneException(com.joliciel.talismane.TalismaneException) ArrayList(java.util.ArrayList)

Example 44 with TalismaneException

use of com.joliciel.talismane.TalismaneException in project talismane by joliciel-informatique.

the class TokenEvaluationObserver method getTokenEvaluationObservers.

/**
 * Collect the observers specified in the configuration key
 * talismane.core.[sessionId].tokeniser.evaluate.observers.<br>
 * <br>
 * Each processor must implement this interface and must have a constructor
 * matching one of the following signatures:<br>
 * - ( {@link File} outputDir, {@link String} sessionId)<br>
 * - ( {@link String} sessionId)<br>
 * <br>
 *
 * @param outDir
 *          directory in which to write the various outputs
 * @return
 * @throws IOException
 * @throws TalismaneException
 *           if an observer does not implement this interface, or if no
 *           constructor is found with the correct signature
 */
public static List<TokenEvaluationObserver> getTokenEvaluationObservers(File outDir, String sessionId) throws IOException, TalismaneException, ReflectiveOperationException {
    if (outDir != null)
        outDir.mkdirs();
    Config config = ConfigFactory.load();
    Config tokeniserConfig = config.getConfig("talismane.core." + sessionId + ".tokeniser");
    Config evalConfig = tokeniserConfig.getConfig("evaluate");
    List<TokenEvaluationObserver> observers = new ArrayList<>();
    List<TokenSequenceProcessor> processors = TokenSequenceProcessor.getProcessors(null, outDir, sessionId);
    for (TokenSequenceProcessor processor : processors) {
        TokenSequenceProcessorWrapper wrapper = new TokenSequenceProcessorWrapper(processor);
        observers.add(wrapper);
    }
    List<String> classes = evalConfig.getStringList("observers");
    if (outDir != null)
        outDir.mkdirs();
    for (String className : classes) {
        @SuppressWarnings("rawtypes") Class untypedClass = Class.forName(className);
        if (!TokenEvaluationObserver.class.isAssignableFrom(untypedClass))
            throw new TalismaneException("Class " + className + " does not implement interface " + TokenEvaluationObserver.class.getSimpleName());
        @SuppressWarnings("unchecked") Class<? extends TokenEvaluationObserver> clazz = untypedClass;
        Constructor<? extends TokenEvaluationObserver> cons = null;
        TokenEvaluationObserver observer = null;
        if (cons == null) {
            try {
                cons = clazz.getConstructor(File.class, String.class);
            } catch (NoSuchMethodException e) {
            // do nothing
            }
            if (cons != null) {
                observer = cons.newInstance(outDir, sessionId);
            }
        }
        if (cons == null) {
            try {
                cons = clazz.getConstructor(String.class);
            } catch (NoSuchMethodException e) {
            // do nothing
            }
            if (cons != null) {
                observer = cons.newInstance(sessionId);
            } else {
                throw new TalismaneException("No constructor found with correct signature for: " + className);
            }
        }
        observers.add(observer);
    }
    return observers;
}
Also used : TokenSequenceProcessor(com.joliciel.talismane.tokeniser.output.TokenSequenceProcessor) TalismaneException(com.joliciel.talismane.TalismaneException) Config(com.typesafe.config.Config) ArrayList(java.util.ArrayList) File(java.io.File)

Example 45 with TalismaneException

use of com.joliciel.talismane.TalismaneException in project talismane by joliciel-informatique.

the class TokenRegexBasedCorpusReader method processSentence.

@Override
protected void processSentence(Sentence sentence, List<CorpusLine> corpusLines) throws TalismaneException, IOException {
    try {
        super.processSentence(sentence, corpusLines);
        tokenSequence = new PretokenisedSequence(sentence, sessionId);
        for (CorpusLine corpusLine : corpusLines) {
            this.convertToToken(tokenSequence, corpusLine);
        }
        for (TokenFilter filter : filters) filter.apply(tokenSequence);
        tokenSequence.cleanSlate();
    } catch (TalismaneException e) {
        this.clearSentence();
        throw e;
    }
}
Also used : TalismaneException(com.joliciel.talismane.TalismaneException) CorpusLine(com.joliciel.talismane.corpus.CorpusLine) TokenFilter(com.joliciel.talismane.tokeniser.filters.TokenFilter)

Aggregations

TalismaneException (com.joliciel.talismane.TalismaneException)47 ArrayList (java.util.ArrayList)27 Config (com.typesafe.config.Config)14 File (java.io.File)11 List (java.util.List)10 TreeSet (java.util.TreeSet)10 FeatureResult (com.joliciel.talismane.machineLearning.features.FeatureResult)9 IOException (java.io.IOException)9 HashMap (java.util.HashMap)9 Set (java.util.Set)9 Decision (com.joliciel.talismane.machineLearning.Decision)8 RuntimeEnvironment (com.joliciel.talismane.machineLearning.features.RuntimeEnvironment)8 PosTaggedToken (com.joliciel.talismane.posTagger.PosTaggedToken)8 Token (com.joliciel.talismane.tokeniser.Token)8 Map (java.util.Map)8 SortedSet (java.util.SortedSet)8 Collectors (java.util.stream.Collectors)8 Logger (org.slf4j.Logger)8 LoggerFactory (org.slf4j.LoggerFactory)8 Sentence (com.joliciel.talismane.rawText.Sentence)7