Search in sources :

Example 1 with RuleBasedBreakIterator

use of com.ibm.icu.text.RuleBasedBreakIterator in project lucene-solr by apache.

the class RBBIRuleCompiler method compile.

static void compile(File srcDir, File destDir) throws Exception {
    File[] files = srcDir.listFiles(new FilenameFilter() {

        public boolean accept(File dir, String name) {
            return name.endsWith("rbbi");
        }
    });
    if (files == null)
        throw new IOException("Path does not exist: " + srcDir);
    for (int i = 0; i < files.length; i++) {
        File file = files[i];
        File outputFile = new File(destDir, file.getName().replaceAll("rbbi$", "brk"));
        String rules = getRules(file);
        System.err.print("Compiling " + file.getName() + " to " + outputFile.getName() + ": ");
        /*
       * if there is a syntax error, compileRules() may succeed. the way to
       * check is to try to instantiate from the string. additionally if the
       * rules are invalid, you can get a useful syntax error.
       */
        try {
            new RuleBasedBreakIterator(rules);
        } catch (IllegalArgumentException e) {
            /*
         * do this intentionally, so you don't get a massive stack trace
         * instead, get a useful syntax error!
         */
            System.err.println(e.getMessage());
            System.exit(1);
        }
        FileOutputStream os = new FileOutputStream(outputFile);
        RuleBasedBreakIterator.compileRules(rules, os);
        os.close();
        System.err.println(outputFile.length() + " bytes.");
    }
}
Also used : FilenameFilter(java.io.FilenameFilter) RuleBasedBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator) FileOutputStream(java.io.FileOutputStream) IOException(java.io.IOException) File(java.io.File)

Example 2 with RuleBasedBreakIterator

use of com.ibm.icu.text.RuleBasedBreakIterator in project elasticsearch by elastic.

the class IcuTokenizerFactory method parseRules.

//parse a single RBBi rule file
private BreakIterator parseRules(String filename, Environment env) throws IOException {
    final Path path = env.configFile().resolve(filename);
    String rules = Files.readAllLines(path).stream().filter((v) -> v.startsWith("#") == false).collect(Collectors.joining("\n"));
    return new RuleBasedBreakIterator(rules.toString());
}
Also used : Path(java.nio.file.Path) ElasticsearchException(org.elasticsearch.ElasticsearchException) UCharacter(com.ibm.icu.lang.UCharacter) Tokenizer(org.apache.lucene.analysis.Tokenizer) UScript(com.ibm.icu.lang.UScript) Files(java.nio.file.Files) RuleBasedBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator) Environment(org.elasticsearch.env.Environment) BreakIterator(com.ibm.icu.text.BreakIterator) IOException(java.io.IOException) HashMap(java.util.HashMap) Collectors(java.util.stream.Collectors) ICUTokenizer(org.apache.lucene.analysis.icu.segmentation.ICUTokenizer) DefaultICUTokenizerConfig(org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig) Settings(org.elasticsearch.common.settings.Settings) Map(java.util.Map) IndexSettings(org.elasticsearch.index.IndexSettings) UProperty(com.ibm.icu.lang.UProperty) Path(java.nio.file.Path) ICUTokenizerConfig(org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig) RuleBasedBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator)

Example 3 with RuleBasedBreakIterator

use of com.ibm.icu.text.RuleBasedBreakIterator in project lucene-solr by apache.

the class DefaultICUTokenizerConfig method readBreakIterator.

private static RuleBasedBreakIterator readBreakIterator(String filename) {
    InputStream is = DefaultICUTokenizerConfig.class.getResourceAsStream(filename);
    try {
        RuleBasedBreakIterator bi = RuleBasedBreakIterator.getInstanceFromCompiledRules(is);
        is.close();
        return bi;
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}
Also used : RuleBasedBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator) InputStream(java.io.InputStream) IOException(java.io.IOException)

Example 4 with RuleBasedBreakIterator

use of com.ibm.icu.text.RuleBasedBreakIterator in project lucene-solr by apache.

the class ICUTokenizerFactory method parseRules.

private BreakIterator parseRules(String filename, ResourceLoader loader) throws IOException {
    StringBuilder rules = new StringBuilder();
    InputStream rulesStream = loader.openResource(filename);
    BufferedReader reader = new BufferedReader(IOUtils.getDecodingReader(rulesStream, StandardCharsets.UTF_8));
    String line = null;
    while ((line = reader.readLine()) != null) {
        if (!line.startsWith("#"))
            rules.append(line);
        rules.append('\n');
    }
    reader.close();
    return new RuleBasedBreakIterator(rules.toString());
}
Also used : RuleBasedBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator) InputStream(java.io.InputStream) BufferedReader(java.io.BufferedReader)

Aggregations

RuleBasedBreakIterator (com.ibm.icu.text.RuleBasedBreakIterator)4 IOException (java.io.IOException)3 InputStream (java.io.InputStream)2 UCharacter (com.ibm.icu.lang.UCharacter)1 UProperty (com.ibm.icu.lang.UProperty)1 UScript (com.ibm.icu.lang.UScript)1 BreakIterator (com.ibm.icu.text.BreakIterator)1 BufferedReader (java.io.BufferedReader)1 File (java.io.File)1 FileOutputStream (java.io.FileOutputStream)1 FilenameFilter (java.io.FilenameFilter)1 Files (java.nio.file.Files)1 Path (java.nio.file.Path)1 HashMap (java.util.HashMap)1 Map (java.util.Map)1 Collectors (java.util.stream.Collectors)1 Tokenizer (org.apache.lucene.analysis.Tokenizer)1 DefaultICUTokenizerConfig (org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig)1 ICUTokenizer (org.apache.lucene.analysis.icu.segmentation.ICUTokenizer)1 ICUTokenizerConfig (org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig)1