Search in sources :

Example 1 with PreConfiguredTokenizer

use of org.opensearch.index.analysis.PreConfiguredTokenizer in project OpenSearch by opensearch-project.

the class AnalysisModule method setupPreConfiguredTokenizers.

static Map<String, PreConfiguredTokenizer> setupPreConfiguredTokenizers(List<AnalysisPlugin> plugins) {
    NamedRegistry<PreConfiguredTokenizer> preConfiguredTokenizers = new NamedRegistry<>("pre-configured tokenizer");
    // Temporary shim to register old style pre-configured tokenizers
    for (PreBuiltTokenizers tokenizer : PreBuiltTokenizers.values()) {
        String name = tokenizer.name().toLowerCase(Locale.ROOT);
        PreConfiguredTokenizer preConfigured;
        switch(tokenizer.getCachingStrategy()) {
            case ONE:
                preConfigured = PreConfiguredTokenizer.singleton(name, () -> tokenizer.create(Version.CURRENT));
                break;
            default:
                throw new UnsupportedOperationException("Caching strategy unsupported by temporary shim [" + tokenizer + "]");
        }
        preConfiguredTokenizers.register(name, preConfigured);
    }
    for (AnalysisPlugin plugin : plugins) {
        for (PreConfiguredTokenizer tokenizer : plugin.getPreConfiguredTokenizers()) {
            preConfiguredTokenizers.register(tokenizer.getName(), tokenizer);
        }
    }
    return unmodifiableMap(preConfiguredTokenizers.getRegistry());
}
Also used : NamedRegistry(org.opensearch.common.NamedRegistry) PreConfiguredTokenizer(org.opensearch.index.analysis.PreConfiguredTokenizer) AnalysisPlugin(org.opensearch.plugins.AnalysisPlugin)

Example 2 with PreConfiguredTokenizer

use of org.opensearch.index.analysis.PreConfiguredTokenizer in project OpenSearch by opensearch-project.

the class CommonAnalysisPlugin method getPreConfiguredTokenizers.

@Override
public List<PreConfiguredTokenizer> getPreConfiguredTokenizers() {
    List<PreConfiguredTokenizer> tokenizers = new ArrayList<>();
    tokenizers.add(PreConfiguredTokenizer.singleton("keyword", KeywordTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("classic", ClassicTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("uax_url_email", UAX29URLEmailTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("path_hierarchy", PathHierarchyTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("letter", LetterTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("whitespace", WhitespaceTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.singleton("ngram", NGramTokenizer::new));
    tokenizers.add(PreConfiguredTokenizer.openSearchVersion("edge_ngram", (version) -> {
        if (version.onOrAfter(LegacyESVersion.V_7_3_0)) {
            return new EdgeNGramTokenizer(NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE, NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
        }
        return new EdgeNGramTokenizer(EdgeNGramTokenizer.DEFAULT_MIN_GRAM_SIZE, EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE);
    }));
    tokenizers.add(PreConfiguredTokenizer.singleton("pattern", () -> new PatternTokenizer(Regex.compile("\\W+", null), -1)));
    tokenizers.add(PreConfiguredTokenizer.singleton("thai", ThaiTokenizer::new));
    // TODO deprecate and remove in API
    // This is already broken with normalization, so backwards compat isn't necessary?
    tokenizers.add(PreConfiguredTokenizer.singleton("lowercase", XLowerCaseTokenizer::new));
    // Temporary shim for aliases. TODO deprecate after they are moved
    tokenizers.add(PreConfiguredTokenizer.openSearchVersion("nGram", (version) -> {
        if (version.onOrAfter(LegacyESVersion.V_7_6_0)) {
            deprecationLogger.deprecate("nGram_tokenizer_deprecation", "The [nGram] tokenizer name is deprecated and will be removed in a future version. " + "Please change the tokenizer name to [ngram] instead.");
        }
        return new NGramTokenizer();
    }));
    tokenizers.add(PreConfiguredTokenizer.openSearchVersion("edgeNGram", (version) -> {
        if (version.onOrAfter(LegacyESVersion.V_7_6_0)) {
            deprecationLogger.deprecate("edgeNGram_tokenizer_deprecation", "The [edgeNGram] tokenizer name is deprecated and will be removed in a future version. " + "Please change the tokenizer name to [edge_ngram] instead.");
        }
        if (version.onOrAfter(LegacyESVersion.V_7_3_0)) {
            return new EdgeNGramTokenizer(NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE, NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
        }
        return new EdgeNGramTokenizer(EdgeNGramTokenizer.DEFAULT_MIN_GRAM_SIZE, EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE);
    }));
    tokenizers.add(PreConfiguredTokenizer.singleton("PathHierarchy", PathHierarchyTokenizer::new));
    return tokenizers;
}
Also used : LimitTokenCountFilter(org.apache.lucene.analysis.miscellaneous.LimitTokenCountFilter) EdgeNGramTokenFilter(org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter) TypeAsPayloadTokenFilter(org.apache.lucene.analysis.payloads.TypeAsPayloadTokenFilter) LithuanianAnalyzer(org.apache.lucene.analysis.lt.LithuanianAnalyzer) DecimalDigitFilter(org.apache.lucene.analysis.core.DecimalDigitFilter) Regex(org.opensearch.common.regex.Regex) PathHierarchyTokenizer(org.apache.lucene.analysis.path.PathHierarchyTokenizer) CatalanAnalyzer(org.apache.lucene.analysis.ca.CatalanAnalyzer) CzechStemFilter(org.apache.lucene.analysis.cz.CzechStemFilter) CzechAnalyzer(org.apache.lucene.analysis.cz.CzechAnalyzer) ScandinavianFoldingFilter(org.apache.lucene.analysis.miscellaneous.ScandinavianFoldingFilter) Map(java.util.Map) ArmenianAnalyzer(org.apache.lucene.analysis.hy.ArmenianAnalyzer) SwedishAnalyzer(org.apache.lucene.analysis.sv.SwedishAnalyzer) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) NodeEnvironment(org.opensearch.env.NodeEnvironment) ScriptService(org.opensearch.script.ScriptService) Client(org.opensearch.client.Client) DutchStemmer(org.tartarus.snowball.ext.DutchStemmer) BrazilianStemFilter(org.apache.lucene.analysis.br.BrazilianStemFilter) PreConfiguredTokenFilter(org.opensearch.index.analysis.PreConfiguredTokenFilter) WordDelimiterFilter(org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) Settings(org.opensearch.common.settings.Settings) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) AnalyzerProvider(org.opensearch.index.analysis.AnalyzerProvider) BasqueAnalyzer(org.apache.lucene.analysis.eu.BasqueAnalyzer) HTMLStripCharFilter(org.apache.lucene.analysis.charfilter.HTMLStripCharFilter) HungarianAnalyzer(org.apache.lucene.analysis.hu.HungarianAnalyzer) IndexNameExpressionResolver(org.opensearch.cluster.metadata.IndexNameExpressionResolver) RepositoriesService(org.opensearch.repositories.RepositoriesService) ThreadPool(org.opensearch.threadpool.ThreadPool) LengthFilter(org.apache.lucene.analysis.miscellaneous.LengthFilter) TurkishAnalyzer(org.apache.lucene.analysis.tr.TurkishAnalyzer) SnowballFilter(org.apache.lucene.analysis.snowball.SnowballFilter) Supplier(java.util.function.Supplier) ArrayList(java.util.ArrayList) WordDelimiterIterator(org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator) LegacyESVersion(org.opensearch.LegacyESVersion) CharFilterFactory(org.opensearch.index.analysis.CharFilterFactory) SoraniAnalyzer(org.apache.lucene.analysis.ckb.SoraniAnalyzer) PersianNormalizationFilter(org.apache.lucene.analysis.fa.PersianNormalizationFilter) AnalysisProvider(org.opensearch.indices.analysis.AnalysisModule.AnalysisProvider) ClassicTokenizer(org.apache.lucene.analysis.standard.ClassicTokenizer) ArabicStemFilter(org.apache.lucene.analysis.ar.ArabicStemFilter) CJKWidthFilter(org.apache.lucene.analysis.cjk.CJKWidthFilter) Environment(org.opensearch.env.Environment) TokenStream(org.apache.lucene.analysis.TokenStream) SetOnce(org.apache.lucene.util.SetOnce) WordDelimiterGraphFilter(org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilter) Analyzer(org.apache.lucene.analysis.Analyzer) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) UAX29URLEmailTokenizer(org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer) GalicianAnalyzer(org.apache.lucene.analysis.gl.GalicianAnalyzer) FrenchStemmer(org.tartarus.snowball.ext.FrenchStemmer) PreConfiguredCharFilter(org.opensearch.index.analysis.PreConfiguredCharFilter) UpperCaseFilter(org.apache.lucene.analysis.core.UpperCaseFilter) Plugin(org.opensearch.plugins.Plugin) CachingStrategy(org.opensearch.indices.analysis.PreBuiltCacheFactory.CachingStrategy) ItalianAnalyzer(org.apache.lucene.analysis.it.ItalianAnalyzer) BengaliAnalyzer(org.apache.lucene.analysis.bn.BengaliAnalyzer) TreeMap(java.util.TreeMap) ClassicFilter(org.apache.lucene.analysis.standard.ClassicFilter) NamedXContentRegistry(org.opensearch.common.xcontent.NamedXContentRegistry) NGramTokenizer(org.apache.lucene.analysis.ngram.NGramTokenizer) ClusterService(org.opensearch.cluster.service.ClusterService) FinnishAnalyzer(org.apache.lucene.analysis.fi.FinnishAnalyzer) EdgeNGramTokenizer(org.apache.lucene.analysis.ngram.EdgeNGramTokenizer) IndonesianAnalyzer(org.apache.lucene.analysis.id.IndonesianAnalyzer) ResourceWatcherService(org.opensearch.watcher.ResourceWatcherService) DisableGraphAttribute(org.apache.lucene.analysis.miscellaneous.DisableGraphAttribute) GermanAnalyzer(org.apache.lucene.analysis.de.GermanAnalyzer) AnalysisPlugin.requiresAnalysisSettings(org.opensearch.plugins.AnalysisPlugin.requiresAnalysisSettings) GreekAnalyzer(org.apache.lucene.analysis.el.GreekAnalyzer) DelimitedPayloadTokenFilter(org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter) PreConfiguredTokenizer(org.opensearch.index.analysis.PreConfiguredTokenizer) TruncateTokenFilter(org.apache.lucene.analysis.miscellaneous.TruncateTokenFilter) BulgarianAnalyzer(org.apache.lucene.analysis.bg.BulgarianAnalyzer) EnglishAnalyzer(org.apache.lucene.analysis.en.EnglishAnalyzer) ScriptPlugin(org.opensearch.plugins.ScriptPlugin) NGramTokenFilter(org.apache.lucene.analysis.ngram.NGramTokenFilter) Collection(java.util.Collection) HindiAnalyzer(org.apache.lucene.analysis.hi.HindiAnalyzer) RussianAnalyzer(org.apache.lucene.analysis.ru.RussianAnalyzer) BrazilianAnalyzer(org.apache.lucene.analysis.br.BrazilianAnalyzer) PortugueseAnalyzer(org.apache.lucene.analysis.pt.PortugueseAnalyzer) NorwegianAnalyzer(org.apache.lucene.analysis.no.NorwegianAnalyzer) ElisionFilter(org.apache.lucene.analysis.util.ElisionFilter) List(java.util.List) ArabicAnalyzer(org.apache.lucene.analysis.ar.ArabicAnalyzer) KStemFilter(org.apache.lucene.analysis.en.KStemFilter) IndexSettings(org.opensearch.index.IndexSettings) DanishAnalyzer(org.apache.lucene.analysis.da.DanishAnalyzer) RomanianAnalyzer(org.apache.lucene.analysis.ro.RomanianAnalyzer) ScriptContext(org.opensearch.script.ScriptContext) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer) ThaiTokenizer(org.apache.lucene.analysis.th.ThaiTokenizer) TokenizerFactory(org.opensearch.index.analysis.TokenizerFactory) LatvianAnalyzer(org.apache.lucene.analysis.lv.LatvianAnalyzer) PatternTokenizer(org.apache.lucene.analysis.pattern.PatternTokenizer) CharArraySet(org.apache.lucene.analysis.CharArraySet) CJKAnalyzer(org.apache.lucene.analysis.cjk.CJKAnalyzer) PreBuiltAnalyzerProviderFactory(org.opensearch.index.analysis.PreBuiltAnalyzerProviderFactory) IndicNormalizationFilter(org.apache.lucene.analysis.in.IndicNormalizationFilter) TokenFilterFactory(org.opensearch.index.analysis.TokenFilterFactory) GermanNormalizationFilter(org.apache.lucene.analysis.de.GermanNormalizationFilter) GermanStemFilter(org.apache.lucene.analysis.de.GermanStemFilter) CJKBigramFilter(org.apache.lucene.analysis.cjk.CJKBigramFilter) CommonGramsFilter(org.apache.lucene.analysis.commongrams.CommonGramsFilter) HindiNormalizationFilter(org.apache.lucene.analysis.hi.HindiNormalizationFilter) NamedWriteableRegistry(org.opensearch.common.io.stream.NamedWriteableRegistry) DeprecationLogger(org.opensearch.common.logging.DeprecationLogger) ArabicNormalizationFilter(org.apache.lucene.analysis.ar.ArabicNormalizationFilter) IrishAnalyzer(org.apache.lucene.analysis.ga.IrishAnalyzer) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) KeywordRepeatFilter(org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter) DutchAnalyzer(org.apache.lucene.analysis.nl.DutchAnalyzer) SpanishAnalyzer(org.apache.lucene.analysis.es.SpanishAnalyzer) TrimFilter(org.apache.lucene.analysis.miscellaneous.TrimFilter) PersianAnalyzer(org.apache.lucene.analysis.fa.PersianAnalyzer) StopFilter(org.apache.lucene.analysis.StopFilter) SoraniNormalizationFilter(org.apache.lucene.analysis.ckb.SoraniNormalizationFilter) ScandinavianNormalizationFilter(org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilter) EstonianAnalyzer(org.apache.lucene.analysis.et.EstonianAnalyzer) BengaliNormalizationFilter(org.apache.lucene.analysis.bn.BengaliNormalizationFilter) FrenchAnalyzer(org.apache.lucene.analysis.fr.FrenchAnalyzer) ThaiAnalyzer(org.apache.lucene.analysis.th.ThaiAnalyzer) AnalysisPlugin(org.opensearch.plugins.AnalysisPlugin) ASCIIFoldingFilter(org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter) ReverseStringFilter(org.apache.lucene.analysis.reverse.ReverseStringFilter) ApostropheFilter(org.apache.lucene.analysis.tr.ApostropheFilter) Collections(java.util.Collections) EdgeNGramTokenizer(org.apache.lucene.analysis.ngram.EdgeNGramTokenizer) PreConfiguredTokenizer(org.opensearch.index.analysis.PreConfiguredTokenizer) ArrayList(java.util.ArrayList) PatternTokenizer(org.apache.lucene.analysis.pattern.PatternTokenizer) NGramTokenizer(org.apache.lucene.analysis.ngram.NGramTokenizer) EdgeNGramTokenizer(org.apache.lucene.analysis.ngram.EdgeNGramTokenizer)

Example 3 with PreConfiguredTokenizer

use of org.opensearch.index.analysis.PreConfiguredTokenizer in project OpenSearch by opensearch-project.

the class AnalysisModuleTests method testPluginPreConfiguredTokenizers.

/**
 * Tests that plugins can register pre-configured token filters that vary in behavior based on OpenSearch version, Lucene version,
 * and that do not vary based on version at all.
 */
public void testPluginPreConfiguredTokenizers() throws IOException {
    // Simple tokenizer that always spits out a single token with some preconfigured characters
    final class FixedTokenizer extends Tokenizer {

        private final CharTermAttribute term = addAttribute(CharTermAttribute.class);

        private final char[] chars;

        private boolean read = false;

        protected FixedTokenizer(String chars) {
            this.chars = chars.toCharArray();
        }

        @Override
        public boolean incrementToken() throws IOException {
            if (read) {
                return false;
            }
            clearAttributes();
            read = true;
            term.resizeBuffer(chars.length);
            System.arraycopy(chars, 0, term.buffer(), 0, chars.length);
            term.setLength(chars.length);
            return true;
        }

        @Override
        public void reset() throws IOException {
            super.reset();
            read = false;
        }
    }
    AnalysisRegistry registry = new AnalysisModule(TestEnvironment.newEnvironment(emptyNodeSettings), singletonList(new AnalysisPlugin() {

        @Override
        public List<PreConfiguredTokenizer> getPreConfiguredTokenizers() {
            return Arrays.asList(PreConfiguredTokenizer.singleton("no_version", () -> new FixedTokenizer("no_version")), PreConfiguredTokenizer.luceneVersion("lucene_version", luceneVersion -> new FixedTokenizer(luceneVersion.toString())), PreConfiguredTokenizer.openSearchVersion("opensearch_version", esVersion -> new FixedTokenizer(esVersion.toString())));
        }
    })).getAnalysisRegistry();
    Version version = VersionUtils.randomVersion(random());
    IndexAnalyzers analyzers = getIndexAnalyzers(registry, Settings.builder().put("index.analysis.analyzer.no_version.tokenizer", "no_version").put("index.analysis.analyzer.lucene_version.tokenizer", "lucene_version").put("index.analysis.analyzer.opensearch_version.tokenizer", "opensearch_version").put(IndexMetadata.SETTING_VERSION_CREATED, version).build());
    assertTokenStreamContents(analyzers.get("no_version").tokenStream("", "test"), new String[] { "no_version" });
    assertTokenStreamContents(analyzers.get("lucene_version").tokenStream("", "test"), new String[] { version.luceneVersion.toString() });
    assertTokenStreamContents(analyzers.get("opensearch_version").tokenStream("", "test"), new String[] { version.toString() });
// These are current broken by https://github.com/elastic/elasticsearch/issues/24752
// assertEquals("test" + (noVersionSupportsMultiTerm ? "no_version" : ""),
// analyzers.get("no_version").normalize("", "test").utf8ToString());
// assertEquals("test" + (luceneVersionSupportsMultiTerm ? version.luceneVersion.toString() : ""),
// analyzers.get("lucene_version").normalize("", "test").utf8ToString());
// assertEquals("test" + (opensearchVersionSupportsMultiTerm ? version.toString() : ""),
// analyzers.get("opensearch_version").normalize("", "test").utf8ToString());
}
Also used : AnalysisRegistry(org.opensearch.index.analysis.AnalysisRegistry) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) Version(org.opensearch.Version) LegacyESVersion(org.opensearch.LegacyESVersion) IndexAnalyzers(org.opensearch.index.analysis.IndexAnalyzers) Collections.singletonList(java.util.Collections.singletonList) List(java.util.List) PreConfiguredTokenizer(org.opensearch.index.analysis.PreConfiguredTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) AnalysisPlugin(org.opensearch.plugins.AnalysisPlugin)

Aggregations

PreConfiguredTokenizer (org.opensearch.index.analysis.PreConfiguredTokenizer)3 AnalysisPlugin (org.opensearch.plugins.AnalysisPlugin)3 List (java.util.List)2 LegacyESVersion (org.opensearch.LegacyESVersion)2 ArrayList (java.util.ArrayList)1 Collection (java.util.Collection)1 Collections (java.util.Collections)1 Collections.singletonList (java.util.Collections.singletonList)1 Map (java.util.Map)1 TreeMap (java.util.TreeMap)1 Supplier (java.util.function.Supplier)1 Analyzer (org.apache.lucene.analysis.Analyzer)1 CharArraySet (org.apache.lucene.analysis.CharArraySet)1 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)1 StopFilter (org.apache.lucene.analysis.StopFilter)1 TokenStream (org.apache.lucene.analysis.TokenStream)1 Tokenizer (org.apache.lucene.analysis.Tokenizer)1 ArabicAnalyzer (org.apache.lucene.analysis.ar.ArabicAnalyzer)1 ArabicNormalizationFilter (org.apache.lucene.analysis.ar.ArabicNormalizationFilter)1 ArabicStemFilter (org.apache.lucene.analysis.ar.ArabicStemFilter)1