Examples with Analyzer - org.apache.lucene.analysis.Analyzer

Example 1 with Analyzer

use of org.apache.lucene.analysis.Analyzer in project zeppelin by apache.

the class LuceneSearch method query.

/* (non-Javadoc)
   * @see org.apache.zeppelin.search.Search#query(java.lang.String)
   */
@Override
public List<Map<String, String>> query(String queryStr) {
    if (null == ramDirectory) {
        throw new IllegalStateException("Something went wrong on instance creation time, index dir is null");
    }
    List<Map<String, String>> result = Collections.emptyList();
    try (IndexReader indexReader = DirectoryReader.open(ramDirectory)) {
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        Analyzer analyzer = new StandardAnalyzer();
        MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[] { SEARCH_FIELD_TEXT, SEARCH_FIELD_TITLE }, analyzer);
        Query query = parser.parse(queryStr);
        LOG.debug("Searching for: " + query.toString(SEARCH_FIELD_TEXT));
        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
        result = doSearch(indexSearcher, query, analyzer, highlighter);
        indexReader.close();
    } catch (IOException e) {
        LOG.error("Failed to open index dir {}, make sure indexing finished OK", ramDirectory, e);
    } catch (ParseException e) {
        LOG.error("Failed to parse query " + queryStr, e);
    }
    return result;
}

Also used : IndexSearcher(org.apache.lucene.search.IndexSearcher) MultiFieldQueryParser(org.apache.lucene.queryparser.classic.MultiFieldQueryParser) Query(org.apache.lucene.search.Query) WildcardQuery(org.apache.lucene.search.WildcardQuery) QueryScorer(org.apache.lucene.search.highlight.QueryScorer) IOException(java.io.IOException) Analyzer(org.apache.lucene.analysis.Analyzer) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) IndexReader(org.apache.lucene.index.IndexReader) ParseException(org.apache.lucene.queryparser.classic.ParseException) SimpleHTMLFormatter(org.apache.lucene.search.highlight.SimpleHTMLFormatter) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap) Highlighter(org.apache.lucene.search.highlight.Highlighter)

Example 2 with Analyzer

use of org.apache.lucene.analysis.Analyzer in project crate by crate.

the class MatchQueryBuilderTest method mockMapperService.

private MapperService mockMapperService() {
    Analyzer analyzer = new GermanAnalyzer();
    MapperService mapperService = mock(MapperService.class);
    when(mapperService.searchAnalyzer()).thenReturn(analyzer);
    return mapperService;
}

Also used : GermanAnalyzer(org.apache.lucene.analysis.de.GermanAnalyzer) Analyzer(org.apache.lucene.analysis.Analyzer) GermanAnalyzer(org.apache.lucene.analysis.de.GermanAnalyzer) MapperService(org.elasticsearch.index.mapper.MapperService)

Example 3 with Analyzer

use of org.apache.lucene.analysis.Analyzer in project elasticsearch by elastic.

the class NoisyChannelSpellCheckerTests method testMultiGenerator.

public void testMultiGenerator() throws IOException {
    RAMDirectory dir = new RAMDirectory();
    Map<String, Analyzer> mapping = new HashMap<>();
    mapping.put("body_ngram", new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer t = new StandardTokenizer();
            ShingleFilter tf = new ShingleFilter(t, 2, 3);
            tf.setOutputUnigrams(false);
            return new TokenStreamComponents(t, new LowerCaseFilter(tf));
        }
    });
    mapping.put("body", new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer t = new StandardTokenizer();
            return new TokenStreamComponents(t, new LowerCaseFilter(t));
        }
    });
    mapping.put("body_reverse", new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer t = new StandardTokenizer();
            return new TokenStreamComponents(t, new ReverseStringFilter(new LowerCaseFilter(t)));
        }
    });
    PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer(), mapping);
    IndexWriterConfig conf = new IndexWriterConfig(wrapper);
    IndexWriter writer = new IndexWriter(dir, conf);
    String[] strings = new String[] { "Xorr the God-Jewel", "Grog the God-Crusher", "Xorn", "Walter Newell", "Wanda Maximoff", "Captain America", "American Ace", "Wundarr the Aquarian", "Will o' the Wisp", "Xemnu the Titan", "Fantastic Four", "Quasar", "Quasar II" };
    for (String line : strings) {
        Document doc = new Document();
        doc.add(new Field("body", line, TextField.TYPE_NOT_STORED));
        doc.add(new Field("body_reverse", line, TextField.TYPE_NOT_STORED));
        doc.add(new Field("body_ngram", line, TextField.TYPE_NOT_STORED));
        writer.addDocument(doc);
    }
    DirectoryReader ir = DirectoryReader.open(writer);
    LaplaceScorer wordScorer = new LaplaceScorer(ir, MultiFields.getTerms(ir, "body_ngram"), "body_ngram", 0.95d, new BytesRef(" "), 0.5f);
    NoisyChannelSpellChecker suggester = new NoisyChannelSpellChecker();
    DirectSpellChecker spellchecker = new DirectSpellChecker();
    spellchecker.setMinQueryLength(1);
    DirectCandidateGenerator forward = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_ALWAYS, ir, 0.95, 10);
    DirectCandidateGenerator reverse = new DirectCandidateGenerator(spellchecker, "body_reverse", SuggestMode.SUGGEST_ALWAYS, ir, 0.95, 10, wrapper, wrapper, MultiFields.getTerms(ir, "body_reverse"));
    CandidateGenerator generator = new MultiCandidateGeneratorWrapper(10, forward, reverse);
    Correction[] corrections = suggester.getCorrections(wrapper, new BytesRef("american cae"), generator, 1, 1, ir, "body", wordScorer, 1, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
    generator = new MultiCandidateGeneratorWrapper(5, forward, reverse);
    corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 1, 1, ir, "body", wordScorer, 1, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
    corrections = suggester.getCorrections(wrapper, new BytesRef("american cae"), forward, 1, 1, ir, "body", wordScorer, 1, 2).corrections;
    // only use forward with constant prefix
    assertThat(corrections.length, equalTo(0));
    corrections = suggester.getCorrections(wrapper, new BytesRef("america cae"), generator, 2, 1, ir, "body", wordScorer, 1, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
    corrections = suggester.getCorrections(wrapper, new BytesRef("Zorr the Got-Jewel"), generator, 0.5f, 4, ir, "body", wordScorer, 0, 2).corrections;
    assertThat(corrections.length, equalTo(4));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
    assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("zorr the god jewel"));
    assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("four the god jewel"));
    corrections = suggester.getCorrections(wrapper, new BytesRef("Zorr the Got-Jewel"), generator, 0.5f, 1, ir, "body", wordScorer, 1.5f, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
    corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 0.5f, 1, ir, "body", wordScorer, 1.5f, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
    // Test a special case where one of the suggest term is unchanged by the postFilter, 'II' here is unchanged by the reverse analyzer.
    corrections = suggester.getCorrections(wrapper, new BytesRef("Quazar II"), generator, 1, 1, ir, "body", wordScorer, 1, 2).corrections;
    assertThat(corrections.length, equalTo(1));
    assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("quasar ii"));
}

Also used : HashMap(java.util.HashMap) WhitespaceAnalyzer(org.apache.lucene.analysis.core.WhitespaceAnalyzer) Analyzer(org.apache.lucene.analysis.Analyzer) Document(org.apache.lucene.document.Document) Field(org.apache.lucene.document.Field) TextField(org.apache.lucene.document.TextField) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) ReverseStringFilter(org.apache.lucene.analysis.reverse.ReverseStringFilter) Tokenizer(org.apache.lucene.analysis.Tokenizer) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) DirectSpellChecker(org.apache.lucene.search.spell.DirectSpellChecker) BytesRef(org.apache.lucene.util.BytesRef) WhitespaceAnalyzer(org.apache.lucene.analysis.core.WhitespaceAnalyzer) DirectoryReader(org.apache.lucene.index.DirectoryReader) RAMDirectory(org.apache.lucene.store.RAMDirectory) PerFieldAnalyzerWrapper(org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper) IndexWriter(org.apache.lucene.index.IndexWriter) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) LowerCaseFilter(org.apache.lucene.analysis.LowerCaseFilter) IndexWriterConfig(org.apache.lucene.index.IndexWriterConfig)

Example 4 with Analyzer

use of org.apache.lucene.analysis.Analyzer in project elasticsearch by elastic.

the class SmoothingModelTestCase method testBuildWordScorer.

/**
     * Test the WordScorer emitted by the smoothing model
     */
public void testBuildWordScorer() throws IOException {
    SmoothingModel testModel = createTestModel();
    Map<String, Analyzer> mapping = new HashMap<>();
    mapping.put("field", new WhitespaceAnalyzer());
    PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer(), mapping);
    IndexWriter writer = new IndexWriter(new RAMDirectory(), new IndexWriterConfig(wrapper));
    Document doc = new Document();
    doc.add(new Field("field", "someText", TextField.TYPE_NOT_STORED));
    writer.addDocument(doc);
    DirectoryReader ir = DirectoryReader.open(writer);
    WordScorer wordScorer = testModel.buildWordScorerFactory().newScorer(ir, MultiFields.getTerms(ir, "field"), "field", 0.9d, BytesRefs.toBytesRef(" "));
    assertWordScorer(wordScorer, testModel);
}

Also used : WhitespaceAnalyzer(org.apache.lucene.analysis.core.WhitespaceAnalyzer) HashMap(java.util.HashMap) DirectoryReader(org.apache.lucene.index.DirectoryReader) WhitespaceAnalyzer(org.apache.lucene.analysis.core.WhitespaceAnalyzer) Analyzer(org.apache.lucene.analysis.Analyzer) Document(org.apache.lucene.document.Document) RAMDirectory(org.apache.lucene.store.RAMDirectory) PerFieldAnalyzerWrapper(org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper) Field(org.apache.lucene.document.Field) TextField(org.apache.lucene.document.TextField) IndexWriter(org.apache.lucene.index.IndexWriter) IndexWriterConfig(org.apache.lucene.index.IndexWriterConfig)

Example 5 with Analyzer

use of org.apache.lucene.analysis.Analyzer in project elasticsearch by elastic.

the class PercolateQueryBuilder method doToQuery.

@Override
protected Query doToQuery(QueryShardContext context) throws IOException {
    // Call nowInMillis() so that this query becomes un-cacheable since we
    // can't be sure that it doesn't use now or scripts
    context.nowInMillis();
    if (indexedDocumentIndex != null || indexedDocumentType != null || indexedDocumentId != null) {
        throw new IllegalStateException("query builder must be rewritten first");
    }
    if (document == null) {
        throw new IllegalStateException("no document to percolate");
    }
    MapperService mapperService = context.getMapperService();
    DocumentMapperForType docMapperForType = mapperService.documentMapperWithAutoCreate(documentType);
    DocumentMapper docMapper = docMapperForType.getDocumentMapper();
    ParsedDocument doc = docMapper.parse(source(context.index().getName(), documentType, "_temp_id", document, documentXContentType));
    FieldNameAnalyzer fieldNameAnalyzer = (FieldNameAnalyzer) docMapper.mappers().indexAnalyzer();
    // Need to this custom impl because FieldNameAnalyzer is strict and the percolator sometimes isn't when
    // 'index.percolator.map_unmapped_fields_as_string' is enabled:
    Analyzer analyzer = new DelegatingAnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY) {

        @Override
        protected Analyzer getWrappedAnalyzer(String fieldName) {
            Analyzer analyzer = fieldNameAnalyzer.analyzers().get(fieldName);
            if (analyzer != null) {
                return analyzer;
            } else {
                return context.getIndexAnalyzers().getDefaultIndexAnalyzer();
            }
        }
    };
    final IndexSearcher docSearcher;
    if (doc.docs().size() > 1) {
        assert docMapper.hasNestedObjects();
        docSearcher = createMultiDocumentSearcher(analyzer, doc);
    } else {
        MemoryIndex memoryIndex = MemoryIndex.fromDocument(doc.rootDoc(), analyzer, true, false);
        docSearcher = memoryIndex.createSearcher();
        docSearcher.setQueryCache(null);
    }
    Version indexVersionCreated = context.getIndexSettings().getIndexVersionCreated();
    boolean mapUnmappedFieldsAsString = context.getIndexSettings().getValue(PercolatorFieldMapper.INDEX_MAP_UNMAPPED_FIELDS_AS_STRING_SETTING);
    // We have to make a copy of the QueryShardContext here so we can have a unfrozen version for parsing the legacy
    // percolator queries
    QueryShardContext percolateShardContext = new QueryShardContext(context);
    MappedFieldType fieldType = context.fieldMapper(field);
    if (fieldType == null) {
        throw new QueryShardException(context, "field [" + field + "] does not exist");
    }
    if (!(fieldType instanceof PercolatorFieldMapper.FieldType)) {
        throw new QueryShardException(context, "expected field [" + field + "] to be of type [percolator], but is of type [" + fieldType.typeName() + "]");
    }
    PercolatorFieldMapper.FieldType pft = (PercolatorFieldMapper.FieldType) fieldType;
    PercolateQuery.QueryStore queryStore = createStore(pft, percolateShardContext, mapUnmappedFieldsAsString);
    return pft.percolateQuery(documentType, queryStore, document, docSearcher);
}

Also used : IndexSearcher(org.apache.lucene.search.IndexSearcher) FieldNameAnalyzer(org.elasticsearch.index.analysis.FieldNameAnalyzer) DocumentMapper(org.elasticsearch.index.mapper.DocumentMapper) Analyzer(org.apache.lucene.analysis.Analyzer) FieldNameAnalyzer(org.elasticsearch.index.analysis.FieldNameAnalyzer) MappedFieldType(org.elasticsearch.index.mapper.MappedFieldType) MemoryIndex(org.apache.lucene.index.memory.MemoryIndex) ParsedDocument(org.elasticsearch.index.mapper.ParsedDocument) DelegatingAnalyzerWrapper(org.apache.lucene.analysis.DelegatingAnalyzerWrapper) Version(org.elasticsearch.Version) DocumentMapperForType(org.elasticsearch.index.mapper.DocumentMapperForType) MappedFieldType(org.elasticsearch.index.mapper.MappedFieldType) QueryShardContext(org.elasticsearch.index.query.QueryShardContext) QueryShardException(org.elasticsearch.index.query.QueryShardException) MapperService(org.elasticsearch.index.mapper.MapperService)

Aggregations

Analyzer (org.apache.lucene.analysis.Analyzer)1020 MockAnalyzer (org.apache.lucene.analysis.MockAnalyzer)396 Tokenizer (org.apache.lucene.analysis.Tokenizer)265 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)228 Document (org.apache.lucene.document.Document)207 Directory (org.apache.lucene.store.Directory)192 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)176 BytesRef (org.apache.lucene.util.BytesRef)122 Test (org.junit.Test)119 TokenStream (org.apache.lucene.analysis.TokenStream)107 RandomIndexWriter (org.apache.lucene.index.RandomIndexWriter)92 Term (org.apache.lucene.index.Term)92 IndexReader (org.apache.lucene.index.IndexReader)67 InputArrayIterator (org.apache.lucene.search.suggest.InputArrayIterator)65 StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer)64 Input (org.apache.lucene.search.suggest.Input)63 CharArraySet (org.apache.lucene.analysis.CharArraySet)58 ArrayList (java.util.ArrayList)57 IndexWriterConfig (org.apache.lucene.index.IndexWriterConfig)57 TextField (org.apache.lucene.document.TextField)55