Search in sources :

Example 6 with OptimaizeLangDetector

use of org.apache.tika.langdetect.OptimaizeLangDetector in project tika by apache.

the class Language method languageDetectionWithWriter.

public static void languageDetectionWithWriter() throws IOException {
    // TODO support version of LanguageWriter that doesn't need a detector.
    LanguageDetector detector = new OptimaizeLangDetector().loadModels();
    LanguageWriter writer = new LanguageWriter(detector);
    writer.append("Minden emberi lény");
    writer.append(" szabadon születik és");
    writer.append(" egyenlő méltósága és");
    writer.append(" joga van.");
    LanguageResult result = writer.getLanguage();
    System.out.println(result.getLanguage());
    writer.close();
}
Also used : LanguageDetector(org.apache.tika.language.detect.LanguageDetector) LanguageResult(org.apache.tika.language.detect.LanguageResult) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) LanguageWriter(org.apache.tika.language.detect.LanguageWriter)

Example 7 with OptimaizeLangDetector

use of org.apache.tika.langdetect.OptimaizeLangDetector in project tika by apache.

the class MyFirstTika method parseUsingComponents.

public static String parseUsingComponents(String filename, TikaConfig tikaConfig, Metadata metadata) throws Exception {
    MimeTypes mimeRegistry = tikaConfig.getMimeRepository();
    System.out.println("Examining: [" + filename + "]");
    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
    System.out.println("The MIME type (based on filename) is: [" + mimeRegistry.detect(null, metadata) + "]");
    InputStream stream = TikaInputStream.get(new File(filename));
    System.out.println("The MIME type (based on MAGIC) is: [" + mimeRegistry.detect(stream, metadata) + "]");
    stream = TikaInputStream.get(new File(filename));
    Detector detector = tikaConfig.getDetector();
    System.out.println("The MIME type (based on the Detector interface) is: [" + detector.detect(stream, metadata) + "]");
    LanguageDetector langDetector = new OptimaizeLangDetector().loadModels();
    LanguageResult lang = langDetector.detect(FileUtils.readFileToString(new File(filename), UTF_8));
    System.out.println("The language of this content is: [" + lang.getLanguage() + "]");
    // Get a non-detecting parser that handles all the types it can
    Parser parser = tikaConfig.getParser();
    // Tell it what we think the content is
    MediaType type = detector.detect(stream, metadata);
    metadata.set(Metadata.CONTENT_TYPE, type.toString());
    // Have the file parsed to get the content and metadata
    ContentHandler handler = new BodyContentHandler();
    parser.parse(stream, handler, metadata, new ParseContext());
    return handler.toString();
}
Also used : LanguageDetector(org.apache.tika.language.detect.LanguageDetector) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) LanguageDetector(org.apache.tika.language.detect.LanguageDetector) Detector(org.apache.tika.detect.Detector) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) LanguageResult(org.apache.tika.language.detect.LanguageResult) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) MediaType(org.apache.tika.mime.MediaType) MimeTypes(org.apache.tika.mime.MimeTypes) File(java.io.File) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Aggregations

OptimaizeLangDetector (org.apache.tika.langdetect.OptimaizeLangDetector)7 LanguageResult (org.apache.tika.language.detect.LanguageResult)7 LanguageDetector (org.apache.tika.language.detect.LanguageDetector)4 Consumes (javax.ws.rs.Consumes)3 POST (javax.ws.rs.POST)3 PUT (javax.ws.rs.PUT)3 Path (javax.ws.rs.Path)3 Produces (javax.ws.rs.Produces)3 File (java.io.File)1 InputStream (java.io.InputStream)1 Detector (org.apache.tika.detect.Detector)1 TikaException (org.apache.tika.exception.TikaException)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 LanguageWriter (org.apache.tika.language.detect.LanguageWriter)1 MediaType (org.apache.tika.mime.MediaType)1 MimeTypes (org.apache.tika.mime.MimeTypes)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)1