Search in sources :

Example 16 with LanguageResult

use of org.apache.tika.language.detect.LanguageResult in project tika by apache.

the class MyFirstTika method parseUsingComponents.

public static String parseUsingComponents(String filename, TikaConfig tikaConfig, Metadata metadata) throws Exception {
    MimeTypes mimeRegistry = tikaConfig.getMimeRepository();
    System.out.println("Examining: [" + filename + "]");
    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
    System.out.println("The MIME type (based on filename) is: [" + mimeRegistry.detect(null, metadata) + "]");
    InputStream stream = TikaInputStream.get(new File(filename));
    System.out.println("The MIME type (based on MAGIC) is: [" + mimeRegistry.detect(stream, metadata) + "]");
    stream = TikaInputStream.get(new File(filename));
    Detector detector = tikaConfig.getDetector();
    System.out.println("The MIME type (based on the Detector interface) is: [" + detector.detect(stream, metadata) + "]");
    LanguageDetector langDetector = new OptimaizeLangDetector().loadModels();
    LanguageResult lang = langDetector.detect(FileUtils.readFileToString(new File(filename), UTF_8));
    System.out.println("The language of this content is: [" + lang.getLanguage() + "]");
    // Get a non-detecting parser that handles all the types it can
    Parser parser = tikaConfig.getParser();
    // Tell it what we think the content is
    MediaType type = detector.detect(stream, metadata);
    metadata.set(Metadata.CONTENT_TYPE, type.toString());
    // Have the file parsed to get the content and metadata
    ContentHandler handler = new BodyContentHandler();
    parser.parse(stream, handler, metadata, new ParseContext());
    return handler.toString();
}
Also used : LanguageDetector(org.apache.tika.language.detect.LanguageDetector) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) LanguageDetector(org.apache.tika.language.detect.LanguageDetector) Detector(org.apache.tika.detect.Detector) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) LanguageResult(org.apache.tika.language.detect.LanguageResult) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) MediaType(org.apache.tika.mime.MediaType) MimeTypes(org.apache.tika.mime.MimeTypes) File(java.io.File) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Example 17 with LanguageResult

use of org.apache.tika.language.detect.LanguageResult in project tika by apache.

the class TextLangDetector method detectAll.

@Override
public List<LanguageResult> detectAll() {
    List<LanguageResult> result = new ArrayList<>();
    String language = detect(writer.toString());
    if (language != null) {
        result.add(new LanguageResult(language, LanguageConfidence.MEDIUM, 1));
    } else {
        result.add(new LanguageResult(language, LanguageConfidence.NONE, 0));
    }
    return result;
}
Also used : LanguageResult(org.apache.tika.language.detect.LanguageResult)

Example 18 with LanguageResult

use of org.apache.tika.language.detect.LanguageResult in project tika by apache.

the class TranslateResource method autoTranslate.

@PUT
@POST
@Path("/all/{translator}/{dest}")
@Consumes("*/*")
@Produces("text/plain")
public String autoTranslate(final InputStream is, @PathParam("translator") String translator, @PathParam("dest") String dLang) throws TikaException, IOException {
    final String content = IOUtils.toString(is, UTF_8);
    LanguageResult language = new OptimaizeLangDetector().loadModels().detect(content);
    if (language.isUnknown()) {
        throw new TikaException("Unable to detect language to use for translation of text");
    }
    String sLang = language.getLanguage();
    LOG.info("LanguageIdentifier: detected source lang: [{}]", sLang);
    return doTranslate(content, translator, sLang, dLang);
}
Also used : LanguageResult(org.apache.tika.language.detect.LanguageResult) TikaException(org.apache.tika.exception.TikaException) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) Path(javax.ws.rs.Path) POST(javax.ws.rs.POST) Consumes(javax.ws.rs.Consumes) Produces(javax.ws.rs.Produces) PUT(javax.ws.rs.PUT)

Example 19 with LanguageResult

use of org.apache.tika.language.detect.LanguageResult in project tika by apache.

the class LanguageResource method detect.

@PUT
@POST
@Path("/stream")
@Consumes("*/*")
@Produces("text/plain")
public String detect(final InputStream is) throws IOException {
    String fileTxt = IOUtils.toString(is, UTF_8);
    LanguageResult language = new OptimaizeLangDetector().loadModels().detect(fileTxt);
    String detectedLang = language.getLanguage();
    LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
    return detectedLang;
}
Also used : LanguageResult(org.apache.tika.language.detect.LanguageResult) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) Path(javax.ws.rs.Path) POST(javax.ws.rs.POST) Consumes(javax.ws.rs.Consumes) Produces(javax.ws.rs.Produces) PUT(javax.ws.rs.PUT)

Example 20 with LanguageResult

use of org.apache.tika.language.detect.LanguageResult in project tika by apache.

the class LanguageResource method detect.

@PUT
@POST
@Path("/string")
@Consumes("*/*")
@Produces("text/plain")
public String detect(final String string) throws IOException {
    LanguageResult language = new OptimaizeLangDetector().loadModels().detect(string);
    String detectedLang = language.getLanguage();
    LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
    return detectedLang;
}
Also used : LanguageResult(org.apache.tika.language.detect.LanguageResult) OptimaizeLangDetector(org.apache.tika.langdetect.OptimaizeLangDetector) Path(javax.ws.rs.Path) POST(javax.ws.rs.POST) Consumes(javax.ws.rs.Consumes) Produces(javax.ws.rs.Produces) PUT(javax.ws.rs.PUT)

Aggregations

LanguageResult (org.apache.tika.language.detect.LanguageResult)20 LanguageDetector (org.apache.tika.language.detect.LanguageDetector)10 OptimaizeLangDetector (org.apache.tika.langdetect.OptimaizeLangDetector)7 LanguageWriter (org.apache.tika.language.detect.LanguageWriter)7 Test (org.junit.Test)6 Consumes (javax.ws.rs.Consumes)3 POST (javax.ws.rs.POST)3 PUT (javax.ws.rs.PUT)3 Path (javax.ws.rs.Path)3 Produces (javax.ws.rs.Produces)3 ArrayList (java.util.ArrayList)2 LanguageHandler (org.apache.tika.language.detect.LanguageHandler)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 ParseContext (org.apache.tika.parser.ParseContext)2 ContentHandler (org.xml.sax.ContentHandler)2 DetectedLanguage (com.optimaize.langdetect.DetectedLanguage)1 File (java.io.File)1 IOException (java.io.IOException)1 InputStream (java.io.InputStream)1 Detector (org.apache.tika.detect.Detector)1