Search in sources :

Example 1 with LanguageIdentifier

use of zemberek.langid.LanguageIdentifier in project zemberek-nlp by ahmetaa.

the class LanguageIdServiceImpl method detectFast.

@Override
public void detectFast(LanguageIdRequest request, StreamObserver<LanguageIdResponse> responseObserver) {
    LanguageIdentifier identifier = request.getTrGroup() ? languageIdentifierTr : languageIdentifier;
    String id = identifier.identifyFast(request.getInput(), request.getMaxSampleCount());
    LanguageIdResponse.Builder builder = LanguageIdResponse.newBuilder().setLangId(id);
    if (request.getIncludeScores()) {
        List<LanguageIdentifier.IdResult> scores = identifier.getScoresFast(request.getInput(), request.getMaxSampleCount());
        for (LanguageIdentifier.IdResult item : scores) {
            builder.addIdResult(IdResult.newBuilder().setId(item.id).setScore(item.score).build());
        }
    }
    responseObserver.onNext(builder.build());
    responseObserver.onCompleted();
}
Also used : LanguageIdentifier(zemberek.langid.LanguageIdentifier)

Example 2 with LanguageIdentifier

use of zemberek.langid.LanguageIdentifier in project zemberek-nlp by ahmetaa.

the class WordHistogram method removeNonTurkish.

static List<String> removeNonTurkish(Path input) throws IOException {
    LanguageIdentifier identifier = LanguageIdentifier.fromInternalModels();
    List<String> chunks = Files.readAllLines(input, StandardCharsets.UTF_8);
    return chunks.stream().filter(s -> identifier.identifyFast(s, 200).equalsIgnoreCase("tr")).collect(Collectors.toList());
}
Also used : WebCorpus(zemberek.corpus.WebCorpus) Strings(zemberek.core.io.Strings) SentenceAnalysis(zemberek.morphology.analysis.SentenceAnalysis) WebDocument(zemberek.corpus.WebDocument) SentenceWordAnalysis(zemberek.morphology.analysis.SentenceWordAnalysis) ArrayList(java.util.ArrayList) HashSet(java.util.HashSet) Turkish(zemberek.core.turkish.Turkish) Token(zemberek.tokenization.Token) Charset(java.nio.charset.Charset) SingleAnalysis(zemberek.morphology.analysis.SingleAnalysis) PrimaryPos(zemberek.core.turkish.PrimaryPos) TurkishTokenizer(zemberek.tokenization.TurkishTokenizer) Path(java.nio.file.Path) Histogram(zemberek.core.collections.Histogram) SecondaryPos(zemberek.core.turkish.SecondaryPos) Files(java.nio.file.Files) TurkishMorphology(zemberek.morphology.TurkishMorphology) Set(java.util.Set) IOException(java.io.IOException) Deasciifier(zemberek.normalization.deasciifier.Deasciifier) Collectors(java.util.stream.Collectors) StandardCharsets(java.nio.charset.StandardCharsets) List(java.util.List) Paths(java.nio.file.Paths) TurkishSentenceExtractor(zemberek.tokenization.TurkishSentenceExtractor) LanguageIdentifier(zemberek.langid.LanguageIdentifier) LanguageIdentifier(zemberek.langid.LanguageIdentifier)

Example 3 with LanguageIdentifier

use of zemberek.langid.LanguageIdentifier in project zemberek-nlp by ahmetaa.

the class ProperNounLanguage method main.

public static void main(String[] args) throws IOException {
    List<String> candidates = Files.readAllLines(Paths.get("/home/ahmetaa/projects/zemberek-nlp/zemberek.proper.vocab"));
    List<String> potentiallyForeign = new ArrayList<>();
    LanguageIdentifier lid = LanguageIdentifier.fromInternalModelGroup("tr_group");
    for (String candidate : candidates) {
        String l = lid.identify(candidate);
        if (l.equals("en")) {
            potentiallyForeign.add(candidate);
        }
    }
    Files.write(Paths.get("/home/ahmetaa/projects/zemberek-nlp/zemberek.proper.vocab.en"), potentiallyForeign);
}
Also used : LanguageIdentifier(zemberek.langid.LanguageIdentifier) ArrayList(java.util.ArrayList)

Example 4 with LanguageIdentifier

use of zemberek.langid.LanguageIdentifier in project zemberek-nlp by ahmetaa.

the class Trainer method generateModelsToDir.

private void generateModelsToDir(File countDir, File modelDir, String[] languages, boolean compressed) throws IOException {
    LanguageIdentifier identifier = LanguageIdentifier.generateFromCounts(countDir, languages);
    List<CharNgramLanguageModel> models = identifier.getModels();
    mkDir(modelDir);
    for (CharNgramLanguageModel model : models) {
        System.out.println("Generating model for:" + model.getId());
        MapBasedCharNgramLanguageModel mbm = (MapBasedCharNgramLanguageModel) model;
        if (compressed) {
            File modelFile = new File(modelDir, model.getId() + ".clm");
            CompressedCharNgramModel.compress(mbm, modelFile);
        } else {
            File modelFile = new File(modelDir, model.getId() + ".lm");
            mbm.saveCustom(modelFile);
        }
    }
}
Also used : CharNgramLanguageModel(zemberek.langid.model.CharNgramLanguageModel) MapBasedCharNgramLanguageModel(zemberek.langid.model.MapBasedCharNgramLanguageModel) LanguageIdentifier(zemberek.langid.LanguageIdentifier) MapBasedCharNgramLanguageModel(zemberek.langid.model.MapBasedCharNgramLanguageModel) File(java.io.File)

Example 5 with LanguageIdentifier

use of zemberek.langid.LanguageIdentifier in project zemberek-nlp by ahmetaa.

the class LanguageIdServiceImpl method detect.

@Override
public void detect(LanguageIdRequest request, StreamObserver<LanguageIdResponse> responseObserver) {
    LanguageIdentifier identifier = request.getTrGroup() ? languageIdentifierTr : languageIdentifier;
    String id = identifier.identify(request.getInput(), request.getMaxSampleCount());
    LanguageIdResponse.Builder builder = LanguageIdResponse.newBuilder().setLangId(id);
    if (request.getIncludeScores()) {
        List<LanguageIdentifier.IdResult> scores = identifier.getScores(request.getInput(), request.getMaxSampleCount());
        for (LanguageIdentifier.IdResult item : scores) {
            builder.addIdResult(IdResult.newBuilder().setId(item.id).setScore(item.score).build());
        }
    }
    responseObserver.onNext(builder.build());
    responseObserver.onCompleted();
}
Also used : LanguageIdentifier(zemberek.langid.LanguageIdentifier)

Aggregations

LanguageIdentifier (zemberek.langid.LanguageIdentifier)5 ArrayList (java.util.ArrayList)2 File (java.io.File)1 IOException (java.io.IOException)1 Charset (java.nio.charset.Charset)1 StandardCharsets (java.nio.charset.StandardCharsets)1 Files (java.nio.file.Files)1 Path (java.nio.file.Path)1 Paths (java.nio.file.Paths)1 HashSet (java.util.HashSet)1 List (java.util.List)1 Set (java.util.Set)1 Collectors (java.util.stream.Collectors)1 Histogram (zemberek.core.collections.Histogram)1 Strings (zemberek.core.io.Strings)1 PrimaryPos (zemberek.core.turkish.PrimaryPos)1 SecondaryPos (zemberek.core.turkish.SecondaryPos)1 Turkish (zemberek.core.turkish.Turkish)1 WebCorpus (zemberek.corpus.WebCorpus)1 WebDocument (zemberek.corpus.WebDocument)1