Search in sources :

Example 11 with SimpleTextWriter

use of zemberek.core.io.SimpleTextWriter in project zemberek-nlp by ahmetaa.

the class OflazerAnalyzerRunner method prepareForAnalysis.

private void prepareForAnalysis(File f, List<String> sentences) throws IOException {
    SimpleTextWriter stw = SimpleTextWriter.keepOpenUTF8Writer(f);
    for (String sentence : sentences) {
        for (String s : Splitter.on(p).omitEmptyStrings().split(sentence)) {
            stw.writeLine(s);
        }
        stw.writeLine("#");
    }
    stw.close();
}
Also used : SimpleTextWriter(zemberek.core.io.SimpleTextWriter)

Example 12 with SimpleTextWriter

use of zemberek.core.io.SimpleTextWriter in project zemberek-nlp by ahmetaa.

the class DisambiguatorPreprocessor method processOflazerAnalyzerOutputYuret.

public void processOflazerAnalyzerOutputYuret(File oflazerResult, File out) throws IOException {
    SimpleTextWriter yuretFileWriter = SimpleTextWriter.keepOpenWriter(new FileOutputStream(out), "ISO-8859-9");
    yuretFileWriter.writeLine("<DOC>\t<DOC>");
    yuretFileWriter.writeLine();
    LineIterator li = new SimpleTextReader(oflazerResult, "UTF-8").getLineIterator();
    boolean sentenceStarted = false;
    List<String> parses = new ArrayList<>();
    while (li.hasNext()) {
        String line = li.next().trim().replaceAll("AorPart", "PresPart");
        String word = Strings.subStringUntilFirst(line, "\t").trim();
        if (line.length() == 0 && !sentenceStarted) {
            continue;
        }
        if (line.length() == 0 && parses.size() > 0) {
            yuretFileWriter.writeLines(parses);
            yuretFileWriter.writeLine();
            yuretFileWriter.writeLine();
            parses = Lists.newArrayList();
        }
        if (line.length() > 0) {
            if (parses.size() == 0) {
                if (!sentenceStarted) {
                    yuretFileWriter.writeLine("<S>\t<S>");
                    yuretFileWriter.writeLine();
                }
                sentenceStarted = true;
            }
            if (punctuations.contains(word)) {
                // because analyser i use does not parse punctuations. i do it myself.
                parses.add(word + "\t" + word + "\t+Punc");
            } else if (!line.endsWith("?")) {
                parses.add(line);
            } else if (!word.equals("#")) {
                String inferred = inferUnknownWordParse(word);
                System.out.println("Bad word: [" + line + "] inferred to [" + inferred + "]");
                parses.add(inferred);
            }
        }
        if (word.equals("#")) {
            sentenceStarted = false;
            yuretFileWriter.writeLine("</S>\t</S>\n");
            parses = new ArrayList<>();
        }
    }
    yuretFileWriter.writeLine("</DOC>\t</DOC>");
    yuretFileWriter.close();
}
Also used : FileOutputStream(java.io.FileOutputStream) SimpleTextReader(zemberek.core.io.SimpleTextReader) ArrayList(java.util.ArrayList) LineIterator(zemberek.core.io.LineIterator) SimpleTextWriter(zemberek.core.io.SimpleTextWriter)

Aggregations

SimpleTextWriter (zemberek.core.io.SimpleTextWriter)12 File (java.io.File)2 ArrayList (java.util.ArrayList)2 Histogram (zemberek.core.collections.Histogram)2 LineIterator (zemberek.core.io.LineIterator)2 SimpleTextReader (zemberek.core.io.SimpleTextReader)2 BufferedReader (java.io.BufferedReader)1 FileInputStream (java.io.FileInputStream)1 FileOutputStream (java.io.FileOutputStream)1 InputStreamReader (java.io.InputStreamReader)1 Collator (java.text.Collator)1 LinkedHashSet (java.util.LinkedHashSet)1