Search in sources :

Example 1 with Conll03NameSampleStream

use of opennlp.tools.formats.Conll03NameSampleStream in project epadd by ePADD.

the class SequenceModelTest method testCONLL.

// we are missing F.C's like F.C. La Valletta
/**
 * Tested on 28th Jan. 2016 on what is believed to be the testa.dat file of original CONLL.
 * I procured this data-set from a prof's (UMass Prof., don't remember the name) home page where he provided the test files for a homework, guess who topped the assignment :)
 * (So, don't use this data to report results at any serious venue)
 * The results on multi-word names is as follows.
 * Note that the test only considered PERSON, LOCATION and ORG; Also, it does not distinguish between the types because the type assigned by Sequence Labeler is almost always right. And, importantly this will avoid any scuffle over the mapping from fine-grained type to the coarse types.
 *  -------------
 *  Found: 8861 -- Total: 7781 -- Correct: 6675
 *  Precision: 0.75330096
 *  Recall: 0.8578589
 *  F1: 0.80218726
 *  ------------
 * I went through 2691 sentences of which only 200 had any unrecognised entities and identified various sources of error.
 * The sources of missing names are as follows in decreasing order of their contribution (approximately), I have put some examples with the sources. The example phrases are recognized as one chunk with a type.
 * Obviously, this list is not exhaustive, USE IT WITH CAUTION!
 *  1. Bad segmentation -- which is minor for ePADD and depends on training data and principles.
 *     For example: "Overseas Development Minister <PERSON>Lynda Chalker</PERSON>",Czech <PERSON>Daniel Vacek</PERSON>, "Frenchman <PERSON>Cedric Pioline</PERSON>"
 *     "President <PERSON>Nelson Mandela</PERSON>","<BANK>Reserve Bank of India</BANK> Governor <PERSON>Chakravarty Rangarajan</PERSON>"
 *     "Third-seeded <PERSON>Wayne Ferreira</PERSON>",
 *     Hong Kong Newsroom -- we got only Hong Kong, <BANK>Hong Kong Interbank</BANK> Offered Rate, Privately-owned <BANK>Bank Duta</BANK>
 *     [SERIOUS]
 *  2. Bad training data -- since our training data (DBpedia instances) contain phrases like "of Romania" a lot
 *     Ex: <PERSON>Yayuk Basuki</PERSON> of Indonesia, <PERSON>Karim Alami</PERSON> of Morocc
 *     This is also leading to errors like when National Bank of Holand is segmented as National Bank
 *     [SERIOUS]
 *  3. Some unknown names, mostly personal -- we see very weird names in CONLL; Hopefully, we can avoid this problem in ePADD by considering the address book of the archive.
 *     Ex: NOVYE ATAGI, Hans-Otto Sieg, NS Kampfruf, Marie-Jose Perec, Billy Mayfair--Paul Goydos--Hidemichi Tanaki
 *     we miss many (almost all) names of the form "M. Dowman" because of uncommon or unknown last name.
 *  4. Bad segmentation due to limitations of CIC
 *     Ex: Hassan al-Turabi, National Democratic party, Department of Humanitarian affairs, Reserve bank of India, Saint of the Gutters, Queen of the South, Queen's Park
 *  5. Very Long entities -- we refrain from seq. labelling if the #tokens>7
 *     Ex: National Socialist German Workers ' Party Foreign Organisation
 *  6. We are missing OCEANs?!
 *     Ex: Atlantic Ocean, Indian Ocean
 *  7. Bad segments -- why are some segments starting with weird chars like '&'
 *     Ex: Goldman Sachs & Co Wertpapier GmbH -> {& Co Wertpapier GmbH, Goldman Sachs}
 *  8. We are missing Times of London?! We get nothing that contains "Newsroom" -- "Amsterdam Newsroom", "Hong Kong News Room"
 *     Why are we getting "Students of South Korea" instead of "South Korea"?
 *
 * 1/50th on only MWs
 * 13 Feb 13:24:54 BMMModel INFO  - -------------
 * 13 Feb 13:24:54 BMMModel INFO  - Found: 4238 -- Total: 4236 -- Correct: 3242 -- Missed due to wrong type: 358
 * 13 Feb 13:24:54 BMMModel INFO  - Precision: 0.7649835
 * 13 Feb 13:24:54 BMMModel INFO  - Recall: 0.7653447
 * 13 Feb 13:24:54 BMMModel INFO  - F1: 0.765164
 * 13 Feb 13:24:54 BMMModel INFO  - ------------
 *
 * Best performance on testa with [ignore segmentation] and single word with CONLL data is
 * 25 Sep 13:27:03 SequenceModel INFO  - -------------
 * 25 Sep 13:27:03 SequenceModel INFO  - Found: 4117 -- Total: 4236 -- Correct: 3368 -- Missed due to wrong type: 266
 * 25 Sep 13:27:03 SequenceModel INFO  - Precision: 0.8180714
 * 25 Sep 13:27:03 SequenceModel INFO  - Recall: 0.7950897
 * 25 Sep 13:27:03 SequenceModel INFO  - F1: 0.80641687
 * 25 Sep 13:27:03 SequenceModel INFO  - ------------
 **
 * on testa, *not* ignoring segmentation (exact match), any number of words
 * 25 Sep 17:23:14 SequenceModel INFO  - -------------
 * 25 Sep 17:23:14 SequenceModel INFO  - Found: 6006 -- Total: 7219 -- Correct: 4245 -- Missed due to wrong type: 605
 * 25 Sep 17:23:14 SequenceModel INFO  - Precision: 0.7067932
 * 25 Sep 17:23:14 SequenceModel INFO  - Recall: 0.5880316
 * 25 Sep 17:23:14 SequenceModel INFO  - F1: 0.6419659
 * 25 Sep 17:23:14 SequenceModel INFO  - ------------
 *
 * on testa, exact matches, multi-word names
 * 25 Sep 17:28:04 SequenceModel INFO  - -------------
 * 25 Sep 17:28:04 SequenceModel INFO  - Found: 4117 -- Total: 4236 -- Correct: 3096 -- Missed due to wrong type: 183
 * 25 Sep 17:28:04 SequenceModel INFO  - Precision: 0.7520039
 * 25 Sep 17:28:04 SequenceModel INFO  - Recall: 0.7308782
 * 25 Sep 17:28:04 SequenceModel INFO  - F1: 0.74129057
 * 25 Sep 17:28:04 SequenceModel INFO  - ------------
 *
 * With a model that is not trained on CONLL lists
 * On testa, ignoring segmentation, any number of words.
 * Sep 19:22:26 SequenceModel INFO  - -------------
 * 25 Sep 19:22:26 SequenceModel INFO  - Found: 6129 -- Total: 7219 -- Correct: 4725 -- Missed due to wrong type: 964
 * 25 Sep 19:22:26 SequenceModel INFO  - Precision: 0.7709251
 * 25 Sep 19:22:26 SequenceModel INFO  - Recall: 0.6545228
 * 25 Sep 19:22:26 SequenceModel INFO  - F1: 0.7079712
 * 25 Sep 19:22:26 SequenceModel INFO  - ------------
 *
 * testa -- model trained on CONLL, ignore segmenatation, any phrase
 * 26 Sep 20:23:58 SequenceModelTest INFO  - -------------
 * Found: 6391 -- Total: 7219 -- Correct: 4900 -- Missed due to wrong type: 987
 * Precision: 0.7667032
 * Recall: 0.67876434
 * F1: 0.7200588
 * ------------
 *
 * testb -- model trained on CONLL, ignore segmenatation, any phrase
 * 26 Sep 20:24:01 SequenceModelTest INFO  - -------------
 * Found: 2198 -- Total: 2339 -- Correct: 1597 -- Missed due to wrong type: 425
 * Precision: 0.7265696
 * Recall: 0.68277043
 * F1: 0.7039894
 * ------------
 */
public static PerfStats testCONLL(SequenceModel seqModel, boolean verbose, ParamsCONLL params) {
    PerfStats stats = new PerfStats();
    try {
        // only multi-word are considered
        boolean onlyMW = params.onlyMultiWord;
        // use ignoreSegmentation=true only with onlyMW=true it is not tested otherwise
        boolean ignoreSegmentation = params.ignoreSegmentation;
        String test = params.testType.toString();
        InputStream in = Config.getResourceAsStream("CONLL" + File.separator + "annotation" + File.separator + test + "spacesep.txt");
        // 7==0111 PER, LOC, ORG
        Conll03NameSampleStream sampleStream = new Conll03NameSampleStream(Conll03NameSampleStream.LANGUAGE.EN, in, 7);
        Set<String> correct = new LinkedHashSet<>(), found = new LinkedHashSet<>(), real = new LinkedHashSet<>(), wrongType = new LinkedHashSet<>();
        Multimap<String, String> matchMap = ArrayListMultimap.create();
        Map<String, String> foundTypes = new LinkedHashMap<>(), benchmarkTypes = new LinkedHashMap<>();
        NameSample sample = sampleStream.read();
        CICTokenizer tokenizer = new CICTokenizer();
        while (sample != null) {
            String[] words = sample.getSentence();
            String sent = "";
            for (String s : words) sent += s + " ";
            sent = sent.substring(0, sent.length() - 1);
            Map<String, String> names = new LinkedHashMap<>();
            opennlp.tools.util.Span[] nspans = sample.getNames();
            for (opennlp.tools.util.Span nspan : nspans) {
                String n = "";
                for (int si = nspan.getStart(); si < nspan.getEnd(); si++) {
                    if (si < words.length - 1 && words[si + 1].equals("'s"))
                        n += words[si];
                    else
                        n += words[si] + " ";
                }
                if (n.endsWith(" "))
                    n = n.substring(0, n.length() - 1);
                if (!onlyMW || n.contains(" "))
                    names.put(n, nspan.getType());
            }
            Span[] chunks = seqModel.find(sent);
            Map<String, String> foundSample = new LinkedHashMap<>();
            if (chunks != null)
                for (Span chunk : chunks) {
                    String text = chunk.text;
                    Short type = chunk.type;
                    if (type == NEType.Type.DISEASE.getCode() || type == NEType.Type.EVENT.getCode() || type == NEType.Type.AWARD.getCode())
                        continue;
                    Short coarseType = NEType.getCoarseType(type).getCode();
                    String typeText;
                    if (coarseType == NEType.Type.PERSON.getCode())
                        typeText = "person";
                    else if (coarseType == NEType.Type.PLACE.getCode())
                        typeText = "location";
                    else
                        typeText = "organization";
                    double s = chunk.typeScore;
                    if (s > 0 && (!onlyMW || text.contains(" ")))
                        foundSample.put(text, typeText);
                }
            Set<String> foundNames = new LinkedHashSet<>();
            Map<String, String> localMatchMap = new LinkedHashMap<>();
            for (Map.Entry<String, String> entry : foundSample.entrySet()) {
                foundTypes.put(entry.getKey(), entry.getValue());
                boolean foundEntry = false;
                String foundType = null;
                for (String name : names.keySet()) {
                    String cname = EmailUtils.uncanonicaliseName(name).toLowerCase();
                    String ek = EmailUtils.uncanonicaliseName(entry.getKey()).toLowerCase();
                    if (cname.equals(ek) || (ignoreSegmentation && ((cname.startsWith(ek + " ") || cname.endsWith(" " + ek) || ek.startsWith(cname + " ") || ek.endsWith(" " + cname))))) {
                        foundEntry = true;
                        foundType = names.get(name);
                        matchMap.put(entry.getKey(), name);
                        localMatchMap.put(entry.getKey(), name);
                        break;
                    }
                }
                if (foundEntry) {
                    if (entry.getValue().equals(foundType)) {
                        foundNames.add(entry.getKey());
                        correct.add(entry.getKey());
                    } else {
                        wrongType.add(entry.getKey());
                    }
                }
            }
            if (verbose) {
                log.info("CIC tokens: " + tokenizer.tokenizeWithoutOffsets(sent));
                log.info(chunks);
                String fn = "Found names:";
                for (String f : foundNames) fn += f + "[" + foundSample.get(f) + "] with " + localMatchMap.get(f) + "--";
                if (fn.endsWith("--"))
                    log.info(fn);
                String extr = "Extra names: ";
                for (String f : foundSample.keySet()) if (!localMatchMap.containsKey(f))
                    extr += f + "[" + foundSample.get(f) + "]--";
                if (extr.endsWith("--"))
                    log.info(extr);
                String miss = "Missing names: ";
                for (String name : names.keySet()) if (!localMatchMap.values().contains(name))
                    miss += name + "[" + names.get(name) + "]--";
                if (miss.endsWith("--"))
                    log.info(miss);
                String misAssign = "Mis-assigned Types: ";
                for (String f : foundSample.keySet()) if (matchMap.containsKey(f)) {
                    // log.warn("This is not expected: " + f + " in matchMap not found names -- " + names);
                    if (names.get(matchMap.get(f)) != null && !names.get(matchMap.get(f)).equals(foundSample.get(f)))
                        misAssign += f + "[" + foundSample.get(f) + "] Expected [" + names.get(matchMap.get(f)) + "]--";
                }
                if (misAssign.endsWith("--"))
                    log.info(misAssign);
                log.info(sent + "\n------------------");
            }
            for (String name : names.keySet()) benchmarkTypes.put(name, names.get(name));
            real.addAll(names.keySet());
            found.addAll(foundSample.keySet());
            sample = sampleStream.read();
        }
        float prec = (float) correct.size() / (float) found.size();
        float recall = (float) correct.size() / (float) real.size();
        if (verbose) {
            log.info("----Correct names----");
            for (String str : correct) log.info(str + " with " + new LinkedHashSet<>(matchMap.get(str)));
            log.info("----Missed names----");
            real.stream().filter(str -> !matchMap.values().contains(str)).forEach(log::info);
            log.info("---Extra names------");
            found.stream().filter(str -> !matchMap.keySet().contains(str)).forEach(log::info);
            log.info("---Assigned wrong type------");
            for (String str : wrongType) {
                Set<String> bMatches = new LinkedHashSet<>(matchMap.get(str));
                for (String bMatch : bMatches) {
                    String ft = foundTypes.get(str);
                    String bt = benchmarkTypes.get(bMatch);
                    if (!ft.equals(bt))
                        log.info(str + "[" + ft + "] expected " + bMatch + "[" + bt + "]");
                }
            }
        }
        stats.f1 = (2 * prec * recall / (prec + recall));
        stats.precision = prec;
        stats.recall = recall;
        stats.numFound = found.size();
        stats.numReal = real.size();
        stats.numCorrect = correct.size();
        stats.numWrongType = wrongType.size();
        log.info(stats.toString());
    } catch (IOException e) {
        e.printStackTrace();
    }
    return stats;
}
Also used : ArrayListMultimap(com.google.common.collect.ArrayListMultimap) Span(edu.stanford.muse.util.Span) Config(edu.stanford.muse.Config) java.util(java.util) GZIPInputStream(java.util.zip.GZIPInputStream) DecimalFormat(java.text.DecimalFormat) Test(org.junit.Test) Multimap(com.google.common.collect.Multimap) SequenceModel(edu.stanford.muse.ner.model.SequenceModel) Collectors(java.util.stream.Collectors) Pair(edu.stanford.muse.util.Pair) Stream(java.util.stream.Stream) java.io(java.io) NEType(edu.stanford.muse.ner.model.NEType) CICTokenizer(edu.stanford.muse.ner.tokenize.CICTokenizer) Conll03NameSampleStream(opennlp.tools.formats.Conll03NameSampleStream) Log(org.apache.commons.logging.Log) GZIPOutputStream(java.util.zip.GZIPOutputStream) NERModel(edu.stanford.muse.ner.model.NERModel) LogFactory(org.apache.commons.logging.LogFactory) EmailUtils(edu.stanford.muse.util.EmailUtils) Assert(org.junit.Assert) NameSample(opennlp.tools.namefind.NameSample) NameSample(opennlp.tools.namefind.NameSample) Span(edu.stanford.muse.util.Span) CICTokenizer(edu.stanford.muse.ner.tokenize.CICTokenizer) GZIPInputStream(java.util.zip.GZIPInputStream) java.util(java.util) Conll03NameSampleStream(opennlp.tools.formats.Conll03NameSampleStream)

Aggregations

ArrayListMultimap (com.google.common.collect.ArrayListMultimap)1 Multimap (com.google.common.collect.Multimap)1 Config (edu.stanford.muse.Config)1 NERModel (edu.stanford.muse.ner.model.NERModel)1 NEType (edu.stanford.muse.ner.model.NEType)1 SequenceModel (edu.stanford.muse.ner.model.SequenceModel)1 CICTokenizer (edu.stanford.muse.ner.tokenize.CICTokenizer)1 EmailUtils (edu.stanford.muse.util.EmailUtils)1 Pair (edu.stanford.muse.util.Pair)1 Span (edu.stanford.muse.util.Span)1 java.io (java.io)1 DecimalFormat (java.text.DecimalFormat)1 java.util (java.util)1 Collectors (java.util.stream.Collectors)1 Stream (java.util.stream.Stream)1 GZIPInputStream (java.util.zip.GZIPInputStream)1 GZIPOutputStream (java.util.zip.GZIPOutputStream)1 Conll03NameSampleStream (opennlp.tools.formats.Conll03NameSampleStream)1 NameSample (opennlp.tools.namefind.NameSample)1 Log (org.apache.commons.logging.Log)1