Search in sources :

Example 1 with NameSample

use of in project epadd by ePADD.

the class SequenceModelTest method testCONLL.

// we are missing F.C's like F.C. La Valletta
 * Tested on 28th Jan. 2016 on what is believed to be the testa.dat file of original CONLL.
 * I procured this data-set from a prof's (UMass Prof., don't remember the name) home page where he provided the test files for a homework, guess who topped the assignment :)
 * (So, don't use this data to report results at any serious venue)
 * The results on multi-word names is as follows.
 * Note that the test only considered PERSON, LOCATION and ORG; Also, it does not distinguish between the types because the type assigned by Sequence Labeler is almost always right. And, importantly this will avoid any scuffle over the mapping from fine-grained type to the coarse types.
 *  -------------
 *  Found: 8861 -- Total: 7781 -- Correct: 6675
 *  Precision: 0.75330096
 *  Recall: 0.8578589
 *  F1: 0.80218726
 *  ------------
 * I went through 2691 sentences of which only 200 had any unrecognised entities and identified various sources of error.
 * The sources of missing names are as follows in decreasing order of their contribution (approximately), I have put some examples with the sources. The example phrases are recognized as one chunk with a type.
 * Obviously, this list is not exhaustive, USE IT WITH CAUTION!
 *  1. Bad segmentation -- which is minor for ePADD and depends on training data and principles.
 *     For example: "Overseas Development Minister <PERSON>Lynda Chalker</PERSON>",Czech <PERSON>Daniel Vacek</PERSON>, "Frenchman <PERSON>Cedric Pioline</PERSON>"
 *     "President <PERSON>Nelson Mandela</PERSON>","<BANK>Reserve Bank of India</BANK> Governor <PERSON>Chakravarty Rangarajan</PERSON>"
 *     "Third-seeded <PERSON>Wayne Ferreira</PERSON>",
 *     Hong Kong Newsroom -- we got only Hong Kong, <BANK>Hong Kong Interbank</BANK> Offered Rate, Privately-owned <BANK>Bank Duta</BANK>
 *     [SERIOUS]
 *  2. Bad training data -- since our training data (DBpedia instances) contain phrases like "of Romania" a lot
 *     Ex: <PERSON>Yayuk Basuki</PERSON> of Indonesia, <PERSON>Karim Alami</PERSON> of Morocc
 *     This is also leading to errors like when National Bank of Holand is segmented as National Bank
 *     [SERIOUS]
 *  3. Some unknown names, mostly personal -- we see very weird names in CONLL; Hopefully, we can avoid this problem in ePADD by considering the address book of the archive.
 *     Ex: NOVYE ATAGI, Hans-Otto Sieg, NS Kampfruf, Marie-Jose Perec, Billy Mayfair--Paul Goydos--Hidemichi Tanaki
 *     we miss many (almost all) names of the form "M. Dowman" because of uncommon or unknown last name.
 *  4. Bad segmentation due to limitations of CIC
 *     Ex: Hassan al-Turabi, National Democratic party, Department of Humanitarian affairs, Reserve bank of India, Saint of the Gutters, Queen of the South, Queen's Park
 *  5. Very Long entities -- we refrain from seq. labelling if the #tokens>7
 *     Ex: National Socialist German Workers ' Party Foreign Organisation
 *  6. We are missing OCEANs?!
 *     Ex: Atlantic Ocean, Indian Ocean
 *  7. Bad segments -- why are some segments starting with weird chars like '&'
 *     Ex: Goldman Sachs & Co Wertpapier GmbH -> {& Co Wertpapier GmbH, Goldman Sachs}
 *  8. We are missing Times of London?! We get nothing that contains "Newsroom" -- "Amsterdam Newsroom", "Hong Kong News Room"
 *     Why are we getting "Students of South Korea" instead of "South Korea"?
 * 1/50th on only MWs
 * 13 Feb 13:24:54 BMMModel INFO  - -------------
 * 13 Feb 13:24:54 BMMModel INFO  - Found: 4238 -- Total: 4236 -- Correct: 3242 -- Missed due to wrong type: 358
 * 13 Feb 13:24:54 BMMModel INFO  - Precision: 0.7649835
 * 13 Feb 13:24:54 BMMModel INFO  - Recall: 0.7653447
 * 13 Feb 13:24:54 BMMModel INFO  - F1: 0.765164
 * 13 Feb 13:24:54 BMMModel INFO  - ------------
 * Best performance on testa with [ignore segmentation] and single word with CONLL data is
 * 25 Sep 13:27:03 SequenceModel INFO  - -------------
 * 25 Sep 13:27:03 SequenceModel INFO  - Found: 4117 -- Total: 4236 -- Correct: 3368 -- Missed due to wrong type: 266
 * 25 Sep 13:27:03 SequenceModel INFO  - Precision: 0.8180714
 * 25 Sep 13:27:03 SequenceModel INFO  - Recall: 0.7950897
 * 25 Sep 13:27:03 SequenceModel INFO  - F1: 0.80641687
 * 25 Sep 13:27:03 SequenceModel INFO  - ------------
 * on testa, *not* ignoring segmentation (exact match), any number of words
 * 25 Sep 17:23:14 SequenceModel INFO  - -------------
 * 25 Sep 17:23:14 SequenceModel INFO  - Found: 6006 -- Total: 7219 -- Correct: 4245 -- Missed due to wrong type: 605
 * 25 Sep 17:23:14 SequenceModel INFO  - Precision: 0.7067932
 * 25 Sep 17:23:14 SequenceModel INFO  - Recall: 0.5880316
 * 25 Sep 17:23:14 SequenceModel INFO  - F1: 0.6419659
 * 25 Sep 17:23:14 SequenceModel INFO  - ------------
 * on testa, exact matches, multi-word names
 * 25 Sep 17:28:04 SequenceModel INFO  - -------------
 * 25 Sep 17:28:04 SequenceModel INFO  - Found: 4117 -- Total: 4236 -- Correct: 3096 -- Missed due to wrong type: 183
 * 25 Sep 17:28:04 SequenceModel INFO  - Precision: 0.7520039
 * 25 Sep 17:28:04 SequenceModel INFO  - Recall: 0.7308782
 * 25 Sep 17:28:04 SequenceModel INFO  - F1: 0.74129057
 * 25 Sep 17:28:04 SequenceModel INFO  - ------------
 * With a model that is not trained on CONLL lists
 * On testa, ignoring segmentation, any number of words.
 * Sep 19:22:26 SequenceModel INFO  - -------------
 * 25 Sep 19:22:26 SequenceModel INFO  - Found: 6129 -- Total: 7219 -- Correct: 4725 -- Missed due to wrong type: 964
 * 25 Sep 19:22:26 SequenceModel INFO  - Precision: 0.7709251
 * 25 Sep 19:22:26 SequenceModel INFO  - Recall: 0.6545228
 * 25 Sep 19:22:26 SequenceModel INFO  - F1: 0.7079712
 * 25 Sep 19:22:26 SequenceModel INFO  - ------------
 * testa -- model trained on CONLL, ignore segmenatation, any phrase
 * 26 Sep 20:23:58 SequenceModelTest INFO  - -------------
 * Found: 6391 -- Total: 7219 -- Correct: 4900 -- Missed due to wrong type: 987
 * Precision: 0.7667032
 * Recall: 0.67876434
 * F1: 0.7200588
 * ------------
 * testb -- model trained on CONLL, ignore segmenatation, any phrase
 * 26 Sep 20:24:01 SequenceModelTest INFO  - -------------
 * Found: 2198 -- Total: 2339 -- Correct: 1597 -- Missed due to wrong type: 425
 * Precision: 0.7265696
 * Recall: 0.68277043
 * F1: 0.7039894
 * ------------
public static PerfStats testCONLL(SequenceModel seqModel, boolean verbose, ParamsCONLL params) {
    PerfStats stats = new PerfStats();
    try {
        // only multi-word are considered
        boolean onlyMW = params.onlyMultiWord;
        // use ignoreSegmentation=true only with onlyMW=true it is not tested otherwise
        boolean ignoreSegmentation = params.ignoreSegmentation;
        String test = params.testType.toString();
        InputStream in = Config.getResourceAsStream("CONLL" + File.separator + "annotation" + File.separator + test + "spacesep.txt");
        // 7==0111 PER, LOC, ORG
        Conll03NameSampleStream sampleStream = new Conll03NameSampleStream(Conll03NameSampleStream.LANGUAGE.EN, in, 7);
        Set<String> correct = new LinkedHashSet<>(), found = new LinkedHashSet<>(), real = new LinkedHashSet<>(), wrongType = new LinkedHashSet<>();
        Multimap<String, String> matchMap = ArrayListMultimap.create();
        Map<String, String> foundTypes = new LinkedHashMap<>(), benchmarkTypes = new LinkedHashMap<>();
        NameSample sample =;
        CICTokenizer tokenizer = new CICTokenizer();
        while (sample != null) {
            String[] words = sample.getSentence();
            String sent = "";
            for (String s : words) sent += s + " ";
            sent = sent.substring(0, sent.length() - 1);
            Map<String, String> names = new LinkedHashMap<>();
  [] nspans = sample.getNames();
            for ( nspan : nspans) {
                String n = "";
                for (int si = nspan.getStart(); si < nspan.getEnd(); si++) {
                    if (si < words.length - 1 && words[si + 1].equals("'s"))
                        n += words[si];
                        n += words[si] + " ";
                if (n.endsWith(" "))
                    n = n.substring(0, n.length() - 1);
                if (!onlyMW || n.contains(" "))
                    names.put(n, nspan.getType());
            Span[] chunks = seqModel.find(sent);
            Map<String, String> foundSample = new LinkedHashMap<>();
            if (chunks != null)
                for (Span chunk : chunks) {
                    String text = chunk.text;
                    Short type = chunk.type;
                    if (type == NEType.Type.DISEASE.getCode() || type == NEType.Type.EVENT.getCode() || type == NEType.Type.AWARD.getCode())
                    Short coarseType = NEType.getCoarseType(type).getCode();
                    String typeText;
                    if (coarseType == NEType.Type.PERSON.getCode())
                        typeText = "person";
                    else if (coarseType == NEType.Type.PLACE.getCode())
                        typeText = "location";
                        typeText = "organization";
                    double s = chunk.typeScore;
                    if (s > 0 && (!onlyMW || text.contains(" ")))
                        foundSample.put(text, typeText);
            Set<String> foundNames = new LinkedHashSet<>();
            Map<String, String> localMatchMap = new LinkedHashMap<>();
            for (Map.Entry<String, String> entry : foundSample.entrySet()) {
                foundTypes.put(entry.getKey(), entry.getValue());
                boolean foundEntry = false;
                String foundType = null;
                for (String name : names.keySet()) {
                    String cname = EmailUtils.uncanonicaliseName(name).toLowerCase();
                    String ek = EmailUtils.uncanonicaliseName(entry.getKey()).toLowerCase();
                    if (cname.equals(ek) || (ignoreSegmentation && ((cname.startsWith(ek + " ") || cname.endsWith(" " + ek) || ek.startsWith(cname + " ") || ek.endsWith(" " + cname))))) {
                        foundEntry = true;
                        foundType = names.get(name);
                        matchMap.put(entry.getKey(), name);
                        localMatchMap.put(entry.getKey(), name);
                if (foundEntry) {
                    if (entry.getValue().equals(foundType)) {
                    } else {
            if (verbose) {
      "CIC tokens: " + tokenizer.tokenizeWithoutOffsets(sent));
                String fn = "Found names:";
                for (String f : foundNames) fn += f + "[" + foundSample.get(f) + "] with " + localMatchMap.get(f) + "--";
                if (fn.endsWith("--"))
                String extr = "Extra names: ";
                for (String f : foundSample.keySet()) if (!localMatchMap.containsKey(f))
                    extr += f + "[" + foundSample.get(f) + "]--";
                if (extr.endsWith("--"))
                String miss = "Missing names: ";
                for (String name : names.keySet()) if (!localMatchMap.values().contains(name))
                    miss += name + "[" + names.get(name) + "]--";
                if (miss.endsWith("--"))
                String misAssign = "Mis-assigned Types: ";
                for (String f : foundSample.keySet()) if (matchMap.containsKey(f)) {
                    // log.warn("This is not expected: " + f + " in matchMap not found names -- " + names);
                    if (names.get(matchMap.get(f)) != null && !names.get(matchMap.get(f)).equals(foundSample.get(f)))
                        misAssign += f + "[" + foundSample.get(f) + "] Expected [" + names.get(matchMap.get(f)) + "]--";
                if (misAssign.endsWith("--"))
       + "\n------------------");
            for (String name : names.keySet()) benchmarkTypes.put(name, names.get(name));
            sample =;
        float prec = (float) correct.size() / (float) found.size();
        float recall = (float) correct.size() / (float) real.size();
        if (verbose) {
  "----Correct names----");
            for (String str : correct) + " with " + new LinkedHashSet<>(matchMap.get(str)));
  "----Missed names----");
   -> !matchMap.values().contains(str)).forEach(log::info);
  "---Extra names------");
   -> !matchMap.keySet().contains(str)).forEach(log::info);
  "---Assigned wrong type------");
            for (String str : wrongType) {
                Set<String> bMatches = new LinkedHashSet<>(matchMap.get(str));
                for (String bMatch : bMatches) {
                    String ft = foundTypes.get(str);
                    String bt = benchmarkTypes.get(bMatch);
                    if (!ft.equals(bt))
               + "[" + ft + "] expected " + bMatch + "[" + bt + "]");
        stats.f1 = (2 * prec * recall / (prec + recall));
        stats.precision = prec;
        stats.recall = recall;
        stats.numFound = found.size();
        stats.numReal = real.size();
        stats.numCorrect = correct.size();
        stats.numWrongType = wrongType.size();;
    } catch (IOException e) {
    return stats;
Also used : ArrayListMultimap( Span(edu.stanford.muse.util.Span) Config(edu.stanford.muse.Config) java.util(java.util) GZIPInputStream( DecimalFormat(java.text.DecimalFormat) Multimap( SequenceModel(edu.stanford.muse.ner.model.SequenceModel) Collectors( Pair(edu.stanford.muse.util.Pair) Logger(org.apache.logging.log4j.Logger) Stream( NEType(edu.stanford.muse.ner.model.NEType) CICTokenizer(edu.stanford.muse.ner.tokenize.CICTokenizer) Conll03NameSampleStream( GZIPOutputStream( NERModel(edu.stanford.muse.ner.model.NERModel) EmailUtils(edu.stanford.muse.util.EmailUtils) Assert(org.junit.Assert) LogManager(org.apache.logging.log4j.LogManager) NameSample( NameSample( Span(edu.stanford.muse.util.Span) CICTokenizer(edu.stanford.muse.ner.tokenize.CICTokenizer) GZIPInputStream( java.util(java.util) Conll03NameSampleStream(


ArrayListMultimap ( Multimap ( Config (edu.stanford.muse.Config)1 NERModel (edu.stanford.muse.ner.model.NERModel)1 NEType (edu.stanford.muse.ner.model.NEType)1 SequenceModel (edu.stanford.muse.ner.model.SequenceModel)1 CICTokenizer (edu.stanford.muse.ner.tokenize.CICTokenizer)1 EmailUtils (edu.stanford.muse.util.EmailUtils)1 Pair (edu.stanford.muse.util.Pair)1 Span (edu.stanford.muse.util.Span)1 ( DecimalFormat (java.text.DecimalFormat)1 java.util (java.util)1 Collectors ( Stream ( GZIPInputStream ( GZIPOutputStream ( Conll03NameSampleStream ( NameSample ( LogManager (org.apache.logging.log4j.LogManager)1