use of edu.stanford.muse.ner.tokenize.CICTokenizer in project epadd by ePADD.
the class SequenceModelTest method testCONLL.
// we are missing F.C's like F.C. La Valletta
/**
* Tested on 28th Jan. 2016 on what is believed to be the testa.dat file of original CONLL.
* I procured this data-set from a prof's (UMass Prof., don't remember the name) home page where he provided the test files for a homework, guess who topped the assignment :)
* (So, don't use this data to report results at any serious venue)
* The results on multi-word names is as follows.
* Note that the test only considered PERSON, LOCATION and ORG; Also, it does not distinguish between the types because the type assigned by Sequence Labeler is almost always right. And, importantly this will avoid any scuffle over the mapping from fine-grained type to the coarse types.
* -------------
* Found: 8861 -- Total: 7781 -- Correct: 6675
* Precision: 0.75330096
* Recall: 0.8578589
* F1: 0.80218726
* ------------
* I went through 2691 sentences of which only 200 had any unrecognised entities and identified various sources of error.
* The sources of missing names are as follows in decreasing order of their contribution (approximately), I have put some examples with the sources. The example phrases are recognized as one chunk with a type.
* Obviously, this list is not exhaustive, USE IT WITH CAUTION!
* 1. Bad segmentation -- which is minor for ePADD and depends on training data and principles.
* For example: "Overseas Development Minister <PERSON>Lynda Chalker</PERSON>",Czech <PERSON>Daniel Vacek</PERSON>, "Frenchman <PERSON>Cedric Pioline</PERSON>"
* "President <PERSON>Nelson Mandela</PERSON>","<BANK>Reserve Bank of India</BANK> Governor <PERSON>Chakravarty Rangarajan</PERSON>"
* "Third-seeded <PERSON>Wayne Ferreira</PERSON>",
* Hong Kong Newsroom -- we got only Hong Kong, <BANK>Hong Kong Interbank</BANK> Offered Rate, Privately-owned <BANK>Bank Duta</BANK>
* [SERIOUS]
* 2. Bad training data -- since our training data (DBpedia instances) contain phrases like "of Romania" a lot
* Ex: <PERSON>Yayuk Basuki</PERSON> of Indonesia, <PERSON>Karim Alami</PERSON> of Morocc
* This is also leading to errors like when National Bank of Holand is segmented as National Bank
* [SERIOUS]
* 3. Some unknown names, mostly personal -- we see very weird names in CONLL; Hopefully, we can avoid this problem in ePADD by considering the address book of the archive.
* Ex: NOVYE ATAGI, Hans-Otto Sieg, NS Kampfruf, Marie-Jose Perec, Billy Mayfair--Paul Goydos--Hidemichi Tanaki
* we miss many (almost all) names of the form "M. Dowman" because of uncommon or unknown last name.
* 4. Bad segmentation due to limitations of CIC
* Ex: Hassan al-Turabi, National Democratic party, Department of Humanitarian affairs, Reserve bank of India, Saint of the Gutters, Queen of the South, Queen's Park
* 5. Very Long entities -- we refrain from seq. labelling if the #tokens>7
* Ex: National Socialist German Workers ' Party Foreign Organisation
* 6. We are missing OCEANs?!
* Ex: Atlantic Ocean, Indian Ocean
* 7. Bad segments -- why are some segments starting with weird chars like '&'
* Ex: Goldman Sachs & Co Wertpapier GmbH -> {& Co Wertpapier GmbH, Goldman Sachs}
* 8. We are missing Times of London?! We get nothing that contains "Newsroom" -- "Amsterdam Newsroom", "Hong Kong News Room"
* Why are we getting "Students of South Korea" instead of "South Korea"?
*
* 1/50th on only MWs
* 13 Feb 13:24:54 BMMModel INFO - -------------
* 13 Feb 13:24:54 BMMModel INFO - Found: 4238 -- Total: 4236 -- Correct: 3242 -- Missed due to wrong type: 358
* 13 Feb 13:24:54 BMMModel INFO - Precision: 0.7649835
* 13 Feb 13:24:54 BMMModel INFO - Recall: 0.7653447
* 13 Feb 13:24:54 BMMModel INFO - F1: 0.765164
* 13 Feb 13:24:54 BMMModel INFO - ------------
*
* Best performance on testa with [ignore segmentation] and single word with CONLL data is
* 25 Sep 13:27:03 SequenceModel INFO - -------------
* 25 Sep 13:27:03 SequenceModel INFO - Found: 4117 -- Total: 4236 -- Correct: 3368 -- Missed due to wrong type: 266
* 25 Sep 13:27:03 SequenceModel INFO - Precision: 0.8180714
* 25 Sep 13:27:03 SequenceModel INFO - Recall: 0.7950897
* 25 Sep 13:27:03 SequenceModel INFO - F1: 0.80641687
* 25 Sep 13:27:03 SequenceModel INFO - ------------
**
* on testa, *not* ignoring segmentation (exact match), any number of words
* 25 Sep 17:23:14 SequenceModel INFO - -------------
* 25 Sep 17:23:14 SequenceModel INFO - Found: 6006 -- Total: 7219 -- Correct: 4245 -- Missed due to wrong type: 605
* 25 Sep 17:23:14 SequenceModel INFO - Precision: 0.7067932
* 25 Sep 17:23:14 SequenceModel INFO - Recall: 0.5880316
* 25 Sep 17:23:14 SequenceModel INFO - F1: 0.6419659
* 25 Sep 17:23:14 SequenceModel INFO - ------------
*
* on testa, exact matches, multi-word names
* 25 Sep 17:28:04 SequenceModel INFO - -------------
* 25 Sep 17:28:04 SequenceModel INFO - Found: 4117 -- Total: 4236 -- Correct: 3096 -- Missed due to wrong type: 183
* 25 Sep 17:28:04 SequenceModel INFO - Precision: 0.7520039
* 25 Sep 17:28:04 SequenceModel INFO - Recall: 0.7308782
* 25 Sep 17:28:04 SequenceModel INFO - F1: 0.74129057
* 25 Sep 17:28:04 SequenceModel INFO - ------------
*
* With a model that is not trained on CONLL lists
* On testa, ignoring segmentation, any number of words.
* Sep 19:22:26 SequenceModel INFO - -------------
* 25 Sep 19:22:26 SequenceModel INFO - Found: 6129 -- Total: 7219 -- Correct: 4725 -- Missed due to wrong type: 964
* 25 Sep 19:22:26 SequenceModel INFO - Precision: 0.7709251
* 25 Sep 19:22:26 SequenceModel INFO - Recall: 0.6545228
* 25 Sep 19:22:26 SequenceModel INFO - F1: 0.7079712
* 25 Sep 19:22:26 SequenceModel INFO - ------------
*
* testa -- model trained on CONLL, ignore segmenatation, any phrase
* 26 Sep 20:23:58 SequenceModelTest INFO - -------------
* Found: 6391 -- Total: 7219 -- Correct: 4900 -- Missed due to wrong type: 987
* Precision: 0.7667032
* Recall: 0.67876434
* F1: 0.7200588
* ------------
*
* testb -- model trained on CONLL, ignore segmenatation, any phrase
* 26 Sep 20:24:01 SequenceModelTest INFO - -------------
* Found: 2198 -- Total: 2339 -- Correct: 1597 -- Missed due to wrong type: 425
* Precision: 0.7265696
* Recall: 0.68277043
* F1: 0.7039894
* ------------
*/
public static PerfStats testCONLL(SequenceModel seqModel, boolean verbose, ParamsCONLL params) {
PerfStats stats = new PerfStats();
try {
// only multi-word are considered
boolean onlyMW = params.onlyMultiWord;
// use ignoreSegmentation=true only with onlyMW=true it is not tested otherwise
boolean ignoreSegmentation = params.ignoreSegmentation;
String test = params.testType.toString();
InputStream in = Config.getResourceAsStream("CONLL" + File.separator + "annotation" + File.separator + test + "spacesep.txt");
// 7==0111 PER, LOC, ORG
Conll03NameSampleStream sampleStream = new Conll03NameSampleStream(Conll03NameSampleStream.LANGUAGE.EN, in, 7);
Set<String> correct = new LinkedHashSet<>(), found = new LinkedHashSet<>(), real = new LinkedHashSet<>(), wrongType = new LinkedHashSet<>();
Multimap<String, String> matchMap = ArrayListMultimap.create();
Map<String, String> foundTypes = new LinkedHashMap<>(), benchmarkTypes = new LinkedHashMap<>();
NameSample sample = sampleStream.read();
CICTokenizer tokenizer = new CICTokenizer();
while (sample != null) {
String[] words = sample.getSentence();
String sent = "";
for (String s : words) sent += s + " ";
sent = sent.substring(0, sent.length() - 1);
Map<String, String> names = new LinkedHashMap<>();
opennlp.tools.util.Span[] nspans = sample.getNames();
for (opennlp.tools.util.Span nspan : nspans) {
String n = "";
for (int si = nspan.getStart(); si < nspan.getEnd(); si++) {
if (si < words.length - 1 && words[si + 1].equals("'s"))
n += words[si];
else
n += words[si] + " ";
}
if (n.endsWith(" "))
n = n.substring(0, n.length() - 1);
if (!onlyMW || n.contains(" "))
names.put(n, nspan.getType());
}
Span[] chunks = seqModel.find(sent);
Map<String, String> foundSample = new LinkedHashMap<>();
if (chunks != null)
for (Span chunk : chunks) {
String text = chunk.text;
Short type = chunk.type;
if (type == NEType.Type.DISEASE.getCode() || type == NEType.Type.EVENT.getCode() || type == NEType.Type.AWARD.getCode())
continue;
Short coarseType = NEType.getCoarseType(type).getCode();
String typeText;
if (coarseType == NEType.Type.PERSON.getCode())
typeText = "person";
else if (coarseType == NEType.Type.PLACE.getCode())
typeText = "location";
else
typeText = "organization";
double s = chunk.typeScore;
if (s > 0 && (!onlyMW || text.contains(" ")))
foundSample.put(text, typeText);
}
Set<String> foundNames = new LinkedHashSet<>();
Map<String, String> localMatchMap = new LinkedHashMap<>();
for (Map.Entry<String, String> entry : foundSample.entrySet()) {
foundTypes.put(entry.getKey(), entry.getValue());
boolean foundEntry = false;
String foundType = null;
for (String name : names.keySet()) {
String cname = EmailUtils.uncanonicaliseName(name).toLowerCase();
String ek = EmailUtils.uncanonicaliseName(entry.getKey()).toLowerCase();
if (cname.equals(ek) || (ignoreSegmentation && ((cname.startsWith(ek + " ") || cname.endsWith(" " + ek) || ek.startsWith(cname + " ") || ek.endsWith(" " + cname))))) {
foundEntry = true;
foundType = names.get(name);
matchMap.put(entry.getKey(), name);
localMatchMap.put(entry.getKey(), name);
break;
}
}
if (foundEntry) {
if (entry.getValue().equals(foundType)) {
foundNames.add(entry.getKey());
correct.add(entry.getKey());
} else {
wrongType.add(entry.getKey());
}
}
}
if (verbose) {
log.info("CIC tokens: " + tokenizer.tokenizeWithoutOffsets(sent));
log.info(chunks);
String fn = "Found names:";
for (String f : foundNames) fn += f + "[" + foundSample.get(f) + "] with " + localMatchMap.get(f) + "--";
if (fn.endsWith("--"))
log.info(fn);
String extr = "Extra names: ";
for (String f : foundSample.keySet()) if (!localMatchMap.containsKey(f))
extr += f + "[" + foundSample.get(f) + "]--";
if (extr.endsWith("--"))
log.info(extr);
String miss = "Missing names: ";
for (String name : names.keySet()) if (!localMatchMap.values().contains(name))
miss += name + "[" + names.get(name) + "]--";
if (miss.endsWith("--"))
log.info(miss);
String misAssign = "Mis-assigned Types: ";
for (String f : foundSample.keySet()) if (matchMap.containsKey(f)) {
// log.warn("This is not expected: " + f + " in matchMap not found names -- " + names);
if (names.get(matchMap.get(f)) != null && !names.get(matchMap.get(f)).equals(foundSample.get(f)))
misAssign += f + "[" + foundSample.get(f) + "] Expected [" + names.get(matchMap.get(f)) + "]--";
}
if (misAssign.endsWith("--"))
log.info(misAssign);
log.info(sent + "\n------------------");
}
for (String name : names.keySet()) benchmarkTypes.put(name, names.get(name));
real.addAll(names.keySet());
found.addAll(foundSample.keySet());
sample = sampleStream.read();
}
float prec = (float) correct.size() / (float) found.size();
float recall = (float) correct.size() / (float) real.size();
if (verbose) {
log.info("----Correct names----");
for (String str : correct) log.info(str + " with " + new LinkedHashSet<>(matchMap.get(str)));
log.info("----Missed names----");
real.stream().filter(str -> !matchMap.values().contains(str)).forEach(log::info);
log.info("---Extra names------");
found.stream().filter(str -> !matchMap.keySet().contains(str)).forEach(log::info);
log.info("---Assigned wrong type------");
for (String str : wrongType) {
Set<String> bMatches = new LinkedHashSet<>(matchMap.get(str));
for (String bMatch : bMatches) {
String ft = foundTypes.get(str);
String bt = benchmarkTypes.get(bMatch);
if (!ft.equals(bt))
log.info(str + "[" + ft + "] expected " + bMatch + "[" + bt + "]");
}
}
}
stats.f1 = (2 * prec * recall / (prec + recall));
stats.precision = prec;
stats.recall = recall;
stats.numFound = found.size();
stats.numReal = real.size();
stats.numCorrect = correct.size();
stats.numWrongType = wrongType.size();
log.info(stats.toString());
} catch (IOException e) {
e.printStackTrace();
}
return stats;
}
use of edu.stanford.muse.ner.tokenize.CICTokenizer in project epadd by ePADD.
the class TokenizerTest method testCICTokenizer.
@Ignore
@Test
public void testCICTokenizer() {
Tokenizer tokenizer = new CICTokenizer();
String[] contents = new String[] { "A book named Information Retrieval by Christopher Manning", "I have visited Museum of Modern Arts aka. MoMA, MMA, MoMa", "Sound of the Music and Arts program by SALL Studios", "Performance by Chaurasia, Hariprasad was great!", "Dummy of the and Something", "Mr. HariPrasad was present.", "We traveled through A174 Road.", "The MIT school has many faculty members who were awarded the Nobel Prize in Physics", "We are celebrating Amy's first birthday", "We are meeting at Barnie's and then go to Terry's", "Patrick's portrayal of Barney is wonderful", "He won a gold in 1874 Winter Olympics", "India got independence in 1947", ">Holly Crumpton in an interview said he will never speak to public directly", "The popular Ellen de Generes show made a Vincent van Gogh themed episode", "Barack-O Obama is the President of USA", "CEO--Sundar attended a meeting in Delhi", "Subject: Jeb Bush, the presidential candidate", "From: Ted Cruz on Jan 15th, 2015", "I met Frank'O Connor in the CCD", "I have met him in the office yesterday", "Annapoorna Residence,\nHouse No: 1975,\nAlma Street,\nPalo Alto,\nCalifornia", // It fails here, because OpenNLP sentence model marks Mt. as end of the sentence.
"Met Mr. Robert Creeley at his place yesterday", "Dear Folks, it is party time!", "Few years ago, I wrote an article on \"Met The President\"", "This is great! I am meeting with Barney Stinson", "The Department of Geology is a hard sell!", "Sawadika!\n" + "\n" + "fondly,\n\n", "Judith C Stern MA PT\n" + "AmSAT Certified Teacher of the Alexander Technique\n" + "31 Purchase Street\n" + "Rye NY 10580", "Currently I am working in a Company", "Unfortunately I cannot attend the meeting", "Personally I prefer this over anything else", "On Behalf of Mr. Spider Man, we would like to apologise", "Quoting Robert Creeley, a Black Mountain Poet", "Hi Mrs. Senora, glad we have met", "Our XXX Company, produces the best detergents in the world", "My Thought on Thought makes an infinite loop", "Regarding The Bangalore Marathon, it has been cancelled due to stray dogs", "I am meeting with him in Jan, and will request for one in Feb, will say OK to everything and disappear on the very next Mon or Tue, etc.", "North Africa is the northern portion of Africa", "Center of Evaluation has developed some evaluation techniques.", "Hi Professor Winograd, this is your student from nowhere", ">> Hi Professor Winograd, this is your student from nowhere", "Hello this is McGill & Wexley Co.", "Why Benjamin Netanyahu may look", "I am good Said Netanyahu", "Even Netanyahu was present at the party", "The New York Times is a US based daily", "Do you know about The New York Times Company that brutally charges for Digital subscription", "Fischler proposed EU-wide measures after reports from Britain and France that under laboratory conditions sheep could contract Bovine Spongiform Encephalopathy ( BSE ) -- mad cow disease", "Spanish Farm Minister Loyola de Palacio had earlier accused Fischler at an EU farm ministers ' meeting of causing unjustified alarm through \" dangerous generalisation .", "P.V. Krishnamoorthi", "Should Rubin be told about this?", "You are talking to Robert Who?", "I will never say a thing SAID REBECCA HALL", "\" Airport officials declared an emergency situation at the highest level and the fire brigade put out the flames while the plane was landing , he said .", "Brussels received 5.6 cm ( 2.24 inches ) of water in the past 24 hours -- compared to an average 7.4 cm ( 2.96 inches ) per month -- but in several communes in the south of the country up to 8 cm ( 3.2 inches ) fell , the Royal Meteorological Institute ( RMT ) said", "Danish cleaning group ISS on Wednesday said it had signed a letter of intent to sell its troubled U.S unit ISS Inc to Canadian firm Aaxis Limited", "That was one hell of a Series!", "I am from India said No one.", "Rachel and I went for a date in the imaginary land of geeks.", "I'm the one invited.", "Shares in Slough , which earlier announced a 14 percent rise in first-half pretax profit to 37.4 million stg , climbed nearly six percent , or 14p to 250 pence at 1009 GMT , while British Land added 12-1 / 2p to 468p , Land Securities rose 5-1 / 2p to 691p and Hammerson was 8p higher at 390 ." };
String[][] tokens = new String[][] { new String[] { "Information Retrieval", "Christopher Manning" }, new String[] { "Museum of Modern Arts", "MoMA", "MMA", "MoMa" }, new String[] { "Music and Arts", "SALL Studios" }, new String[] { "Chaurasia", "Hariprasad" }, new String[] {}, new String[] { "Mr. HariPrasad" }, new String[] { "A174 Road" }, new String[] { "MIT", "Nobel Prize in Physics" }, new String[] { "Amy" }, new String[] { "Barnie", "Terry" }, new String[] { "Patrick", "Barney" }, new String[] { "Winter Olympics" }, new String[] { "India" }, new String[] { "Holly Crumpton" }, new String[] { "Ellen de Generes", "Vincent van Gogh" }, new String[] { "Barack-O Obama", "President of USA" }, new String[] { "CEO", "Sundar", "Delhi" }, new String[] { "Jeb Bush" }, // Can we do a better job here? without knowing that Ted Cruz is a person.
new String[] { "Ted Cruz" }, new String[] { "Frank'O Connor", "CCD" }, new String[] {}, new String[] { "Annapoorna Residence", "House No", "Alma Street", "Palo Alto", "California" }, new String[] { "Mr. Robert Creeley" }, new String[] {}, new String[] { "President" }, new String[] { "Barney Stinson" }, new String[] { "Department of Geology" }, new String[] { "Sawadika" }, new String[] { "Judith C Stern MA PT", "AmSAT Certified Teacher", "Alexander Technique", "Purchase Street", "Rye NY" }, new String[] {}, new String[] {}, new String[] {}, new String[] { "Mr. Spider Man" }, new String[] { "Robert Creeley", "Black Mountain Poet" }, new String[] { "Mrs. Senora" }, new String[] { "XXX Company" }, new String[] { "Thought" }, new String[] { "Bangalore Marathon" }, new String[] {}, new String[] { "North Africa", "Africa" }, new String[] { "Center of Evaluation" }, new String[] { "Professor Winograd" }, new String[] { "Professor Winograd" }, new String[] { "McGill & Wexley Co" }, new String[] { "Benjamin Netanyahu" }, new String[] { "Netanyahu" }, new String[] { "Netanyahu" }, new String[] { "New York Times", "US" }, new String[] { "New York Times Company" }, new String[] { "Fischler", "EU-wide", "Britain and France", "Bovine Spongiform Encephalopathy", "BSE" }, new String[] { "Spanish Farm Minister Loyola de Palacio", "Fischler", "EU" }, new String[] { "P.V. Krishnamoorthi" }, new String[] { "Rubin" }, new String[] { "Robert" }, new String[] { "REBECCA HALL" }, new String[] {}, new String[] { "Royal Meteorological Institute", "RMT", "Brussels" }, new String[] { "ISS", "ISS Inc", "Canadian", "Wednesday", "Aaxis Limited", "U.S", "Danish" }, new String[] {}, new String[] { "India" }, new String[] { "Rachel" }, new String[] {}, // this is a bad case
new String[] { "Shares in", "GMT", "British Land", "Land Securities", "Hammerson" } };
for (int ci = 0; ci < contents.length; ci++) {
String content = contents[ci];
List<String> ts = Arrays.asList(tokens[ci]);
// want to specifically test person names tokenize for index 3.
List<Triple<String, Integer, Integer>> cics = tokenizer.tokenize(content);
List<String> cicTokens = cics.stream().map(t -> t.first).collect(Collectors.toList());
boolean missing = ts.stream().anyMatch(t -> !cicTokens.contains(CICTokenizer.canonicalize(t)));
boolean wrongOffsets = cics.stream().anyMatch(t -> {
if (!ts.contains(content.substring(t.second, t.third))) {
System.out.println("Fail at: " + content.substring(t.second, t.third));
return true;
}
return false;
});
String str = "------------\n" + "Test failed!\n" + "Content: " + content + "\n" + "Expected tokens: " + ts + "\n" + "Found: " + cics + "\n";
assertTrue("Missing tokens: " + str, cics.size() == ts.size() && !missing);
assertTrue("Wrong offsets: " + str, cics.size() == ts.size() && !wrongOffsets);
}
}
use of edu.stanford.muse.ner.tokenize.CICTokenizer in project epadd by ePADD.
the class NameTypes method computeNameMap.
/**
* returns ctitle -> nameinfo
*/
public static Map<String, NameInfo> computeNameMap(Archive archive, Collection<EmailDocument> allDocs) {
if (allDocs == null)
allDocs = (List) archive.getAllDocs();
// compute name -> nameInfo
Map<String, NameInfo> hitTitles = new LinkedHashMap<>();
int i = 0;
List<String> upnames = new ArrayList<>(), unernames = new ArrayList<>();
Tokenizer tokenizer = new CICTokenizer();
for (EmailDocument ed : allDocs) {
if (i % 1000 == 0)
log.info("Collected names from :" + i + "/" + allDocs.size());
i++;
String id = ed.getUniqueId();
// String content = archive.getContents(ed, false);
// Set<String> pnames = tokenize.tokenizeWithoutOffsets(content, true);
// Note that archive.getAllNames does not fetch the corr. names, but openNLPNER names.
List<String> pnames = ed.getAllNames();
List<String> names = new ArrayList<>();
// temp to remove duplication.
Set<String> unames = new HashSet<>();
unames.addAll(pnames);
names.addAll(unames);
for (String name : names) {
if (name == null || !name.contains(" "))
continue;
// canonical title
String cTitle = name.trim().toLowerCase();
// these are noisy "names"
if ("best_wishes".equals(cTitle) || "best_regards".equals(cTitle) || "uncensored".equals(cTitle))
continue;
NameInfo I = hitTitles.get(cTitle);
if (I == null) {
I = new NameInfo(name);
I.times = 1;
I.snippet = "";
hitTitles.put(cTitle, I);
} else
I.times++;
}
}
Set<String> unp = new HashSet<>(), unner = new HashSet<>();
unp.addAll(upnames);
for (String u : unernames) unner.add(u);
return hitTitles;
}
Aggregations