use of edu.stanford.muse.util.Span in project epadd by ePADD.
the class EntityBook method getDisplayNameToFreq.
public Map<String, Integer> getDisplayNameToFreq(Archive archive, short type) {
Map<String, Entity> displayNameToEntity = new LinkedHashMap();
double theta = 0.001;
EntityBook entityBook = archive.getEntityBook();
for (Document doc : archive.getAllDocs()) {
Span[] spans = archive.getEntitiesInDoc(doc, true);
Set<String> seenInThisDoc = new LinkedHashSet<>();
for (Span span : spans) {
// bail out if not of entity type that we're looking for, or not enough confidence
if (span.type != type || span.typeScore < theta)
continue;
String name = span.getText();
String displayName = name;
// map the name to its display name. if no mapping, we should get the same name back as its displayName
if (entityBook != null)
displayName = entityBook.getDisplayName(name, span.type);
displayName = displayName.trim();
if (seenInThisDoc.contains(displayName))
// count an entity in a doc only once
continue;
seenInThisDoc.add(displayName);
if (!displayNameToEntity.containsKey(displayName))
displayNameToEntity.put(displayName, new Entity(displayName, span.typeScore));
else
displayNameToEntity.get(displayName).freq++;
}
}
// convert from displayNameToEntity to displayNameToFreq
Map<String, Integer> displayNameToFreq = new LinkedHashMap<>();
for (Entity e : displayNameToEntity.values()) displayNameToFreq.put(e.entity, e.freq);
return displayNameToFreq;
}
use of edu.stanford.muse.util.Span in project epadd by ePADD.
the class SequenceModelTest method testCONLL.
// we are missing F.C's like F.C. La Valletta
/**
* Tested on 28th Jan. 2016 on what is believed to be the testa.dat file of original CONLL.
* I procured this data-set from a prof's (UMass Prof., don't remember the name) home page where he provided the test files for a homework, guess who topped the assignment :)
* (So, don't use this data to report results at any serious venue)
* The results on multi-word names is as follows.
* Note that the test only considered PERSON, LOCATION and ORG; Also, it does not distinguish between the types because the type assigned by Sequence Labeler is almost always right. And, importantly this will avoid any scuffle over the mapping from fine-grained type to the coarse types.
* -------------
* Found: 8861 -- Total: 7781 -- Correct: 6675
* Precision: 0.75330096
* Recall: 0.8578589
* F1: 0.80218726
* ------------
* I went through 2691 sentences of which only 200 had any unrecognised entities and identified various sources of error.
* The sources of missing names are as follows in decreasing order of their contribution (approximately), I have put some examples with the sources. The example phrases are recognized as one chunk with a type.
* Obviously, this list is not exhaustive, USE IT WITH CAUTION!
* 1. Bad segmentation -- which is minor for ePADD and depends on training data and principles.
* For example: "Overseas Development Minister <PERSON>Lynda Chalker</PERSON>",Czech <PERSON>Daniel Vacek</PERSON>, "Frenchman <PERSON>Cedric Pioline</PERSON>"
* "President <PERSON>Nelson Mandela</PERSON>","<BANK>Reserve Bank of India</BANK> Governor <PERSON>Chakravarty Rangarajan</PERSON>"
* "Third-seeded <PERSON>Wayne Ferreira</PERSON>",
* Hong Kong Newsroom -- we got only Hong Kong, <BANK>Hong Kong Interbank</BANK> Offered Rate, Privately-owned <BANK>Bank Duta</BANK>
* [SERIOUS]
* 2. Bad training data -- since our training data (DBpedia instances) contain phrases like "of Romania" a lot
* Ex: <PERSON>Yayuk Basuki</PERSON> of Indonesia, <PERSON>Karim Alami</PERSON> of Morocc
* This is also leading to errors like when National Bank of Holand is segmented as National Bank
* [SERIOUS]
* 3. Some unknown names, mostly personal -- we see very weird names in CONLL; Hopefully, we can avoid this problem in ePADD by considering the address book of the archive.
* Ex: NOVYE ATAGI, Hans-Otto Sieg, NS Kampfruf, Marie-Jose Perec, Billy Mayfair--Paul Goydos--Hidemichi Tanaki
* we miss many (almost all) names of the form "M. Dowman" because of uncommon or unknown last name.
* 4. Bad segmentation due to limitations of CIC
* Ex: Hassan al-Turabi, National Democratic party, Department of Humanitarian affairs, Reserve bank of India, Saint of the Gutters, Queen of the South, Queen's Park
* 5. Very Long entities -- we refrain from seq. labelling if the #tokens>7
* Ex: National Socialist German Workers ' Party Foreign Organisation
* 6. We are missing OCEANs?!
* Ex: Atlantic Ocean, Indian Ocean
* 7. Bad segments -- why are some segments starting with weird chars like '&'
* Ex: Goldman Sachs & Co Wertpapier GmbH -> {& Co Wertpapier GmbH, Goldman Sachs}
* 8. We are missing Times of London?! We get nothing that contains "Newsroom" -- "Amsterdam Newsroom", "Hong Kong News Room"
* Why are we getting "Students of South Korea" instead of "South Korea"?
*
* 1/50th on only MWs
* 13 Feb 13:24:54 BMMModel INFO - -------------
* 13 Feb 13:24:54 BMMModel INFO - Found: 4238 -- Total: 4236 -- Correct: 3242 -- Missed due to wrong type: 358
* 13 Feb 13:24:54 BMMModel INFO - Precision: 0.7649835
* 13 Feb 13:24:54 BMMModel INFO - Recall: 0.7653447
* 13 Feb 13:24:54 BMMModel INFO - F1: 0.765164
* 13 Feb 13:24:54 BMMModel INFO - ------------
*
* Best performance on testa with [ignore segmentation] and single word with CONLL data is
* 25 Sep 13:27:03 SequenceModel INFO - -------------
* 25 Sep 13:27:03 SequenceModel INFO - Found: 4117 -- Total: 4236 -- Correct: 3368 -- Missed due to wrong type: 266
* 25 Sep 13:27:03 SequenceModel INFO - Precision: 0.8180714
* 25 Sep 13:27:03 SequenceModel INFO - Recall: 0.7950897
* 25 Sep 13:27:03 SequenceModel INFO - F1: 0.80641687
* 25 Sep 13:27:03 SequenceModel INFO - ------------
**
* on testa, *not* ignoring segmentation (exact match), any number of words
* 25 Sep 17:23:14 SequenceModel INFO - -------------
* 25 Sep 17:23:14 SequenceModel INFO - Found: 6006 -- Total: 7219 -- Correct: 4245 -- Missed due to wrong type: 605
* 25 Sep 17:23:14 SequenceModel INFO - Precision: 0.7067932
* 25 Sep 17:23:14 SequenceModel INFO - Recall: 0.5880316
* 25 Sep 17:23:14 SequenceModel INFO - F1: 0.6419659
* 25 Sep 17:23:14 SequenceModel INFO - ------------
*
* on testa, exact matches, multi-word names
* 25 Sep 17:28:04 SequenceModel INFO - -------------
* 25 Sep 17:28:04 SequenceModel INFO - Found: 4117 -- Total: 4236 -- Correct: 3096 -- Missed due to wrong type: 183
* 25 Sep 17:28:04 SequenceModel INFO - Precision: 0.7520039
* 25 Sep 17:28:04 SequenceModel INFO - Recall: 0.7308782
* 25 Sep 17:28:04 SequenceModel INFO - F1: 0.74129057
* 25 Sep 17:28:04 SequenceModel INFO - ------------
*
* With a model that is not trained on CONLL lists
* On testa, ignoring segmentation, any number of words.
* Sep 19:22:26 SequenceModel INFO - -------------
* 25 Sep 19:22:26 SequenceModel INFO - Found: 6129 -- Total: 7219 -- Correct: 4725 -- Missed due to wrong type: 964
* 25 Sep 19:22:26 SequenceModel INFO - Precision: 0.7709251
* 25 Sep 19:22:26 SequenceModel INFO - Recall: 0.6545228
* 25 Sep 19:22:26 SequenceModel INFO - F1: 0.7079712
* 25 Sep 19:22:26 SequenceModel INFO - ------------
*
* testa -- model trained on CONLL, ignore segmenatation, any phrase
* 26 Sep 20:23:58 SequenceModelTest INFO - -------------
* Found: 6391 -- Total: 7219 -- Correct: 4900 -- Missed due to wrong type: 987
* Precision: 0.7667032
* Recall: 0.67876434
* F1: 0.7200588
* ------------
*
* testb -- model trained on CONLL, ignore segmenatation, any phrase
* 26 Sep 20:24:01 SequenceModelTest INFO - -------------
* Found: 2198 -- Total: 2339 -- Correct: 1597 -- Missed due to wrong type: 425
* Precision: 0.7265696
* Recall: 0.68277043
* F1: 0.7039894
* ------------
*/
public static PerfStats testCONLL(SequenceModel seqModel, boolean verbose, ParamsCONLL params) {
PerfStats stats = new PerfStats();
try {
// only multi-word are considered
boolean onlyMW = params.onlyMultiWord;
// use ignoreSegmentation=true only with onlyMW=true it is not tested otherwise
boolean ignoreSegmentation = params.ignoreSegmentation;
String test = params.testType.toString();
InputStream in = Config.getResourceAsStream("CONLL" + File.separator + "annotation" + File.separator + test + "spacesep.txt");
// 7==0111 PER, LOC, ORG
Conll03NameSampleStream sampleStream = new Conll03NameSampleStream(Conll03NameSampleStream.LANGUAGE.EN, in, 7);
Set<String> correct = new LinkedHashSet<>(), found = new LinkedHashSet<>(), real = new LinkedHashSet<>(), wrongType = new LinkedHashSet<>();
Multimap<String, String> matchMap = ArrayListMultimap.create();
Map<String, String> foundTypes = new LinkedHashMap<>(), benchmarkTypes = new LinkedHashMap<>();
NameSample sample = sampleStream.read();
CICTokenizer tokenizer = new CICTokenizer();
while (sample != null) {
String[] words = sample.getSentence();
String sent = "";
for (String s : words) sent += s + " ";
sent = sent.substring(0, sent.length() - 1);
Map<String, String> names = new LinkedHashMap<>();
opennlp.tools.util.Span[] nspans = sample.getNames();
for (opennlp.tools.util.Span nspan : nspans) {
String n = "";
for (int si = nspan.getStart(); si < nspan.getEnd(); si++) {
if (si < words.length - 1 && words[si + 1].equals("'s"))
n += words[si];
else
n += words[si] + " ";
}
if (n.endsWith(" "))
n = n.substring(0, n.length() - 1);
if (!onlyMW || n.contains(" "))
names.put(n, nspan.getType());
}
Span[] chunks = seqModel.find(sent);
Map<String, String> foundSample = new LinkedHashMap<>();
if (chunks != null)
for (Span chunk : chunks) {
String text = chunk.text;
Short type = chunk.type;
if (type == NEType.Type.DISEASE.getCode() || type == NEType.Type.EVENT.getCode() || type == NEType.Type.AWARD.getCode())
continue;
Short coarseType = NEType.getCoarseType(type).getCode();
String typeText;
if (coarseType == NEType.Type.PERSON.getCode())
typeText = "person";
else if (coarseType == NEType.Type.PLACE.getCode())
typeText = "location";
else
typeText = "organization";
double s = chunk.typeScore;
if (s > 0 && (!onlyMW || text.contains(" ")))
foundSample.put(text, typeText);
}
Set<String> foundNames = new LinkedHashSet<>();
Map<String, String> localMatchMap = new LinkedHashMap<>();
for (Map.Entry<String, String> entry : foundSample.entrySet()) {
foundTypes.put(entry.getKey(), entry.getValue());
boolean foundEntry = false;
String foundType = null;
for (String name : names.keySet()) {
String cname = EmailUtils.uncanonicaliseName(name).toLowerCase();
String ek = EmailUtils.uncanonicaliseName(entry.getKey()).toLowerCase();
if (cname.equals(ek) || (ignoreSegmentation && ((cname.startsWith(ek + " ") || cname.endsWith(" " + ek) || ek.startsWith(cname + " ") || ek.endsWith(" " + cname))))) {
foundEntry = true;
foundType = names.get(name);
matchMap.put(entry.getKey(), name);
localMatchMap.put(entry.getKey(), name);
break;
}
}
if (foundEntry) {
if (entry.getValue().equals(foundType)) {
foundNames.add(entry.getKey());
correct.add(entry.getKey());
} else {
wrongType.add(entry.getKey());
}
}
}
if (verbose) {
log.info("CIC tokens: " + tokenizer.tokenizeWithoutOffsets(sent));
log.info(chunks);
String fn = "Found names:";
for (String f : foundNames) fn += f + "[" + foundSample.get(f) + "] with " + localMatchMap.get(f) + "--";
if (fn.endsWith("--"))
log.info(fn);
String extr = "Extra names: ";
for (String f : foundSample.keySet()) if (!localMatchMap.containsKey(f))
extr += f + "[" + foundSample.get(f) + "]--";
if (extr.endsWith("--"))
log.info(extr);
String miss = "Missing names: ";
for (String name : names.keySet()) if (!localMatchMap.values().contains(name))
miss += name + "[" + names.get(name) + "]--";
if (miss.endsWith("--"))
log.info(miss);
String misAssign = "Mis-assigned Types: ";
for (String f : foundSample.keySet()) if (matchMap.containsKey(f)) {
// log.warn("This is not expected: " + f + " in matchMap not found names -- " + names);
if (names.get(matchMap.get(f)) != null && !names.get(matchMap.get(f)).equals(foundSample.get(f)))
misAssign += f + "[" + foundSample.get(f) + "] Expected [" + names.get(matchMap.get(f)) + "]--";
}
if (misAssign.endsWith("--"))
log.info(misAssign);
log.info(sent + "\n------------------");
}
for (String name : names.keySet()) benchmarkTypes.put(name, names.get(name));
real.addAll(names.keySet());
found.addAll(foundSample.keySet());
sample = sampleStream.read();
}
float prec = (float) correct.size() / (float) found.size();
float recall = (float) correct.size() / (float) real.size();
if (verbose) {
log.info("----Correct names----");
for (String str : correct) log.info(str + " with " + new LinkedHashSet<>(matchMap.get(str)));
log.info("----Missed names----");
real.stream().filter(str -> !matchMap.values().contains(str)).forEach(log::info);
log.info("---Extra names------");
found.stream().filter(str -> !matchMap.keySet().contains(str)).forEach(log::info);
log.info("---Assigned wrong type------");
for (String str : wrongType) {
Set<String> bMatches = new LinkedHashSet<>(matchMap.get(str));
for (String bMatch : bMatches) {
String ft = foundTypes.get(str);
String bt = benchmarkTypes.get(bMatch);
if (!ft.equals(bt))
log.info(str + "[" + ft + "] expected " + bMatch + "[" + bt + "]");
}
}
}
stats.f1 = (2 * prec * recall / (prec + recall));
stats.precision = prec;
stats.recall = recall;
stats.numFound = found.size();
stats.numReal = real.size();
stats.numCorrect = correct.size();
stats.numWrongType = wrongType.size();
log.info(stats.toString());
} catch (IOException e) {
e.printStackTrace();
}
return stats;
}
use of edu.stanford.muse.util.Span in project epadd by ePADD.
the class EmailRenderer method getHTMLForHeader.
/**
* returns a HTML table string for the doc header
*
* @throws IOException
*/
private static StringBuilder getHTMLForHeader(EmailDocument ed, SearchResult searchResult, boolean IA_links, boolean debug) throws IOException {
AddressBook addressBook = searchResult.getArchive().addressBook;
Set<String> contactNames = new LinkedHashSet<>();
Set<String> contactAddresses = new LinkedHashSet<>();
String archiveID = ArchiveReaderWriter.getArchiveIDForArchive(searchResult.getArchive());
// get contact ids from searchResult object.
Set<Integer> highlightContactIds = searchResult.getHLInfoContactIDs().stream().map(Integer::parseInt).collect(Collectors.toSet());
if (highlightContactIds != null)
for (Integer hci : highlightContactIds) {
if (hci == null)
continue;
Contact c = searchResult.getArchive().addressBook.getContact(hci);
if (c == null)
continue;
contactNames.addAll(c.getNames());
contactAddresses.addAll(c.getEmails());
}
// get highlight terms from searchResult object for this document.
Set<String> highlightTerms = searchResult.getHLInfoTerms(ed);
StringBuilder result = new StringBuilder();
// header table
result.append("<table class=\"docheader rounded\">\n");
// + this.folderName + "</td></tr>\n");
if (debug)
result.append("<tr><td>docId: </td><td>" + ed.getUniqueId() + "</td></tr>\n");
result.append(JSPHelper.getHTMLForDate(archiveID, ed.date));
final String style = "<tr><td align=\"right\" class=\"muted\" valign=\"top\">";
// email specific headers
result.append(style + "From: </td><td align=\"left\">");
Address[] addrs = ed.from;
// get ArchiveID
if (addrs != null) {
result.append(formatAddressesAsHTML(archiveID, addrs, addressBook, TEXT_WRAP_WIDTH, highlightTerms, contactNames, contactAddresses));
}
result.append("\n</td></tr>\n");
result.append(style + "To: </td><td align=\"left\">");
addrs = ed.to;
if (addrs != null)
result.append(formatAddressesAsHTML(archiveID, addrs, addressBook, TEXT_WRAP_WIDTH, highlightTerms, contactNames, contactAddresses) + "");
result.append("\n</td></tr>\n");
if (ed.cc != null && ed.cc.length > 0) {
result.append(style + "Cc: </td><td align=\"left\">");
result.append(formatAddressesAsHTML(archiveID, ed.cc, addressBook, TEXT_WRAP_WIDTH, highlightTerms, contactNames, contactAddresses) + "");
result.append("\n</td></tr>\n");
}
if (ed.bcc != null && ed.bcc.length > 0) {
result.append(style + "Bcc: </td><td align=\"left\">");
result.append(formatAddressesAsHTML(archiveID, ed.bcc, addressBook, TEXT_WRAP_WIDTH, highlightTerms, contactNames, contactAddresses) + "");
result.append("\n</td></tr>\n");
}
String x = ed.description;
if (x == null)
x = "<None>";
result.append(style + "Subject: </td>");
// <pre> to escape special chars if any in the subject. max 70 chars in
// one line, otherwise spill to next line
result.append("<td align=\"left\"><b>");
x = DatedDocument.formatStringForMaxCharsPerLine(x, 70).toString();
if (x.endsWith("\n"))
x = x.substring(0, x.length() - 1);
Span[] names = searchResult.getArchive().getAllNamesInDoc(ed, false);
// Contains all entities and id if it is authorised else null
Map<String, Entity> entitiesWithId = new HashMap<>();
// we annotate three specially recognized types
Map<Short, String> recMap = new HashMap<>();
recMap.put(NEType.Type.PERSON.getCode(), "cp");
recMap.put(NEType.Type.PLACE.getCode(), "cl");
recMap.put(NEType.Type.ORGANISATION.getCode(), "co");
Arrays.stream(names).filter(n -> recMap.containsKey(NEType.getCoarseType(n.type).getCode())).forEach(n -> {
Set<String> types = new HashSet<>();
types.add(recMap.get(NEType.getCoarseType(n.type).getCode()));
entitiesWithId.put(n.text, new Entity(n.text, null, types));
});
x = searchResult.getArchive().annotate(x, ed.getDate(), ed.getUniqueId(), searchResult.getRegexToHighlight(), highlightTerms, entitiesWithId, IA_links, false);
result.append(x);
result.append("</b>\n");
result.append("\n</td></tr>\n");
// String messageId = Util.hash (ed.getSignature());
// String messageLink = "(<a href=\"browse?archiveID="+archiveID+"&adv-search=1&uniqueId=" + messageId + "\">Link</a>)";
// result.append ("\n" + style + "ID: " + "</td><td>" + messageId + " " + messageLink + "</td></tr>");
// end docheader table
result.append("</table>\n");
if (ModeConfig.isPublicMode())
return new StringBuilder(Util.maskEmailDomain(result.toString()));
return result;
}
use of edu.stanford.muse.util.Span in project epadd by ePADD.
the class EntityBookManager method getEntitiesInDoc.
/* body = true => in message body, false => in subject */
/**
* This method is a wrapper over @getEntitiesInDocFromLucene. It reads the entities form lucene and filter out those which are not present
* in the entitybooks. This is to make sure that entitybook remain as a single point of deciding if an entity is present in the doc or not.
* @param document
* @param body
* @return
*/
public Span[] getEntitiesInDoc(Document document, boolean body) {
Span[] names = getEntitiesInDocFromLucene(document, body);
Set<Span> res = new LinkedHashSet<>();
for (NEType.Type t : NEType.Type.values()) {
EntityBook ebook = this.getEntityBookForType(t.getCode());
for (Span name : names) {
if (ebook.nameToMappedEntity.get(EntityBook.canonicalize(name.text)) != null)
res.add(name);
}
}
// return res as array
return res.toArray(new Span[res.size()]);
}
use of edu.stanford.muse.util.Span in project epadd by ePADD.
the class EntityBookManager method fillEntityBookFromLucene.
/*
This is a slow path but the assumption is that it must be used only once when porting the old archives (where entitybooks are not factored out as files). After that only the other
path 'fillEntityBookFromText' will be used repetitively (when loading the archive)
*/
private void fillEntityBookFromLucene(Short type) {
EntityBook ebook = new EntityBook(type);
mTypeToEntityBook.put(type, ebook);
double theta = 0.001;
// docset map maps a mappedentity to it's score and the set of documents.
Map<MappedEntity, Pair<Double, Set<Document>>> docsetmap = new LinkedHashMap<>();
for (Document doc : mArchive.getAllDocs()) {
Span[] spansbody = getEntitiesInDocFromLucene(doc, true);
Span[] spans = getEntitiesInDocFromLucene(doc, false);
Span[] allspans = ArrayUtils.addAll(spans, spansbody);
Set<String> seenInThisDoc = new LinkedHashSet<>();
for (Span span : allspans) {
// bail out if not of entity type that we're looking for, or not enough confidence
if (span.type != type || span.typeScore < theta)
continue;
String name = span.getText();
String canonicalizedname = EntityBook.canonicalize(name);
Double score = new Double(span.typeScore);
// map the name to its display name. if no mapping, we should get the same name back as its displayName
MappedEntity mappedEntity = (ebook.nameToMappedEntity.get(canonicalizedname));
if (mappedEntity == null) {
// add this name as a mapped entity in the entiybook.
mappedEntity = new MappedEntity();
// Don't canonicalize for the display purpose otherwise 'University of Florida' becomes 'florida of university'
mappedEntity.setDisplayName(name);
mappedEntity.setEntityType(type);
mappedEntity.addAltNames(name);
ebook.nameToMappedEntity.put(canonicalizedname, mappedEntity);
Set<Document> docset = new LinkedHashSet<>();
docsetmap.put(mappedEntity, new Pair(score, docset));
// No doc exists already for this mappedntity
docset.add(doc);
} else {
// add it in the docset.//what about the score??? For now take the score as max of all scores..
Double oldscore = docsetmap.get(mappedEntity).first;
Double finalscore = Double.max(oldscore, score);
Set<Document> docset = docsetmap.get(mappedEntity).second;
docset.add(doc);
docsetmap.put(mappedEntity, new Pair(finalscore, docset));
}
}
}
// fill cache summary for ebook in other fields of ebook.
ebook.fillSummaryFields(docsetmap, mArchive);
}
Aggregations