Search in sources :

Example 1 with Pair

use of edu.stanford.muse.util.Pair in project epadd by ePADD.

the class Highlighter method getHTMLAnnotatedDocumentContents.

/**
 * @param contents is the content to be annotated, typically the text in email body
 * A convenience method to do the bulk job of annotating all the terms in termsToHighlight, termsToHyperlink and entitiesWithId
 * Also hyperlinks any URLs found in the content
 * @param regexToHighlight - the output will highlight all the strings matching regexToHighlight
 * @param showDebugInfo - when set will append to the output some debug info. related to the entities present in the content and passed through entitiesWithId
 *
 * Note: DO not modify any of the objects passed in the parameter
 *       if need to be modified then clone and modify a local copy
 */
// TODO: can also get rid of termsToHyperlink
public static String getHTMLAnnotatedDocumentContents(String contents, Date d, String docId, String regexToHighlight, Set<String> termsToHighlight, Map<String, EmailRenderer.Entity> entitiesWithId, Set<String> termsToHyperlink, boolean showDebugInfo) {
    Set<String> highlightTerms = new LinkedHashSet<>(), hyperlinkTerms = new LinkedHashSet<>();
    if (termsToHighlight != null)
        highlightTerms.addAll(termsToHighlight);
    if (termsToHyperlink != null)
        hyperlinkTerms.addAll(termsToHyperlink);
    if (log.isDebugEnabled())
        log.debug("DocId: " + docId + "; Highlight terms: " + highlightTerms + "; Entities: " + entitiesWithId + "; Hyperlink terms: " + hyperlinkTerms);
    // System.err.println("DocId: " + docId + "; Highlight terms: " + highlightTerms + "; Entities: " + entitiesWithId + "; Hyperlink terms: " + hyperlinkTerms);
    short HIGHLIGHT = 0, HYPERLINK = 1;
    // pp for post process, as we cannot add complex tags which highlighting
    String preHighlightTag = "<span class='hilitedTerm rounded' >", postHighlightTag = "</span>";
    String preHyperlinkTag = "<span data-process='pp'>", postHyperlinkTag = "</span>";
    // since the urls are not tokenized as one token, it is not possible to highlight with Lucene highlighter.
    Pattern p = Pattern.compile("https?://[^\\s\\n]*");
    Matcher m = p.matcher(contents);
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        String link = m.group();
        String url = link;
        if (d != null) {
            Calendar c = new GregorianCalendar();
            c.setTime(d);
            String archiveDate = c.get(Calendar.YEAR) + String.format("%02d", c.get(Calendar.MONTH)) + String.format("%02d", c.get(Calendar.DATE)) + "120000";
            url = "http://web.archive.org/web/" + archiveDate + "/" + link;
        }
        m.appendReplacement(sb, Matcher.quoteReplacement("<a target=\"_blank\" href=\"" + url + "\">" + link + "</a> "));
    }
    m.appendTail(sb);
    contents = sb.toString();
    if (!Util.nullOrEmpty(regexToHighlight)) {
        contents = annotateRegex(contents, regexToHighlight, preHighlightTag, postHighlightTag);
    }
    List<String> catchTerms = Arrays.asList("class", "span", "data", "ignore");
    Set<String> ignoreTermsForHyperlinking = catchTerms.stream().map(String::toLowerCase).collect(Collectors.toSet());
    // entitiesid stuff is already canonicalized with tokenize used with analyzer
    if (entitiesWithId != null)
        hyperlinkTerms.addAll(entitiesWithId.keySet().stream().filter(term -> !ignoreTermsForHyperlinking.contains(term.trim().toLowerCase())).map(term -> "\"" + term + "\"").collect(Collectors.toSet()));
    // If there are overlapping annotations, then they need to be serialised.
    // This is serialized order for such annotations.
    // map strings to be annotated -> boolean denoting whether to highlight or hyperlink.
    List<Pair<String, Short>> order = new ArrayList<>();
    // should preserve order so that highlight terms are seen before hyperlink
    Set<String> allTerms = new LinkedHashSet<>();
    allTerms.addAll(highlightTerms);
    /*
		 * We ant to assign order in which terms are highlighted or hyperlinked.
		 * for example: if we want to annotate both "Robert" and "Robert Creeley", and if we annotate "Robert" first then we may miss on "Robert Creeley"
		 * so we assign order over strings that share any common words as done in the loop below
		 * TODO:
		 * This test can still miss cases when a regular expression that eventually matches a word already annotated and
		 * when two terms like "Robert Creeley" "Mr Robert" to match a text like: "Mr Robert Creeley".
		 * TODO: Give pref. to highlighter over hyperlink
		 * TODO: remove order and simplify
		 * In such cases one of the terms may not be annotated.
		 * Terms that are added to o are those that just share at-least one word
		 */
    // should preserve order so that highlight terms that are added first stay that way
    Map<Pair<String, Short>, Integer> o = new LinkedHashMap<>();
    // prioritised terms
    // Note that a term can be marked both for highlight and hyperlink
    Set<String> consTermsHighlight = new HashSet<>(), consTermsHyperlink = new HashSet<>();
    for (String at : allTerms) {
        // Catch: if we are trying to highlight terms like class, span e.t.c,
        // we better annotate them first as it may go into span tags and annotate the stuff, causing the highlighter to break
        Set<String> substrs = IndexUtils.computeAllSubstrings(at);
        for (String substr : substrs) {
            if (at.equals(substr) || at.equals("\"" + substr + "\""))
                continue;
            boolean match = catchTerms.contains(substr.toLowerCase());
            int val = match ? Integer.MAX_VALUE : substr.length();
            // The highlight or hyperlink terms may have quotes, specially handling below is for that.. is there a better way?
            if (highlightTerms.contains(substr) || highlightTerms.contains("\"" + substr + "\"")) {
                highlightTerms.remove(substr);
                highlightTerms.remove("\"" + substr + "\"");
                // there should be no repetitions in the order array, else it leads to multiple annotations i.e. two spans around one single element
                if (!consTermsHighlight.contains(substr)) {
                    o.put(new Pair<>(substr, HIGHLIGHT), val);
                    consTermsHighlight.add(substr);
                }
            }
            if (hyperlinkTerms.contains(substr) || hyperlinkTerms.contains("\"" + substr + "\"")) {
                hyperlinkTerms.remove(substr);
                hyperlinkTerms.remove("\"" + substr + "\"");
                if (!consTermsHyperlink.contains(substr)) {
                    o.put(new Pair<>(substr, HYPERLINK), val);
                    consTermsHyperlink.add(substr);
                }
            }
        }
    }
    // now sort the phrases from longest length to smallest length
    List<Pair<Pair<String, Short>, Integer>> os = Util.sortMapByValue(o);
    order.addAll(os.stream().map(pair -> pair.first).collect(Collectors.toSet()));
    // System.err.println(order+" hit: "+highlightTerms+" -- hyt: "+hyperlinkTerms);
    // annotate whatever is left in highlight and hyperlink Terms.
    // String result = contents;
    String result = highlightBatch(contents, highlightTerms.toArray(new String[highlightTerms.size()]), preHighlightTag, postHighlightTag);
    result = highlightBatch(result, hyperlinkTerms.toArray(new String[hyperlinkTerms.size()]), preHyperlinkTag, postHyperlinkTag);
    // now highlight terms in order.
    for (Pair<String, Short> ann : order) {
        short type = ann.second;
        String term = ann.first;
        String preTag = null, postTag = null;
        if (type == HYPERLINK) {
            preTag = preHyperlinkTag;
            postTag = postHyperlinkTag;
        } else if (type == HIGHLIGHT) {
            preTag = preHighlightTag;
            postTag = postHighlightTag;
        }
        try {
            result = highlight(result, term, preTag, postTag);
        } catch (IOException | InvalidTokenOffsetsException | ParseException e) {
            Util.print_exception("Exception while adding html annotation: " + ann.first, e, log);
            e.printStackTrace();
        }
    }
    // do some line breaking and show overflow.
    String[] lines = result.split("\\n");
    StringBuilder htmlResult = new StringBuilder();
    boolean overflow = false;
    for (String line : lines) {
        htmlResult.append(line);
        htmlResult.append("\n<br/>");
    }
    if (overflow) {
        htmlResult.append("</div>\n");
        // the nojog class ensures that the jog doesn't pop up when the more
        // button is clicked
        htmlResult.append("<span class=\"nojog\" style=\"color:#500050;text-decoration:underline;font-size:12px\" onclick=\"muse.reveal(this, false);\">More</span><br/>\n");
    }
    // Now do post-processing to add complex tags that depend on the text inside. title, link and cssclass
    org.jsoup.nodes.Document doc = Jsoup.parse(htmlResult.toString());
    Elements elts = doc.select("[data-process]");
    for (int j = 0; j < elts.size(); j++) {
        Element elt = elts.get(j);
        Element par = elt.parent();
        // Do not touch nested entities
        if (par != null && par.attr("data-process") == null)
            // (preHighlightTag.contains(par.tagName())||preHyperlinkTag.contains(par.tagName())))
            continue;
        String entity = elt.text();
        int span_j = j;
        String link = "browse?adv-search=1&termBody=on&termSubject=on&termAttachments=on&termOriginalBody=on&term=\"" + Util.escapeHTML(entity) + "\"";
        // note &quot here because the quotes have to survive
        // through the html page and reflect back in the URL
        // may need to URI escape docId?
        link += "&initDocId=" + docId;
        String title = "";
        try {
            String cssclass = "";
            EmailRenderer.Entity info = entitiesWithId.get(entity);
            if (info != null) {
                if (info.ids != null) {
                    title += "<div id=\"fast_" + info.ids + "\"></div>";
                    title += "<script>getFastData(\"" + info.ids + "\");</script>";
                    cssclass = "resolved";
                } else {
                    // the last three are the OpenNLPs'
                    // could have defined overlapping sub-classes, which would have reduced code repetitions in css file; but this way more flexibility
                    String[] types = new String[] { "cp", "cl", "co", "person", "org", "place", "acr" };
                    String[] cssclasses = new String[] { "custom-people", "custom-loc", "custom-org", "opennlp-person", "opennlp-org", "opennlp-place", "acronym" };
                    outer: for (String et : info.types) {
                        for (int t = 0; t < types.length; t++) {
                            String type = types[t];
                            if (type.equals(et)) {
                                if (t < 3) {
                                    cssclass += cssclasses[t] + " ";
                                    // consider no other class
                                    continue outer;
                                } else {
                                    cssclass += cssclasses[t] + " ";
                                }
                            }
                        }
                    }
                }
            } else {
                cssclass += " unresolved";
            }
            // enables completion (expansion) of words while browsing of messages.
            if (entity != null) {
                // enable for only few types
                if (cssclass.contains("custom-people") || cssclass.contains("acronym") || cssclass.contains("custom-org") || cssclass.contains("custom-loc")) {
                    // TODO: remove regexs
                    entity = entity.replaceAll("(^\\s+|\\s+$)", "");
                    if (!entity.contains(" ")) {
                        // String rnd = rand.nextInt() + "";
                        // <img src="images/spinner.gif" style="height:15px"/>
                        // <script>expand("" + entity + "\",\"" + StringEscapeUtils.escapeJava(docId) + "\",\"" + rnd + "");</script>
                        // if(info.expandsTo!=null)
                        // title += "<div class=\"resolutions\" id=\"expand_" + rnd + "\"><a href='browse?term=\""+info.expandsTo+"\"'>"+info.expandsTo+"</a></div>";
                        cssclass += " expand";
                    }
                }
            }
            for (int k = j; k <= span_j; k++) {
                elt = elts.get(k);
                // don't annotate nested tags-- double check if the parent tag is highlight related tag or entity related annotation
                if (elt.parent().tag().getName().toLowerCase().equals("span") && elt.parent().classNames().toString().contains("custom")) {
                    continue;
                }
                String cc = elt.attr("class");
                elt.attr("class", cc + " " + cssclass);
                elt.attr("title", title);
                elt.attr("onclick", "window.location='" + link + "'");
                // A tag may have nested tags in it and is involved to get the text in it.
                elt.attr("data-text", entity);
                elt.attr("data-docId", StringEscapeUtils.escapeHtml(docId));
            }
        } catch (Exception e) {
            Util.print_exception("Some unknown error while highlighting", e, log);
        }
    }
    // The output Jsoup .html() will dump each tag in separate line
    String html = doc.html();
    if (showDebugInfo) {
        String debug_html = html + "<br>";
        debug_html += "<div class='debug' style='display:none'>";
        debug_html += "docId: " + docId;
        debug_html += "<br>-------------------------------------------------<br>";
        for (String str : entitiesWithId.keySet()) debug_html += str + ":" + entitiesWithId.get(str).types + ";;; ";
        debug_html += "<br>-------------------------------------------------<br>";
        String[] opennlp = new String[] { "person", "place", "org" };
        String[] custom = new String[] { "cp", "cl", "co" };
        for (int j = 0; j < opennlp.length; j++) {
            String t1 = opennlp[j];
            String t2 = custom[j];
            Set<String> e1 = new HashSet<>();
            Set<String> e2 = new HashSet<>();
            for (String str : entitiesWithId.keySet()) {
                Set<String> types = entitiesWithId.get(str).types;
                if (types.contains(t1) && !types.contains(t2))
                    e1.add(entitiesWithId.get(str).name);
                else if (types.contains(t2) && !types.contains(t1))
                    e2.add(entitiesWithId.get(str).name);
            }
            debug_html += opennlp[j] + " entities recognised by only opennlp: " + e1;
            debug_html += "<br>";
            debug_html += opennlp[j] + " entities recognised by only custom: " + e2;
            debug_html += "<br><br>";
        }
        debug_html += "-------------------------------------------------<br>";
        lines = contents.split("\\n");
        for (String line : lines) debug_html += line + "<br>";
        debug_html += "</div>";
        debug_html += "<button onclick='$(\".debug\").style(\"display\",\"block\");'>Show Debug Info</button>";
        return debug_html;
    }
    return html;
}
Also used : ParseException(org.apache.lucene.queryparser.classic.ParseException) org.apache.lucene.search.highlight(org.apache.lucene.search.highlight) java.util(java.util) CharArraySet(org.apache.lucene.analysis.CharArraySet) MultiFieldQueryParser(org.apache.lucene.queryparser.classic.MultiFieldQueryParser) Matcher(java.util.regex.Matcher) Formatter(org.apache.lucene.search.highlight.Formatter) Element(org.jsoup.nodes.Element) SimpleSessions(edu.stanford.muse.webapp.SimpleSessions) EmailRenderer(edu.stanford.muse.webapp.EmailRenderer) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) TokenStream(org.apache.lucene.analysis.TokenStream) Analyzer(org.apache.lucene.analysis.Analyzer) Util(edu.stanford.muse.util.Util) FileWriter(java.io.FileWriter) IOException(java.io.IOException) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute) Version(org.apache.lucene.util.Version) Collectors(java.util.stream.Collectors) File(java.io.File) BooleanClause(org.apache.lucene.search.BooleanClause) Pair(edu.stanford.muse.util.Pair) BooleanQuery(org.apache.lucene.search.BooleanQuery) StringReader(java.io.StringReader) QueryParser(org.apache.lucene.queryparser.classic.QueryParser) NEType(edu.stanford.muse.ner.model.NEType) Log(org.apache.commons.logging.Log) LogFactory(org.apache.commons.logging.LogFactory) Jsoup(org.jsoup.Jsoup) Elements(org.jsoup.select.Elements) Pattern(java.util.regex.Pattern) ModeConfig(edu.stanford.muse.webapp.ModeConfig) StringEscapeUtils(org.apache.commons.lang.StringEscapeUtils) Matcher(java.util.regex.Matcher) Element(org.jsoup.nodes.Element) Elements(org.jsoup.select.Elements) Pair(edu.stanford.muse.util.Pair) Pattern(java.util.regex.Pattern) EmailRenderer(edu.stanford.muse.webapp.EmailRenderer) IOException(java.io.IOException) ParseException(org.apache.lucene.queryparser.classic.ParseException) IOException(java.io.IOException) ParseException(org.apache.lucene.queryparser.classic.ParseException)

Example 2 with Pair

use of edu.stanford.muse.util.Pair in project epadd by ePADD.

the class SearchResult method selectBlobs.

/**
 * this map is used only by attachments page right now, not advanced search.
 * TODO: make adv. search page also use it
 */
public static SearchResult selectBlobs(SearchResult inputSet) {
    Collection<Document> docs = inputSet.archive.getAllDocs();
    String neededFilesize = JSPHelper.getParam(inputSet.queryParams, "attachmentFilesize");
    String[] extensions = JSPHelper.getParams(inputSet.queryParams, "attachmentExtension").toArray(new String[0]);
    // should also have lower-case strings, no "." included
    Set<String> extensionsToMatch = new LinkedHashSet<>();
    if (!Util.nullOrEmpty(extensions)) {
        extensionsToMatch = new LinkedHashSet<>();
        for (String s : extensions) extensionsToMatch.add(s.trim().toLowerCase());
    }
    // or given extensions with extensions due to attachment type
    // this will have more semicolon separated extensions
    String[] types = JSPHelper.getParams(inputSet.queryParams, "attachmentType").toArray(new String[0]);
    if (!Util.nullOrEmpty(types)) {
        for (String t : types) {
            String exts = Config.attachmentTypeToExtensions.get(t);
            if (exts == null)
                exts = t;
            // continue;
            // Front end should uniformly pass attachment types as extensions like mp3;mov;ogg etc. Earlier it was passing vide, audio, doc etc.
            // In order to accommodate both cases we first check if there is ampping from the extension type to actual extensions using .get(t)
            // if no such mapping is present then we assume that the input extension types are of the form mp3;mov;ogg and work on that.
            String[] components = exts.split(";");
            Collections.addAll(extensionsToMatch, components);
        }
    }
    // a variable to select if the extensions needed contain others.
    boolean isOtherSelected = extensionsToMatch.contains("others");
    // get the options that were displayed for attachment types. This will be used to select attachment extensions if the option 'other'
    // was selected by the user in the drop down box of export.jsp.
    List<String> attachmentTypeOptions = Config.attachmentTypeToExtensions.values().stream().map(x -> Util.tokenize(x, ";")).flatMap(Collection::stream).collect(Collectors.toList());
    SearchResult outputSet = filterDocsByDate(inputSet);
    // Collection<EmailDocument> eDocs = (Collection) filterDocsByDate (params, new HashSet<>((Collection) docs));
    Map<Document, Pair<BodyHLInfo, AttachmentHLInfo>> outputDocs = new HashMap<>();
    for (Document k : outputSet.matchedDocs.keySet()) {
        EmailDocument ed = (EmailDocument) k;
        Set<Blob> matchedBlobs = new HashSet<>();
        for (Blob b : ed.attachments) {
            if (!Util.filesizeCheck(neededFilesize, b.getSize()))
                continue;
            if (!(Util.nullOrEmpty(extensionsToMatch))) {
                Pair<String, String> pair = Util.splitIntoFileBaseAndExtension(b.getName());
                String ext = pair.getSecond();
                if (ext == null)
                    continue;
                ext = ext.toLowerCase();
                // Proceed to add this attachment only if either
                // 1. other is selected and this extension is not present in the list attachmentOptionType, or
                // 2. this extension is present in the variable neededExtensions [Q. What if there is a file with extension .others?]
                boolean firstcondition = isOtherSelected && !attachmentTypeOptions.contains(ext);
                boolean secondcondition = extensionsToMatch.contains(ext);
                if (!firstcondition && !secondcondition)
                    continue;
            }
            // ok, we've survived all filters, add b
            matchedBlobs.add(b);
        }
        // of this document
        if (matchedBlobs.size() != 0) {
            BodyHLInfo bhlinfo = inputSet.matchedDocs.get(k).first;
            AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(k).second;
            attachmentHLInfo.addMultipleInfo(matchedBlobs);
            outputDocs.put(k, new Pair(bhlinfo, attachmentHLInfo));
        }
    }
    // Collections.reverse (allAttachments); // reverse, so most recent attachment is first
    return new SearchResult(outputDocs, inputSet.archive, inputSet.queryParams, inputSet.commonHLInfo, inputSet.regexToHighlight);
}
Also used : Blob(edu.stanford.muse.datacache.Blob) Pair(edu.stanford.muse.util.Pair)

Example 3 with Pair

use of edu.stanford.muse.util.Pair in project epadd by ePADD.

the class SearchResult method filterForAttachmentEntities.

/**
 ******************************ATTACHMENT SPECIFIC FILTERS************************************
 */
/**
 * returns only those docs with attachments matching params[attachmentEntity]
 * (this field is or-delimiter separated)
 * Todo: review usage of this and BlobStore.getKeywordsForBlob()
 */
private static SearchResult filterForAttachmentEntities(SearchResult inputSet) {
    String val = JSPHelper.getParam(inputSet.queryParams, "attachmentEntity");
    if (Util.nullOrEmpty(val))
        return inputSet;
    val = val.toLowerCase();
    Set<String> entities = Util.splitFieldForOr(val);
    BlobStore blobStore = inputSet.archive.blobStore;
    Map<Document, Pair<BodyHLInfo, AttachmentHLInfo>> outputDocs = new HashMap<>();
    inputSet.matchedDocs.keySet().stream().forEach((Document k) -> {
        EmailDocument ed = (EmailDocument) k;
        // Here.. check for all attachments of ed for match.
        Collection<Blob> blobs = ed.attachments;
        Set<Blob> matchedBlobs = new HashSet<>();
        for (Blob blob : blobs) {
            Collection<String> keywords = blobStore.getKeywordsForBlob(blob);
            if (keywords != null) {
                keywords.retainAll(entities);
                if (// it means this blob is of interest, add it to matchedBlobs.
                keywords.size() > 0)
                    matchedBlobs.add(blob);
            }
        }
        // of this document
        if (matchedBlobs.size() != 0) {
            BodyHLInfo bhlinfo = inputSet.matchedDocs.get(k).first;
            AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(k).second;
            attachmentHLInfo.addMultipleInfo(matchedBlobs);
            outputDocs.put(k, new Pair(bhlinfo, attachmentHLInfo));
        }
    });
    return new SearchResult(outputDocs, inputSet.archive, inputSet.queryParams, inputSet.commonHLInfo, inputSet.regexToHighlight);
}
Also used : Blob(edu.stanford.muse.datacache.Blob) BlobStore(edu.stanford.muse.datacache.BlobStore) Pair(edu.stanford.muse.util.Pair)

Example 4 with Pair

use of edu.stanford.muse.util.Pair in project epadd by ePADD.

the class SearchResult method searchForTerm.

/**
 * returns SearchResult containing docs and attachments matching the given term.
 *
 * @param inputSet Input search result object on which this term filtering needs to be done
 * @param term     term to search for
 * @return searchresult obj
 */
public static SearchResult searchForTerm(SearchResult inputSet, String term) {
    // go in the order subject, body, attachment
    Set<Document> docsForTerm = new LinkedHashSet<>();
    SearchResult outputSet;
    if ("on".equals(JSPHelper.getParam(inputSet.queryParams, "termSubject"))) {
        Indexer.QueryOptions options = new Indexer.QueryOptions();
        options.setQueryType(Indexer.QueryType.SUBJECT);
        docsForTerm.addAll(inputSet.archive.docsForQuery(term, options));
    }
    if ("on".equals(JSPHelper.getParam(inputSet.queryParams, "termBody"))) {
        Indexer.QueryOptions options = new Indexer.QueryOptions();
        options.setQueryType(Indexer.QueryType.FULL);
        docsForTerm.addAll(inputSet.archive.docsForQuery(term, options));
    } else if ("on".equals(JSPHelper.getParam(inputSet.queryParams, "termOriginalBody"))) {
        // this is an else because we don't want to look at both body and body original
        Indexer.QueryOptions options = new Indexer.QueryOptions();
        options.setQueryType(Indexer.QueryType.ORIGINAL);
        docsForTerm.addAll(inputSet.archive.docsForQuery(term, options));
    }
    Map<Document, Pair<BodyHLInfo, AttachmentHLInfo>> attachmentSearchResult;
    if ("on".equals(JSPHelper.getParam(inputSet.queryParams, "termAttachments"))) {
        attachmentSearchResult = new HashMap<>();
        Set<Blob> blobsForTerm = inputSet.archive.blobsForQuery(term);
        // iterate over 'all attachments' of docs present in 'inputSet'
        inputSet.matchedDocs.keySet().stream().forEach(d -> {
            EmailDocument edoc = (EmailDocument) d;
            Set<Blob> commonAttachments = new HashSet<>(edoc.attachments);
            commonAttachments.retainAll(blobsForTerm);
            // 0         yes        term found in body but not in attachment. keep its info in bodyHLInfo only.
            if (commonAttachments.size() > 0) {
                if (docsForTerm.contains(edoc)) {
                    BodyHLInfo bhlinfo = inputSet.matchedDocs.get(d).first;
                    AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(d).second;
                    // it means the body and the attachment matched the term. add this information in body highliter/attachment highlighter
                    bhlinfo.addTerm(term);
                    attachmentHLInfo.addMultipleInfo(commonAttachments);
                    attachmentSearchResult.put(d, new Pair(bhlinfo, attachmentHLInfo));
                } else {
                    // means only attachment matched the term. add this information in attachment highlighter
                    BodyHLInfo bhlinfo = inputSet.matchedDocs.get(d).first;
                    AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(d).second;
                    attachmentHLInfo.addMultipleInfo(commonAttachments);
                    attachmentSearchResult.put(d, new Pair(bhlinfo, attachmentHLInfo));
                }
            } else if (commonAttachments.size() == 0 && docsForTerm.contains(d)) {
                // means the document had the term only in its body and not in the attachment.
                BodyHLInfo bhlinfo = inputSet.matchedDocs.get(d).first;
                AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(d).second;
                bhlinfo.addTerm(term);
                attachmentSearchResult.put(d, new Pair(bhlinfo, attachmentHLInfo));
            }
        });
        outputSet = new SearchResult(attachmentSearchResult, inputSet.archive, inputSet.queryParams, inputSet.commonHLInfo, inputSet.regexToHighlight);
    } else {
        // just retain only those document in inputSet.matchedDocs which are present in docsForTerm set.
        inputSet.matchedDocs.keySet().retainAll(docsForTerm);
        outputSet = inputSet;
    }
    // blobsForTerm.retainAll(inputSet.matchInAttachment.second);
    /*
        //query for the docs where these blobs are present. Note that we do not need to search for these blobs in all docs
        //only those present in the input search object (matchInAttachment.first) are sufficient as by our invariant of
        //matchInAttachment, the set of documents where matchInAttachment.second are present is same as matchInAttachment.first.
        Set<Document> blobDocsForTerm = (Set<Document>) EmailUtils.getDocsForAttachments((Collection) inputSet.matchInAttachment.first, blobsForTerm);
        attachmentSearchResult = new Pair(blobDocsForTerm,blobsForTerm);
        */
    // Add term to common highlighting info (as it is without parsing) for highlighting.
    // The term will be in lucene syntax (OR,AND etc.)
    // lucene highlighter will take care of highlighting that.
    outputSet.commonHLInfo.addTerm(term);
    return outputSet;
}
Also used : Blob(edu.stanford.muse.datacache.Blob) Pair(edu.stanford.muse.util.Pair)

Example 5 with Pair

use of edu.stanford.muse.util.Pair in project epadd by ePADD.

the class SearchResult method filterForAttachments.

/**
 * will look in the given docs for a message with an attachment that satisfies all the requirements.
 * the set of such messages, along with the matching blobs is returned
 * if no requirements, Pair<docs, null> is returned.
 */
private static SearchResult filterForAttachments(SearchResult inputSet) {
    String neededFilesize = JSPHelper.getParam(inputSet.queryParams, "attachmentFilesize");
    String neededFilename = JSPHelper.getParam(inputSet.queryParams, "attachmentFilename");
    // this can come in as a single parameter with multiple values (in case of multiple selections by the user)
    Collection<String> neededTypeStr = JSPHelper.getParams(inputSet.queryParams, "attachmentType");
    String neededExtensionStr = JSPHelper.getParam(inputSet.queryParams, "attachmentExtension");
    if (Util.nullOrEmpty(neededFilesize) && Util.nullOrEmpty(neededFilename) && Util.nullOrEmpty(neededTypeStr) && Util.nullOrEmpty(neededExtensionStr)) {
        return inputSet;
    }
    // set up the file names incl. regex pattern if applicable
    String neededFilenameRegex = JSPHelper.getParam(inputSet.queryParams, "attachmentFilenameRegex");
    Set<String> neededFilenames = null;
    Pattern filenameRegexPattern = null;
    if ("on".equals(neededFilenameRegex) && !Util.nullOrEmpty(neededFilename)) {
        filenameRegexPattern = Pattern.compile(neededFilename);
    } else {
        if (// will be in lower case
        !Util.nullOrEmpty(neededFilename))
            neededFilenames = Util.splitFieldForOr(neededFilename);
    }
    // set up the extensions
    // will be in lower case
    Set<String> neededExtensions = new LinkedHashSet<>();
    if (!Util.nullOrEmpty(neededTypeStr) || !Util.nullOrEmpty(neededExtensionStr)) {
        // compile the list of all extensions from type (audio/video, etc) and explicitly provided extensions
        if (!Util.nullOrEmpty(neededTypeStr)) {
            // will be something like "mp3;ogg,avi;mp4" multiselect picker gives us , separated between types, convert it to ;
            for (String s : neededTypeStr) neededExtensions.addAll(Util.splitFieldForOr(s));
        }
        if (!Util.nullOrEmpty(neededExtensionStr)) {
            neededExtensions.addAll(Util.splitFieldForOr(neededExtensionStr));
        }
    } else {
        // if attachment type and attachment extensions are not provided fill in the set neededExtensions set
        // with the set of all possible extensions/types..
        Map<String, String> allTypes = Config.attachmentTypeToExtensions;
        for (String s : allTypes.values()) {
            neededExtensions.addAll(Util.splitFieldForOr(s));
        }
    }
    // Here we could not use stream's forEach beacause lambda expression can not use non-final variables
    // declared outside. Here filenameRegexPattern, neededFilenames were giving error. So changed to
    // iteration.
    Map<Document, Pair<BodyHLInfo, AttachmentHLInfo>> outputDocs = new HashMap<>();
    for (Document k : inputSet.matchedDocs.keySet()) {
        EmailDocument ed = (EmailDocument) k;
        Set<Blob> matchedBlobs = new HashSet<>();
        for (Blob b : ed.attachments) {
            // 1. filename matches?
            if (filenameRegexPattern == null) {
                // non-regex check
                if (neededFilenames != null && (b.filename == null || !(neededFilename.contains(b.filename))))
                    continue;
            } else {
                // regex check
                if (!Util.nullOrEmpty(neededFilename)) {
                    if (b.filename == null)
                        continue;
                    if (// use find rather than matches because we want partial match on the filename, doesn't have to be full match
                    !filenameRegexPattern.matcher(b.filename).find())
                        continue;
                }
            }
            // 2. extension matches?
            // a variable to select if the extensions needed contain others.
            boolean isOtherSelected = neededExtensions.contains("others");
            // get the options that were displayed for attachment types. This will be used to select attachment extensions if the option 'other'
            // was selected by the user in the drop down box of export.jsp.
            List<String> attachmentTypeOptions = Config.attachmentTypeToExtensions.values().stream().map(x -> Util.tokenize(x, ";")).flatMap(Collection::stream).collect(Collectors.toList());
            if (neededExtensions != null) {
                if (b.filename == null)
                    // just over-defensive, if no name, effectively doesn't match
                    continue;
                String extension = Util.getExtension(b.filename);
                if (extension == null)
                    continue;
                extension = extension.toLowerCase();
                // Proceed to add this attachment only if either
                // 1. other is selected and this extension is not present in the list attachmentOptionType, or
                // 2. this extension is present in the variable neededExtensions [Q. What if there is a file with extension .others?]
                boolean firstcondition = isOtherSelected && !attachmentTypeOptions.contains(extension);
                boolean secondcondition = neededExtensions.contains(extension);
                if (!firstcondition && !secondcondition)
                    continue;
            }
            // 3. size matches?
            long size = b.getSize();
            /*
                // these attachmentFilesizes parameters are hardcoded -- could make it more flexible if needed in the future
                // "1".."5" are the only valid filesizes. If none of these, this parameter not set and we can include the blob
                if ("1".equals(neededFilesize) || "2".equals(neededFilesize) || "3".equals(neededFilesize) ||"4".equals(neededFilesize) ||"5".equals(neededFilesize)) { // any other value, we ignore this param
                    boolean include = ("1".equals(neededFilesize) && size < 5 * KB) ||
                            ("2".equals(neededFilesize) && size >= 5 * KB && size <= 20 * KB) ||
                            ("3".equals(neededFilesize) && size >= 20 * KB && size <= 100 * KB) ||
                            ("4".equals(neededFilesize) && size >= 100 * KB && size <= 2 * KB * KB) ||
                            ("5".equals(neededFilesize) && size >= 2 * KB * KB);
                }
                */
            boolean include = Util.filesizeCheck(neededFilesize, size);
            if (!include)
                continue;
            // if we reached here, all conditions must be satisfied
            matchedBlobs.add(b);
        }
        // of this document
        if (matchedBlobs.size() != 0) {
            BodyHLInfo bhlinfo = inputSet.matchedDocs.get(k).first;
            AttachmentHLInfo attachmentHLInfo = inputSet.matchedDocs.get(k).second;
            attachmentHLInfo.addMultipleInfo(matchedBlobs);
            outputDocs.put(k, new Pair(bhlinfo, attachmentHLInfo));
        }
    }
    return new SearchResult(outputDocs, inputSet.archive, inputSet.queryParams, inputSet.commonHLInfo, inputSet.regexToHighlight);
}
Also used : Pattern(java.util.regex.Pattern) Blob(edu.stanford.muse.datacache.Blob) Pair(edu.stanford.muse.util.Pair)

Aggregations

Pair (edu.stanford.muse.util.Pair)25 Blob (edu.stanford.muse.datacache.Blob)6 AnnotationManager (edu.stanford.muse.AnnotationManager.AnnotationManager)3 BlobStore (edu.stanford.muse.datacache.BlobStore)3 IOException (java.io.IOException)3 Matcher (java.util.regex.Matcher)3 EmailDocument (edu.stanford.muse.index.EmailDocument)2 Triple (edu.stanford.muse.util.Triple)2 InputStream (java.io.InputStream)2 Pattern (java.util.regex.Pattern)2 Span (opennlp.tools.util.Span)2 AddressBook (edu.stanford.muse.AddressBookManager.AddressBook)1 Contact (edu.stanford.muse.AddressBookManager.Contact)1 AuthorityMapper (edu.stanford.muse.AuthorityMapper.AuthorityMapper)1 CancelledException (edu.stanford.muse.exceptions.CancelledException)1 Archive (edu.stanford.muse.index.Archive)1 Document (edu.stanford.muse.index.Document)1 NEType (edu.stanford.muse.ner.model.NEType)1 Span (edu.stanford.muse.util.Span)1 Util (edu.stanford.muse.util.Util)1