Search in sources :

Example 11 with AddressBook

use of edu.stanford.muse.AddressBookManager.AddressBook in project epadd by ePADD.

the class Lens method getHitsQuick.

/**
 * looks up given names in address book + message content index and returns a json of scores. lensPrefs has the user's term preferences
 */
public static List<JSONObject> getHitsQuick(List<Pair<String, Float>> names, LensPrefs lensPrefs, Archive archive, String baseURL, Collection<EmailDocument> allDocs) throws JSONException, IOException {
    List<JSONObject> list = new ArrayList<>();
    Indexer indexer = archive.indexer;
    AddressBook ab = archive.addressBook;
    String archiveID = ArchiveReaderWriter.getArchiveIDForArchive(archive);
    if (indexer == null)
        return list;
    for (Pair<String, Float> pair : names) {
        String term = pair.getFirst();
        if (term.length() <= 2)
            continue;
        float pageScore = pair.getSecond();
        term = JSPHelper.convertRequestParamToUTF8(term);
        // Prune all the non-alphabetical characters
        term = term.replaceAll("[\\r\\n]", "");
        term = term.replaceAll("[^\\p{L}\\p{Nd}\\s\\.]", "");
        term = term.replaceAll("\\s+", " ");
        JSONObject json = new JSONObject();
        json.put("text", term);
        json.put("pageScore", pageScore);
        NameInfo ni = archive.nameLookup(term);
        if (ni != null && ni.type != null && !"notype".equals(ni.type))
            json.put("type", ni.type);
        int NAME_IN_ADDRESS_BOOK_WEIGHT = 100;
        // look up term in 2 places -- AB and in the index
        // temporarily disabled AB - sgh. IndexUtils.selectDocsByPersons(ab, allDocs, new String[]{term}).size();
        int hitsInAddressBook = 0;
        // To check: does this include subject line also...
        int hitsInMessageContent = archive.countHitsForQuery("\"" + term + "\"");
        // weigh any docs for name in addressbook hugely more!
        double termScore = hitsInAddressBook * NAME_IN_ADDRESS_BOOK_WEIGHT + hitsInMessageContent;
        json.put("indexScore", termScore);
        int totalHits = hitsInAddressBook + hitsInMessageContent;
        // this is an over-estimate since the same message might match both in addressbook and in body. it is used only for scoring and should NEVER be shown to the user. getTermHitDetails will get the accurate count
        json.put("nMessages", totalHits);
        log.info(term + ": " + hitsInAddressBook + " in address book, " + hitsInMessageContent + " in messages");
        String url = baseURL + "/browse?archiveID=" + archiveID + "&adv-search=1&termBody=on&termSubject=on&termAttachments=on&termOriginalBody=on&term=\"" + term + "\"";
        json.put("url", url);
        // JSONArray messages = new JSONArray();
        // json.put("messages", messages); // empty messages
        list.add(json);
    }
    log.info(list.size() + " terms hit");
    list = scoreHits(list, lensPrefs);
    return list;
}
Also used : NameInfo(edu.stanford.muse.ie.NameInfo) JSONObject(org.json.JSONObject) AddressBook(edu.stanford.muse.AddressBookManager.AddressBook)

Example 12 with AddressBook

use of edu.stanford.muse.AddressBookManager.AddressBook in project epadd by ePADD.

the class CrossCollectionSearch method initialize.

/**
 * initializes lookup structures (entity infos and ctokenToInfos) for cross collection search
 * reads all archives available in the base dir.
 * should be synchronized so there's no chance of doing it multiple times at the same time.
 */
private static synchronized void initialize(String baseDir) {
    // this is created only once in one run. if it has already been created, reuse it.
    // in the future, this may be read from a serialized file, etc.
    cTokenToInfos = LinkedHashMultimap.create();
    File[] files = new File(baseDir).listFiles();
    if (files == null) {
        log.warn("Trying to initialize cross collection search from an invalid directory: " + baseDir);
        return;
    }
    int archiveNum = 0;
    for (File f : files) {
        if (!f.isDirectory())
            continue;
        try {
            String archiveFile = f.getAbsolutePath() + File.separator + Archive.BAG_DATA_FOLDER + File.separator + Archive.SESSIONS_SUBDIR + File.separator + "default" + SimpleSessions.getSessionSuffix();
            if (!new File(archiveFile).exists()) {
                log.warn("Unable to find archive file" + archiveFile + ".. Serious error");
                continue;
            }
            // Assumption is that this feature is present only in discovery mode. In future when we want to add it to processing, we need proper care.
            Archive archive = ArchiveReaderWriter.readArchiveIfPresent(f.getAbsolutePath(), ModeConfig.Mode.DISCOVERY);
            if (archive == null) {
                log.warn("failed to read archive from " + f.getAbsolutePath());
                continue;
            }
            log.info("Loaded archive from " + f.getAbsolutePath());
            log.info("Loaded archive metadata from " + f.getAbsolutePath());
            // process all docs in this archive to set up centityToInfo map
            String archiveID = ArchiveReaderWriter.getArchiveIDForArchive(archive);
            Map<String, EntityInfo> centityToInfo = new LinkedHashMap<>();
            {
                // get all contacts from the addressbook
                Set<Pair<String, Pair<Pair<Date, Date>, Integer>>> correspondentEntities = new LinkedHashSet<>();
                {
                    Map<Contact, DetailedFacetItem> res = IndexUtils.partitionDocsByPerson(archive.getAllDocs(), archive.getAddressBook());
                    res.entrySet().forEach(s -> {
                        // get contactname
                        Contact c = s.getKey();
                        // get duration (first and last doc where this contact was used)
                        Set<EmailDocument> edocs = s.getValue().docs.stream().map(t -> (EmailDocument) t).collect(Collectors.toSet());
                        Pair<Date, Date> duration = EmailUtils.getFirstLast(edocs);
                        if (duration == null) {
                            duration = new Pair<>(archive.collectionMetadata.firstDate, archive.collectionMetadata.lastDate);
                        }
                        if (duration.first == null)
                            duration.first = archive.collectionMetadata.firstDate;
                        if (duration.second == null)
                            duration.second = archive.collectionMetadata.lastDate;
                        // get number of messages where this was used.
                        Integer count = s.getValue().docs.size();
                        if (c.getNames() != null) {
                            Pair<Date, Date> finalDuration = duration;
                            c.getNames().forEach(w -> {
                                if (!Util.nullOrEmpty(w) && finalDuration != null && count != null)
                                    correspondentEntities.add(new Pair(canonicalize(w), new Pair(finalDuration, count)));
                            });
                        }
                        if (c.getEmails() != null) {
                            Pair<Date, Date> finalDuration1 = duration;
                            c.getEmails().forEach(w -> {
                                if (!Util.nullOrEmpty(w) && finalDuration1 != null && count != null)
                                    correspondentEntities.add(new Pair(canonicalize(w), new Pair(finalDuration1, count)));
                            });
                        }
                    });
                }
                // get all entities from entitybookmanager
                Set<Pair<String, Pair<Pair<Date, Date>, Integer>>> entitiessummary = new LinkedHashSet<>();
                {
                    entitiessummary = archive.getEntityBookManager().getAllEntitiesSummary();
                    // filter out any null or empty strings (just in case)
                    // don't canonicalize right away because we need to keep the original form of the name
                    entitiessummary = entitiessummary.stream().filter(s -> !Util.nullOrEmpty(s.first)).collect(Collectors.toSet());
                }
                // if an entity is present as a person entity as well as in correspondent then consider the count of the person entity as the final count.  Therefore start with
                // processing of correspondent entities.
                correspondentEntities.forEach(entity -> {
                    String centity = canonicalize(entity.first);
                    EntityInfo ei = centityToInfo.get(centity);
                    if (ei == null) {
                        ei = new EntityInfo();
                        ei.archiveID = archiveID;
                        ei.displayName = entity.first;
                        centityToInfo.put(centity, ei);
                    }
                    ei.isCorrespondent = true;
                    ei.firstDate = entity.second.first.first;
                    ei.lastDate = entity.second.first.second;
                    ei.count = entity.second.second;
                });
                // Now process entities (except correspondents).
                entitiessummary.forEach(entity -> {
                    String centity = canonicalize(entity.first);
                    EntityInfo ei = centityToInfo.get(centity);
                    if (ei == null) {
                        ei = new EntityInfo();
                        ei.archiveID = archiveID;
                        ei.displayName = entity.first;
                        centityToInfo.put(centity, ei);
                    }
                    // ei.isCorrespondent=true;
                    ei.firstDate = entity.second.first.first;
                    ei.lastDate = entity.second.first.second;
                    ei.count = entity.second.second;
                });
            }
            log.info("Archive # " + archiveNum + " read " + centityToInfo.size() + " entities");
            // now set up this map as a token map
            for (EntityInfo ei : centityToInfo.values()) {
                String entity = ei.displayName;
                String centity = canonicalize(entity);
                allCEntities.add(centity);
                // consider a set of tokens because we don't want repeats
                Set<String> ctokens = new LinkedHashSet<>(Util.tokenize(centity));
                for (String ctoken : ctokens) cTokenToInfos.put(ctoken, ei);
            }
        } catch (Exception e) {
            Util.print_exception("Error loading archive in directory " + f.getAbsolutePath(), e, log);
        }
        archiveNum++;
    }
}
Also used : Config(edu.stanford.muse.Config) java.util(java.util) edu.stanford.muse.index(edu.stanford.muse.index) AddressBook(edu.stanford.muse.AddressBookManager.AddressBook) Util(edu.stanford.muse.util.Util) Multimap(com.google.common.collect.Multimap) Collectors(java.util.stream.Collectors) File(java.io.File) MappedEntity(edu.stanford.muse.ie.variants.MappedEntity) DetailedFacetItem(edu.stanford.muse.util.DetailedFacetItem) Contact(edu.stanford.muse.AddressBookManager.Contact) Pair(edu.stanford.muse.util.Pair) Logger(org.apache.logging.log4j.Logger) EntityBook(edu.stanford.muse.ie.variants.EntityBook) EmailUtils(edu.stanford.muse.util.EmailUtils) SimpleSessions(edu.stanford.muse.webapp.SimpleSessions) ModeConfig(edu.stanford.muse.webapp.ModeConfig) LogManager(org.apache.logging.log4j.LogManager) LinkedHashMultimap(com.google.common.collect.LinkedHashMultimap) Contact(edu.stanford.muse.AddressBookManager.Contact) DetailedFacetItem(edu.stanford.muse.util.DetailedFacetItem) File(java.io.File) Pair(edu.stanford.muse.util.Pair)

Example 13 with AddressBook

use of edu.stanford.muse.AddressBookManager.AddressBook in project epadd by ePADD.

the class MuseEmailFetcher method fetchAndIndexEmails.

/**
 * key method to fetch actual email messages. can take a long time.
 * @param session is used only to set the status provider object. callers who do not need to track status can leave it as null
 * @param selectedFolders is in the format <account name>^-^<folder name>
 * @param session is used only to put a status object in. can be null in which case status object is not set.
 * emailDocs, addressBook and blobstore
 * @throws NoDefaultFolderException
 */
public void fetchAndIndexEmails(Archive archive, String[] selectedFolders, boolean useDefaultFolders, FetchConfig fetchConfig, HttpSession session, Consumer<StatusProvider> setStatusProvider) throws InterruptedException, JSONException, NoDefaultFolderException, CancelledException {
    setupFetchers(-1);
    long startTime = System.currentTimeMillis();
    setStatusProvider.accept(new StaticStatusProvider("Starting to process messages..."));
    // if (session != null)
    // session.setAttribute("statusProvider", new StaticStatusProvider("Starting to process messages..."));
    boolean op_cancelled = false, out_of_mem = false;
    BlobStore attachmentsStore = archive.getBlobStore();
    fetchConfig.downloadAttachments = fetchConfig.downloadAttachments && attachmentsStore != null;
    if (Util.nullOrEmpty(fetchers)) {
        log.warn("Trying to fetch email with no fetchers, setup not called ?");
        return;
    }
    setupFoldersForFetchers(fetchers, selectedFolders, useDefaultFolders);
    List<FolderInfo> fetchedFolderInfos = new ArrayList<>();
    // one fetcher will aggregate everything
    FetchStats stats = new FetchStats();
    MTEmailFetcher aggregatingFetcher = null;
    // a fetcher is one source, like an account or a top-level mbox dir. A fetcher could include multiple folders.
    long startTimeMillis = System.currentTimeMillis();
    for (MTEmailFetcher fetcher : fetchers) {
        // in theory, different iterations of this loop could be run in parallel ("archive" access will be synchronized)
        setStatusProvider.accept(fetcher);
        /*if (session != null)
				session.setAttribute("statusProvider", fetcher);
*/
        fetcher.setArchive(archive);
        fetcher.setFetchConfig(fetchConfig);
        log.info("Memory status before fetching emails: " + Util.getMemoryStats());
        // this is the big call, can run for a long time. Note: running in the same thread, its not fetcher.start();
        List<FolderInfo> foldersFetchedByThisFetcher = fetcher.run();
        // but don't abort immediately, only at the end, after addressbook has been built for at least the processed messages
        if (fetcher.isCancelled()) {
            log.info("NOTE: fetcher operation was cancelled");
            op_cancelled = true;
            break;
        }
        if (fetcher.mayHaveRunOutOfMemory()) {
            log.warn("Fetcher operation ran out of memory " + fetcher);
            out_of_mem = true;
            break;
        }
        fetchedFolderInfos.addAll(foldersFetchedByThisFetcher);
        if (aggregatingFetcher == null && !Util.nullOrEmpty(foldersFetchedByThisFetcher))
            // first non-empty fetcher
            aggregatingFetcher = fetcher;
        if (aggregatingFetcher != null)
            aggregatingFetcher.merge(fetcher);
        // add the indexed folders to the stats
        EmailStore store = fetcher.getStore();
        String fetcherDescription = store.displayName + ":" + store.emailAddress;
        for (FolderInfo fi : fetchedFolderInfos) stats.selectedFolders.add(new Pair<>(fetcherDescription, fi));
    }
    if (op_cancelled)
        throw new CancelledException();
    if (out_of_mem)
        throw new OutOfMemoryError();
    if (aggregatingFetcher != null) {
        stats.importStats = aggregatingFetcher.stats;
        if (aggregatingFetcher.mayHaveRunOutOfMemory())
            throw new OutOfMemoryError();
    }
    // save memory
    aggregatingFetcher = null;
    long endTimeMillis = System.currentTimeMillis();
    long elapsedMillis = endTimeMillis - startTimeMillis;
    log.info(elapsedMillis + " ms for fetch+index, Memory status: " + Util.getMemoryStats());
    // note: this is all archive docs, not just the ones that may have been just imported
    List<EmailDocument> allEmailDocs = (List) archive.getAllDocs();
    archive.addFetchedFolderInfos(fetchedFolderInfos);
    if (allEmailDocs.size() == 0)
        log.warn("0 messages from email fetcher");
    // EmailUtils.cleanDates(allEmailDocs);
    // create a new address book
    // if (session != null)
    // session.setAttribute("statusProvider", new StaticStatusProvider("Building address book..."));
    setStatusProvider.accept(new StaticStatusProvider("Building address book..."));
    AddressBook addressBook = EmailDocument.buildAddressBook(allEmailDocs, archive.ownerEmailAddrs, archive.ownerNames);
    log.info("Address book created!!");
    log.info("Address book stats: " + addressBook.getStats());
    // if (session != null)
    // session.setAttribute("statusProvider", new StaticStatusProvider("Finishing up..."));
    setStatusProvider.accept(new StaticStatusProvider("Finishing up..."));
    archive.setAddressBook(addressBook);
    // we shouldn't really have dups now because the archive ensures that only unique docs are added
    // move sorting to archive.postprocess?
    EmailUtils.removeDupsAndSort(allEmailDocs);
    // report stats
    stats.lastUpdate = new Date().getTime();
    // For issue #254.
    stats.archiveOwnerInput = name;
    stats.archiveTitleInput = archiveTitle;
    stats.primaryEmailInput = alternateEmailAddrs;
    stats.emailSourcesInput = emailSources;
    // ////
    // (String) JSPHelper.getSessionAttribute(session, "userKey");
    stats.userKey = "USER KEY UNUSED";
    stats.fetchAndIndexTimeMillis = elapsedMillis;
    updateStats(archive, addressBook, stats);
    // if (session != null)
    // session.removeAttribute("statusProvider");
    log.info("Fetch+index complete: " + Util.commatize(System.currentTimeMillis() - startTime) + " ms");
}
Also used : CancelledException(edu.stanford.muse.exceptions.CancelledException) EmailDocument(edu.stanford.muse.index.EmailDocument) AddressBook(edu.stanford.muse.AddressBookManager.AddressBook) BlobStore(edu.stanford.muse.datacache.BlobStore) Pair(edu.stanford.muse.util.Pair)

Example 14 with AddressBook

use of edu.stanford.muse.AddressBookManager.AddressBook in project epadd by ePADD.

the class NameExpansion method getMatches.

/* Given the string s in emailDocument ed, returns a matches object with candidates matching s */
public static Matches getMatches(String s, Archive archive, EmailDocument ed, int maxResults) {
    Matches matches = new Matches(s, maxResults);
    AddressBook ab = archive.addressBook;
    List<Contact> contactsExceptSelf = ed.getParticipatingContactsExceptOwn(archive.addressBook);
    List<Contact> contacts = new ArrayList(contactsExceptSelf);
    contacts.add(ab.getContactForSelf());
    // check if s matches any contacts on this message
    outer: for (Contact c : contacts) {
        if (c.getNames() == null)
            continue;
        for (String name : c.getNames()) {
            StringMatchType matchType = Matches.match(s, name);
            if (matchType != null) {
                float score = 1.0F;
                if (matches.addMatch(name, score, matchType, "Name of a contact on this message", true))
                    return matches;
                continue outer;
            }
        }
    }
    // check if s matches anywhere else in this message
    if (matchAgainstEmailContent(archive, ed, matches, "Mentioned elsewhere in this message", 1.0F)) {
        return matches;
    }
    synchronized (archive) {
        if (ed.threadID == 0L) {
            archive.assignThreadIds();
        }
    }
    // check if s matches anywhere else in this thread
    List<EmailDocument> messagesInThread = (List) archive.docsWithThreadId(ed.threadID);
    for (EmailDocument messageInThread : messagesInThread) {
        if (matchAgainstEmailContent(archive, messageInThread, matches, "Mentioned in this thread", 0.9F)) {
            return matches;
        }
    }
    // check if s matches any other email with any of these correspondents
    for (Contact c : contactsExceptSelf) {
        if (c.getEmails() != null) {
            String correspondentsSearchStr = String.join(";", c.getEmails());
            // As filterForCorrespondents function do not use queryparams therefore it is fine to instantiate SearchResult
            // object with queryParams as null. After refactoring, filter methods take SearchObject as input and modify it
            // according to the filter.
            SearchResult inputSet = new SearchResult(archive, null);
            SearchResult outputSet = SearchResult.filterForCorrespondents(inputSet, correspondentsSearchStr, true, true, true, true);
            Set<Document> messagesWithSameCorrespondents = outputSet.getDocumentSet();
            for (Document messageWithSameCorrespondents : messagesWithSameCorrespondents) {
                EmailDocument edoc = (EmailDocument) messageWithSameCorrespondents;
                if (matchAgainstEmailContent(archive, edoc, matches, "Mentioned in other messages with these correspondents", 0.8F)) {
                    return matches;
                }
            }
        }
    }
    // search for s anywhere in the archive
    Multimap<String, String> params = LinkedHashMultimap.create();
    params.put("termSubject", "on");
    params.put("termBody", "on");
    String term = s;
    if (s.contains(" ") && (!s.startsWith("\"") || !s.endsWith("\""))) {
        term = "\"" + s + "\"";
    }
    // To search for terms, create a searchResult object and invoke appropriate filter method on it.
    SearchResult inputSet = new SearchResult(archive, params);
    SearchResult outputSet = SearchResult.searchForTerm(inputSet, term);
    Set<Document> docsWithTerm = outputSet.getDocumentSet();
    for (Document docWithTerm : docsWithTerm) {
        EmailDocument edoc = (EmailDocument) docWithTerm;
        if (matchAgainstEmailContent(archive, edoc, matches, "Mentioned elsewhere in this archive", 0.7F))
            return matches;
    }
    return matches;
}
Also used : EmailDocument(edu.stanford.muse.index.EmailDocument) SearchResult(edu.stanford.muse.index.SearchResult) Document(edu.stanford.muse.index.Document) EmailDocument(edu.stanford.muse.index.EmailDocument) Contact(edu.stanford.muse.AddressBookManager.Contact) AddressBook(edu.stanford.muse.AddressBookManager.AddressBook)

Example 15 with AddressBook

use of edu.stanford.muse.AddressBookManager.AddressBook in project epadd by ePADD.

the class ArchiveReaderWriter method readAddressBook.

public static AddressBook readAddressBook(String addressBookPath, Collection<Document> alldocs) {
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader(addressBookPath));
        AddressBook ab = AddressBook.readObjectFromStream(br, alldocs);
        br.close();
        return ab;
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        return null;
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}
Also used : AddressBook(edu.stanford.muse.AddressBookManager.AddressBook)

Aggregations

AddressBook (edu.stanford.muse.AddressBookManager.AddressBook)20 Contact (edu.stanford.muse.AddressBookManager.Contact)12 java.util (java.util)7 Collectors (java.util.stream.Collectors)7 LinkedHashMultimap (com.google.common.collect.LinkedHashMultimap)6 Blob (edu.stanford.muse.datacache.Blob)6 BlobStore (edu.stanford.muse.datacache.BlobStore)6 Multimap (com.google.common.collect.Multimap)5 AnnotationManager (edu.stanford.muse.AnnotationManager.AnnotationManager)5 Pair (edu.stanford.muse.util.Pair)5 Util (edu.stanford.muse.util.Util)5 Config (edu.stanford.muse.Config)4 LabelManager (edu.stanford.muse.LabelManager.LabelManager)4 EntityBook (edu.stanford.muse.ie.variants.EntityBook)4 EmailUtils (edu.stanford.muse.util.EmailUtils)4 java.io (java.io)4 Address (javax.mail.Address)4 InternetAddress (javax.mail.internet.InternetAddress)4 LogManager (org.apache.logging.log4j.LogManager)4 Logger (org.apache.logging.log4j.Logger)4