Search in sources :

Example 1 with GeocodeRule

use of org.opensextant.extractors.geo.rules.GeocodeRule in project Xponents by OpenSextant.

the class PlaceGeocoder method extract.

/**
     * Unfinished Beta; ready for experimentation and improvement on rules.
     *
     * Extractor.extract() calls first XCoord to get coordinates, then
     * PlacenameMatcher In the end you have all geo entities ranked and scored.
     * 
     * LangID can be set on TextInput input.langid. Only lowercase langIDs please:
     * 'zh', 'ar', tag text for those languages in particular. Null and Other values
     * are treated as generic as of v2.8.
     * 
     * <pre>
     * Use TextMatch.getType()
     * to determine how to interpret TextMatch / Geocoding results:
     *
     * Given TextMatch match
     *
     *    Place tag:   ((PlaceCandiate)match).getGeocoding()
     *    Coord tag:   (Geocoding)match
     *
     * Both methods yield a geocoding.
     * </pre>
     *
     * @param input
     *            input buffer, doc ID, and optional langID.
     * @return TextMatch instances which are all PlaceCandidates.
     * @throws ExtractionException
     *             on err
     */
@Override
public List<TextMatch> extract(TextInput input) throws ExtractionException {
    long t1 = System.currentTimeMillis();
    reset();
    List<TextMatch> matches = new ArrayList<TextMatch>();
    List<TextMatch> coordinates = null;
    // 0. GEOTAG raw text. Flag tag-only = false, in otherwords do extra work for geocoding.
    //
    List<PlaceCandidate> candidates = null;
    if (input.langid == null) {
        candidates = tagText(input.buffer, input.id, tagOnly);
    //} else if (TextUtils.isCJK(input.langid)) {
    // candidates = this.tagCJKText(input.buffer, input.id, tagOnly);
    } else if (TextUtils.arabicLang.equals(input.langid)) {
        candidates = this.tagArabicText(input.buffer, input.id, tagOnly);
    } else {
        // Default - unknown language.
        log.debug("Default Language {}. Treating as Generic.", input.langid);
        candidates = tagText(input, tagOnly);
    }
    // 1. COORDINATES. If caller thinks their data may have coordinates, then attempt to parse
    // lat/lon.  Any coordinates found fire rules for resolve lat/lon to a Province/Country if possible.
    //
    coordinates = parseGeoCoordinates(input);
    if (coordinates != null) {
        matches.addAll(coordinates);
    }
    /*
         * 3.RULE EVALUATION: accumulate all the evidence from everything found so far.
         * Assemble some histograms to support some basic counts, weighting and sorting.
         * 
         * Rules:  Work with observables first, then move onto associations between candidates and more obscure fine tuning. 
         * 1a.  Country - named country weighs heavily; 
         * 1b.  Place, Boundary -- a city or location, followed/qualified by a geopolitical boundary name or code. Paris, France; Paris, Texas.
         * 1c.  Coordinate rule -- coordinates emit Province ID and Country ID if possible. So inferred Provinces are weighted heavily.
         * b.  Person name rule - filters out heavily, making use of JRC Names and your own data sets as a TaxCat catalog/tagger.
         * d.  Major Places rule -- well-known large cities, capitals or provinces are weighted moderately.
         * e.  Province association rule -- for each found place, weight geos falling in Provinces positively ID'd.
         * f.  Location Chooser rule -- assemble all evidence and account for weights.
         */
    countryRule.evaluate(candidates);
    nameWithAdminRule.evaluate(candidates);
    // 2. NON-PLACE ID. Tag person and org names to negate celebrity names or well-known
    // individuals who share a city name. "Tom Jackson", "Bill Clinton"
    //
    parseKnownNonPlaces(input, candidates, matches);
    // Measure duration of tagging.
    this.taggingTimes.addTimeSince(t1);
    // 
    for (GeocodeRule r : rules) {
        r.evaluate(candidates);
    }
    // Last rule: score, choose, add confidence.
    // 
    chooser.evaluate(candidates);
    // For each candidate, if PlaceCandidate.chosen is not null,
    // add chosen (Geocoding) to matches
    // Otherwise add PlaceCandidates to matches.
    // non-geocoded matches will appear in non-GIS formats.
    //
    // Downstream recipients of 'matches' must know how to parse through
    // evaluated place candidates. We send the candidates and all evidence.
    matches.addAll(candidates);
    // Measure full processing duration for this doc.
    this.matcherTotalTimes.addBytes(input.buffer.length());
    this.matcherTotalTimes.addTimeSince(t1);
    return matches;
}
Also used : ArrayList(java.util.ArrayList) TextMatch(org.opensextant.extraction.TextMatch) GeocodeRule(org.opensextant.extractors.geo.rules.GeocodeRule)

Example 2 with GeocodeRule

use of org.opensextant.extractors.geo.rules.GeocodeRule in project Xponents by OpenSextant.

the class PlaceGeocoder method reset.

private void reset() {
    this.relevantCountries.clear();
    this.relevantProvinces.clear();
    this.relevantLocations.clear();
    this.nationalities.clear();
    personNameRule.reset();
    countryRule.reset();
    majorPlaceRule.reset();
    chooser.reset();
    for (GeocodeRule r : rules) {
        r.reset();
    }
}
Also used : GeocodeRule(org.opensextant.extractors.geo.rules.GeocodeRule)

Aggregations

GeocodeRule (org.opensextant.extractors.geo.rules.GeocodeRule)2 ArrayList (java.util.ArrayList)1 TextMatch (org.opensextant.extraction.TextMatch)1