use of org.opensextant.extractors.geo.rules.GeocodeRule in project Xponents by OpenSextant.
the class PlaceGeocoder method extract.
/**
* Unfinished Beta; ready for experimentation and improvement on rules.
*
* Extractor.extract() calls first XCoord to get coordinates, then
* PlacenameMatcher In the end you have all geo entities ranked and scored.
*
* LangID can be set on TextInput input.langid. Only lowercase langIDs please:
* 'zh', 'ar', tag text for those languages in particular. Null and Other values
* are treated as generic as of v2.8.
*
* <pre>
* Use TextMatch.getType()
* to determine how to interpret TextMatch / Geocoding results:
*
* Given TextMatch match
*
* Place tag: ((PlaceCandiate)match).getGeocoding()
* Coord tag: (Geocoding)match
*
* Both methods yield a geocoding.
* </pre>
*
* @param input
* input buffer, doc ID, and optional langID.
* @return TextMatch instances which are all PlaceCandidates.
* @throws ExtractionException
* on err
*/
@Override
public List<TextMatch> extract(TextInput input) throws ExtractionException {
long t1 = System.currentTimeMillis();
reset();
List<TextMatch> matches = new ArrayList<TextMatch>();
List<TextMatch> coordinates = null;
// 0. GEOTAG raw text. Flag tag-only = false, in otherwords do extra work for geocoding.
//
List<PlaceCandidate> candidates = null;
if (input.langid == null) {
candidates = tagText(input.buffer, input.id, tagOnly);
//} else if (TextUtils.isCJK(input.langid)) {
// candidates = this.tagCJKText(input.buffer, input.id, tagOnly);
} else if (TextUtils.arabicLang.equals(input.langid)) {
candidates = this.tagArabicText(input.buffer, input.id, tagOnly);
} else {
// Default - unknown language.
log.debug("Default Language {}. Treating as Generic.", input.langid);
candidates = tagText(input, tagOnly);
}
// 1. COORDINATES. If caller thinks their data may have coordinates, then attempt to parse
// lat/lon. Any coordinates found fire rules for resolve lat/lon to a Province/Country if possible.
//
coordinates = parseGeoCoordinates(input);
if (coordinates != null) {
matches.addAll(coordinates);
}
/*
* 3.RULE EVALUATION: accumulate all the evidence from everything found so far.
* Assemble some histograms to support some basic counts, weighting and sorting.
*
* Rules: Work with observables first, then move onto associations between candidates and more obscure fine tuning.
* 1a. Country - named country weighs heavily;
* 1b. Place, Boundary -- a city or location, followed/qualified by a geopolitical boundary name or code. Paris, France; Paris, Texas.
* 1c. Coordinate rule -- coordinates emit Province ID and Country ID if possible. So inferred Provinces are weighted heavily.
* b. Person name rule - filters out heavily, making use of JRC Names and your own data sets as a TaxCat catalog/tagger.
* d. Major Places rule -- well-known large cities, capitals or provinces are weighted moderately.
* e. Province association rule -- for each found place, weight geos falling in Provinces positively ID'd.
* f. Location Chooser rule -- assemble all evidence and account for weights.
*/
countryRule.evaluate(candidates);
nameWithAdminRule.evaluate(candidates);
// 2. NON-PLACE ID. Tag person and org names to negate celebrity names or well-known
// individuals who share a city name. "Tom Jackson", "Bill Clinton"
//
parseKnownNonPlaces(input, candidates, matches);
// Measure duration of tagging.
this.taggingTimes.addTimeSince(t1);
//
for (GeocodeRule r : rules) {
r.evaluate(candidates);
}
// Last rule: score, choose, add confidence.
//
chooser.evaluate(candidates);
// For each candidate, if PlaceCandidate.chosen is not null,
// add chosen (Geocoding) to matches
// Otherwise add PlaceCandidates to matches.
// non-geocoded matches will appear in non-GIS formats.
//
// Downstream recipients of 'matches' must know how to parse through
// evaluated place candidates. We send the candidates and all evidence.
matches.addAll(candidates);
// Measure full processing duration for this doc.
this.matcherTotalTimes.addBytes(input.buffer.length());
this.matcherTotalTimes.addTimeSince(t1);
return matches;
}
use of org.opensextant.extractors.geo.rules.GeocodeRule in project Xponents by OpenSextant.
the class PlaceGeocoder method reset.
private void reset() {
this.relevantCountries.clear();
this.relevantProvinces.clear();
this.relevantLocations.clear();
this.nationalities.clear();
personNameRule.reset();
countryRule.reset();
majorPlaceRule.reset();
chooser.reset();
for (GeocodeRule r : rules) {
r.reset();
}
}
Aggregations