use of org.apache.nutch.util.domain.DomainSuffix in project nutch by apache.
the class DomainDenylistURLFilter method filter.
@Override
public String filter(String url) {
try {
// match for suffix, domain, and host in that order. more general will
// override more specific
String domain = URLUtil.getDomainName(url).toLowerCase().trim();
String host = URLUtil.getHost(url);
String suffix = null;
DomainSuffix domainSuffix = URLUtil.getDomainSuffix(url);
if (domainSuffix != null) {
suffix = domainSuffix.getDomain();
}
if (domainSet.contains(suffix) || domainSet.contains(domain) || domainSet.contains(host)) {
// Matches, filter!
return null;
}
// doesn't match, allow
return url;
} catch (Exception e) {
// if an error happens, allow the url to pass
LOG.error("Could not apply filter on url: " + url + "\n" + org.apache.hadoop.util.StringUtils.stringifyException(e));
return null;
}
}
Aggregations