Search in sources :

Example 6 with DomainSuffix

use of org.apache.nutch.util.domain.DomainSuffix in project nutch by apache.

the class DomainDenylistURLFilter method filter.

@Override
public String filter(String url) {
    try {
        // match for suffix, domain, and host in that order. more general will
        // override more specific
        String domain = URLUtil.getDomainName(url).toLowerCase().trim();
        String host = URLUtil.getHost(url);
        String suffix = null;
        DomainSuffix domainSuffix = URLUtil.getDomainSuffix(url);
        if (domainSuffix != null) {
            suffix = domainSuffix.getDomain();
        }
        if (domainSet.contains(suffix) || domainSet.contains(domain) || domainSet.contains(host)) {
            // Matches, filter!
            return null;
        }
        // doesn't match, allow
        return url;
    } catch (Exception e) {
        // if an error happens, allow the url to pass
        LOG.error("Could not apply filter on url: " + url + "\n" + org.apache.hadoop.util.StringUtils.stringifyException(e));
        return null;
    }
}
Also used : DomainSuffix(org.apache.nutch.util.domain.DomainSuffix) IOException(java.io.IOException)

Aggregations

DomainSuffix (org.apache.nutch.util.domain.DomainSuffix)6 IOException (java.io.IOException)3 URL (java.net.URL)1 IndexingException (org.apache.nutch.indexer.IndexingException)1 NutchField (org.apache.nutch.indexer.NutchField)1 DomainSuffixes (org.apache.nutch.util.domain.DomainSuffixes)1