Search in sources :

Example 1 with ResultData

use of org.codelibs.fess.crawler.entity.ResultData in project fess by codelibs.

the class AbstractFessFileTransformer method transform.

@Override
public ResultData transform(final ResponseData responseData) {
    if (responseData == null || !responseData.hasResponseBody()) {
        throw new CrawlingAccessException("No response body.");
    }
    final ResultData resultData = new ResultData();
    resultData.setTransformerName(getName());
    try {
        resultData.setData(SerializeUtil.fromObjectToBinary(generateData(responseData)));
    } catch (final Exception e) {
        throw new CrawlingAccessException("Could not serialize object", e);
    }
    resultData.setEncoding(fessConfig.getCrawlerCrawlingDataEncoding());
    return resultData;
}
Also used : AccessResultData(org.codelibs.fess.crawler.entity.AccessResultData) ResultData(org.codelibs.fess.crawler.entity.ResultData) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Example 2 with ResultData

use of org.codelibs.fess.crawler.entity.ResultData in project fess by codelibs.

the class FileListIndexUpdateCallbackImpl method processRequest.

protected String processRequest(final Map<String, String> paramMap, final Map<String, Object> dataMap, final String url, final CrawlerClient client) {
    final long startTime = System.currentTimeMillis();
    try (final ResponseData responseData = client.execute(RequestDataBuilder.newRequestData().get().url(url).build())) {
        if (responseData.getRedirectLocation() != null) {
            return responseData.getRedirectLocation();
        }
        responseData.setExecutionTime(System.currentTimeMillis() - startTime);
        if (dataMap.containsKey(Constants.SESSION_ID)) {
            responseData.setSessionId((String) dataMap.get(Constants.SESSION_ID));
        } else {
            responseData.setSessionId(paramMap.get(Constants.CRAWLING_INFO_ID));
        }
        final RuleManager ruleManager = SingletonLaContainer.getComponent(RuleManager.class);
        final Rule rule = ruleManager.getRule(responseData);
        if (rule == null) {
            logger.warn("No url rule. Data: " + dataMap);
        } else {
            responseData.setRuleId(rule.getRuleId());
            final ResponseProcessor responseProcessor = rule.getResponseProcessor();
            if (responseProcessor instanceof DefaultResponseProcessor) {
                final Transformer transformer = ((DefaultResponseProcessor) responseProcessor).getTransformer();
                final ResultData resultData = transformer.transform(responseData);
                final byte[] data = resultData.getData();
                if (data != null) {
                    try {
                        @SuppressWarnings("unchecked") final Map<String, Object> responseDataMap = (Map<String, Object>) SerializeUtil.fromBinaryToObject(data);
                        dataMap.putAll(responseDataMap);
                    } catch (final Exception e) {
                        throw new CrawlerSystemException("Could not create an instance from bytes.", e);
                    }
                }
                // remove
                String[] ignoreFields;
                if (paramMap.containsKey("ignore.field.names")) {
                    ignoreFields = paramMap.get("ignore.field.names").split(",");
                } else {
                    ignoreFields = new String[] { Constants.INDEXING_TARGET, Constants.SESSION_ID };
                }
                stream(ignoreFields).of(stream -> stream.map(s -> s.trim()).forEach(s -> dataMap.remove(s)));
                indexUpdateCallback.store(paramMap, dataMap);
            } else {
                logger.warn("The response processor is not DefaultResponseProcessor. responseProcessor: " + responseProcessor + ", Data: " + dataMap);
            }
        }
        return null;
    } catch (final ChildUrlsException e) {
        throw new DataStoreCrawlingException(url, "Redirected to " + e.getChildUrlList().stream().map(r -> r.getUrl()).collect(Collectors.joining(", ")), e);
    } catch (final Exception e) {
        throw new DataStoreCrawlingException(url, "Failed to add: " + dataMap, e);
    }
}
Also used : Constants(org.codelibs.fess.Constants) IndexingHelper(org.codelibs.fess.helper.IndexingHelper) ThreadPoolExecutor(java.util.concurrent.ThreadPoolExecutor) LoggerFactory(org.slf4j.LoggerFactory) SerializeUtil(org.codelibs.core.io.SerializeUtil) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) IndexUpdateCallback(org.codelibs.fess.ds.IndexUpdateCallback) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) Transformer(org.codelibs.fess.crawler.transformer.Transformer) ArrayList(java.util.ArrayList) CrawlerClient(org.codelibs.fess.crawler.client.CrawlerClient) FessConfig(org.codelibs.fess.mylasta.direction.FessConfig) Map(java.util.Map) ResponseProcessor(org.codelibs.fess.crawler.processor.ResponseProcessor) ExecutorService(java.util.concurrent.ExecutorService) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) StreamUtil.stream(org.codelibs.core.stream.StreamUtil.stream) Logger(org.slf4j.Logger) ResultData(org.codelibs.fess.crawler.entity.ResultData) FessEsClient(org.codelibs.fess.es.client.FessEsClient) RuleManager(org.codelibs.fess.crawler.rule.RuleManager) Rule(org.codelibs.fess.crawler.rule.Rule) LinkedBlockingQueue(java.util.concurrent.LinkedBlockingQueue) Collectors(java.util.stream.Collectors) TimeUnit(java.util.concurrent.TimeUnit) List(java.util.List) ComponentUtil(org.codelibs.fess.util.ComponentUtil) SingletonLaContainer(org.lastaflute.di.core.SingletonLaContainer) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) RequestDataBuilder(org.codelibs.fess.crawler.builder.RequestDataBuilder) CrawlerClientFactory(org.codelibs.fess.crawler.client.CrawlerClientFactory) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) Transformer(org.codelibs.fess.crawler.transformer.Transformer) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) ResponseProcessor(org.codelibs.fess.crawler.processor.ResponseProcessor) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) ResultData(org.codelibs.fess.crawler.entity.ResultData) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) RuleManager(org.codelibs.fess.crawler.rule.RuleManager) Rule(org.codelibs.fess.crawler.rule.Rule) Map(java.util.Map)

Example 3 with ResultData

use of org.codelibs.fess.crawler.entity.ResultData in project fess by codelibs.

the class FessXpathTransformerTest method test_processMetaRobots_noindex.

public void test_processMetaRobots_noindex() throws Exception {
    final String data = "<meta name=\"robots\" content=\"noindex\" /><a href=\"index.html\">aaa</a>";
    final Document document = getDocument(data);
    final FessXpathTransformer transformer = new FessXpathTransformer();
    final ResponseData responseData = new ResponseData();
    responseData.setUrl("http://example.com/");
    responseData.setResponseBody(data.getBytes());
    try {
        transformer.processMetaRobots(responseData, new ResultData(), document);
        fail();
    } catch (ChildUrlsException e) {
        assertTrue(e.getChildUrlList().isEmpty());
    } catch (Exception e) {
        fail();
    }
}
Also used : ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) ResultData(org.codelibs.fess.crawler.entity.ResultData) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) Document(org.w3c.dom.Document) ComponentNotFoundException(org.lastaflute.di.core.exception.ComponentNotFoundException) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException)

Example 4 with ResultData

use of org.codelibs.fess.crawler.entity.ResultData in project fess by codelibs.

the class FessXpathTransformerTest method test_processMetaRobots_noindexnofollow.

public void test_processMetaRobots_noindexnofollow() throws Exception {
    final String data = "<meta name=\"ROBOTS\" content=\"NOINDEX,NOFOLLOW\" />";
    final Document document = getDocument(data);
    final FessXpathTransformer transformer = new FessXpathTransformer();
    final ResponseData responseData = new ResponseData();
    responseData.setUrl("http://example.com/");
    try {
        transformer.processMetaRobots(responseData, new ResultData(), document);
        fail();
    } catch (ChildUrlsException e) {
        assertTrue(e.getChildUrlList().isEmpty());
    } catch (Exception e) {
        fail();
    }
}
Also used : ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) ResultData(org.codelibs.fess.crawler.entity.ResultData) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) Document(org.w3c.dom.Document) ComponentNotFoundException(org.lastaflute.di.core.exception.ComponentNotFoundException) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException)

Example 5 with ResultData

use of org.codelibs.fess.crawler.entity.ResultData in project fess by codelibs.

the class FessXpathTransformerTest method test_processMetaRobots_no.

public void test_processMetaRobots_no() throws Exception {
    final String data = "<html><body>foo</body></html>";
    final Document document = getDocument(data);
    final FessXpathTransformer transformer = new FessXpathTransformer();
    final ResponseData responseData = new ResponseData();
    responseData.setUrl("http://example.com/");
    transformer.processMetaRobots(responseData, new ResultData(), document);
    assertFalse(responseData.isNoFollow());
}
Also used : ResultData(org.codelibs.fess.crawler.entity.ResultData) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) Document(org.w3c.dom.Document)

Aggregations

ResultData (org.codelibs.fess.crawler.entity.ResultData)9 ResponseData (org.codelibs.fess.crawler.entity.ResponseData)8 ChildUrlsException (org.codelibs.fess.crawler.exception.ChildUrlsException)5 Document (org.w3c.dom.Document)5 CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)3 ComponentNotFoundException (org.lastaflute.di.core.exception.ComponentNotFoundException)3 Map (java.util.Map)2 CrawlerClient (org.codelibs.fess.crawler.client.CrawlerClient)2 CrawlerClientFactory (org.codelibs.fess.crawler.client.CrawlerClientFactory)2 CrawlingAccessException (org.codelibs.fess.crawler.exception.CrawlingAccessException)2 ResponseProcessor (org.codelibs.fess.crawler.processor.ResponseProcessor)2 DefaultResponseProcessor (org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor)2 Rule (org.codelibs.fess.crawler.rule.Rule)2 RuleManager (org.codelibs.fess.crawler.rule.RuleManager)2 Transformer (org.codelibs.fess.crawler.transformer.Transformer)2 IOException (java.io.IOException)1 ArrayList (java.util.ArrayList)1 Date (java.util.Date)1 HashSet (java.util.HashSet)1 List (java.util.List)1