Search in sources :

Example 16 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class WebDriverClient method execute.

@Override
public ResponseData execute(final RequestData request) {
    WebDriver webDriver = null;
    try {
        webDriver = webDriverPool.borrowObject();
        Map<String, String> paramMap = null;
        final String url = request.getUrl();
        final String metaData = request.getMetaData();
        if (StringUtil.isNotBlank(metaData)) {
            paramMap = parseParamMap(metaData);
        }
        if (!url.equals(webDriver.getCurrentUrl())) {
            webDriver.get(url);
        }
        if (logger.isDebugEnabled()) {
            logger.debug("Base URL: " + url + "\nContent: " + webDriver.getPageSource());
        }
        if (paramMap != null) {
            final String processorName = paramMap.get(UrlAction.URL_ACTION);
            final UrlAction urlAction = urlActionMap.get(processorName);
            if (urlAction == null) {
                throw new CrawlerSystemException("Unknown processor: " + processorName);
            }
            urlAction.navigate(webDriver, paramMap);
        }
        final String source = webDriver.getPageSource();
        final ResponseData responseData = new ResponseData();
        responseData.setUrl(webDriver.getCurrentUrl());
        responseData.setMethod(request.getMethod().name());
        responseData.setContentLength(source.length());
        final String charSet = getCharSet(webDriver);
        responseData.setCharSet(charSet);
        responseData.setHttpStatusCode(getStatusCode(webDriver));
        responseData.setLastModified(getLastModified(webDriver));
        responseData.setMimeType(getContentType(webDriver));
        responseData.setResponseBody(source.getBytes(charSet));
        for (final UrlAction urlAction : urlActionMap.values()) {
            urlAction.collect(url, webDriver, responseData);
        }
        return responseData;
    } catch (final Exception e) {
        throw new CrawlerSystemException("Failed to access " + request.getUrl(), e);
    } finally {
        if (webDriver != null) {
            try {
                webDriverPool.returnObject(webDriver);
            } catch (final Exception e) {
                logger.warn("Failed to return a returned object.", e);
            }
        }
    }
}
Also used : WebDriver(org.openqa.selenium.WebDriver) UrlAction(org.codelibs.fess.crawler.client.http.action.UrlAction) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Example 17 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class ZipExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    final StringBuilder buf = new StringBuilder(1000);
    try (final ArchiveInputStream ais = archiveStreamFactory.createArchiveInputStream(in.markSupported() ? in : new BufferedInputStream(in))) {
        ZipArchiveEntry entry = null;
        long contentSize = 0;
        while ((entry = (ZipArchiveEntry) ais.getNextEntry()) != null) {
            contentSize += entry.getSize();
            if (maxContentSize != -1 && contentSize > maxContentSize) {
                throw new MaxLengthExceededException("Extracted size is " + contentSize + " > " + maxContentSize);
            }
            final String filename = entry.getName();
            final String mimeType = mimeTypeHelper.getContentType(null, filename);
            if (mimeType != null) {
                final Extractor extractor = extractorFactory.getExtractor(mimeType);
                if (extractor != null) {
                    try {
                        final Map<String, String> map = new HashMap<>();
                        map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                        buf.append(extractor.getText(new IgnoreCloseInputStream(ais), map).getContent());
                        buf.append('\n');
                    } catch (final Exception e) {
                        if (logger.isDebugEnabled()) {
                            logger.debug("Exception in an internal extractor.", e);
                        }
                    }
                }
            }
        }
    } catch (final MaxLengthExceededException e) {
        throw e;
    } catch (final Exception e) {
        if (buf.length() == 0) {
            throw new ExtractException("Could not extract a content.", e);
        }
    }
    return new ExtractData(buf.toString().trim());
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) HashMap(java.util.HashMap) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ArchiveInputStream(org.apache.commons.compress.archivers.ArchiveInputStream) BufferedInputStream(java.io.BufferedInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ZipArchiveEntry(org.apache.commons.compress.archivers.zip.ZipArchiveEntry) Extractor(org.codelibs.fess.crawler.extractor.Extractor) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream)

Example 18 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class AbstractXmlExtractor method getText.

public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    try {
        final BufferedInputStream bis = new BufferedInputStream(in);
        final String enc = getEncoding(bis);
        final String content = UNESCAPE_HTML4.translate(new String(InputStreamUtil.getBytes(bis), enc));
        return new ExtractData(extractString(content));
    } catch (final Exception e) {
        throw new ExtractException(e);
    }
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) BufferedInputStream(java.io.BufferedInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) IOException(java.io.IOException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Example 19 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess by codelibs.

the class FileListIndexUpdateCallbackImpl method processRequest.

protected String processRequest(final Map<String, String> paramMap, final Map<String, Object> dataMap, final String url, final CrawlerClient client) {
    final long startTime = System.currentTimeMillis();
    try (final ResponseData responseData = client.execute(RequestDataBuilder.newRequestData().get().url(url).build())) {
        if (responseData.getRedirectLocation() != null) {
            return responseData.getRedirectLocation();
        }
        responseData.setExecutionTime(System.currentTimeMillis() - startTime);
        if (dataMap.containsKey(Constants.SESSION_ID)) {
            responseData.setSessionId((String) dataMap.get(Constants.SESSION_ID));
        } else {
            responseData.setSessionId(paramMap.get(Constants.CRAWLING_INFO_ID));
        }
        final RuleManager ruleManager = SingletonLaContainer.getComponent(RuleManager.class);
        final Rule rule = ruleManager.getRule(responseData);
        if (rule == null) {
            logger.warn("No url rule. Data: {}", dataMap);
        } else {
            responseData.setRuleId(rule.getRuleId());
            final ResponseProcessor responseProcessor = rule.getResponseProcessor();
            if (responseProcessor instanceof DefaultResponseProcessor) {
                final Transformer transformer = ((DefaultResponseProcessor) responseProcessor).getTransformer();
                final ResultData resultData = transformer.transform(responseData);
                final byte[] data = resultData.getData();
                if (data != null) {
                    try {
                        @SuppressWarnings("unchecked") final Map<String, Object> responseDataMap = (Map<String, Object>) SerializeUtil.fromBinaryToObject(data);
                        dataMap.putAll(responseDataMap);
                    } catch (final Exception e) {
                        throw new CrawlerSystemException("Could not create an instance from bytes.", e);
                    }
                }
                // remove
                String[] ignoreFields;
                if (paramMap.containsKey("ignore.field.names")) {
                    ignoreFields = paramMap.get("ignore.field.names").split(",");
                } else {
                    ignoreFields = new String[] { Constants.INDEXING_TARGET, Constants.SESSION_ID };
                }
                stream(ignoreFields).of(stream -> stream.map(String::trim).forEach(s -> dataMap.remove(s)));
                indexUpdateCallback.store(paramMap, dataMap);
            } else {
                logger.warn("The response processor is not DefaultResponseProcessor. responseProcessor: {}, Data: {}", responseProcessor, dataMap);
            }
        }
        return null;
    } catch (final ChildUrlsException e) {
        throw new DataStoreCrawlingException(url, "Redirected to " + e.getChildUrlList().stream().map(RequestData::getUrl).collect(Collectors.joining(", ")), e);
    } catch (final Exception e) {
        throw new DataStoreCrawlingException(url, "Failed to add: " + dataMap, e);
    }
}
Also used : Constants(org.codelibs.fess.Constants) IndexingHelper(org.codelibs.fess.helper.IndexingHelper) ThreadPoolExecutor(java.util.concurrent.ThreadPoolExecutor) SearchEngineClient(org.codelibs.fess.es.client.SearchEngineClient) SerializeUtil(org.codelibs.core.io.SerializeUtil) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) Deque(java.util.Deque) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) Transformer(org.codelibs.fess.crawler.transformer.Transformer) ArrayList(java.util.ArrayList) CrawlerClient(org.codelibs.fess.crawler.client.CrawlerClient) FessConfig(org.codelibs.fess.mylasta.direction.FessConfig) Map(java.util.Map) ResponseProcessor(org.codelibs.fess.crawler.processor.ResponseProcessor) LinkedList(java.util.LinkedList) ExecutorService(java.util.concurrent.ExecutorService) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) QueryBuilders(org.opensearch.index.query.QueryBuilders) StreamUtil.stream(org.codelibs.core.stream.StreamUtil.stream) ResultData(org.codelibs.fess.crawler.entity.ResultData) RuleManager(org.codelibs.fess.crawler.rule.RuleManager) Rule(org.codelibs.fess.crawler.rule.Rule) LinkedBlockingQueue(java.util.concurrent.LinkedBlockingQueue) Collectors(java.util.stream.Collectors) TimeUnit(java.util.concurrent.TimeUnit) List(java.util.List) Logger(org.apache.logging.log4j.Logger) RequestData(org.codelibs.fess.crawler.entity.RequestData) ComponentUtil(org.codelibs.fess.util.ComponentUtil) SingletonLaContainer(org.lastaflute.di.core.SingletonLaContainer) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) LogManager(org.apache.logging.log4j.LogManager) RequestDataBuilder(org.codelibs.fess.crawler.builder.RequestDataBuilder) CrawlerClientFactory(org.codelibs.fess.crawler.client.CrawlerClientFactory) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) Transformer(org.codelibs.fess.crawler.transformer.Transformer) ResponseData(org.codelibs.fess.crawler.entity.ResponseData) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) ResponseProcessor(org.codelibs.fess.crawler.processor.ResponseProcessor) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) ResultData(org.codelibs.fess.crawler.entity.ResultData) DataStoreCrawlingException(org.codelibs.fess.exception.DataStoreCrawlingException) DefaultResponseProcessor(org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor) RequestData(org.codelibs.fess.crawler.entity.RequestData) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) RuleManager(org.codelibs.fess.crawler.rule.RuleManager) Rule(org.codelibs.fess.crawler.rule.Rule) Map(java.util.Map)

Example 20 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class EsClient method createTransportClient.

protected TransportClient createTransportClient() {
    final Settings settings = Settings.builder().put("cluster.name", StringUtil.isBlank(clusterName) ? "elasticsearch" : clusterName).build();
    final TransportClient transportClient = new PreBuiltTransportClient(settings);
    Arrays.stream(addresses).forEach(address -> {
        final String[] values = address.split(":");
        String hostname;
        int port = 9300;
        if (values.length == 1) {
            hostname = values[0];
        } else if (values.length == 2) {
            hostname = values[0];
            port = Integer.parseInt(values[1]);
        } else {
            throw new CrawlerSystemException("Invalid address: " + address);
        }
        try {
            transportClient.addTransportAddress(new TransportAddress(InetAddress.getByName(hostname), port));
        } catch (final Exception e) {
            throw new CrawlerSystemException("Unknown host: " + address);
        }
        logger.info("Connected to " + hostname + ":" + port);
    });
    return transportClient;
}
Also used : TransportClient(org.elasticsearch.client.transport.TransportClient) PreBuiltTransportClient(org.elasticsearch.transport.client.PreBuiltTransportClient) PreBuiltTransportClient(org.elasticsearch.transport.client.PreBuiltTransportClient) TransportAddress(org.elasticsearch.common.transport.TransportAddress) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) Settings(org.elasticsearch.common.settings.Settings) ElasticsearchException(org.elasticsearch.ElasticsearchException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) IndexNotFoundException(org.elasticsearch.index.IndexNotFoundException) EsAccessException(org.codelibs.fess.crawler.exception.EsAccessException) VersionConflictEngineException(org.elasticsearch.index.engine.VersionConflictEngineException)

Aggregations

CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)41 IOException (java.io.IOException)16 CrawlingAccessException (org.codelibs.fess.crawler.exception.CrawlingAccessException)13 File (java.io.File)11 InputStream (java.io.InputStream)11 UnsupportedEncodingException (java.io.UnsupportedEncodingException)10 BufferedInputStream (java.io.BufferedInputStream)9 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)9 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)8 ResponseData (org.codelibs.fess.crawler.entity.ResponseData)8 Map (java.util.Map)7 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)7 MalformedURLException (java.net.MalformedURLException)6 HashMap (java.util.HashMap)6 AccessResultDataImpl (org.codelibs.fess.crawler.entity.AccessResultDataImpl)6 RequestData (org.codelibs.fess.crawler.entity.RequestData)6 ResultData (org.codelibs.fess.crawler.entity.ResultData)6 ChildUrlsException (org.codelibs.fess.crawler.exception.ChildUrlsException)6 HashSet (java.util.HashSet)5 TransformerException (javax.xml.transform.TransformerException)5