Search in sources :

Example 6 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class XmlTransformer method transform.

/*
     * (non-Javadoc)
     *
     * @see org.codelibs.fess.crawler.transformer.impl.AbstractTransformer#transform(org.codelibs.fess.crawler.entity.ResponseData)
     */
@Override
public ResultData transform(final ResponseData responseData) {
    if (responseData == null || !responseData.hasResponseBody()) {
        throw new CrawlingAccessException("No response body.");
    }
    try (final InputStream is = responseData.getResponseBody()) {
        final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        for (final Map.Entry<String, Object> entry : attributeMap.entrySet()) {
            factory.setAttribute(entry.getKey(), entry.getValue());
        }
        for (final Map.Entry<String, String> entry : featureMap.entrySet()) {
            factory.setFeature(entry.getKey(), "true".equalsIgnoreCase(entry.getValue()));
        }
        factory.setCoalescing(coalescing);
        factory.setExpandEntityReferences(expandEntityRef);
        factory.setIgnoringComments(ignoringComments);
        factory.setIgnoringElementContentWhitespace(ignoringElementContentWhitespace);
        factory.setNamespaceAware(namespaceAware);
        factory.setValidating(validating);
        factory.setXIncludeAware(includeAware);
        final DocumentBuilder builder = factory.newDocumentBuilder();
        final Document doc = builder.parse(is);
        final StringBuilder buf = new StringBuilder(1000);
        buf.append(getResultDataHeader());
        for (final Map.Entry<String, String> entry : fieldRuleMap.entrySet()) {
            final List<String> nodeStrList = new ArrayList<>();
            try {
                final NodeList nodeList = getNodeList(doc, entry.getValue());
                for (int i = 0; i < nodeList.getLength(); i++) {
                    final Node node = nodeList.item(i);
                    nodeStrList.add(node.getTextContent());
                }
            } catch (final TransformerException e) {
                logger.warn("Could not parse a value of " + entry.getKey() + ":" + entry.getValue(), e);
            }
            if (nodeStrList.size() == 1) {
                buf.append(getResultDataBody(entry.getKey(), nodeStrList.get(0)));
            } else if (nodeStrList.size() > 1) {
                buf.append(getResultDataBody(entry.getKey(), nodeStrList));
            }
        }
        buf.append(getAdditionalData(responseData, doc));
        buf.append(getResultDataFooter());
        final ResultData resultData = new ResultData();
        resultData.setTransformerName(getName());
        final String data = buf.toString().trim();
        try {
            resultData.setData(data.getBytes(charsetName));
        } catch (final UnsupportedEncodingException e) {
            if (logger.isInfoEnabled()) {
                logger.info("Invalid charsetName: " + charsetName + ". Changed to " + Constants.UTF_8, e);
            }
            charsetName = Constants.UTF_8_CHARSET.name();
            resultData.setData(data.getBytes(Constants.UTF_8_CHARSET));
        }
        resultData.setEncoding(charsetName);
        return resultData;
    } catch (final CrawlerSystemException e) {
        throw e;
    } catch (final Exception e) {
        throw new CrawlerSystemException("Could not store data.", e);
    }
}
Also used : DocumentBuilderFactory(javax.xml.parsers.DocumentBuilderFactory) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) InputStream(java.io.InputStream) NodeList(org.w3c.dom.NodeList) Node(org.w3c.dom.Node) ArrayList(java.util.ArrayList) UnsupportedEncodingException(java.io.UnsupportedEncodingException) Document(org.w3c.dom.Document) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) TransformerException(javax.xml.transform.TransformerException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) UnsupportedEncodingException(java.io.UnsupportedEncodingException) AccessResultData(org.codelibs.fess.crawler.entity.AccessResultData) ResultData(org.codelibs.fess.crawler.entity.ResultData) DocumentBuilder(javax.xml.parsers.DocumentBuilder) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) HashMap(java.util.HashMap) LinkedHashMap(java.util.LinkedHashMap) Map(java.util.Map) TransformerException(javax.xml.transform.TransformerException)

Example 7 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class XpathTransformer method getData.

/**
 * Returns data as XML content of String.
 *
 * @return XML content of String.
 */
@Override
public Object getData(final AccessResultData<?> accessResultData) {
    if (dataClass == null) {
        return super.getData(accessResultData);
    }
    final Map<String, Object> dataMap = XmlUtil.getDataMap(accessResultData);
    if (Map.class.equals(dataClass)) {
        return dataMap;
    }
    try {
        final Object obj = dataClass.newInstance();
        BeanUtil.copyMapToBean(dataMap, obj);
        return obj;
    } catch (final Exception e) {
        throw new CrawlerSystemException("Could not create/copy a data map to " + dataClass, e);
    }
}
Also used : CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) XObject(org.apache.xpath.objects.XObject) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) TransformerException(javax.xml.transform.TransformerException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) UnsupportedEncodingException(java.io.UnsupportedEncodingException)

Example 8 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class HtmlTransformerTest method test_getData_wrongName.

public void test_getData_wrongName() throws Exception {
    final String value = "<html><body>hoge</body></html>";
    final AccessResultDataImpl accessResultDataImpl = new AccessResultDataImpl();
    accessResultDataImpl.setData(value.getBytes());
    accessResultDataImpl.setEncoding(Constants.UTF_8);
    accessResultDataImpl.setTransformerName("transformer");
    try {
        htmlTransformer.getData(accessResultDataImpl);
        fail();
    } catch (final CrawlerSystemException e) {
    }
}
Also used : CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) AccessResultDataImpl(org.codelibs.fess.crawler.entity.AccessResultDataImpl)

Example 9 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class TextTransformerTest method test_getData_wrongName.

public void test_getData_wrongName() throws Exception {
    final AccessResultDataImpl accessResultData = new AccessResultDataImpl();
    accessResultData.setTransformerName("transformer");
    accessResultData.setData("xyz".getBytes());
    try {
        textTransformer.getData(accessResultData);
        fail();
    } catch (final CrawlerSystemException e) {
    }
}
Also used : CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) AccessResultDataImpl(org.codelibs.fess.crawler.entity.AccessResultDataImpl)

Example 10 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class XmlTransformerTest method test_getData_wrongName.

public void test_getData_wrongName() throws Exception {
    final String value = // 
    "<?xml version=\"1.0\"?>\n" + // 
    "<doc>\n" + // 
    "<field name=\"title\">タイトル</field>\n" + // 
    "<field name=\"body\">第一章 第一節 ほげほげふがふが LINK 第2章 第2節</field>\n" + "</doc>";
    final AccessResultDataImpl accessResultDataImpl = new AccessResultDataImpl();
    accessResultDataImpl.setData(value.getBytes(Constants.UTF_8));
    accessResultDataImpl.setEncoding(Constants.UTF_8);
    accessResultDataImpl.setTransformerName("transformer");
    try {
        final Object obj = xmlTransformer.getData(accessResultDataImpl);
        fail();
    } catch (final CrawlerSystemException e) {
    }
}
Also used : CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) AccessResultDataImpl(org.codelibs.fess.crawler.entity.AccessResultDataImpl)

Aggregations

CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)41 IOException (java.io.IOException)16 CrawlingAccessException (org.codelibs.fess.crawler.exception.CrawlingAccessException)13 File (java.io.File)11 InputStream (java.io.InputStream)11 UnsupportedEncodingException (java.io.UnsupportedEncodingException)10 BufferedInputStream (java.io.BufferedInputStream)9 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)9 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)8 ResponseData (org.codelibs.fess.crawler.entity.ResponseData)8 Map (java.util.Map)7 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)7 MalformedURLException (java.net.MalformedURLException)6 HashMap (java.util.HashMap)6 AccessResultDataImpl (org.codelibs.fess.crawler.entity.AccessResultDataImpl)6 RequestData (org.codelibs.fess.crawler.entity.RequestData)6 ResultData (org.codelibs.fess.crawler.entity.ResultData)6 ChildUrlsException (org.codelibs.fess.crawler.exception.ChildUrlsException)6 HashSet (java.util.HashSet)5 TransformerException (javax.xml.transform.TransformerException)5