Search in sources :

Example 1 with DomSerializer

use of org.htmlcleaner.DomSerializer in project webmagic by code4craft.

the class Xpath2Selector method selectList.

@Override
public List<String> selectList(String text) {
    List<String> results = new ArrayList<String>();
    try {
        HtmlCleaner htmlCleaner = new HtmlCleaner();
        TagNode tagNode = htmlCleaner.clean(text);
        Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
        Object result;
        try {
            result = xPathExpression.evaluate(document, XPathConstants.NODESET);
        } catch (XPathExpressionException e) {
            result = xPathExpression.evaluate(document, XPathConstants.STRING);
        }
        if (result instanceof NodeList) {
            NodeList nodeList = (NodeList) result;
            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            StreamResult xmlOutput = new StreamResult();
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            for (int i = 0; i < nodeList.getLength(); i++) {
                Node item = nodeList.item(i);
                if (item.getNodeType() == Node.ATTRIBUTE_NODE || item.getNodeType() == Node.TEXT_NODE) {
                    results.add(item.getTextContent());
                } else {
                    xmlOutput.setWriter(new StringWriter());
                    transformer.transform(new DOMSource(item), xmlOutput);
                    results.add(xmlOutput.getWriter().toString());
                }
            }
        } else {
            results.add(result.toString());
        }
    } catch (Exception e) {
        logger.error("select text error! " + xpathStr, e);
    }
    return results;
}
Also used : DOMSource(javax.xml.transform.dom.DOMSource) Transformer(javax.xml.transform.Transformer) StreamResult(javax.xml.transform.stream.StreamResult) XPathExpressionException(javax.xml.xpath.XPathExpressionException) NodeList(org.w3c.dom.NodeList) TagNode(org.htmlcleaner.TagNode) Node(org.w3c.dom.Node) ArrayList(java.util.ArrayList) Document(org.w3c.dom.Document) HtmlCleaner(org.htmlcleaner.HtmlCleaner) XPathExpressionException(javax.xml.xpath.XPathExpressionException) StringWriter(java.io.StringWriter) DomSerializer(org.htmlcleaner.DomSerializer) CleanerProperties(org.htmlcleaner.CleanerProperties) TagNode(org.htmlcleaner.TagNode)

Example 2 with DomSerializer

use of org.htmlcleaner.DomSerializer in project webmagic by code4craft.

the class Xpath2Selector method select.

@Override
public String select(String text) {
    try {
        HtmlCleaner htmlCleaner = new HtmlCleaner();
        TagNode tagNode = htmlCleaner.clean(text);
        Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
        Object result;
        try {
            result = xPathExpression.evaluate(document, XPathConstants.NODESET);
        } catch (XPathExpressionException e) {
            result = xPathExpression.evaluate(document, XPathConstants.STRING);
        }
        if (result instanceof NodeList) {
            NodeList nodeList = (NodeList) result;
            if (nodeList.getLength() == 0) {
                return null;
            }
            Node item = nodeList.item(0);
            if (item.getNodeType() == Node.ATTRIBUTE_NODE || item.getNodeType() == Node.TEXT_NODE) {
                return item.getTextContent();
            } else {
                StreamResult xmlOutput = new StreamResult(new StringWriter());
                Transformer transformer = TransformerFactory.newInstance().newTransformer();
                transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
                transformer.transform(new DOMSource(item), xmlOutput);
                return xmlOutput.getWriter().toString();
            }
        }
        return result.toString();
    } catch (Exception e) {
        logger.error("select text error! " + xpathStr, e);
    }
    return null;
}
Also used : DOMSource(javax.xml.transform.dom.DOMSource) Transformer(javax.xml.transform.Transformer) StreamResult(javax.xml.transform.stream.StreamResult) XPathExpressionException(javax.xml.xpath.XPathExpressionException) NodeList(org.w3c.dom.NodeList) TagNode(org.htmlcleaner.TagNode) Node(org.w3c.dom.Node) Document(org.w3c.dom.Document) HtmlCleaner(org.htmlcleaner.HtmlCleaner) XPathExpressionException(javax.xml.xpath.XPathExpressionException) StringWriter(java.io.StringWriter) DomSerializer(org.htmlcleaner.DomSerializer) CleanerProperties(org.htmlcleaner.CleanerProperties) TagNode(org.htmlcleaner.TagNode)

Aggregations

StringWriter (java.io.StringWriter)2 Transformer (javax.xml.transform.Transformer)2 DOMSource (javax.xml.transform.dom.DOMSource)2 StreamResult (javax.xml.transform.stream.StreamResult)2 XPathExpressionException (javax.xml.xpath.XPathExpressionException)2 CleanerProperties (org.htmlcleaner.CleanerProperties)2 DomSerializer (org.htmlcleaner.DomSerializer)2 HtmlCleaner (org.htmlcleaner.HtmlCleaner)2 TagNode (org.htmlcleaner.TagNode)2 Document (org.w3c.dom.Document)2 Node (org.w3c.dom.Node)2 NodeList (org.w3c.dom.NodeList)2 ArrayList (java.util.ArrayList)1