Search in sources :

Example 1 with DOMFragmentParser

use of org.cyberneko.html.parsers.DOMFragmentParser in project translationstudio8 by heartsome.

the class MessageParser method htmlToText.

/**
	 * 将 html 格式的文本过滤掉标签.
	 * @param html
	 *            html 格式的字符串
	 * @return String
	 * 			  过滤掉 html 标签后的文本。如果 html 为空,返回空串""
	 */
private String htmlToText(String html) {
    if (html == null) {
        return "";
    }
    DOMFragmentParser parser = new DOMFragmentParser();
    CoreDocumentImpl codeDoc = new CoreDocumentImpl();
    InputSource inSource = new InputSource(new ByteArrayInputStream(html.getBytes()));
    inSource.setEncoding(textCharset);
    DocumentFragment doc = codeDoc.createDocumentFragment();
    try {
        parser.parse(inSource, doc);
    } catch (Exception e) {
        return "";
    }
    textBuffer = new StringBuffer();
    processNode(doc);
    return textBuffer.toString();
}
Also used : InputSource(org.xml.sax.InputSource) ByteArrayInputStream(java.io.ByteArrayInputStream) CoreDocumentImpl(org.apache.xerces.dom.CoreDocumentImpl) DOMFragmentParser(org.cyberneko.html.parsers.DOMFragmentParser) DocumentFragment(org.w3c.dom.DocumentFragment) MessagingException(javax.mail.MessagingException) IOException(java.io.IOException) FileNotFoundException(java.io.FileNotFoundException)

Aggregations

ByteArrayInputStream (java.io.ByteArrayInputStream)1 FileNotFoundException (java.io.FileNotFoundException)1 IOException (java.io.IOException)1 MessagingException (javax.mail.MessagingException)1 CoreDocumentImpl (org.apache.xerces.dom.CoreDocumentImpl)1 DOMFragmentParser (org.cyberneko.html.parsers.DOMFragmentParser)1 DocumentFragment (org.w3c.dom.DocumentFragment)1 InputSource (org.xml.sax.InputSource)1