Search in sources :

Example 1 with XmlDeclaration

use of org.jsoup.nodes.XmlDeclaration in project jsoup by jhy.

the class XmlTreeBuilderTest method testParseDeclarationAttributes.

@Test
public void testParseDeclarationAttributes() {
    String xml = "<?xml version='1' encoding='UTF-8' something='else'?><val>One</val>";
    Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
    XmlDeclaration decl = (XmlDeclaration) doc.childNode(0);
    assertEquals("1", decl.attr("version"));
    assertEquals("UTF-8", decl.attr("encoding"));
    assertEquals("else", decl.attr("something"));
    assertEquals("version=\"1\" encoding=\"UTF-8\" something=\"else\"", decl.getWholeDeclaration());
    assertEquals("<?xml version=\"1\" encoding=\"UTF-8\" something=\"else\"?>", decl.outerHtml());
}
Also used : XmlDeclaration(org.jsoup.nodes.XmlDeclaration) Document(org.jsoup.nodes.Document) Test(org.junit.Test)

Example 2 with XmlDeclaration

use of org.jsoup.nodes.XmlDeclaration in project jsoup by jhy.

the class DataUtil method parseByteData.

// reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
    String docData;
    Document doc = null;
    // look for BOM - overrides any other header or input
    charsetName = detectCharsetFromBom(byteData, charsetName);
    if (charsetName == null) {
        // determine from meta. safe first parse as UTF-8
        // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
        docData = Charset.forName(defaultCharset).decode(byteData).toString();
        doc = parser.parseInput(docData, baseUri);
        Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
        // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta != null) {
            if (meta.hasAttr("http-equiv")) {
                foundCharset = getCharsetFromContentType(meta.attr("content"));
            }
            if (foundCharset == null && meta.hasAttr("charset")) {
                foundCharset = meta.attr("charset");
            }
        }
        // look for <?xml encoding='ISO-8859-1'?>
        if (foundCharset == null && doc.childNodeSize() > 0 && doc.childNode(0) instanceof XmlDeclaration) {
            XmlDeclaration prolog = (XmlDeclaration) doc.childNode(0);
            if (prolog.name().equals("xml")) {
                foundCharset = prolog.attr("encoding");
            }
        }
        foundCharset = validateCharset(foundCharset);
        if (foundCharset != null && !foundCharset.equals(defaultCharset)) {
            // need to re-decode
            foundCharset = foundCharset.trim().replaceAll("[\"']", "");
            charsetName = foundCharset;
            byteData.rewind();
            docData = Charset.forName(foundCharset).decode(byteData).toString();
            doc = null;
        }
    } else {
        // specified by content type header (or by user on file load)
        Validate.notEmpty(charsetName, "Must set charset arg to character set of file to parse. Set to null to attempt to detect from HTML");
        docData = Charset.forName(charsetName).decode(byteData).toString();
    }
    if (doc == null) {
        doc = parser.parseInput(docData, baseUri);
        doc.outputSettings().charset(charsetName);
    }
    return doc;
}
Also used : Element(org.jsoup.nodes.Element) XmlDeclaration(org.jsoup.nodes.XmlDeclaration) Document(org.jsoup.nodes.Document)

Aggregations

Document (org.jsoup.nodes.Document)2 XmlDeclaration (org.jsoup.nodes.XmlDeclaration)2 Element (org.jsoup.nodes.Element)1 Test (org.junit.Test)1