Search in sources :

Example 16 with Locator

use of org.xml.sax.Locator in project tika by apache.

the class HtmlParserTest method testLocator.

/**
     * Test case for TIKA-820:  Locator is unset for HTML parser
     *
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-820">TIKA-820</a>
     */
@Test
public void testLocator() throws Exception {
    final int line = 0;
    final int col = 1;
    final int[] textPosition = new int[2];
    new HtmlParser().parse(HtmlParserTest.class.getResourceAsStream("/test-documents/testHTML.html"), new ContentHandler() {

        Locator locator;

        public void setDocumentLocator(Locator locator) {
            this.locator = locator;
        }

        public void startDocument() throws SAXException {
        }

        public void endDocument() throws SAXException {
        }

        public void startPrefixMapping(String prefix, String uri) throws SAXException {
        }

        public void endPrefixMapping(String prefix) throws SAXException {
        }

        public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        }

        public void endElement(String uri, String localName, String qName) throws SAXException {
        }

        public void characters(char[] ch, int start, int length) throws SAXException {
            String text = new String(ch, start, length);
            if (text.equals("Test Indexation Html") && locator != null) {
                textPosition[line] = locator.getLineNumber();
                textPosition[col] = locator.getColumnNumber();
            }
        }

        public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
        }

        public void processingInstruction(String target, String data) throws SAXException {
        }

        public void skippedEntity(String name) throws SAXException {
        }
    }, new Metadata(), new ParseContext());
    // The text occurs at line 24 (if lines start at 0) or 25 (if lines start at 1).
    assertEquals(24, textPosition[line]);
    // The column reported seems fuzzy, just test it is close enough.
    assertTrue(Math.abs(textPosition[col] - 47) < 10);
}
Also used : Locator(org.xml.sax.Locator) Attributes(org.xml.sax.Attributes) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) SAXException(org.xml.sax.SAXException) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

Locator (org.xml.sax.Locator)16 Element (org.w3c.dom.Element)7 Node (org.w3c.dom.Node)7 EntityReference (org.w3c.dom.EntityReference)4 NamedNodeMap (org.w3c.dom.NamedNodeMap)4 ProcessingInstruction (org.w3c.dom.ProcessingInstruction)4 Attributes (org.xml.sax.Attributes)4 LexicalHandler (org.xml.sax.ext.LexicalHandler)4 Stack (java.util.Stack)3 DocumentBuilder (javax.xml.parsers.DocumentBuilder)3 DocumentBuilderFactory (javax.xml.parsers.DocumentBuilderFactory)3 SAXParser (javax.xml.parsers.SAXParser)3 SAXParserFactory (javax.xml.parsers.SAXParserFactory)3 Document (org.w3c.dom.Document)3 DefaultHandler (org.xml.sax.helpers.DefaultHandler)3 StringReader (java.io.StringReader)2 TransformerConfigurationException (javax.xml.transform.TransformerConfigurationException)2 TransformerException (javax.xml.transform.TransformerException)2 ElemExtensionCall (org.apache.xalan.templates.ElemExtensionCall)2 ElemLiteralResult (org.apache.xalan.templates.ElemLiteralResult)2