Search in sources :

Example 6 with HtmlParser

use of org.apache.tika.parser.html.HtmlParser in project acs-aem-commons by Adobe-Consulting-Services.

the class BrokenLinksReport method collectPaths.

/**
 * Collect references from a JCR property.
 * A property can be one of:
 * <ol>
 *     <li>A string containing a reference, e.g, fileReference=/content/dam/image.png. </li>
 *     <li>An array of strings, e.g, fileReference=[/content/dam/image1.png, /content/dam/image2.png]</li>
 *     <li>An html fragment containing links , e.g,
 *     <pre>
 *       &lt;p&gt;
 *         &lt;a href="/content/site/page.html"&gt;hello&lt;/a&gt;
 *         &lt;img src="/content/dam/image1.png"&gt;hello&lt;/a&gt;
 *       &lt;/p&gt;
 *     </pre>
 *     </li>
 * </ol>
 *
 * @param property an entry from a ValueMap
 * @param htmlFields  lst of properties containing html
 * @return stream containing extracted references
 */
static Stream<String> collectPaths(Map.Entry<String, Object> property, Set<String> htmlFields) {
    Object p = property.getValue();
    Stream<String> stream;
    if (p.getClass() == String[].class) {
        stream = Arrays.stream((String[]) p);
    } else if (p.getClass() == String.class) {
        stream = Stream.of((String) p);
    } else {
        stream = Stream.empty();
    }
    if (htmlFields.contains(property.getKey())) {
        stream = stream.flatMap(val -> {
            try {
                // parse html and extract links via underlying tagsoup library
                LinkContentHandler linkHandler = new LinkContentHandler();
                HtmlParser parser = new HtmlParser();
                parser.parse(new ByteArrayInputStream(val.getBytes("utf-8")), linkHandler, new Metadata(), new ParseContext());
                return linkHandler.getLinks().stream().map(Link::getUri);
            } catch (Exception e) {
                return Stream.empty();
            }
        });
    }
    return stream;
}
Also used : Arrays(java.util.Arrays) ResourceResolver(org.apache.sling.api.resource.ResourceResolver) ResourceUtil(org.apache.sling.api.resource.ResourceUtil) ProcessDefinition(com.adobe.acs.commons.mcp.ProcessDefinition) HashSet(java.util.HashSet) Metadata(org.apache.tika.metadata.Metadata) HtmlParser(org.apache.tika.parser.html.HtmlParser) RepositoryException(javax.jcr.RepositoryException) ByteArrayInputStream(java.io.ByteArrayInputStream) Map(java.util.Map) FormField(com.adobe.acs.commons.mcp.form.FormField) PersistenceException(org.apache.sling.api.resource.PersistenceException) Link(org.apache.tika.sax.Link) PathfieldComponent(com.adobe.acs.commons.mcp.form.PathfieldComponent) EnumMap(java.util.EnumMap) ConcurrentHashMap(java.util.concurrent.ConcurrentHashMap) Resource(org.apache.sling.api.resource.Resource) Set(java.util.Set) ActionManager(com.adobe.acs.commons.fam.ActionManager) Collectors(java.util.stream.Collectors) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) Serializable(java.io.Serializable) LoginException(org.apache.sling.api.resource.LoginException) List(java.util.List) GenericReport(com.adobe.acs.commons.mcp.model.GenericReport) Stream(java.util.stream.Stream) TreeFilteringResourceVisitor(com.adobe.acs.commons.util.visitors.TreeFilteringResourceVisitor) ParseContext(org.apache.tika.parser.ParseContext) CheckboxComponent(com.adobe.acs.commons.mcp.form.CheckboxComponent) Pattern(java.util.regex.Pattern) ProcessInstance(com.adobe.acs.commons.mcp.ProcessInstance) HtmlParser(org.apache.tika.parser.html.HtmlParser) ByteArrayInputStream(java.io.ByteArrayInputStream) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) RepositoryException(javax.jcr.RepositoryException) PersistenceException(org.apache.sling.api.resource.PersistenceException) LoginException(org.apache.sling.api.resource.LoginException)

Example 7 with HtmlParser

use of org.apache.tika.parser.html.HtmlParser in project data-prep by Talend.

the class HtmlSerializer method deserialize.

private void deserialize(InputStream rawContent, DataSetMetadata dataSetMetadata, OutputStream jsonOutput, long limit) {
    try {
        List<ColumnMetadata> columns = dataSetMetadata.getRowMetadata().getColumns();
        SimpleValuesContentHandler valuesContentHandler = new SimpleValuesContentHandler(columns.size(), limit);
        HtmlParser htmlParser = new HtmlParser();
        Metadata metadata = new Metadata();
        htmlParser.parse(rawContent, valuesContentHandler, metadata, new ParseContext());
        JsonGenerator generator = new JsonFactory().createGenerator(jsonOutput);
        // start the record
        generator.writeStartArray();
        for (List<String> values : valuesContentHandler.getValues()) {
            if (values.isEmpty()) {
                // avoid empty record which can fail analysis
                continue;
            }
            generator.writeStartObject();
            int idx = 0;
            for (String value : values) {
                if (idx < columns.size()) {
                    ColumnMetadata columnMetadata = columns.get(idx);
                    generator.writeFieldName(columnMetadata.getId());
                    if (value != null) {
                        generator.writeString(value);
                    } else {
                        generator.writeNull();
                    }
                    idx++;
                }
            }
            generator.writeEndObject();
        }
        // end the record
        generator.writeEndArray();
        generator.flush();
    } catch (Exception e) {
        // Consumer may very well interrupt consumption of stream (in case of limit(n) use for sampling).
        // This is not an issue as consumer is allowed to partially consumes results, it's up to the
        // consumer to ensure data it consumed is consistent.
        LOGGER.debug("Unable to continue serialization for {}. Skipping remaining content.", dataSetMetadata.getId(), e);
    } finally {
        try {
            jsonOutput.close();
        } catch (IOException e) {
            LOGGER.error("Unable to close output", e);
        }
    }
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) Metadata(org.apache.tika.metadata.Metadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) JsonFactory(com.fasterxml.jackson.core.JsonFactory) TDPException(org.talend.dataprep.exception.TDPException) HtmlParser(org.apache.tika.parser.html.HtmlParser) ParseContext(org.apache.tika.parser.ParseContext) JsonGenerator(com.fasterxml.jackson.core.JsonGenerator)

Example 8 with HtmlParser

use of org.apache.tika.parser.html.HtmlParser in project data-prep by Talend.

the class HtmlSchemaParser method parse.

/**
 * @see SchemaParser#parse(Request)
 */
@Override
public Schema parse(Request request) {
    try {
        SimpleHeadersContentHandler headersContentHandler = new SimpleHeadersContentHandler();
        InputStream inputStream = request.getContent();
        HtmlParser htmlParser = new HtmlParser();
        Metadata metadata = new Metadata();
        htmlParser.parse(inputStream, headersContentHandler, metadata, new ParseContext());
        List<ColumnMetadata> columns = new ArrayList<>(headersContentHandler.getHeaderValues().size());
        for (String headerValue : headersContentHandler.getHeaderValues()) {
            columns.add(ColumnMetadata.Builder.column().type(// ATM not doing any complicated type calculation
            Type.STRING).name(// 
            headerValue).id(// 
            columns.size()).build());
        }
        Schema.SheetContent sheetContent = new Schema.SheetContent();
        sheetContent.setColumnMetadatas(columns);
        return Schema.Builder.parserResult().sheetContents(// 
        Collections.singletonList(sheetContent)).draft(// 
        false).build();
    } catch (Exception e) {
        LOGGER.debug("Exception during parsing html request :" + e.getMessage(), e);
        throw new TDPException(CommonErrorCodes.UNEXPECTED_EXCEPTION, e);
    }
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) InputStream(java.io.InputStream) Schema(org.talend.dataprep.schema.Schema) Metadata(org.apache.tika.metadata.Metadata) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) TDPException(org.talend.dataprep.exception.TDPException) TDPException(org.talend.dataprep.exception.TDPException) HtmlParser(org.apache.tika.parser.html.HtmlParser) ParseContext(org.apache.tika.parser.ParseContext)

Aggregations

HtmlParser (org.apache.tika.parser.html.HtmlParser)8 Metadata (org.apache.tika.metadata.Metadata)7 ParseContext (org.apache.tika.parser.ParseContext)6 Parser (org.apache.tika.parser.Parser)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 InputStream (java.io.InputStream)4 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)3 LinkContentHandler (org.apache.tika.sax.LinkContentHandler)3 FileInputStream (java.io.FileInputStream)2 Map (java.util.Map)2 Set (java.util.Set)2 GZIPInputStream (java.util.zip.GZIPInputStream)2 TikaInputStream (org.apache.tika.io.TikaInputStream)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)2 TDPException (org.talend.dataprep.exception.TDPException)2 ContentHandler (org.xml.sax.ContentHandler)2 ActionManager (com.adobe.acs.commons.fam.ActionManager)1 ProcessDefinition (com.adobe.acs.commons.mcp.ProcessDefinition)1 ProcessInstance (com.adobe.acs.commons.mcp.ProcessInstance)1