Search in sources :

Example 1 with ParseException

use of org.apache.nutch.parse.ParseException in project nutch by apache.

the class ZipTextExtractor method extractText.

public String extractText(InputStream input, String url, List<Outlink> outLinksList) throws IOException {
    String resultText = "";
    ZipInputStream zin = new ZipInputStream(input);
    ZipEntry entry;
    while ((entry = zin.getNextEntry()) != null) {
        if (!entry.isDirectory()) {
            int size = (int) entry.getSize();
            byte[] b = new byte[size];
            for (int x = 0; x < size; x++) {
                int err = zin.read();
                if (err != -1) {
                    b[x] = (byte) err;
                }
            }
            String newurl = url + "/";
            String fname = entry.getName();
            newurl += fname;
            URL aURL = new URL(newurl);
            String base = aURL.toString();
            int i = fname.lastIndexOf('.');
            if (i != -1) {
                // Trying to resolve the Mime-Type
                Tika tika = new Tika();
                String contentType = tika.detect(fname);
                try {
                    Metadata metadata = new Metadata();
                    metadata.set(Response.CONTENT_LENGTH, Long.toString(entry.getSize()));
                    metadata.set(Response.CONTENT_TYPE, contentType);
                    Content content = new Content(newurl, base, b, contentType, metadata, this.conf);
                    Parse parse = new ParseUtil(this.conf).parse(content).get(content.getUrl());
                    ParseData theParseData = parse.getData();
                    Outlink[] theOutlinks = theParseData.getOutlinks();
                    for (int count = 0; count < theOutlinks.length; count++) {
                        outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), theOutlinks[count].getAnchor()));
                    }
                    resultText += entry.getName() + " " + parse.getText() + " ";
                } catch (ParseException e) {
                    if (LOG.isInfoEnabled()) {
                        LOG.info("fetch okay, but can't parse " + fname + ", reason: " + e.getMessage());
                    }
                }
            }
        }
    }
    return resultText;
}
Also used : Outlink(org.apache.nutch.parse.Outlink) ParseUtil(org.apache.nutch.parse.ParseUtil) Parse(org.apache.nutch.parse.Parse) ZipEntry(java.util.zip.ZipEntry) Metadata(org.apache.nutch.metadata.Metadata) Tika(org.apache.tika.Tika) URL(java.net.URL) ZipInputStream(java.util.zip.ZipInputStream) ParseData(org.apache.nutch.parse.ParseData) Content(org.apache.nutch.protocol.Content) ParseException(org.apache.nutch.parse.ParseException)

Example 2 with ParseException

use of org.apache.nutch.parse.ParseException in project nutch by apache.

the class CCParseFilter method filter.

/**
 * Adds metadata or otherwise modifies a parse of an HTML document, given the
 * DOM tree of a page.
 */
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
    // get parse obj
    Parse parse = parseResult.get(content.getUrl());
    // construct base url
    URL base;
    try {
        base = new URL(content.getBaseUrl());
    } catch (MalformedURLException e) {
        Parse emptyParse = new ParseStatus(e).getEmptyParse(getConf());
        parseResult.put(content.getUrl(), new ParseText(emptyParse.getText()), emptyParse.getData());
        return parseResult;
    }
    try {
        // extract license metadata
        Walker.walk(doc, base, parse.getData().getParseMeta(), getConf());
    } catch (ParseException e) {
        Parse emptyParse = new ParseStatus(e).getEmptyParse(getConf());
        parseResult.put(content.getUrl(), new ParseText(emptyParse.getText()), emptyParse.getData());
        return parseResult;
    }
    return parseResult;
}
Also used : ParseStatus(org.apache.nutch.parse.ParseStatus) MalformedURLException(java.net.MalformedURLException) Parse(org.apache.nutch.parse.Parse) ParseException(org.apache.nutch.parse.ParseException) URL(java.net.URL) ParseText(org.apache.nutch.parse.ParseText)

Aggregations

URL (java.net.URL)2 Parse (org.apache.nutch.parse.Parse)2 ParseException (org.apache.nutch.parse.ParseException)2 MalformedURLException (java.net.MalformedURLException)1 ZipEntry (java.util.zip.ZipEntry)1 ZipInputStream (java.util.zip.ZipInputStream)1 Metadata (org.apache.nutch.metadata.Metadata)1 Outlink (org.apache.nutch.parse.Outlink)1 ParseData (org.apache.nutch.parse.ParseData)1 ParseStatus (org.apache.nutch.parse.ParseStatus)1 ParseText (org.apache.nutch.parse.ParseText)1 ParseUtil (org.apache.nutch.parse.ParseUtil)1 Content (org.apache.nutch.protocol.Content)1 Tika (org.apache.tika.Tika)1