Search in sources :

Example 6 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class PDFIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException {
    COSDocument cosDoc = null;
    try {
        PDFParser parser = getPdfParser(fileData);
        parser.parse();
        cosDoc = parser.getDocument();
        PDFTextStripper stripper = getPdfTextStripper();
        String docText = stripper.getText(new PDDocument(cosDoc));
        IndexDocument indexDoc = new IndexDocument(fileData.path, docText, null);
        Map<String, List<String>> fields = new HashMap<String, List<String>>();
        fields.put("path", Collections.singletonList(fileData.path));
        if (fileData.mediaType != null) {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList(fileData.mediaType));
        } else {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList("application/pdf"));
        }
        indexDoc.setFields(fields);
        return indexDoc;
    } catch (IOException e) {
        String msg = "Failed to write to the index";
        log.error(msg, e);
        throw new SolrException(ErrorCode.SERVER_ERROR, msg, e);
    } finally {
        if (cosDoc != null) {
            try {
                cosDoc.close();
            } catch (IOException e) {
                log.error("Failed to close pdf doc stream ", e);
            }
        }
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) HashMap(java.util.HashMap) PDFParser(org.apache.pdfbox.pdfparser.PDFParser) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) COSDocument(org.apache.pdfbox.cos.COSDocument) List(java.util.List) IOException(java.io.IOException) SolrException(org.apache.solr.common.SolrException) PDFTextStripper(org.apache.pdfbox.text.PDFTextStripper)

Example 7 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class CustomAPIIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(AsyncIndexer.File2Index fileData) throws SolrException, RegistryException {
    Registry registry = GovernanceUtils.getGovernanceSystemRegistry(IndexingManager.getInstance().getRegistry(fileData.tenantId));
    String resourcePath = fileData.path.substring(RegistryConstants.GOVERNANCE_REGISTRY_BASE_PATH.length());
    Resource resource = null;
    if (resourcePath.contains("/apimgt/applicationdata/apis/")) {
        return null;
    }
    if (registry.resourceExists(resourcePath)) {
        resource = registry.get(resourcePath);
    }
    if (log.isDebugEnabled()) {
        log.debug("CustomAPIIndexer is currently indexing the api at path " + resourcePath);
    }
    // Here we are adding properties as fields, so that we can search the properties as we do for attributes.
    IndexDocument indexDocument = super.getIndexedDocument(fileData);
    Map<String, List<String>> fields = indexDocument.getFields();
    if (resource != null) {
        Properties properties = resource.getProperties();
        Enumeration propertyNames = properties.propertyNames();
        while (propertyNames.hasMoreElements()) {
            String property = (String) propertyNames.nextElement();
            if (log.isDebugEnabled()) {
                log.debug("API at " + resourcePath + " has " + property + " property");
            }
            if (property.startsWith(APIConstants.API_RELATED_CUSTOM_PROPERTIES_PREFIX)) {
                fields.put((OVERVIEW_PREFIX + property), getLowerCaseList(resource.getPropertyValues(property)));
                if (log.isDebugEnabled()) {
                    log.debug(property + " is added as " + (OVERVIEW_PREFIX + property) + " field for indexing");
                }
            }
        }
        indexDocument.setFields(fields);
    }
    return indexDocument;
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) Enumeration(java.util.Enumeration) Resource(org.wso2.carbon.registry.core.Resource) ArrayList(java.util.ArrayList) List(java.util.List) Registry(org.wso2.carbon.registry.core.Registry) Properties(java.util.Properties)

Example 8 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class MSExcelIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException {
    try {
        String excelText = null;
        try {
            // Extract Excel 2003 (.xsl) document files
            ExcelExtractor extractor = getExcelExtractor(fileData);
            excelText = extractor.getText();
        } catch (OfficeXmlFileException e) {
            // if 2003 Excel (.xsl) extraction failed, try with Excel 2007 (.xslx) document files extractor
            XSSFExcelExtractor xssfExcelExtractor = getXssfExcelExtractor(fileData);
            excelText = xssfExcelExtractor.getText();
        } catch (Exception e) {
            String msg = "Failed to extract the document";
            log.error(msg, e);
        }
        IndexDocument indexDoc = new IndexDocument(fileData.path, excelText, null);
        Map<String, List<String>> fields = new HashMap<String, List<String>>();
        fields.put("path", Collections.singletonList(fileData.path));
        if (fileData.mediaType != null) {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList(fileData.mediaType));
        } else {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList("application/vnd.ms-excel"));
        }
        indexDoc.setFields(fields);
        return indexDoc;
    } catch (IOException e) {
        String msg = "Failed to write to the index";
        log.error(msg, e);
        throw new SolrException(ErrorCode.SERVER_ERROR, msg, e);
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) HashMap(java.util.HashMap) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) List(java.util.List) IOException(java.io.IOException) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) IOException(java.io.IOException) SolrException(org.apache.solr.common.SolrException) SolrException(org.apache.solr.common.SolrException)

Example 9 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class MSWordIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException {
    try {
        String wordText = null;
        try {
            // Extract MSWord 2003 document files
            POIFSFileSystem fs = new POIFSFileSystem(new ByteArrayInputStream(fileData.data));
            WordExtractor msWord2003Extractor = new WordExtractor(fs);
            wordText = msWord2003Extractor.getText();
        } catch (OfficeXmlFileException e) {
            // if 2003 extraction failed, try with MSWord 2007 document files extractor
            XWPFDocument doc = new XWPFDocument(new ByteArrayInputStream(fileData.data));
            XWPFWordExtractor msWord2007Extractor = new XWPFWordExtractor(doc);
            wordText = msWord2007Extractor.getText();
        } catch (Exception e) {
            // The reason for not throwing an exception is that since this is an indexer that runs in the background
            // throwing an exception might lead to adverse behaviors in the client side and might lead to
            // other files not being indexed
            String msg = "Failed to extract the document while indexing";
            log.error(msg, e);
        }
        IndexDocument indexDoc = new IndexDocument(fileData.path, wordText, null);
        Map<String, List<String>> fields = new HashMap<String, List<String>>();
        fields.put("path", Collections.singletonList(fileData.path));
        if (fileData.mediaType != null) {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList(fileData.mediaType));
        } else {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList("application/pdf"));
        }
        indexDoc.setFields(fields);
        return indexDoc;
    } catch (IOException e) {
        String msg = "Failed to write to the index";
        log.error(msg, e);
        throw new SolrException(ErrorCode.SERVER_ERROR, msg, e);
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) HashMap(java.util.HashMap) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) IOException(java.io.IOException) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) IOException(java.io.IOException) SolrException(org.apache.solr.common.SolrException) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) ByteArrayInputStream(java.io.ByteArrayInputStream) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) XWPFDocument(org.apache.poi.xwpf.usermodel.XWPFDocument) List(java.util.List) SolrException(org.apache.solr.common.SolrException)

Example 10 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class PlainTextIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException, RegistryException {
    IndexDocument indexDoc = new IndexDocument(fileData.path, RegistryUtils.decodeBytes(fileData.data), null);
    Map<String, List<String>> fields = new HashMap<String, List<String>>();
    fields.put("path", Arrays.asList(fileData.path));
    if (fileData.mediaType != null) {
        fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Arrays.asList(fileData.mediaType));
    } else {
        fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Arrays.asList("text/(.)"));
    }
    indexDoc.setFields(fields);
    return indexDoc;
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) HashMap(java.util.HashMap) List(java.util.List)

Aggregations

IndexDocument (org.wso2.carbon.registry.indexing.solr.IndexDocument)15 List (java.util.List)9 HashMap (java.util.HashMap)7 Test (org.junit.Test)6 IOException (java.io.IOException)5 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)4 SolrException (org.apache.solr.common.SolrException)4 OfficeXmlFileException (org.apache.poi.poifs.filesystem.OfficeXmlFileException)3 AsyncIndexer (org.wso2.carbon.registry.indexing.AsyncIndexer)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 COSDocument (org.apache.pdfbox.cos.COSDocument)2 PDFParser (org.apache.pdfbox.pdfparser.PDFParser)2 PDDocument (org.apache.pdfbox.pdmodel.PDDocument)2 PDFTextStripper (org.apache.pdfbox.text.PDFTextStripper)2 PowerPointExtractor (org.apache.poi.hslf.extractor.PowerPointExtractor)2 WordExtractor (org.apache.poi.hwpf.extractor.WordExtractor)2 XSLFPowerPointExtractor (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor)2 XMLSlideShow (org.apache.poi.xslf.usermodel.XMLSlideShow)2 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)2 XWPFDocument (org.apache.poi.xwpf.usermodel.XWPFDocument)2