Search in sources :

Example 21 with WordExtractor

use of org.apache.poi.hwpf.extractor.WordExtractor in project carbon-apimgt by wso2.

the class MSWordIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException {
    try {
        String wordText = null;
        try {
            // Extract MSWord 2003 document files
            POIFSFileSystem fs = new POIFSFileSystem(new ByteArrayInputStream(fileData.data));
            WordExtractor msWord2003Extractor = new WordExtractor(fs);
            wordText = msWord2003Extractor.getText();
        } catch (OfficeXmlFileException e) {
            // if 2003 extraction failed, try with MSWord 2007 document files extractor
            XWPFDocument doc = new XWPFDocument(new ByteArrayInputStream(fileData.data));
            XWPFWordExtractor msWord2007Extractor = new XWPFWordExtractor(doc);
            wordText = msWord2007Extractor.getText();
        } catch (Exception e) {
            // The reason for not throwing an exception is that since this is an indexer that runs in the background
            // throwing an exception might lead to adverse behaviors in the client side and might lead to
            // other files not being indexed
            String msg = "Failed to extract the document while indexing";
            log.error(msg, e);
        }
        IndexDocument indexDoc = new IndexDocument(fileData.path, wordText, null);
        Map<String, List<String>> fields = new HashMap<String, List<String>>();
        fields.put("path", Collections.singletonList(fileData.path));
        if (fileData.mediaType != null) {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList(fileData.mediaType));
        } else {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList("application/pdf"));
        }
        indexDoc.setFields(fields);
        return indexDoc;
    } catch (IOException e) {
        String msg = "Failed to write to the index";
        log.error(msg, e);
        throw new SolrException(ErrorCode.SERVER_ERROR, msg, e);
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) HashMap(java.util.HashMap) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) IOException(java.io.IOException) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) IOException(java.io.IOException) SolrException(org.apache.solr.common.SolrException) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) ByteArrayInputStream(java.io.ByteArrayInputStream) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) XWPFDocument(org.apache.poi.xwpf.usermodel.XWPFDocument) List(java.util.List) SolrException(org.apache.solr.common.SolrException)

Aggregations

WordExtractor (org.apache.poi.hwpf.extractor.WordExtractor)21 Test (org.junit.Test)12 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)11 ExcelExtractor (org.apache.poi.hssf.extractor.ExcelExtractor)8 PowerPointExtractor (org.apache.poi.hslf.extractor.PowerPointExtractor)7 OutlookTextExtactor (org.apache.poi.hsmf.extractor.OutlookTextExtactor)7 HWPFDocument (org.apache.poi.hwpf.HWPFDocument)7 XSSFExcelExtractor (org.apache.poi.xssf.extractor.XSSFExcelExtractor)7 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)6 XSLFPowerPointExtractor (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor)6 XSSFEventBasedExcelExtractor (org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor)6 FileInputStream (java.io.FileInputStream)5 IOException (java.io.IOException)5 EventBasedExcelExtractor (org.apache.poi.hssf.extractor.EventBasedExcelExtractor)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 InputStream (java.io.InputStream)4 POITextExtractor (org.apache.poi.POITextExtractor)4 VisioTextExtractor (org.apache.poi.hdgf.extractor.VisioTextExtractor)4 PublisherTextExtractor (org.apache.poi.hpbf.extractor.PublisherTextExtractor)4 Word6Extractor (org.apache.poi.hwpf.extractor.Word6Extractor)4