Search in sources :

Example 1 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class PDFIndexerTest method testShouldReturnIndexedDocumentWhenParameterCorrect.

@Test
public void testShouldReturnIndexedDocumentWhenParameterCorrect() throws IOException {
    String mediaType = "application/pdf+test";
    final String MEDIA_TYPE = "mediaType";
    PDFParser parser = Mockito.mock(PDFParser.class);
    COSDocument cosDoc = Mockito.mock(COSDocument.class);
    PDFTextStripper pdfTextStripper = Mockito.mock(PDFTextStripper.class);
    Mockito.doThrow(IOException.class).when(cosDoc).close();
    Mockito.when(parser.getDocument()).thenReturn(new COSDocument()).thenReturn(cosDoc);
    Mockito.when(pdfTextStripper.getText(new PDDocument())).thenReturn("");
    PDFIndexer pdfIndexer = new PDFIndexerWrapper(parser, pdfTextStripper);
    // should return the default media type when media type is not defined in file2Index
    IndexDocument pdf = pdfIndexer.getIndexedDocument(file2Index);
    if (!"application/pdf".equals(pdf.getFields().get(MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
    // should return the media type we have set in the file2Index even if error occurs in finally block
    file2Index.mediaType = mediaType;
    pdf = pdfIndexer.getIndexedDocument(file2Index);
    if (!mediaType.equals(pdf.getFields().get(MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) PDFIndexerWrapper(org.wso2.carbon.apimgt.impl.indexing.indexer.util.PDFIndexerWrapper) PDFParser(org.apache.pdfbox.pdfparser.PDFParser) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) COSDocument(org.apache.pdfbox.cos.COSDocument) PDFTextStripper(org.apache.pdfbox.text.PDFTextStripper) Test(org.junit.Test)

Example 2 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class MSWordIndexerTest method testShouldReturnIndexedDocumentWhenParameterCorrect.

@Test
public void testShouldReturnIndexedDocumentWhenParameterCorrect() throws Exception {
    POIFSFileSystem poiFS = Mockito.mock(POIFSFileSystem.class);
    WordExtractor wordExtractor = Mockito.mock(WordExtractor.class);
    XWPFWordExtractor xwpfExtractor = Mockito.mock(XWPFWordExtractor.class);
    XWPFDocument xwpfDocument = Mockito.mock(XWPFDocument.class);
    PowerMockito.whenNew(POIFSFileSystem.class).withParameterTypes(InputStream.class).withArguments(Mockito.any(InputStream.class)).thenThrow(OfficeXmlFileException.class).thenReturn(poiFS).thenThrow(APIManagementException.class);
    PowerMockito.whenNew(WordExtractor.class).withArguments(poiFS).thenReturn(wordExtractor);
    PowerMockito.whenNew(XWPFDocument.class).withParameterTypes(InputStream.class).withArguments(Mockito.any()).thenReturn(xwpfDocument);
    PowerMockito.whenNew(XWPFWordExtractor.class).withArguments(xwpfDocument).thenReturn(xwpfExtractor);
    Mockito.when(wordExtractor.getText()).thenReturn("");
    Mockito.when(xwpfExtractor.getText()).thenReturn("");
    MSWordIndexer indexer = new MSWordIndexer();
    IndexDocument wordDoc = indexer.getIndexedDocument(file2Index);
    // should return the default media type when media type is not defined in file2Index
    if (!"application/pdf".equals(wordDoc.getFields().get(IndexingConstants.FIELD_MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
    // should return the media type we have set in the file2Index
    file2Index.mediaType = "text/html";
    wordDoc = indexer.getIndexedDocument(file2Index);
    if (!"text/html".equals(wordDoc.getFields().get(IndexingConstants.FIELD_MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
    // should return the media type we have set in the file2Index even if exception occurred while reading the file
    file2Index.mediaType = "text/html";
    wordDoc = indexer.getIndexedDocument(file2Index);
    if (!"text/html".equals(wordDoc.getFields().get(IndexingConstants.FIELD_MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) XWPFDocument(org.apache.poi.xwpf.usermodel.XWPFDocument) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) Test(org.junit.Test) PrepareForTest(org.powermock.core.classloader.annotations.PrepareForTest)

Example 3 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class XMLIndexerTest method testShouldReturnIndexedDocumentWhenParameterCorrect.

@Test
public void testShouldReturnIndexedDocumentWhenParameterCorrect() throws RegistryException {
    String mediaType = "text/xml";
    final String MEDIA_TYPE = "mediaType";
    AsyncIndexer.File2Index file2Index = new AsyncIndexer.File2Index("".getBytes(), null, "", -1234, "");
    XMLIndexer indexer = new XMLIndexer();
    // should return the the default media type when media type is not defined in file2Index
    IndexDocument xml = indexer.getIndexedDocument(file2Index);
    if (xml.getFields().get(MEDIA_TYPE) != null) {
        Assert.fail();
    }
    // should return the media type we have set in the file2Index
    file2Index.mediaType = mediaType;
    xml = indexer.getIndexedDocument(file2Index);
    if (!mediaType.equals(xml.getFields().get(MEDIA_TYPE).get(0))) {
        Assert.fail();
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) AsyncIndexer(org.wso2.carbon.registry.indexing.AsyncIndexer) Test(org.junit.Test)

Example 4 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class DocumentIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(AsyncIndexer.File2Index fileData) throws SolrException, RegistryException {
    IndexDocument indexDocument = super.getIndexedDocument(fileData);
    IndexDocument newIndexDocument = indexDocument;
    Registry registry = GovernanceUtils.getGovernanceSystemRegistry(IndexingManager.getInstance().getRegistry(fileData.tenantId));
    String documentResourcePath = fileData.path.substring(RegistryConstants.GOVERNANCE_REGISTRY_BASE_PATH.length());
    if (documentResourcePath.contains("/apimgt/applicationdata/apis/")) {
        return null;
    }
    if (log.isDebugEnabled()) {
        log.debug("Executing document indexer for resource at " + documentResourcePath);
    }
    Resource documentResource = null;
    Map<String, List<String>> fields = indexDocument.getFields();
    if (registry.resourceExists(documentResourcePath)) {
        documentResource = registry.get(documentResourcePath);
    }
    if (documentResource != null) {
        try {
            fetchRequiredDetailsFromAssociatedAPI(registry, documentResource, fields);
            StringBuilder stringBuilder = new StringBuilder();
            stringBuilder.append(fetchDocumentContent(registry, documentResource));
            if (fields.get(APIConstants.DOC_NAME) != null) {
                stringBuilder.append(APIConstants.DOC_NAME + "=" + StringUtils.join(fields.get(APIConstants.DOC_NAME), ","));
            }
            if (fields.get(APIConstants.DOC_SUMMARY) != null) {
                stringBuilder.append(APIConstants.DOC_SUMMARY + "=" + StringUtils.join(fields.get(APIConstants.DOC_SUMMARY), ","));
            }
            newIndexDocument = new IndexDocument(fileData.path, "", stringBuilder.toString(), indexDocument.getTenantId());
            fields.put(APIConstants.DOCUMENT_INDEXER_INDICATOR, Arrays.asList("true"));
            newIndexDocument.setFields(fields);
        } catch (APIManagementException e) {
            // error occured while fetching details from API, but continuing document indexing
            log.error("Error while updating indexed document.", e);
        } catch (IOException e) {
            // error occured while fetching document content, but continuing document indexing
            log.error("Error while getting document content.", e);
        }
    }
    return newIndexDocument;
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) APIManagementException(org.wso2.carbon.apimgt.api.APIManagementException) Resource(org.wso2.carbon.registry.core.Resource) List(java.util.List) Registry(org.wso2.carbon.registry.core.Registry) IOException(java.io.IOException)

Example 5 with IndexDocument

use of org.wso2.carbon.registry.indexing.solr.IndexDocument in project carbon-apimgt by wso2.

the class MSPowerpointIndexer method getIndexedDocument.

public IndexDocument getIndexedDocument(File2Index fileData) throws SolrException {
    try {
        String ppText = null;
        try {
            // Extract Powerpoint 2003 (.ppt) document files
            POIFSFileSystem fs = new POIFSFileSystem(new ByteArrayInputStream(fileData.data));
            PowerPointExtractor extractor = new PowerPointExtractor(fs);
            ppText = extractor.getText();
        } catch (OfficeXmlFileException e) {
            // if 2003 Powerpoint (.ppt) extraction failed, try with Powerpoint 2007 (.pptx) document file extractor
            XMLSlideShow xmlSlideShow = new XMLSlideShow(new ByteArrayInputStream(fileData.data));
            XSLFPowerPointExtractor xslfPowerPointExtractor = new XSLFPowerPointExtractor(xmlSlideShow);
            ppText = xslfPowerPointExtractor.getText();
        } catch (Exception e) {
            String msg = "Failed to extract the document";
            log.error(msg, e);
        }
        IndexDocument indexDoc = new IndexDocument(fileData.path, ppText, null);
        Map<String, List<String>> fields = new HashMap<String, List<String>>();
        fields.put("path", Collections.singletonList(fileData.path));
        if (fileData.mediaType != null) {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList(fileData.mediaType));
        } else {
            fields.put(IndexingConstants.FIELD_MEDIA_TYPE, Collections.singletonList("application/vnd" + ".ms-powerpoint"));
        }
        indexDoc.setFields(fields);
        return indexDoc;
    } catch (IOException e) {
        String msg = "Failed to write to the index";
        log.error(msg, e);
        throw new SolrException(ErrorCode.SERVER_ERROR, msg, e);
    }
}
Also used : IndexDocument(org.wso2.carbon.registry.indexing.solr.IndexDocument) HashMap(java.util.HashMap) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) IOException(java.io.IOException) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) IOException(java.io.IOException) SolrException(org.apache.solr.common.SolrException) OfficeXmlFileException(org.apache.poi.poifs.filesystem.OfficeXmlFileException) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) ByteArrayInputStream(java.io.ByteArrayInputStream) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) XMLSlideShow(org.apache.poi.xslf.usermodel.XMLSlideShow) List(java.util.List) SolrException(org.apache.solr.common.SolrException)

Aggregations

IndexDocument (org.wso2.carbon.registry.indexing.solr.IndexDocument)15 List (java.util.List)9 HashMap (java.util.HashMap)7 Test (org.junit.Test)6 IOException (java.io.IOException)5 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)4 SolrException (org.apache.solr.common.SolrException)4 OfficeXmlFileException (org.apache.poi.poifs.filesystem.OfficeXmlFileException)3 AsyncIndexer (org.wso2.carbon.registry.indexing.AsyncIndexer)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 COSDocument (org.apache.pdfbox.cos.COSDocument)2 PDFParser (org.apache.pdfbox.pdfparser.PDFParser)2 PDDocument (org.apache.pdfbox.pdmodel.PDDocument)2 PDFTextStripper (org.apache.pdfbox.text.PDFTextStripper)2 PowerPointExtractor (org.apache.poi.hslf.extractor.PowerPointExtractor)2 WordExtractor (org.apache.poi.hwpf.extractor.WordExtractor)2 XSLFPowerPointExtractor (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor)2 XMLSlideShow (org.apache.poi.xslf.usermodel.XMLSlideShow)2 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)2 XWPFDocument (org.apache.poi.xwpf.usermodel.XWPFDocument)2