use of org.apache.tika.metadata.Metadata in project tika by apache.
the class TestContainerAwareDetector method testTruncatedFiles.
@Test
public void testTruncatedFiles() throws Exception {
// First up a truncated OOXML (zip) file
// With only the data supplied, the best we can do is the container
Metadata m = new Metadata();
try (TikaInputStream xlsx = getTruncatedFile("testEXCEL.xlsx", 300)) {
assertEquals(MediaType.application("x-tika-ooxml"), detector.detect(xlsx, m));
}
// With truncated data + filename, we can use the filename to specialise
m = new Metadata();
m.add(Metadata.RESOURCE_NAME_KEY, "testEXCEL.xlsx");
try (TikaInputStream xlsx = getTruncatedFile("testEXCEL.xlsx", 300)) {
assertEquals(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), detector.detect(xlsx, m));
}
// Now a truncated OLE2 file
m = new Metadata();
try (TikaInputStream xls = getTruncatedFile("testEXCEL.xls", 400)) {
assertEquals(MediaType.application("x-tika-msoffice"), detector.detect(xls, m));
}
// Finally a truncated OLE2 file, with a filename available
m = new Metadata();
m.add(Metadata.RESOURCE_NAME_KEY, "testEXCEL.xls");
try (TikaInputStream xls = getTruncatedFile("testEXCEL.xls", 400)) {
assertEquals(MediaType.application("vnd.ms-excel"), detector.detect(xls, m));
}
}
use of org.apache.tika.metadata.Metadata in project tika by apache.
the class ExternalEmbedderTest method getMetadataToEmbed.
/**
* Gets the tika <code>Metadata</code> object containing data to be
* embedded.
*
* @return the populated tika metadata object
*/
protected Metadata getMetadataToEmbed(Date timestamp) {
Metadata metadata = new Metadata();
metadata.add(TikaCoreProperties.DESCRIPTION, getExpectedMetadataValueString(TikaCoreProperties.DESCRIPTION.toString(), timestamp));
return metadata;
}
use of org.apache.tika.metadata.Metadata in project tika by apache.
the class TestContainerAwareDetector method assertTypeByNameAndData.
private void assertTypeByNameAndData(String dataFile, String name, String typeFromDetector, String typeFromMagic) throws Exception {
try (TikaInputStream stream = TikaInputStream.get(TestContainerAwareDetector.class.getResource("/test-documents/" + dataFile))) {
Metadata m = new Metadata();
if (name != null)
m.add(Metadata.RESOURCE_NAME_KEY, name);
// Mime Magic version is likely to be less precise
if (typeFromMagic != null) {
assertEquals(MediaType.parse(typeFromMagic), mimeTypes.detect(stream, m));
}
// All being well, the detector should get it perfect
assertEquals(MediaType.parse(typeFromDetector), detector.detect(stream, m));
}
}
use of org.apache.tika.metadata.Metadata in project tika by apache.
the class TestParsers method testWORDxtraction.
@Test
public void testWORDxtraction() throws Exception {
File file = getResourceAsFile("/test-documents/testWORD.doc");
Parser parser = tika.getParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(file)) {
parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
}
assertEquals("Sample Word Document", metadata.get(TikaCoreProperties.TITLE));
}
use of org.apache.tika.metadata.Metadata in project tika by apache.
the class TensorflowImageRecParser method recognise.
@Override
public List<RecognisedObject> recognise(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
Metadata md = new Metadata();
parse(stream, handler, md, context);
List<RecognisedObject> objects = new ArrayList<>();
for (String key : md.names()) {
double confidence = Double.parseDouble(md.get(key));
objects.add(new RecognisedObject(key, "eng", key, confidence));
}
return objects;
}
Aggregations