Search in sources :

Example 6 with SimpleFileIOOutputProperties

use of org.talend.components.simplefileio.output.SimpleFileIOOutputProperties in project components by Talend.

the class SparkSimpleFileIOOutputRuntimeTestIT method testCsv_merge.

@Test
public void testCsv_merge() throws IOException {
    FileSystem fs = FileSystem.get(spark.createHadoopConfiguration());
    String fileSpec = fs.getUri().resolve(new Path(tmp.getRoot().toString(), "output.csv").toUri()).toString();
    // Configure the component.
    SimpleFileIOOutputProperties props = SimpleFileIOOutputRuntimeTest.createOutputComponentProperties();
    props.getDatasetProperties().path.setValue(fileSpec);
    props.getDatasetProperties().format.setValue(SimpleFileIOFormat.CSV);
    props.mergeOutput.setValue(true);
    // Create the runtime.
    SimpleFileIOOutputRuntime runtime = new SimpleFileIOOutputRuntime();
    runtime.initialize(null, props);
    // Use the runtime in a Spark pipeline to test.
    final Pipeline p = spark.createPipeline();
    PCollection<IndexedRecord> input = // 
    p.apply(// 
    Create.of(// 
    ConvertToIndexedRecord.convertToAvro(new String[] { "1", "one" }), // 
    ConvertToIndexedRecord.convertToAvro(new String[] { "2", "two" })));
    input.apply(runtime);
    // And run the test.
    p.run().waitUntilFinish();
    // Check the expected values.
    MiniDfsResource.assertReadFile(fs, fileSpec, "1;one", "2;two");
    MiniDfsResource.assertFileNumber(fs, fileSpec, 1);
}
Also used : Path(org.apache.hadoop.fs.Path) SimpleFileIOOutputProperties(org.talend.components.simplefileio.output.SimpleFileIOOutputProperties) ConvertToIndexedRecord(org.talend.components.adapter.beam.transform.ConvertToIndexedRecord) IndexedRecord(org.apache.avro.generic.IndexedRecord) FileSystem(org.apache.hadoop.fs.FileSystem) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Example 7 with SimpleFileIOOutputProperties

use of org.talend.components.simplefileio.output.SimpleFileIOOutputProperties in project components by Talend.

the class SimpleFileIOInputRuntimeTest method testInputParquetByteBufferSerialization.

/**
 * Test to read an Parquet input and dump it on CSV. This is a special case to see the support of ByteBuffer
 * coding/decoding. This test is currently not working due to log on the Beam class ExecutorServiceParallelExecutor,
 * that will move the offset of any ByteBuffer.
 */
@Test
public void testInputParquetByteBufferSerialization() throws IOException, URISyntaxException {
    InputStream in = getClass().getResourceAsStream("two_lines.snappy.parquet");
    try (OutputStream inOnMinFS = mini.getFs().create(new Path("/user/test/two_lines.snappy.parquet"))) {
        inOnMinFS.write(IOUtils.toByteArray(in));
    }
    String fileSpec = mini.getFs().getUri().resolve("/user/test/two_lines.snappy.parquet").toString();
    String fileSpecOutput = mini.getLocalFs().getUri().resolve(new Path(mini.newFolder().toString(), "output.csv").toUri()).toString();
    // Configure the component.
    SimpleFileIOInputProperties inputProps = createInputComponentProperties();
    inputProps.getDatasetProperties().format.setValue(SimpleFileIOFormat.PARQUET);
    inputProps.getDatasetProperties().path.setValue(fileSpec);
    // Create the runtime.
    SimpleFileIOInputRuntime runtime = new SimpleFileIOInputRuntime();
    runtime.initialize(null, inputProps);
    SimpleFileIOOutputProperties outputProps = new SimpleFileIOOutputProperties(null);
    outputProps.init();
    outputProps.setDatasetProperties(SimpleFileIODatasetRuntimeTest.createDatasetProperties());
    outputProps.getDatasetProperties().path.setValue(fileSpecOutput);
    outputProps.getDatasetProperties().format.setValue(SimpleFileIOFormat.CSV);
    SimpleFileIOOutputRuntime runtimeO = new SimpleFileIOOutputRuntime();
    runtimeO.initialize(null, outputProps);
    // Use the runtime in a direct pipeline to test.
    final Pipeline p = beam.createPipeline(1);
    p.apply(runtime).apply(runtimeO);
    p.run().waitUntilFinish();
    mini.assertReadFile(mini.getLocalFs(), fileSpecOutput, "1;rdubois", "2;clombard");
}
Also used : Path(org.apache.hadoop.fs.Path) SimpleFileIOOutputProperties(org.talend.components.simplefileio.output.SimpleFileIOOutputProperties) InputStream(java.io.InputStream) OutputStream(java.io.OutputStream) SimpleFileIOInputProperties(org.talend.components.simplefileio.input.SimpleFileIOInputProperties) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Example 8 with SimpleFileIOOutputProperties

use of org.talend.components.simplefileio.output.SimpleFileIOOutputProperties in project components by Talend.

the class SimpleFileIOOutputRuntimeTest method testBasicAvroBytes.

/**
 * Basic unit test writing to Avro.
 */
@Test
public void testBasicAvroBytes() throws IOException, URISyntaxException {
    String fileSpec = mini.getLocalFs().getUri().resolve(new Path(mini.newFolder().toString(), "output.avro").toUri()).toString();
    // Configure the component.
    SimpleFileIOOutputProperties props = createOutputComponentProperties();
    props.getDatasetProperties().path.setValue(fileSpec);
    props.getDatasetProperties().format.setValue(SimpleFileIOFormat.AVRO);
    // Create the runtime.
    SimpleFileIOOutputRuntime runtime = new SimpleFileIOOutputRuntime();
    runtime.initialize(null, props);
    Schema s = // 
    SchemaBuilder.record("test").fields().name("key").type(Schema.create(Schema.Type.BYTES)).noDefault().name("value").type(Schema.create(Schema.Type.STRING)).noDefault().endRecord();
    IndexedRecord ir1 = new GenericData.Record(s);
    IndexedRecord ir2 = new GenericData.Record(s);
    ir1.put(0, ByteBuffer.wrap(new byte[] { 0x00, 0x01, 0x02 }));
    ir1.put(1, "012");
    ir2.put(0, ByteBuffer.wrap(new byte[] { 0x01, 0x02, 0x03 }));
    ir2.put(1, "123");
    // Use the runtime in a direct pipeline to test.
    final Pipeline p = beam.createPipeline();
    PCollection<IndexedRecord> input = // 
    p.apply(// 
    Create.of(// 
    ir1, // 
    ir2));
    input.apply(runtime);
    // And run the test.
    p.run().waitUntilFinish();
// Check the expected values.
// TODO(rskraba): Implement a comparison for the file on disk.
// mini.assertReadFile(mini.getLocalFs(), fileSpec, "1;one", "2;two");
}
Also used : Path(org.apache.hadoop.fs.Path) SimpleFileIOOutputProperties(org.talend.components.simplefileio.output.SimpleFileIOOutputProperties) ConvertToIndexedRecord(org.talend.components.adapter.beam.transform.ConvertToIndexedRecord) IndexedRecord(org.apache.avro.generic.IndexedRecord) Schema(org.apache.avro.Schema) ConvertToIndexedRecord(org.talend.components.adapter.beam.transform.ConvertToIndexedRecord) IndexedRecord(org.apache.avro.generic.IndexedRecord) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Example 9 with SimpleFileIOOutputProperties

use of org.talend.components.simplefileio.output.SimpleFileIOOutputProperties in project components by Talend.

the class SimpleFileIOOutputRuntimeTest method testBasicCsvFormat.

/**
 * Basic unit test using all default values (except for the path) on an in-memory DFS cluster.
 */
@Test
public void testBasicCsvFormat() throws IOException, URISyntaxException {
    // Fetch the expected results and input dataset.
    List<IndexedRecord> inputs = new ArrayList<>();
    List<String> expected = new ArrayList<>();
    for (CsvExample csvEx : CsvExample.getCsvExamples()) {
        // Ignore lines that don't have the same schema (3 columns)
        if (csvEx.getValues().length == 3) {
            expected.add(csvEx.getExpectedOutputLine());
            inputs.add(ConvertToIndexedRecord.convertToAvro(csvEx.getValues()));
        }
    }
    String fileSpec = mini.getLocalFs().getUri().resolve(new Path(mini.newFolder().toString(), "output.csv").toUri()).toString();
    // Configure the component.
    SimpleFileIOOutputProperties props = createOutputComponentProperties();
    props.getDatasetProperties().path.setValue(fileSpec);
    // Create the runtime.
    SimpleFileIOOutputRuntime runtime = new SimpleFileIOOutputRuntime();
    runtime.initialize(null, props);
    // Use the runtime in a direct pipeline to test.
    final Pipeline p = beam.createPipeline();
    // 
    PCollection<IndexedRecord> input = p.apply(Create.of(inputs));
    input.apply(runtime);
    // And run the test.
    p.run().waitUntilFinish();
    // Check the expected values.
    mini.assertReadFile(mini.getLocalFs(), fileSpec, expected.toArray(new String[0]));
}
Also used : Path(org.apache.hadoop.fs.Path) SimpleFileIOOutputProperties(org.talend.components.simplefileio.output.SimpleFileIOOutputProperties) ConvertToIndexedRecord(org.talend.components.adapter.beam.transform.ConvertToIndexedRecord) IndexedRecord(org.apache.avro.generic.IndexedRecord) ArrayList(java.util.ArrayList) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Example 10 with SimpleFileIOOutputProperties

use of org.talend.components.simplefileio.output.SimpleFileIOOutputProperties in project components by Talend.

the class SimpleFileIOOutputRuntimeTest method testBasicAvro.

/**
 * Basic unit test writing to Avro.
 */
@Test
public void testBasicAvro() throws IOException, URISyntaxException {
    String fileSpec = mini.getLocalFs().getUri().resolve(new Path(mini.newFolder().toString(), "output.avro").toUri()).toString();
    // Configure the component.
    SimpleFileIOOutputProperties props = createOutputComponentProperties();
    props.getDatasetProperties().path.setValue(fileSpec);
    props.getDatasetProperties().format.setValue(SimpleFileIOFormat.AVRO);
    // Create the runtime.
    SimpleFileIOOutputRuntime runtime = new SimpleFileIOOutputRuntime();
    runtime.initialize(null, props);
    // Use the runtime in a direct pipeline to test.
    final Pipeline p = beam.createPipeline();
    PCollection<IndexedRecord> input = // 
    p.apply(// 
    Create.of(// 
    ConvertToIndexedRecord.convertToAvro(new String[] { "1", "one" }), // 
    ConvertToIndexedRecord.convertToAvro(new String[] { "2", "two" })));
    input.apply(runtime);
    // And run the test.
    p.run().waitUntilFinish();
// Check the expected values.
// TODO(rskraba): Implement a comparison for the file on disk.
// mini.assertReadFile(mini.getLocalFs(), fileSpec, "1;one", "2;two");
}
Also used : Path(org.apache.hadoop.fs.Path) SimpleFileIOOutputProperties(org.talend.components.simplefileio.output.SimpleFileIOOutputProperties) ConvertToIndexedRecord(org.talend.components.adapter.beam.transform.ConvertToIndexedRecord) IndexedRecord(org.apache.avro.generic.IndexedRecord) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Aggregations

SimpleFileIOOutputProperties (org.talend.components.simplefileio.output.SimpleFileIOOutputProperties)20 Test (org.junit.Test)18 IndexedRecord (org.apache.avro.generic.IndexedRecord)17 ConvertToIndexedRecord (org.talend.components.adapter.beam.transform.ConvertToIndexedRecord)17 Pipeline (org.apache.beam.sdk.Pipeline)13 Path (org.apache.hadoop.fs.Path)13 SimpleFileIOInputProperties (org.talend.components.simplefileio.input.SimpleFileIOInputProperties)6 OutputStream (java.io.OutputStream)4 FileSystem (org.apache.hadoop.fs.FileSystem)4 RecordSet (org.talend.components.test.RecordSet)4 TalendRuntimeException (org.talend.daikon.exception.TalendRuntimeException)4 ArrayList (java.util.ArrayList)2 InputStream (java.io.InputStream)1 Schema (org.apache.avro.Schema)1 Ignore (org.junit.Ignore)1 Category (org.junit.experimental.categories.Category)1 SimpleFileIODatasetRuntimeTest (org.talend.components.simplefileio.runtime.SimpleFileIODatasetRuntimeTest)1 SimpleFileIOInputRuntime (org.talend.components.simplefileio.runtime.SimpleFileIOInputRuntime)1 SimpleFileIOOutputRuntime (org.talend.components.simplefileio.runtime.SimpleFileIOOutputRuntime)1