Search in sources :

Example 1 with Hfs

use of cascading.tap.hadoop.Hfs in project parquet-mr by apache.

the class TestParquetTBaseScheme method testWrite.

@Test
public void testWrite() throws Exception {
    Path path = new Path(parquetOutputPath);
    JobConf jobConf = new JobConf();
    final FileSystem fs = path.getFileSystem(jobConf);
    if (fs.exists(path))
        fs.delete(path, true);
    Scheme sourceScheme = new TextLine(new Fields("first", "last"));
    Tap source = new Hfs(sourceScheme, txtInputPath);
    Scheme sinkScheme = new ParquetTBaseScheme(Name.class);
    Tap sink = new Hfs(sinkScheme, parquetOutputPath);
    Pipe assembly = new Pipe("namecp");
    assembly = new Each(assembly, new PackThriftFunction());
    HadoopFlowConnector hadoopFlowConnector = new HadoopFlowConnector();
    Flow flow = hadoopFlowConnector.connect("namecp", source, sink, assembly);
    flow.complete();
    assertTrue(fs.exists(new Path(parquetOutputPath)));
    assertTrue(fs.exists(new Path(parquetOutputPath + "/_metadata")));
    assertTrue(fs.exists(new Path(parquetOutputPath + "/_common_metadata")));
}
Also used : Path(org.apache.hadoop.fs.Path) Each(cascading.pipe.Each) Tap(cascading.tap.Tap) Scheme(cascading.scheme.Scheme) TextLine(cascading.scheme.hadoop.TextLine) Pipe(cascading.pipe.Pipe) HadoopFlowConnector(cascading.flow.hadoop.HadoopFlowConnector) Flow(cascading.flow.Flow) Fields(cascading.tuple.Fields) Hfs(cascading.tap.hadoop.Hfs) FileSystem(org.apache.hadoop.fs.FileSystem) JobConf(org.apache.hadoop.mapred.JobConf) Test(org.junit.Test)

Example 2 with Hfs

use of cascading.tap.hadoop.Hfs in project parquet-mr by apache.

the class TestParquetTBaseScheme method doRead.

private void doRead(Scheme sourceScheme) throws Exception {
    createFileForRead();
    Path path = new Path(txtOutputPath);
    final FileSystem fs = path.getFileSystem(new Configuration());
    if (fs.exists(path))
        fs.delete(path, true);
    Tap source = new Hfs(sourceScheme, parquetInputPath);
    Scheme sinkScheme = new TextLine(new Fields("first", "last"));
    Tap sink = new Hfs(sinkScheme, txtOutputPath);
    Pipe assembly = new Pipe("namecp");
    assembly = new Each(assembly, new UnpackThriftFunction());
    Flow flow = new HadoopFlowConnector().connect("namecp", source, sink, assembly);
    flow.complete();
    String result = FileUtils.readFileToString(new File(txtOutputPath + "/part-00000"));
    assertEquals("Alice\tPractice\nBob\tHope\nCharlie\tHorse\n", result);
}
Also used : Path(org.apache.hadoop.fs.Path) Each(cascading.pipe.Each) Tap(cascading.tap.Tap) Scheme(cascading.scheme.Scheme) Configuration(org.apache.hadoop.conf.Configuration) TextLine(cascading.scheme.hadoop.TextLine) Pipe(cascading.pipe.Pipe) HadoopFlowConnector(cascading.flow.hadoop.HadoopFlowConnector) Flow(cascading.flow.Flow) Fields(cascading.tuple.Fields) Hfs(cascading.tap.hadoop.Hfs) FileSystem(org.apache.hadoop.fs.FileSystem) File(java.io.File)

Example 3 with Hfs

use of cascading.tap.hadoop.Hfs in project parquet-mr by apache.

the class ParquetScroogeSchemeTest method verifyScroogeRead.

public <T> void verifyScroogeRead(List<TBase> recordsToWrite, Class<T> readClass, String expectedStr, String projectionFilter) throws Exception {
    Configuration conf = new Configuration();
    deleteIfExist(PARQUET_PATH);
    deleteIfExist(TXT_OUTPUT_PATH);
    final Path parquetFile = new Path(PARQUET_PATH);
    writeParquetFile(recordsToWrite, conf, parquetFile);
    Scheme sourceScheme = new ParquetScroogeScheme(new Config().withRecordClass(readClass).withProjectionString(projectionFilter));
    Tap source = new Hfs(sourceScheme, PARQUET_PATH);
    Scheme sinkScheme = new TextLine(new Fields("first", "last"));
    Tap sink = new Hfs(sinkScheme, TXT_OUTPUT_PATH);
    Pipe assembly = new Pipe("namecp");
    assembly = new Each(assembly, new ObjectToStringFunction());
    Flow flow = new HadoopFlowConnector().connect("namecp", source, sink, assembly);
    flow.complete();
    String result = FileUtils.readFileToString(new File(TXT_OUTPUT_PATH + "/part-00000"));
    assertEquals(expectedStr, result);
}
Also used : Path(org.apache.hadoop.fs.Path) Each(cascading.pipe.Each) Tap(cascading.tap.Tap) Scheme(cascading.scheme.Scheme) Configuration(org.apache.hadoop.conf.Configuration) TextLine(cascading.scheme.hadoop.TextLine) Config(org.apache.parquet.cascading.ParquetValueScheme.Config) Pipe(cascading.pipe.Pipe) HadoopFlowConnector(cascading.flow.hadoop.HadoopFlowConnector) Flow(cascading.flow.Flow) Fields(cascading.tuple.Fields) Hfs(cascading.tap.hadoop.Hfs) File(java.io.File)

Example 4 with Hfs

use of cascading.tap.hadoop.Hfs in project Impatient by Cascading.

the class Main method main.

public static void main(String[] args) {
    String docPath = args[0];
    String wcPath = args[1];
    String stopPath = args[2];
    String tfidfPath = args[3];
    String trapPath = args[4];
    String checkPath = args[5];
    Properties properties = new Properties();
    AppProps.setApplicationJarClass(properties, Main.class);
    AppProps.setApplicationName(properties, "Impatient Part 6");
    AppProps.addApplicationTag(properties, "tutorial:impatient");
    AppProps.addApplicationTag(properties, "technology:Cascading");
    FlowConnector flowConnector = new Hadoop2MR1FlowConnector(properties);
    // create source and sink taps
    Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath);
    Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);
    Fields stop = new Fields("stop");
    Tap stopTap = new Hfs(new TextDelimited(stop, true, "\t"), stopPath);
    Tap tfidfTap = new Hfs(new TextDelimited(true, "\t"), tfidfPath);
    Tap trapTap = new Hfs(new TextDelimited(true, "\t"), trapPath);
    Tap checkTap = new Hfs(new TextDelimited(true, "\t"), checkPath);
    // use a stream assertion to validate the input data
    Pipe docPipe = new Pipe("token");
    AssertMatches assertMatches = new AssertMatches("doc\\d+\\s.*");
    docPipe = new Each(docPipe, AssertionLevel.STRICT, assertMatches);
    // specify a regex operation to split the "document" text lines into a token stream
    Fields token = new Fields("token");
    Fields text = new Fields("text");
    RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
    Fields fieldSelector = new Fields("doc_id", "token");
    docPipe = new Each(docPipe, text, splitter, fieldSelector);
    // define "ScrubFunction" to clean up the token stream
    Fields scrubArguments = new Fields("doc_id", "token");
    docPipe = new Each(docPipe, scrubArguments, new ScrubFunction(scrubArguments), Fields.RESULTS);
    // perform a left join to remove stop words, discarding the rows
    // which joined with stop words, i.e., were non-null after left join
    Pipe stopPipe = new Pipe("stop");
    Pipe tokenPipe = new HashJoin(docPipe, token, stopPipe, stop, new LeftJoin());
    tokenPipe = new Each(tokenPipe, stop, new RegexFilter("^$"));
    tokenPipe = new Retain(tokenPipe, fieldSelector);
    // one branch of the flow tallies the token counts for term frequency (TF)
    Pipe tfPipe = new Pipe("TF", tokenPipe);
    Fields tf_count = new Fields("tf_count");
    tfPipe = new CountBy(tfPipe, new Fields("doc_id", "token"), tf_count);
    Fields tf_token = new Fields("tf_token");
    tfPipe = new Rename(tfPipe, token, tf_token);
    // one branch counts the number of documents (D)
    Fields doc_id = new Fields("doc_id");
    Fields tally = new Fields("tally");
    Fields rhs_join = new Fields("rhs_join");
    Fields n_docs = new Fields("n_docs");
    Pipe dPipe = new Unique("D", tokenPipe, doc_id);
    dPipe = new Each(dPipe, new Insert(tally, 1), Fields.ALL);
    dPipe = new Each(dPipe, new Insert(rhs_join, 1), Fields.ALL);
    dPipe = new SumBy(dPipe, rhs_join, tally, n_docs, long.class);
    // one branch tallies the token counts for document frequency (DF)
    Pipe dfPipe = new Unique("DF", tokenPipe, Fields.ALL);
    Fields df_count = new Fields("df_count");
    dfPipe = new CountBy(dfPipe, token, df_count);
    Fields df_token = new Fields("df_token");
    Fields lhs_join = new Fields("lhs_join");
    dfPipe = new Rename(dfPipe, token, df_token);
    dfPipe = new Each(dfPipe, new Insert(lhs_join, 1), Fields.ALL);
    // example use of a debug, to observe tuple stream; turn off below
    dfPipe = new Each(dfPipe, DebugLevel.VERBOSE, new Debug(true));
    // join to bring together all the components for calculating TF-IDF
    // the D side of the join is smaller, so it goes on the RHS
    Pipe idfPipe = new HashJoin(dfPipe, lhs_join, dPipe, rhs_join);
    // create a checkpoint, to observe the intermediate data in DF stream
    Checkpoint idfCheck = new Checkpoint("checkpoint", idfPipe);
    // the IDF side of the join is smaller, so it goes on the RHS
    Pipe tfidfPipe = new CoGroup(tfPipe, tf_token, idfCheck, df_token);
    // calculate the TF-IDF weights, per token, per document
    Fields tfidf = new Fields("tfidf");
    String expression = "(double) tf_count * Math.log( (double) n_docs / ( 1.0 + df_count ) )";
    ExpressionFunction tfidfExpression = new ExpressionFunction(tfidf, expression, Double.class);
    Fields tfidfArguments = new Fields("tf_count", "df_count", "n_docs");
    tfidfPipe = new Each(tfidfPipe, tfidfArguments, tfidfExpression, Fields.ALL);
    fieldSelector = new Fields("tf_token", "doc_id", "tfidf");
    tfidfPipe = new Retain(tfidfPipe, fieldSelector);
    tfidfPipe = new Rename(tfidfPipe, tf_token, token);
    // keep track of the word counts, which are useful for QA
    Pipe wcPipe = new Pipe("wc", tfPipe);
    Fields count = new Fields("count");
    wcPipe = new SumBy(wcPipe, tf_token, tf_count, count, long.class);
    wcPipe = new Rename(wcPipe, tf_token, token);
    // additionally, sort by count
    wcPipe = new GroupBy(wcPipe, count, count);
    // connect the taps, pipes, traps, checkpoints, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName("tfidf").addSource(docPipe, docTap).addSource(stopPipe, stopTap).addTailSink(tfidfPipe, tfidfTap).addTailSink(wcPipe, wcTap).addTrap(docPipe, trapTap).addCheckpoint(idfCheck, checkTap);
    // set to DebugLevel.VERBOSE for trace, or DebugLevel.NONE in production
    flowDef.setDebugLevel(DebugLevel.VERBOSE);
    // set to AssertionLevel.STRICT for all assertions, or AssertionLevel.NONE in production
    flowDef.setAssertionLevel(AssertionLevel.STRICT);
    // write a DOT file and run the flow
    Flow tfidfFlow = flowConnector.connect(flowDef);
    tfidfFlow.writeDOT("dot/tfidf.dot");
    tfidfFlow.complete();
}
Also used : Each(cascading.pipe.Each) Tap(cascading.tap.Tap) Hadoop2MR1FlowConnector(cascading.flow.hadoop2.Hadoop2MR1FlowConnector) FlowDef(cascading.flow.FlowDef) Properties(java.util.Properties) RegexFilter(cascading.operation.regex.RegexFilter) Insert(cascading.operation.Insert) AssertMatches(cascading.operation.assertion.AssertMatches) RegexSplitGenerator(cascading.operation.regex.RegexSplitGenerator) HashJoin(cascading.pipe.HashJoin) Hfs(cascading.tap.hadoop.Hfs) Retain(cascading.pipe.assembly.Retain) CoGroup(cascading.pipe.CoGroup) Debug(cascading.operation.Debug) Hadoop2MR1FlowConnector(cascading.flow.hadoop2.Hadoop2MR1FlowConnector) FlowConnector(cascading.flow.FlowConnector) GroupBy(cascading.pipe.GroupBy) CountBy(cascading.pipe.assembly.CountBy) ExpressionFunction(cascading.operation.expression.ExpressionFunction) SumBy(cascading.pipe.assembly.SumBy) Pipe(cascading.pipe.Pipe) Rename(cascading.pipe.assembly.Rename) Flow(cascading.flow.Flow) Checkpoint(cascading.pipe.Checkpoint) Fields(cascading.tuple.Fields) LeftJoin(cascading.pipe.joiner.LeftJoin) Unique(cascading.pipe.assembly.Unique) TextDelimited(cascading.scheme.hadoop.TextDelimited)

Example 5 with Hfs

use of cascading.tap.hadoop.Hfs in project parquet-mr by apache.

the class ParquetScroogeSchemeTest method doWrite.

private void doWrite() throws Exception {
    Path path = new Path(parquetOutputPath);
    final FileSystem fs = path.getFileSystem(new Configuration());
    if (fs.exists(path))
        fs.delete(path, true);
    Scheme sourceScheme = new TextLine(new Fields("first", "last"));
    Tap source = new Hfs(sourceScheme, txtInputPath);
    Scheme sinkScheme = new ParquetScroogeScheme<Name>(Name.class);
    Tap sink = new Hfs(sinkScheme, parquetOutputPath);
    Pipe assembly = new Pipe("namecp");
    assembly = new Each(assembly, new PackThriftFunction());
    Flow flow = new HadoopFlowConnector().connect("namecp", source, sink, assembly);
    flow.complete();
}
Also used : Path(org.apache.hadoop.fs.Path) Each(cascading.pipe.Each) Tap(cascading.tap.Tap) Scheme(cascading.scheme.Scheme) Configuration(org.apache.hadoop.conf.Configuration) TextLine(cascading.scheme.hadoop.TextLine) Pipe(cascading.pipe.Pipe) HadoopFlowConnector(cascading.flow.hadoop.HadoopFlowConnector) Flow(cascading.flow.Flow) Fields(cascading.tuple.Fields) Hfs(cascading.tap.hadoop.Hfs) FileSystem(org.apache.hadoop.fs.FileSystem)

Aggregations

Hfs (cascading.tap.hadoop.Hfs)11 Flow (cascading.flow.Flow)9 Each (cascading.pipe.Each)9 Pipe (cascading.pipe.Pipe)9 Tap (cascading.tap.Tap)9 Fields (cascading.tuple.Fields)9 HadoopFlowConnector (cascading.flow.hadoop.HadoopFlowConnector)8 Scheme (cascading.scheme.Scheme)7 TextLine (cascading.scheme.hadoop.TextLine)7 Path (org.apache.hadoop.fs.Path)7 Configuration (org.apache.hadoop.conf.Configuration)6 FileSystem (org.apache.hadoop.fs.FileSystem)6 File (java.io.File)5 FlowDef (cascading.flow.FlowDef)2 Insert (cascading.operation.Insert)2 ExpressionFunction (cascading.operation.expression.ExpressionFunction)2 RegexFilter (cascading.operation.regex.RegexFilter)2 RegexSplitGenerator (cascading.operation.regex.RegexSplitGenerator)2 CoGroup (cascading.pipe.CoGroup)2 GroupBy (cascading.pipe.GroupBy)2