Search in sources :

Example 56 with BatchEnvironment

use of edu.iu.dsc.tws.tset.env.BatchEnvironment in project twister2 by DSC-SPIDAL.

the class FileBasedWordCount method execute.

@Override
public void execute(WorkerEnvironment workerEnv) {
    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    int sourcePar = (int) env.getConfig().get("PAR");
    // read the file line by line by using a single worker
    SourceTSet<String> lines = env.createSource(new WordCountFileSource(), 1);
    // distribute the lines among the workers and performs a flatmap operation to extract words
    ComputeTSet<String> words = lines.partition(new HashingPartitioner<>(), sourcePar).flatmap((FlatMapFunc<String, String>) (l, collector) -> {
        StringTokenizer itr = new StringTokenizer(l);
        while (itr.hasMoreTokens()) {
            collector.collect(itr.nextToken());
        }
    });
    // attach count as 1 for each word
    KeyedTSet<String, Integer> groupedWords = words.mapToTuple(w -> new Tuple<>(w, 1));
    // performs reduce by key at each worker
    KeyedReduceTLink<String, Integer> keyedReduce = groupedWords.keyedReduce(Integer::sum);
    // gather the results to worker0 (there is a dummy map op here to pass the values to edges)
    // and write to a file
    keyedReduce.map(i -> i).gather().forEach(new WordcountFileWriter());
}
Also used : Twister2Job(edu.iu.dsc.tws.api.Twister2Job) URL(java.net.URL) ResourceAllocator(edu.iu.dsc.tws.rsched.core.ResourceAllocator) Options(org.apache.commons.cli.Options) LocalTextInputPartitioner(edu.iu.dsc.tws.data.api.formatters.LocalTextInputPartitioner) BatchEnvironment(edu.iu.dsc.tws.tset.env.BatchEnvironment) FlatMapFunc(edu.iu.dsc.tws.api.tset.fn.FlatMapFunc) KeyedTSet(edu.iu.dsc.tws.tset.sets.batch.KeyedTSet) JobConfig(edu.iu.dsc.tws.api.JobConfig) StandardCopyOption(java.nio.file.StandardCopyOption) Level(java.util.logging.Level) DefaultParser(org.apache.commons.cli.DefaultParser) FileInputSplit(edu.iu.dsc.tws.data.api.splits.FileInputSplit) HashingPartitioner(edu.iu.dsc.tws.tset.fn.HashingPartitioner) InputSplit(edu.iu.dsc.tws.data.fs.io.InputSplit) StringTokenizer(java.util.StringTokenizer) Map(java.util.Map) CommandLine(org.apache.commons.cli.CommandLine) DataSource(edu.iu.dsc.tws.dataset.DataSource) TSetContext(edu.iu.dsc.tws.api.tset.TSetContext) BaseApplyFunc(edu.iu.dsc.tws.api.tset.fn.BaseApplyFunc) Tuple(edu.iu.dsc.tws.api.comms.structs.Tuple) ComputeTSet(edu.iu.dsc.tws.tset.sets.batch.ComputeTSet) SourceTSet(edu.iu.dsc.tws.tset.sets.batch.SourceTSet) Files(java.nio.file.Files) CommandLineParser(org.apache.commons.cli.CommandLineParser) BufferedWriter(java.io.BufferedWriter) BaseSourceFunc(edu.iu.dsc.tws.api.tset.fn.BaseSourceFunc) FileWriter(java.io.FileWriter) IOException(java.io.IOException) Logger(java.util.logging.Logger) KeyedReduceTLink(edu.iu.dsc.tws.tset.links.batch.KeyedReduceTLink) File(java.io.File) Serializable(java.io.Serializable) Twister2Submitter(edu.iu.dsc.tws.rsched.job.Twister2Submitter) WorkerEnvironment(edu.iu.dsc.tws.api.resource.WorkerEnvironment) TSetEnvironment(edu.iu.dsc.tws.tset.env.TSetEnvironment) TreeMap(java.util.TreeMap) Paths(java.nio.file.Paths) Path(edu.iu.dsc.tws.api.data.Path) BufferedReader(java.io.BufferedReader) FileReader(java.io.FileReader) Twister2Worker(edu.iu.dsc.tws.api.resource.Twister2Worker) InputStream(java.io.InputStream) BatchEnvironment(edu.iu.dsc.tws.tset.env.BatchEnvironment) StringTokenizer(java.util.StringTokenizer) HashingPartitioner(edu.iu.dsc.tws.tset.fn.HashingPartitioner)

Example 57 with BatchEnvironment

use of edu.iu.dsc.tws.tset.env.BatchEnvironment in project twister2 by DSC-SPIDAL.

the class WordCount method execute.

@Override
public void execute(WorkerEnvironment workerEnv) {
    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    int sourcePar = 4;
    Config config = env.getConfig();
    // create a source with fixed number of random words
    SourceTSet<String> source = env.createSource(new WordGenerator((int) config.get("NO_OF_SAMPLE_WORDS"), (int) config.get("MAX_CHARS")), sourcePar).setName("source");
    // map the words to a tuple, with <word, 1>, 1 is the count
    KeyedTSet<String, Integer> groupedWords = source.mapToTuple(w -> new Tuple<>(w, 1));
    // reduce using the sim operation
    KeyedReduceTLink<String, Integer> keyedReduce = groupedWords.keyedReduce(Integer::sum);
    // print the counts
    keyedReduce.forEach(c -> LOG.info(c.toString()));
}
Also used : BatchEnvironment(edu.iu.dsc.tws.tset.env.BatchEnvironment) Config(edu.iu.dsc.tws.api.config.Config) JobConfig(edu.iu.dsc.tws.api.JobConfig) RandomString(edu.iu.dsc.tws.examples.utils.RandomString)

Example 58 with BatchEnvironment

use of edu.iu.dsc.tws.tset.env.BatchEnvironment in project twister2 by DSC-SPIDAL.

the class CSVTSetSourceExample method execute.

@Override
public void execute(WorkerEnvironment workerEnv) {
    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    int dsize = 100;
    int parallelism = 2;
    int dimension = 2;
    SourceTSet<String[]> pointSource = env.createCSVSource("/tmp/dinput", dsize, parallelism, "split");
    ComputeTSet<double[][]> points = pointSource.direct().compute(new ComputeFunc<Iterator<String[]>, double[][]>() {

        private double[][] localPoints = new double[dsize / parallelism][dimension];

        @Override
        public double[][] compute(Iterator<String[]> input) {
            for (int i = 0; i < dsize / parallelism && input.hasNext(); i++) {
                String[] value = input.next();
                for (int j = 0; j < value.length; j++) {
                    localPoints[i][j] = Double.parseDouble(value[j]);
                }
            }
            LOG.info("Double Array Values:" + Arrays.deepToString(localPoints));
            return localPoints;
        }
    });
}
Also used : BatchEnvironment(edu.iu.dsc.tws.tset.env.BatchEnvironment) Iterator(java.util.Iterator)

Example 59 with BatchEnvironment

use of edu.iu.dsc.tws.tset.env.BatchEnvironment in project twister2 by DSC-SPIDAL.

the class KMeansTsetJob method execute.

@Override
public void execute(WorkerEnvironment workerEnv) {
    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    int workerId = env.getWorkerID();
    LOG.info("TSet worker starting: " + workerId);
    Config config = env.getConfig();
    int parallelism = config.getIntegerValue(DataObjectConstants.PARALLELISM_VALUE);
    int dimension = config.getIntegerValue(DataObjectConstants.DIMENSIONS);
    int numFiles = config.getIntegerValue(DataObjectConstants.NUMBER_OF_FILES);
    int dsize = config.getIntegerValue(DataObjectConstants.DSIZE);
    int csize = config.getIntegerValue(DataObjectConstants.CSIZE);
    int iterations = config.getIntegerValue(DataObjectConstants.ARGS_ITERATIONS);
    String dataDirectory = config.getStringValue(DataObjectConstants.DINPUT_DIRECTORY) + workerId;
    String centroidDirectory = config.getStringValue(DataObjectConstants.CINPUT_DIRECTORY) + workerId;
    String type = config.getStringValue(DataObjectConstants.FILE_TYPE);
    KMeansUtils.generateDataPoints(env.getConfig(), dimension, numFiles, dsize, csize, dataDirectory, centroidDirectory, type);
    long startTime = System.currentTimeMillis();
    /*CachedTSet<double[][]> points =
        tc.createSource(new PointsSource(type), parallelismValue).setName("dataSource").cache();*/
    SourceTSet<String[]> pointSource = env.createCSVSource(dataDirectory, dsize, parallelism, "split");
    ComputeTSet<double[][]> points = pointSource.direct().compute(new ComputeFunc<Iterator<String[]>, double[][]>() {

        private double[][] localPoints = new double[dsize / parallelism][dimension];

        @Override
        public double[][] compute(Iterator<String[]> input) {
            for (int i = 0; i < dsize / parallelism && input.hasNext(); i++) {
                String[] value = input.next();
                for (int j = 0; j < value.length; j++) {
                    localPoints[i][j] = Double.parseDouble(value[j]);
                }
            }
            return localPoints;
        }
    });
    points.setName("dataSource").cache();
    // CachedTSet<double[][]> centers = tc.createSource(new CenterSource(type), parallelism).cache();
    SourceTSet<String[]> centerSource = env.createCSVSource(centroidDirectory, csize, parallelism, "complete");
    ComputeTSet<double[][]> centers = centerSource.direct().compute(new ComputeFunc<Iterator<String[]>, double[][]>() {

        private double[][] localCenters = new double[csize][dimension];

        @Override
        public double[][] compute(Iterator<String[]> input) {
            for (int i = 0; i < csize && input.hasNext(); i++) {
                String[] value = input.next();
                for (int j = 0; j < dimension; j++) {
                    localCenters[i][j] = Double.parseDouble(value[j]);
                }
            }
            return localCenters;
        }
    });
    CachedTSet<double[][]> cachedCenters = centers.cache();
    long endTimeData = System.currentTimeMillis();
    ComputeTSet<double[][]> kmeansTSet = points.direct().map(new KMeansMap());
    ComputeTSet<double[][]> reduced = kmeansTSet.allReduce((ReduceFunc<double[][]>) (t1, t2) -> {
        double[][] newCentroids = new double[t1.length][t1[0].length];
        for (int j = 0; j < t1.length; j++) {
            for (int k = 0; k < t1[0].length; k++) {
                double newVal = t1[j][k] + t2[j][k];
                newCentroids[j][k] = newVal;
            }
        }
        return newCentroids;
    }).map(new AverageCenters());
    kmeansTSet.addInput("centers", cachedCenters);
    CachedTSet<double[][]> cached = reduced.lazyCache();
    for (int i = 0; i < iterations; i++) {
        env.evalAndUpdate(cached, cachedCenters);
    }
    env.finishEval(cached);
    long endTime = System.currentTimeMillis();
    if (workerId == 0) {
        LOG.info("Data Load time : " + (endTimeData - startTime) + "\n" + "Total Time : " + (endTime - startTime) + "Compute Time : " + (endTime - endTimeData));
        LOG.info("Final Centroids After\t" + iterations + "\titerations\t");
        cachedCenters.direct().forEach(i -> LOG.info(Arrays.deepToString(i)));
    }
}
Also used : BatchEnvironment(edu.iu.dsc.tws.tset.env.BatchEnvironment) Config(edu.iu.dsc.tws.api.config.Config) Iterator(java.util.Iterator) ReduceFunc(edu.iu.dsc.tws.api.tset.fn.ReduceFunc)

Aggregations

BatchEnvironment (edu.iu.dsc.tws.tset.env.BatchEnvironment)59 Config (edu.iu.dsc.tws.api.config.Config)24 TSetEnvironment (edu.iu.dsc.tws.tset.env.TSetEnvironment)24 JobConfig (edu.iu.dsc.tws.api.JobConfig)23 WorkerEnvironment (edu.iu.dsc.tws.api.resource.WorkerEnvironment)23 Logger (java.util.logging.Logger)23 SourceTSet (edu.iu.dsc.tws.tset.sets.batch.SourceTSet)22 HashMap (java.util.HashMap)22 ResourceAllocator (edu.iu.dsc.tws.rsched.core.ResourceAllocator)21 Iterator (java.util.Iterator)21 Tuple (edu.iu.dsc.tws.api.comms.structs.Tuple)18 ComputeCollectorFunc (edu.iu.dsc.tws.api.tset.fn.ComputeCollectorFunc)12 ComputeFunc (edu.iu.dsc.tws.api.tset.fn.ComputeFunc)12 TSetContext (edu.iu.dsc.tws.api.tset.TSetContext)7 SinkTSet (edu.iu.dsc.tws.tset.sets.batch.SinkTSet)6 Twister2Job (edu.iu.dsc.tws.api.Twister2Job)5 MapFunc (edu.iu.dsc.tws.api.tset.fn.MapFunc)5 SinkFunc (edu.iu.dsc.tws.api.tset.fn.SinkFunc)5 Twister2Submitter (edu.iu.dsc.tws.rsched.job.Twister2Submitter)5 ComputeTSet (edu.iu.dsc.tws.tset.sets.batch.ComputeTSet)5