Search in sources :

Example 1 with Dataset

use of org.apache.gobblin.dataset.Dataset in project incubator-gobblin by apache.

the class DatasetFinderSource method createWorkUnitStream.

private Stream<WorkUnit> createWorkUnitStream(SourceState state) throws IOException {
    IterableDatasetFinder datasetsFinder = createDatasetsFinder(state);
    Stream<Dataset> datasetStream = datasetsFinder.getDatasetsStream(0, null);
    if (this.drilldownIntoPartitions) {
        return datasetStream.flatMap(dataset -> {
            if (dataset instanceof PartitionableDataset) {
                try {
                    return (Stream<PartitionableDataset.DatasetPartition>) ((PartitionableDataset) dataset).getPartitions(0, null);
                } catch (IOException ioe) {
                    log.error("Failed to get partitions for dataset " + dataset.getUrn());
                    return Stream.empty();
                }
            } else {
                return Stream.of(new DatasetWrapper(dataset));
            }
        }).map(this::workUnitForPartitionInternal);
    } else {
        return datasetStream.map(this::workUnitForDataset);
    }
}
Also used : DatasetUtils(org.apache.gobblin.data.management.dataset.DatasetUtils) WorkUnitStream(org.apache.gobblin.source.workunit.WorkUnitStream) Getter(lombok.Getter) IOException(java.io.IOException) Collectors(java.util.stream.Collectors) PartitionableDataset(org.apache.gobblin.dataset.PartitionableDataset) IterableDatasetFinder(org.apache.gobblin.dataset.IterableDatasetFinder) List(java.util.List) Slf4j(lombok.extern.slf4j.Slf4j) Stream(java.util.stream.Stream) BasicWorkUnitStream(org.apache.gobblin.source.workunit.BasicWorkUnitStream) SourceState(org.apache.gobblin.configuration.SourceState) WorkUnitStreamSource(org.apache.gobblin.source.WorkUnitStreamSource) HadoopUtils(org.apache.gobblin.util.HadoopUtils) AllArgsConstructor(lombok.AllArgsConstructor) Dataset(org.apache.gobblin.dataset.Dataset) WorkUnit(org.apache.gobblin.source.workunit.WorkUnit) PartitionableDataset(org.apache.gobblin.dataset.PartitionableDataset) IterableDatasetFinder(org.apache.gobblin.dataset.IterableDatasetFinder) PartitionableDataset(org.apache.gobblin.dataset.PartitionableDataset) Dataset(org.apache.gobblin.dataset.Dataset) IOException(java.io.IOException)

Example 2 with Dataset

use of org.apache.gobblin.dataset.Dataset in project incubator-gobblin by apache.

the class ConfigBasedCleanabledDatasetFinder method findDatasetsCallable.

protected Callable<Void> findDatasetsCallable(final ConfigClient confClient, final URI u, final Properties p, Optional<List<String>> blacklistURNs, final Collection<Dataset> datasets) {
    return new Callable<Void>() {

        @Override
        public Void call() throws Exception {
            // Process each {@link Config}, find dataset and add those into the datasets
            Config c = confClient.getConfig(u);
            Dataset datasetForConfig = new ConfigurableCleanableDataset(fileSystem, p, new Path(c.getString(DATASET_PATH)), c, log);
            datasets.add(datasetForConfig);
            return null;
        }
    };
}
Also used : Path(org.apache.hadoop.fs.Path) ConfigurableCleanableDataset(org.apache.gobblin.data.management.retention.dataset.ConfigurableCleanableDataset) Config(com.typesafe.config.Config) ConfigurableCleanableDataset(org.apache.gobblin.data.management.retention.dataset.ConfigurableCleanableDataset) Dataset(org.apache.gobblin.dataset.Dataset) Callable(java.util.concurrent.Callable)

Example 3 with Dataset

use of org.apache.gobblin.dataset.Dataset in project incubator-gobblin by apache.

the class ComplianceRetentionJob method run.

public void run() throws IOException {
    // Dropping empty tables
    for (HiveDataset dataset : this.tablesToDrop) {
        log.info("Dropping table: " + dataset.getTable().getCompleteName());
        executeDropTableQuery(dataset, this.properties);
    }
    Preconditions.checkNotNull(this.finder, "Dataset finder class is not set");
    List<Dataset> datasets = this.finder.findDatasets();
    this.finishCleanSignal = Optional.of(new CountDownLatch(datasets.size()));
    for (final Dataset dataset : datasets) {
        ListenableFuture<Void> future = this.service.submit(new Callable<Void>() {

            @Override
            public Void call() throws Exception {
                if (dataset instanceof CleanableDataset) {
                    ((CleanableDataset) dataset).clean();
                } else {
                    log.warn("Not an instance of " + CleanableDataset.class + " Dataset won't be cleaned " + dataset.datasetURN());
                }
                return null;
            }
        });
        Futures.addCallback(future, new FutureCallback<Void>() {

            @Override
            public void onSuccess(@Nullable Void result) {
                ComplianceRetentionJob.this.finishCleanSignal.get().countDown();
                log.info("Successfully cleaned: " + dataset.datasetURN());
            }

            @Override
            public void onFailure(Throwable t) {
                ComplianceRetentionJob.this.finishCleanSignal.get().countDown();
                log.warn("Exception caught when cleaning " + dataset.datasetURN() + ".", t);
                ComplianceRetentionJob.this.throwables.add(t);
                ComplianceRetentionJob.this.eventSubmitter.submit(ComplianceEvents.Retention.FAILED_EVENT_NAME, ImmutableMap.of(ComplianceEvents.FAILURE_CONTEXT_METADATA_KEY, ExceptionUtils.getFullStackTrace(t), ComplianceEvents.DATASET_URN_METADATA_KEY, dataset.datasetURN()));
            }
        });
    }
}
Also used : CleanableDataset(org.apache.gobblin.data.management.retention.dataset.CleanableDataset) Dataset(org.apache.gobblin.dataset.Dataset) HiveDataset(org.apache.gobblin.data.management.copy.hive.HiveDataset) CleanableDataset(org.apache.gobblin.data.management.retention.dataset.CleanableDataset) HiveDataset(org.apache.gobblin.data.management.copy.hive.HiveDataset) CountDownLatch(java.util.concurrent.CountDownLatch) SQLException(java.sql.SQLException) TException(org.apache.thrift.TException) IOException(java.io.IOException)

Example 4 with Dataset

use of org.apache.gobblin.dataset.Dataset in project incubator-gobblin by apache.

the class CompactionSource method getWorkunitStream.

@Override
public WorkUnitStream getWorkunitStream(SourceState state) {
    try {
        fs = getSourceFileSystem(state);
        state.setProp(COMPACTION_INIT_TIME, DateTimeUtils.currentTimeMillis());
        suite = CompactionSuiteUtils.getCompactionSuiteFactory(state).createSuite(state);
        initRequestAllocator(state);
        initJobDir(state);
        copyJarDependencies(state);
        DatasetsFinder finder = DatasetUtils.instantiateDatasetFinder(state.getProperties(), getSourceFileSystem(state), DefaultFileSystemGlobFinder.class.getName());
        List<Dataset> datasets = finder.findDatasets();
        CompactionWorkUnitIterator workUnitIterator = new CompactionWorkUnitIterator();
        // Spawn a single thread to create work units
        new Thread(new SingleWorkUnitGeneratorService(state, prioritize(datasets, state), workUnitIterator), "SingleWorkUnitGeneratorService").start();
        return new BasicWorkUnitStream.Builder(workUnitIterator).build();
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}
Also used : DefaultFileSystemGlobFinder(org.apache.gobblin.data.management.dataset.DefaultFileSystemGlobFinder) Dataset(org.apache.gobblin.dataset.Dataset) BasicWorkUnitStream(org.apache.gobblin.source.workunit.BasicWorkUnitStream) DatasetsFinder(org.apache.gobblin.dataset.DatasetsFinder) IOException(java.io.IOException)

Example 5 with Dataset

use of org.apache.gobblin.dataset.Dataset in project incubator-gobblin by apache.

the class ConfigBasedCopyableDatasetFinder method findDatasetsCallable.

protected Callable<Void> findDatasetsCallable(final ConfigClient confClient, final URI u, final Properties p, Optional<List<String>> blacklistPatterns, final Collection<Dataset> datasets) {
    return new Callable<Void>() {

        @Override
        public Void call() throws Exception {
            // Process each {@link Config}, find dataset and add those into the datasets
            Config c = confClient.getConfig(u);
            List<Dataset> datasetForConfig = new ConfigBasedMultiDatasets(c, p, blacklistPatterns).getConfigBasedDatasetList();
            datasets.addAll(datasetForConfig);
            return null;
        }
    };
}
Also used : Config(com.typesafe.config.Config) Dataset(org.apache.gobblin.dataset.Dataset) Callable(java.util.concurrent.Callable)

Aggregations

Dataset (org.apache.gobblin.dataset.Dataset)15 IOException (java.io.IOException)7 SourceState (org.apache.gobblin.configuration.SourceState)6 IterableDatasetFinder (org.apache.gobblin.dataset.IterableDatasetFinder)6 PartitionableDataset (org.apache.gobblin.dataset.PartitionableDataset)6 WorkUnit (org.apache.gobblin.source.workunit.WorkUnit)6 WorkUnitStream (org.apache.gobblin.source.workunit.WorkUnitStream)6 CountDownLatch (java.util.concurrent.CountDownLatch)4 SimpleDatasetForTesting (org.apache.gobblin.dataset.test.SimpleDatasetForTesting)4 SimpleDatasetPartitionForTesting (org.apache.gobblin.dataset.test.SimpleDatasetPartitionForTesting)4 SimplePartitionableDatasetForTesting (org.apache.gobblin.dataset.test.SimplePartitionableDatasetForTesting)4 StaticDatasetsFinderForTesting (org.apache.gobblin.dataset.test.StaticDatasetsFinderForTesting)4 Test (org.testng.annotations.Test)4 Config (com.typesafe.config.Config)3 WorkUnitState (org.apache.gobblin.configuration.WorkUnitState)3 CleanableDataset (org.apache.gobblin.data.management.retention.dataset.CleanableDataset)3 BasicWorkUnitStream (org.apache.gobblin.source.workunit.BasicWorkUnitStream)3 TException (org.apache.thrift.TException)3 List (java.util.List)2 Callable (java.util.concurrent.Callable)2