Search in sources :

Example 1 with IngestFileNotFoundException

use of bio.terra.service.dataset.exception.IngestFileNotFoundException in project jade-data-repo by DataBiosphere.

the class BigQueryPdao method loadToStagingTable.

// Load data
public PdaoLoadStatistics loadToStagingTable(Dataset dataset, DatasetTable targetTable, String stagingTableName, IngestRequestModel ingestRequest) throws InterruptedException {
    BigQueryProject bigQueryProject = bigQueryProjectForDataset(dataset);
    BigQuery bigQuery = bigQueryProject.getBigQuery();
    TableId tableId = TableId.of(prefixName(dataset.getName()), stagingTableName);
    // Source does not have row_id
    Schema schema = buildSchema(targetTable, true);
    LoadJobConfiguration.Builder loadBuilder = LoadJobConfiguration.builder(tableId, ingestRequest.getPath()).setFormatOptions(buildFormatOptions(ingestRequest)).setMaxBadRecords((ingestRequest.getMaxBadRecords() == null) ? Integer.valueOf(0) : ingestRequest.getMaxBadRecords()).setIgnoreUnknownValues((ingestRequest.isIgnoreUnknownValues() == null) ? Boolean.TRUE : ingestRequest.isIgnoreUnknownValues()).setSchema(// docs say this is for target, but CLI provides one for the source
    schema).setCreateDisposition(JobInfo.CreateDisposition.CREATE_IF_NEEDED).setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE);
    // so we have to special-case here. Grumble...
    if (ingestRequest.getFormat() == IngestRequestModel.FormatEnum.CSV) {
        loadBuilder.setNullMarker((ingestRequest.getCsvNullMarker() == null) ? "" : ingestRequest.getCsvNullMarker());
    }
    LoadJobConfiguration configuration = loadBuilder.build();
    Job loadJob = bigQuery.create(JobInfo.of(configuration));
    Instant loadJobMaxTime = Instant.now().plusSeconds(TimeUnit.MINUTES.toSeconds(20L));
    while (!loadJob.isDone()) {
        logger.info("Waiting for staging table load job " + loadJob.getJobId().getJob() + " to complete");
        TimeUnit.SECONDS.sleep(5L);
        if (loadJobMaxTime.isBefore(Instant.now())) {
            loadJob.cancel();
            throw new PdaoException("Staging table load failed to complete within timeout - canceled");
        }
    }
    loadJob = loadJob.reload();
    BigQueryError loadJobError = loadJob.getStatus().getError();
    if (loadJobError == null) {
        logger.info("Staging table load job " + loadJob.getJobId().getJob() + " succeeded");
    } else {
        logger.info("Staging table load job " + loadJob.getJobId().getJob() + " failed: " + loadJobError);
        if ("notFound".equals(loadJobError.getReason())) {
            throw new IngestFileNotFoundException("Ingest source file not found: " + ingestRequest.getPath());
        }
        List<String> loadErrors = new ArrayList<>();
        List<BigQueryError> bigQueryErrors = loadJob.getStatus().getExecutionErrors();
        for (BigQueryError bigQueryError : bigQueryErrors) {
            loadErrors.add("BigQueryError: reason=" + bigQueryError.getReason() + " message=" + bigQueryError.getMessage());
        }
        throw new IngestFailureException("Ingest failed with " + loadErrors.size() + " errors - see error details", loadErrors);
    }
    // Job completed successfully
    JobStatistics.LoadStatistics loadStatistics = loadJob.getStatistics();
    PdaoLoadStatistics pdaoLoadStatistics = new PdaoLoadStatistics().badRecords(loadStatistics.getBadRecords()).rowCount(loadStatistics.getOutputRows()).startTime(Instant.ofEpochMilli(loadStatistics.getStartTime())).endTime(Instant.ofEpochMilli(loadStatistics.getEndTime()));
    return pdaoLoadStatistics;
}
Also used : TableId(com.google.cloud.bigquery.TableId) JobStatistics(com.google.cloud.bigquery.JobStatistics) BigQuery(com.google.cloud.bigquery.BigQuery) BigQueryError(com.google.cloud.bigquery.BigQueryError) Schema(com.google.cloud.bigquery.Schema) Instant(java.time.Instant) ArrayList(java.util.ArrayList) LoadJobConfiguration(com.google.cloud.bigquery.LoadJobConfiguration) IngestFailureException(bio.terra.service.dataset.exception.IngestFailureException) PdaoException(bio.terra.common.exception.PdaoException) IngestFileNotFoundException(bio.terra.service.dataset.exception.IngestFileNotFoundException) Job(com.google.cloud.bigquery.Job) PdaoLoadStatistics(bio.terra.common.PdaoLoadStatistics)

Aggregations

PdaoLoadStatistics (bio.terra.common.PdaoLoadStatistics)1 PdaoException (bio.terra.common.exception.PdaoException)1 IngestFailureException (bio.terra.service.dataset.exception.IngestFailureException)1 IngestFileNotFoundException (bio.terra.service.dataset.exception.IngestFileNotFoundException)1 BigQuery (com.google.cloud.bigquery.BigQuery)1 BigQueryError (com.google.cloud.bigquery.BigQueryError)1 Job (com.google.cloud.bigquery.Job)1 JobStatistics (com.google.cloud.bigquery.JobStatistics)1 LoadJobConfiguration (com.google.cloud.bigquery.LoadJobConfiguration)1 Schema (com.google.cloud.bigquery.Schema)1 TableId (com.google.cloud.bigquery.TableId)1 Instant (java.time.Instant)1 ArrayList (java.util.ArrayList)1