Search in sources :

Example 1 with RdfClient

use of org.wikidata.query.rdf.tool.rdf.client.RdfClient in project wikidata-query-rdf by wikimedia.

the class StreamingUpdate method build.

static StreamingUpdaterConsumer build(StreamingUpdateOptions options, MetricRegistry metrics) {
    RDFChunkDeserializer deser = new RDFChunkDeserializer(new RDFParserSuppliers(RDFParserRegistry.getInstance()));
    KafkaStreamConsumer consumer = KafkaStreamConsumer.build(options.brokers(), options.topic(), options.partition(), options.consumerGroup(), options.batchSize(), deser, parseInitialOffset(options), KafkaStreamConsumerMetricsListener.forRegistry(metrics), options.bufferedInputMessages(), buildFilter(StreamingUpdateOptions.entityFilterPattern(options)));
    HttpClient httpClient = buildHttpClient(getHttpProxyHost(), getHttpProxyPort());
    Retryer<ContentResponse> retryer = buildHttpClientRetryer();
    Duration rdfClientTimeout = RdfRepositoryUpdater.getRdfClientTimeout();
    RdfClient rdfClient = new RdfClient(httpClient, StreamingUpdateOptions.sparqlUri(options), retryer, rdfClientTimeout);
    UrisScheme uris = UrisSchemeFactory.getURISystem();
    return new StreamingUpdaterConsumer(consumer, new RdfRepositoryUpdater(rdfClient, uris), metrics, options.inconsistenciesWarningThreshold());
}
Also used : RDFParserSuppliers(org.wikidata.query.rdf.tool.rdf.RDFParserSuppliers) ContentResponse(org.eclipse.jetty.client.api.ContentResponse) UrisScheme(org.wikidata.query.rdf.common.uri.UrisScheme) HttpClient(org.eclipse.jetty.client.HttpClient) HttpClientUtils.buildHttpClient(org.wikidata.query.rdf.tool.HttpClientUtils.buildHttpClient) RdfRepositoryUpdater(org.wikidata.query.rdf.tool.rdf.RdfRepositoryUpdater) Duration(java.time.Duration) RdfClient(org.wikidata.query.rdf.tool.rdf.client.RdfClient) RDFChunkDeserializer(org.wikidata.query.rdf.updater.RDFChunkDeserializer)

Example 2 with RdfClient

use of org.wikidata.query.rdf.tool.rdf.client.RdfClient in project wikidata-query-rdf by wikimedia.

the class RdfKafkaRepositoryIntegrationTest method readWriteOffsets.

@Test
public void readWriteOffsets() throws Exception {
    Uris uris = new Uris(new URI("https://acme.test"), singleton(0L), "/api.php", "/entitydata");
    Instant startTime = Instant.ofEpochMilli(BEGIN_DATE);
    HttpClient httpClient = buildHttpClient(getHttpProxyHost(), getHttpProxyPort());
    RdfClient rdfClient = new RdfClient(httpClient, url("/namespace/wdq/sparql"), buildHttpClientRetryer(), Duration.of(-1, SECONDS));
    try {
        rdfClient.update("CLEAR ALL");
        KafkaOffsetsRepository kafkaOffsetsRepository = new RdfKafkaOffsetsRepository(uris.builder().build(), rdfClient);
        Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
        offsets.put(new TopicPartition("topictest", 0), new OffsetAndMetadata(1L));
        offsets.put(new TopicPartition("othertopic", 0), new OffsetAndMetadata(2L));
        kafkaOffsetsRepository.store(offsets);
        Map<TopicPartition, OffsetAndTimestamp> offsetsAndTimestamps = kafkaOffsetsRepository.load(startTime);
        assertThat(offsetsAndTimestamps.get(new TopicPartition("topictest", 0)).offset()).isEqualTo(1L);
        assertThat(offsetsAndTimestamps.get(new TopicPartition("othertopic", 0)).offset()).isEqualTo(2L);
        offsets = new HashMap<>();
        offsets.put(new TopicPartition("topictest", 0), new OffsetAndMetadata(3L));
        offsets.put(new TopicPartition("othertopic", 0), new OffsetAndMetadata(4L));
        kafkaOffsetsRepository.store(offsets);
        offsetsAndTimestamps = kafkaOffsetsRepository.load(startTime);
        assertThat(offsetsAndTimestamps.get(new TopicPartition("topictest", 0)).offset()).isEqualTo(3L);
        assertThat(offsetsAndTimestamps.get(new TopicPartition("othertopic", 0)).offset()).isEqualTo(4L);
    } finally {
        rdfClient.update("CLEAR ALL");
        httpClient.stop();
    }
}
Also used : HashMap(java.util.HashMap) Instant(java.time.Instant) RdfClient(org.wikidata.query.rdf.tool.rdf.client.RdfClient) URI(java.net.URI) Uris(org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.Uris) TopicPartition(org.apache.kafka.common.TopicPartition) HttpClient(org.eclipse.jetty.client.HttpClient) HttpClientUtils.buildHttpClient(org.wikidata.query.rdf.tool.HttpClientUtils.buildHttpClient) OffsetAndMetadata(org.apache.kafka.clients.consumer.OffsetAndMetadata) OffsetAndTimestamp(org.apache.kafka.clients.consumer.OffsetAndTimestamp) Test(org.junit.Test)

Example 3 with RdfClient

use of org.wikidata.query.rdf.tool.rdf.client.RdfClient in project wikidata-query-rdf by wikimedia.

the class Update method initialize.

private static Updater<? extends Change.Batch> initialize(String[] args, Closer closer) throws URISyntaxException {
    try {
        UpdateOptions options = handleOptions(UpdateOptions.class, args);
        MetricRegistry metricRegistry = createMetricRegistry(closer, options.metricDomain());
        StreamDumper wikibaseStreamDumper = createStreamDumper(dumpDirPath(options));
        WikibaseRepository wikibaseRepository = new WikibaseRepository(UpdateOptions.uris(options), options.constraints(), metricRegistry, wikibaseStreamDumper, UpdateOptions.revisionDuration(options), RDFParserSuppliers.defaultRdfParser());
        closer.register(wikibaseRepository);
        UrisScheme wikibaseUris = WikibaseOptions.wikibaseUris(options);
        URI root = wikibaseRepository.getUris().builder().build();
        URI sparqlUri = UpdateOptions.sparqlUri(options);
        HttpClient httpClient = buildHttpClient(getHttpProxyHost(), getHttpProxyPort());
        closer.register(wrapHttpClient(httpClient));
        Retryer<ContentResponse> retryer = buildHttpClientRetryer();
        Duration rdfClientTimeout = getRdfClientTimeout();
        RdfClient rdfClient = new RdfClient(httpClient, sparqlUri, retryer, rdfClientTimeout);
        RdfRepository rdfRepository = new RdfRepository(wikibaseUris, rdfClient, MAX_FORM_CONTENT_SIZE);
        Instant startTime = getStartTime(startInstant(options), rdfRepository, options.init());
        Change.Source<? extends Change.Batch> changeSource = buildChangeSource(options, startTime, wikibaseRepository, rdfClient, root, metricRegistry);
        Munger munger = mungerFromOptions(options);
        ExecutorService updaterExecutorService = createUpdaterExecutorService(options.threadCount());
        Updater<? extends Change.Batch> updater = createUpdater(wikibaseRepository, wikibaseUris, rdfRepository, changeSource, munger, updaterExecutorService, options.importAsync(), options.pollDelay(), options.verify(), metricRegistry);
        closer.register(updater);
        return updater;
    } catch (Exception e) {
        log.error("Error during initialization.", e);
        throw e;
    }
}
Also used : ContentResponse(org.eclipse.jetty.client.api.ContentResponse) UrisScheme(org.wikidata.query.rdf.common.uri.UrisScheme) MetricRegistry(com.codahale.metrics.MetricRegistry) UpdateOptions.startInstant(org.wikidata.query.rdf.tool.options.UpdateOptions.startInstant) Instant(java.time.Instant) Munger(org.wikidata.query.rdf.tool.rdf.Munger) WikibaseRepository(org.wikidata.query.rdf.tool.wikibase.WikibaseRepository) Duration(java.time.Duration) RdfRepository(org.wikidata.query.rdf.tool.rdf.RdfRepository) RdfClient(org.wikidata.query.rdf.tool.rdf.client.RdfClient) Change(org.wikidata.query.rdf.tool.change.Change) URI(java.net.URI) UpdateOptions(org.wikidata.query.rdf.tool.options.UpdateOptions) URISyntaxException(java.net.URISyntaxException) IOException(java.io.IOException) FileStreamDumper(org.wikidata.query.rdf.tool.utils.FileStreamDumper) StreamDumper(org.wikidata.query.rdf.tool.utils.StreamDumper) NullStreamDumper(org.wikidata.query.rdf.tool.utils.NullStreamDumper) HttpClient(org.eclipse.jetty.client.HttpClient) HttpClientUtils.buildHttpClient(org.wikidata.query.rdf.tool.HttpClientUtils.buildHttpClient) ExecutorService(java.util.concurrent.ExecutorService)

Example 4 with RdfClient

use of org.wikidata.query.rdf.tool.rdf.client.RdfClient in project wikidata-query-rdf by wikimedia.

the class RdfRepositoryUnitTest method batchUpdate.

@Test
public void batchUpdate() {
    RdfClient mockClient = mock(RdfClient.class);
    // 1.5M size means ~4k statements or 250K statement size max
    long maxPostSize = 1572864L;
    CollectedUpdateMetrics collectedUpdateMetrics = new CollectedUpdateMetrics();
    collectedUpdateMetrics.setMutationCount(1);
    collectedUpdateMetrics.merge(MultiSyncStep.INSERT_NEW_DATA, UpdateMetrics.builder().build());
    when(mockClient.update(any(String.class), any(UpdateMetricsResponseHandler.class))).thenReturn(collectedUpdateMetrics);
    RdfRepository repo = new RdfRepository(uris, mockClient, maxPostSize);
    // 6000 statements - should go over the limit
    Change change1 = new Change("Q1", 1, Instant.EPOCH, 1);
    StatementBuilder sb = new StatementBuilder("Q1");
    for (int i = 0; i < 6000; i++) {
        sb.withPredicateObject(RDFS.LABEL, new LiteralImpl("some item " + i));
    }
    change1.setStatements(sb.build());
    // One statement with 300K data - should go over the limit
    Change change2 = new Change("Q2", 1, Instant.EPOCH, 1);
    List<Statement> statements2 = new StatementBuilder("Q2").withPredicateObject(RDFS.LABEL, new LiteralImpl(randomizer.randomAsciiOfLength(300 * 1024))).build();
    change2.setStatements(statements2);
    // Just one statement - this will be separated anyway
    Change change3 = new Change("Q3", 1, Instant.EPOCH, 1);
    List<Statement> statements3 = new StatementBuilder("Q3").withPredicateObject(RDFS.LABEL, new LiteralImpl("third item")).build();
    change3.setStatements(statements3);
    List<Change> changes = ImmutableList.of(change1, change2, change3);
    int count = repo.syncFromChanges(changes, false).getMutationCount();
    assertThat(count).isEqualTo(3);
    // We should get 3 calls to update
    verify(mockClient, times(3)).update(any(), any());
}
Also used : Statement(org.openrdf.model.Statement) RdfClient(org.wikidata.query.rdf.tool.rdf.client.RdfClient) Change(org.wikidata.query.rdf.tool.change.Change) LiteralImpl(org.openrdf.model.impl.LiteralImpl) UpdateMetricsResponseHandler(org.wikidata.query.rdf.tool.rdf.client.UpdateMetricsResponseHandler) StatementBuilder(org.wikidata.query.rdf.test.StatementHelper.StatementBuilder) Test(org.junit.Test)

Aggregations

RdfClient (org.wikidata.query.rdf.tool.rdf.client.RdfClient)4 HttpClient (org.eclipse.jetty.client.HttpClient)3 HttpClientUtils.buildHttpClient (org.wikidata.query.rdf.tool.HttpClientUtils.buildHttpClient)3 URI (java.net.URI)2 Duration (java.time.Duration)2 Instant (java.time.Instant)2 ContentResponse (org.eclipse.jetty.client.api.ContentResponse)2 Test (org.junit.Test)2 UrisScheme (org.wikidata.query.rdf.common.uri.UrisScheme)2 Change (org.wikidata.query.rdf.tool.change.Change)2 MetricRegistry (com.codahale.metrics.MetricRegistry)1 IOException (java.io.IOException)1 URISyntaxException (java.net.URISyntaxException)1 HashMap (java.util.HashMap)1 ExecutorService (java.util.concurrent.ExecutorService)1 OffsetAndMetadata (org.apache.kafka.clients.consumer.OffsetAndMetadata)1 OffsetAndTimestamp (org.apache.kafka.clients.consumer.OffsetAndTimestamp)1 TopicPartition (org.apache.kafka.common.TopicPartition)1 Statement (org.openrdf.model.Statement)1 LiteralImpl (org.openrdf.model.impl.LiteralImpl)1