Search in sources :

Example 1 with SinglePartitionReadCommand

use of org.apache.cassandra.db.SinglePartitionReadCommand in project cassandra by apache.

the class ShortReadRowsProtection method moreContents.

/*
     * We only get here once all the rows in this iterator have been iterated over, and so if the node
     * had returned the requested number of rows but we still get here, then some results were skipped
     * during reconciliation.
     */
public UnfilteredRowIterator moreContents() {
    // never try to request additional rows from replicas if our reconciled partition is already filled to the limit
    assert !mergedResultCounter.isDoneForPartition();
    // we do not apply short read protection when we have no limits at all
    assert !command.limits().isUnlimited();
    /*
         * If the returned partition doesn't have enough rows to satisfy even the original limit, don't ask for more.
         *
         * Can only take the short cut if there is no per partition limit set. Otherwise it's possible to hit false
         * positives due to some rows being uncounted for in certain scenarios (see CASSANDRA-13911).
         */
    if (command.limits().isExhausted(singleResultCounter) && command.limits().perPartitionCount() == DataLimits.NO_LIMIT)
        return null;
    /*
         * If the replica has no live rows in the partition, don't try to fetch more.
         *
         * Note that the previous branch [if (!singleResultCounter.isDoneForPartition()) return null] doesn't
         * always cover this scenario:
         * isDoneForPartition() is defined as [isDone() || rowInCurrentPartition >= perPartitionLimit],
         * and will return true if isDone() returns true, even if there are 0 rows counted in the current partition.
         *
         * This can happen with a range read if after 1+ rounds of short read protection requests we managed to fetch
         * enough extra rows for other partitions to satisfy the singleResultCounter's total row limit, but only
         * have tombstones in the current partition.
         *
         * One other way we can hit this condition is when the partition only has a live static row and no regular
         * rows. In that scenario the counter will remain at 0 until the partition is closed - which happens after
         * the moreContents() call.
         */
    if (singleResultCounter.rowsCountedInCurrentPartition() == 0)
        return null;
    /*
         * This is a table with no clustering columns, and has at most one row per partition - with EMPTY clustering.
         * We already have the row, so there is no point in asking for more from the partition.
         */
    if (lastClustering != null && lastClustering.isEmpty())
        return null;
    lastFetched = singleResultCounter.rowsCountedInCurrentPartition() - lastCounted;
    lastCounted = singleResultCounter.rowsCountedInCurrentPartition();
    // getting back fewer rows than we asked for means the partition on the replica has been fully consumed
    if (lastQueried > 0 && lastFetched < lastQueried)
        return null;
    /*
         * At this point we know that:
         *     1. the replica returned [repeatedly?] as many rows as we asked for and potentially has more
         *        rows in the partition
         *     2. at least one of those returned rows was shadowed by a tombstone returned from another
         *        replica
         *     3. we haven't satisfied the client's limits yet, and should attempt to query for more rows to
         *        avoid a short read
         *
         * In the ideal scenario, we would get exactly min(a, b) or fewer rows from the next request, where a and b
         * are defined as follows:
         *     [a] limits.count() - mergedResultCounter.counted()
         *     [b] limits.perPartitionCount() - mergedResultCounter.countedInCurrentPartition()
         *
         * It would be naive to query for exactly that many rows, as it's possible and not unlikely
         * that some of the returned rows would also be shadowed by tombstones from other hosts.
         *
         * Note: we don't know, nor do we care, how many rows from the replica made it into the reconciled result;
         * we can only tell how many in total we queried for, and that [0, mrc.countedInCurrentPartition()) made it.
         *
         * In general, our goal should be to minimise the number of extra requests - *not* to minimise the number
         * of rows fetched: there is a high transactional cost for every individual request, but a relatively low
         * marginal cost for each extra row requested.
         *
         * As such it's better to overfetch than to underfetch extra rows from a host; but at the same
         * time we want to respect paging limits and not blow up spectacularly.
         *
         * Note: it's ok to retrieve more rows that necessary since singleResultCounter is not stopping and only
         * counts.
         *
         * With that in mind, we'll just request the minimum of (count(), perPartitionCount()) limits.
         *
         * See CASSANDRA-13794 for more details.
         */
    lastQueried = Math.min(command.limits().count(), command.limits().perPartitionCount());
    ColumnFamilyStore.metricsFor(metadata.id).shortReadProtectionRequests.mark();
    Tracing.trace("Requesting {} extra rows from {} for short read protection", lastQueried, source);
    SinglePartitionReadCommand cmd = makeFetchAdditionalRowsReadCommand(lastQueried);
    return UnfilteredPartitionIterators.getOnlyElement(commandExecutor.apply(cmd), cmd);
}
Also used : SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand)

Example 2 with SinglePartitionReadCommand

use of org.apache.cassandra.db.SinglePartitionReadCommand in project cassandra by apache.

the class AbstractReadRepair method maybeSendAdditionalReads.

public void maybeSendAdditionalReads() {
    Preconditions.checkState(command instanceof SinglePartitionReadCommand, "maybeSendAdditionalReads can only be called for SinglePartitionReadCommand");
    DigestRepair<E, P> repair = digestRepair;
    if (repair == null)
        return;
    if (shouldSpeculate() && !repair.readCallback.await(cfs.sampleReadLatencyNanos, NANOSECONDS)) {
        Replica uncontacted = replicaPlan().firstUncontactedCandidate(replica -> true);
        if (uncontacted == null)
            return;
        replicaPlan.addToContacts(uncontacted);
        sendReadCommand(uncontacted, repair.readCallback, true, false);
        ReadRepairMetrics.speculatedRead.mark();
        ReadRepairDiagnostics.speculatedRead(this, uncontacted.endpoint(), replicaPlan());
    }
}
Also used : SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) Replica(org.apache.cassandra.locator.Replica)

Example 3 with SinglePartitionReadCommand

use of org.apache.cassandra.db.SinglePartitionReadCommand in project cassandra by apache.

the class StorageProxy method cas.

/**
 * Apply @param updates if and only if the current values in the row for @param key
 * match the provided @param conditions.  The algorithm is "raw" Paxos: that is, Paxos
 * minus leader election -- any node in the cluster may propose changes for any row,
 * which (that is, the row) is the unit of values being proposed, not single columns.
 *
 * The Paxos cohort is only the replicas for the given key, not the entire cluster.
 * So we expect performance to be reasonable, but CAS is still intended to be used
 * "when you really need it," not for all your updates.
 *
 * There are three phases to Paxos:
 *  1. Prepare: the coordinator generates a ballot (timeUUID in our case) and asks replicas to (a) promise
 *     not to accept updates from older ballots and (b) tell us about the most recent update it has already
 *     accepted.
 *  2. Accept: if a majority of replicas respond, the coordinator asks replicas to accept the value of the
 *     highest proposal ballot it heard about, or a new value if no in-progress proposals were reported.
 *  3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can commit the new
 *     value.
 *
 *  Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned in "Paxos Made Live,"
 *  so here is our approach:
 *   3a. The coordinator sends a commit message to all replicas with the ballot and value.
 *   3b. Because of 1-2, this will be the highest-seen commit ballot.  The replicas will note that,
 *       and send it with subsequent promise replies.  This allows us to discard acceptance records
 *       for successfully committed replicas, without allowing incomplete proposals to commit erroneously
 *       later on.
 *
 *  Note that since we are performing a CAS rather than a simple update, we perform a read (of committed
 *  values) between the prepare and accept phases.  This gives us a slightly longer window for another
 *  coordinator to come along and trump our own promise with a newer one but is otherwise safe.
 *
 * @param keyspaceName the keyspace for the CAS
 * @param cfName the column family for the CAS
 * @param key the row key for the row to CAS
 * @param request the conditions for the CAS to apply as well as the update to perform if the conditions hold.
 * @param consistencyForPaxos the consistency for the paxos prepare and propose round. This can only be either SERIAL or LOCAL_SERIAL.
 * @param consistencyForCommit the consistency for write done during the commit phase. This can be anything, except SERIAL or LOCAL_SERIAL.
 *
 * @return null if the operation succeeds in updating the row, or the current values corresponding to conditions.
 * (since, if the CAS doesn't succeed, it means the current value do not match the conditions).
 */
public static RowIterator cas(String keyspaceName, String cfName, DecoratedKey key, CASRequest request, ConsistencyLevel consistencyForPaxos, ConsistencyLevel consistencyForCommit, ClientState state, int nowInSeconds, long queryStartNanoTime) throws UnavailableException, IsBootstrappingException, RequestFailureException, RequestTimeoutException, InvalidRequestException, CasWriteUnknownResultException {
    final long startTimeForMetrics = nanoTime();
    try {
        TableMetadata metadata = Schema.instance.validateTable(keyspaceName, cfName);
        if (DatabaseDescriptor.getPartitionDenylistEnabled() && DatabaseDescriptor.getDenylistWritesEnabled() && !partitionDenylist.isKeyPermitted(keyspaceName, cfName, key.getKey())) {
            denylistMetrics.incrementWritesRejected();
            throw new InvalidRequestException(String.format("Unable to CAS write to denylisted partition [0x%s] in %s/%s", key.toString(), keyspaceName, cfName));
        }
        Supplier<Pair<PartitionUpdate, RowIterator>> updateProposer = () -> {
            // read the current values and check they validate the conditions
            Tracing.trace("Reading existing values for CAS precondition");
            SinglePartitionReadCommand readCommand = (SinglePartitionReadCommand) request.readCommand(nowInSeconds);
            ConsistencyLevel readConsistency = consistencyForPaxos == ConsistencyLevel.LOCAL_SERIAL ? ConsistencyLevel.LOCAL_QUORUM : ConsistencyLevel.QUORUM;
            FilteredPartition current;
            try (RowIterator rowIter = readOne(readCommand, readConsistency, queryStartNanoTime)) {
                current = FilteredPartition.create(rowIter);
            }
            if (!request.appliesTo(current)) {
                Tracing.trace("CAS precondition does not match current values {}", current);
                casWriteMetrics.conditionNotMet.inc();
                return Pair.create(PartitionUpdate.emptyUpdate(metadata, key), current.rowIterator());
            }
            // Create the desired updates
            PartitionUpdate updates = request.makeUpdates(current, state);
            long size = updates.dataSize();
            casWriteMetrics.mutationSize.update(size);
            writeMetricsForLevel(consistencyForPaxos).mutationSize.update(size);
            // Apply triggers to cas updates. A consideration here is that
            // triggers emit Mutations, and so a given trigger implementation
            // may generate mutations for partitions other than the one this
            // paxos round is scoped for. In this case, TriggerExecutor will
            // validate that the generated mutations are targetted at the same
            // partition as the initial updates and reject (via an
            // InvalidRequestException) any which aren't.
            updates = TriggerExecutor.instance.execute(updates);
            return Pair.create(updates, null);
        };
        return doPaxos(metadata, key, consistencyForPaxos, consistencyForCommit, consistencyForCommit, queryStartNanoTime, casWriteMetrics, updateProposer);
    } catch (CasWriteUnknownResultException e) {
        casWriteMetrics.unknownResult.mark();
        throw e;
    } catch (CasWriteTimeoutException wte) {
        casWriteMetrics.timeouts.mark();
        writeMetricsForLevel(consistencyForPaxos).timeouts.mark();
        throw new CasWriteTimeoutException(wte.writeType, wte.consistency, wte.received, wte.blockFor, wte.contentions);
    } catch (ReadTimeoutException e) {
        casWriteMetrics.timeouts.mark();
        writeMetricsForLevel(consistencyForPaxos).timeouts.mark();
        throw e;
    } catch (ReadAbortException e) {
        casWriteMetrics.markAbort(e);
        writeMetricsForLevel(consistencyForPaxos).markAbort(e);
        throw e;
    } catch (WriteFailureException | ReadFailureException e) {
        casWriteMetrics.failures.mark();
        writeMetricsForLevel(consistencyForPaxos).failures.mark();
        throw e;
    } catch (UnavailableException e) {
        casWriteMetrics.unavailables.mark();
        writeMetricsForLevel(consistencyForPaxos).unavailables.mark();
        throw e;
    } finally {
        final long latency = nanoTime() - startTimeForMetrics;
        casWriteMetrics.addNano(latency);
        writeMetricsForLevel(consistencyForPaxos).addNano(latency);
    }
}
Also used : TableMetadata(org.apache.cassandra.schema.TableMetadata) ReadFailureException(org.apache.cassandra.exceptions.ReadFailureException) ReadTimeoutException(org.apache.cassandra.exceptions.ReadTimeoutException) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) UnavailableException(org.apache.cassandra.exceptions.UnavailableException) FilteredPartition(org.apache.cassandra.db.partitions.FilteredPartition) ReadAbortException(org.apache.cassandra.exceptions.ReadAbortException) CasWriteUnknownResultException(org.apache.cassandra.exceptions.CasWriteUnknownResultException) ConsistencyLevel(org.apache.cassandra.db.ConsistencyLevel) WriteFailureException(org.apache.cassandra.exceptions.WriteFailureException) RowIterator(org.apache.cassandra.db.rows.RowIterator) InvalidRequestException(org.apache.cassandra.exceptions.InvalidRequestException) CasWriteTimeoutException(org.apache.cassandra.exceptions.CasWriteTimeoutException) PartitionUpdate(org.apache.cassandra.db.partitions.PartitionUpdate) Pair(org.apache.cassandra.utils.Pair)

Example 4 with SinglePartitionReadCommand

use of org.apache.cassandra.db.SinglePartitionReadCommand in project cassandra by apache.

the class ViewBuilderTask method buildKey.

@SuppressWarnings("resource")
private void buildKey(DecoratedKey key) {
    ReadQuery selectQuery = view.getReadQuery();
    if (!selectQuery.selectsKey(key)) {
        logger.trace("Skipping {}, view query filters", key);
        return;
    }
    int nowInSec = FBUtilities.nowInSeconds();
    SinglePartitionReadCommand command = view.getSelectStatement().internalReadForView(key, nowInSec);
    // We're rebuilding everything from what's on disk, so we read everything, consider that as new updates
    // and pretend that there is nothing pre-existing.
    UnfilteredRowIterator empty = UnfilteredRowIterators.noRowsIterator(baseCfs.metadata(), key, Rows.EMPTY_STATIC_ROW, DeletionTime.LIVE, false);
    try (ReadExecutionController orderGroup = command.executionController();
        UnfilteredRowIterator data = UnfilteredPartitionIterators.getOnlyElement(command.executeLocally(orderGroup), command)) {
        Iterator<Collection<Mutation>> mutations = baseCfs.keyspace.viewManager.forTable(baseCfs.metadata.id).generateViewUpdates(Collections.singleton(view), data, empty, nowInSec, true);
        AtomicLong noBase = new AtomicLong(Long.MAX_VALUE);
        mutations.forEachRemaining(m -> StorageProxy.mutateMV(key.getKey(), m, true, noBase, nanoTime()));
    }
}
Also used : UnfilteredRowIterator(org.apache.cassandra.db.rows.UnfilteredRowIterator) ReadExecutionController(org.apache.cassandra.db.ReadExecutionController) AtomicLong(java.util.concurrent.atomic.AtomicLong) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) Collection(java.util.Collection) ReadQuery(org.apache.cassandra.db.ReadQuery)

Example 5 with SinglePartitionReadCommand

use of org.apache.cassandra.db.SinglePartitionReadCommand in project cassandra by apache.

the class DigestResolverTest method transientResponseData.

@Test
public void transientResponseData() {
    SinglePartitionReadCommand command = SinglePartitionReadCommand.fullPartitionRead(cfm, nowInSec, dk);
    EndpointsForToken targetReplicas = EndpointsForToken.of(dk.getToken(), full(EP1), full(EP2), trans(EP3));
    DigestResolver<?, ?> resolver = new DigestResolver<>(command, plan(ConsistencyLevel.QUORUM, targetReplicas), 0);
    PartitionUpdate fullResponse = update(row(1000, 1, 1)).build();
    PartitionUpdate digestResponse = update(row(1000, 1, 1)).build();
    PartitionUpdate transientResponse = update(row(1000, 2, 2)).build();
    Assert.assertFalse(resolver.isDataPresent());
    Assert.assertFalse(resolver.hasTransientResponse());
    resolver.preprocess(response(command, EP1, iter(fullResponse), false));
    Assert.assertTrue(resolver.isDataPresent());
    resolver.preprocess(response(command, EP2, iter(digestResponse), true));
    resolver.preprocess(response(command, EP3, iter(transientResponse), false));
    Assert.assertTrue(resolver.hasTransientResponse());
    assertPartitionsEqual(filter(iter(dk, row(1000, 1, 1), row(1000, 2, 2))), resolver.getData());
}
Also used : EndpointsForToken(org.apache.cassandra.locator.EndpointsForToken) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) PartitionUpdate(org.apache.cassandra.db.partitions.PartitionUpdate) Test(org.junit.Test)

Aggregations

SinglePartitionReadCommand (org.apache.cassandra.db.SinglePartitionReadCommand)14 PartitionUpdate (org.apache.cassandra.db.partitions.PartitionUpdate)7 Test (org.junit.Test)7 EndpointsForToken (org.apache.cassandra.locator.EndpointsForToken)6 ReadExecutionController (org.apache.cassandra.db.ReadExecutionController)3 UnfilteredRowIterator (org.apache.cassandra.db.rows.UnfilteredRowIterator)3 InvalidRequestException (org.apache.cassandra.exceptions.InvalidRequestException)3 TableMetadata (org.apache.cassandra.schema.TableMetadata)3 ConsistencyLevel (org.apache.cassandra.db.ConsistencyLevel)2 DecoratedKey (org.apache.cassandra.db.DecoratedKey)2 FilteredPartition (org.apache.cassandra.db.partitions.FilteredPartition)2 PartitionIterator (org.apache.cassandra.db.partitions.PartitionIterator)2 UnfilteredPartitionIterator (org.apache.cassandra.db.partitions.UnfilteredPartitionIterator)2 RowIterator (org.apache.cassandra.db.rows.RowIterator)2 CasWriteTimeoutException (org.apache.cassandra.exceptions.CasWriteTimeoutException)2 ReadAbortException (org.apache.cassandra.exceptions.ReadAbortException)2 ReadFailureException (org.apache.cassandra.exceptions.ReadFailureException)2 ReadTimeoutException (org.apache.cassandra.exceptions.ReadTimeoutException)2 UnavailableException (org.apache.cassandra.exceptions.UnavailableException)2 WriteFailureException (org.apache.cassandra.exceptions.WriteFailureException)2