Search in sources :

Example 1 with ReadAbortException

use of org.apache.cassandra.exceptions.ReadAbortException in project cassandra by apache.

the class StorageProxy method cas.

/**
 * Apply @param updates if and only if the current values in the row for @param key
 * match the provided @param conditions.  The algorithm is "raw" Paxos: that is, Paxos
 * minus leader election -- any node in the cluster may propose changes for any row,
 * which (that is, the row) is the unit of values being proposed, not single columns.
 *
 * The Paxos cohort is only the replicas for the given key, not the entire cluster.
 * So we expect performance to be reasonable, but CAS is still intended to be used
 * "when you really need it," not for all your updates.
 *
 * There are three phases to Paxos:
 *  1. Prepare: the coordinator generates a ballot (timeUUID in our case) and asks replicas to (a) promise
 *     not to accept updates from older ballots and (b) tell us about the most recent update it has already
 *     accepted.
 *  2. Accept: if a majority of replicas respond, the coordinator asks replicas to accept the value of the
 *     highest proposal ballot it heard about, or a new value if no in-progress proposals were reported.
 *  3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can commit the new
 *     value.
 *
 *  Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned in "Paxos Made Live,"
 *  so here is our approach:
 *   3a. The coordinator sends a commit message to all replicas with the ballot and value.
 *   3b. Because of 1-2, this will be the highest-seen commit ballot.  The replicas will note that,
 *       and send it with subsequent promise replies.  This allows us to discard acceptance records
 *       for successfully committed replicas, without allowing incomplete proposals to commit erroneously
 *       later on.
 *
 *  Note that since we are performing a CAS rather than a simple update, we perform a read (of committed
 *  values) between the prepare and accept phases.  This gives us a slightly longer window for another
 *  coordinator to come along and trump our own promise with a newer one but is otherwise safe.
 *
 * @param keyspaceName the keyspace for the CAS
 * @param cfName the column family for the CAS
 * @param key the row key for the row to CAS
 * @param request the conditions for the CAS to apply as well as the update to perform if the conditions hold.
 * @param consistencyForPaxos the consistency for the paxos prepare and propose round. This can only be either SERIAL or LOCAL_SERIAL.
 * @param consistencyForCommit the consistency for write done during the commit phase. This can be anything, except SERIAL or LOCAL_SERIAL.
 *
 * @return null if the operation succeeds in updating the row, or the current values corresponding to conditions.
 * (since, if the CAS doesn't succeed, it means the current value do not match the conditions).
 */
public static RowIterator cas(String keyspaceName, String cfName, DecoratedKey key, CASRequest request, ConsistencyLevel consistencyForPaxos, ConsistencyLevel consistencyForCommit, ClientState state, int nowInSeconds, long queryStartNanoTime) throws UnavailableException, IsBootstrappingException, RequestFailureException, RequestTimeoutException, InvalidRequestException, CasWriteUnknownResultException {
    final long startTimeForMetrics = nanoTime();
    try {
        TableMetadata metadata = Schema.instance.validateTable(keyspaceName, cfName);
        if (DatabaseDescriptor.getPartitionDenylistEnabled() && DatabaseDescriptor.getDenylistWritesEnabled() && !partitionDenylist.isKeyPermitted(keyspaceName, cfName, key.getKey())) {
            denylistMetrics.incrementWritesRejected();
            throw new InvalidRequestException(String.format("Unable to CAS write to denylisted partition [0x%s] in %s/%s", key.toString(), keyspaceName, cfName));
        }
        Supplier<Pair<PartitionUpdate, RowIterator>> updateProposer = () -> {
            // read the current values and check they validate the conditions
            Tracing.trace("Reading existing values for CAS precondition");
            SinglePartitionReadCommand readCommand = (SinglePartitionReadCommand) request.readCommand(nowInSeconds);
            ConsistencyLevel readConsistency = consistencyForPaxos == ConsistencyLevel.LOCAL_SERIAL ? ConsistencyLevel.LOCAL_QUORUM : ConsistencyLevel.QUORUM;
            FilteredPartition current;
            try (RowIterator rowIter = readOne(readCommand, readConsistency, queryStartNanoTime)) {
                current = FilteredPartition.create(rowIter);
            }
            if (!request.appliesTo(current)) {
                Tracing.trace("CAS precondition does not match current values {}", current);
                casWriteMetrics.conditionNotMet.inc();
                return Pair.create(PartitionUpdate.emptyUpdate(metadata, key), current.rowIterator());
            }
            // Create the desired updates
            PartitionUpdate updates = request.makeUpdates(current, state);
            long size = updates.dataSize();
            casWriteMetrics.mutationSize.update(size);
            writeMetricsForLevel(consistencyForPaxos).mutationSize.update(size);
            // Apply triggers to cas updates. A consideration here is that
            // triggers emit Mutations, and so a given trigger implementation
            // may generate mutations for partitions other than the one this
            // paxos round is scoped for. In this case, TriggerExecutor will
            // validate that the generated mutations are targetted at the same
            // partition as the initial updates and reject (via an
            // InvalidRequestException) any which aren't.
            updates = TriggerExecutor.instance.execute(updates);
            return Pair.create(updates, null);
        };
        return doPaxos(metadata, key, consistencyForPaxos, consistencyForCommit, consistencyForCommit, queryStartNanoTime, casWriteMetrics, updateProposer);
    } catch (CasWriteUnknownResultException e) {
        casWriteMetrics.unknownResult.mark();
        throw e;
    } catch (CasWriteTimeoutException wte) {
        casWriteMetrics.timeouts.mark();
        writeMetricsForLevel(consistencyForPaxos).timeouts.mark();
        throw new CasWriteTimeoutException(wte.writeType, wte.consistency, wte.received, wte.blockFor, wte.contentions);
    } catch (ReadTimeoutException e) {
        casWriteMetrics.timeouts.mark();
        writeMetricsForLevel(consistencyForPaxos).timeouts.mark();
        throw e;
    } catch (ReadAbortException e) {
        casWriteMetrics.markAbort(e);
        writeMetricsForLevel(consistencyForPaxos).markAbort(e);
        throw e;
    } catch (WriteFailureException | ReadFailureException e) {
        casWriteMetrics.failures.mark();
        writeMetricsForLevel(consistencyForPaxos).failures.mark();
        throw e;
    } catch (UnavailableException e) {
        casWriteMetrics.unavailables.mark();
        writeMetricsForLevel(consistencyForPaxos).unavailables.mark();
        throw e;
    } finally {
        final long latency = nanoTime() - startTimeForMetrics;
        casWriteMetrics.addNano(latency);
        writeMetricsForLevel(consistencyForPaxos).addNano(latency);
    }
}
Also used : TableMetadata(org.apache.cassandra.schema.TableMetadata) ReadFailureException(org.apache.cassandra.exceptions.ReadFailureException) ReadTimeoutException(org.apache.cassandra.exceptions.ReadTimeoutException) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) UnavailableException(org.apache.cassandra.exceptions.UnavailableException) FilteredPartition(org.apache.cassandra.db.partitions.FilteredPartition) ReadAbortException(org.apache.cassandra.exceptions.ReadAbortException) CasWriteUnknownResultException(org.apache.cassandra.exceptions.CasWriteUnknownResultException) ConsistencyLevel(org.apache.cassandra.db.ConsistencyLevel) WriteFailureException(org.apache.cassandra.exceptions.WriteFailureException) RowIterator(org.apache.cassandra.db.rows.RowIterator) InvalidRequestException(org.apache.cassandra.exceptions.InvalidRequestException) CasWriteTimeoutException(org.apache.cassandra.exceptions.CasWriteTimeoutException) PartitionUpdate(org.apache.cassandra.db.partitions.PartitionUpdate) Pair(org.apache.cassandra.utils.Pair)

Example 2 with ReadAbortException

use of org.apache.cassandra.exceptions.ReadAbortException in project cassandra by apache.

the class StorageProxy method readRegular.

@SuppressWarnings("resource")
private static PartitionIterator readRegular(SinglePartitionReadCommand.Group group, ConsistencyLevel consistencyLevel, long queryStartNanoTime) throws UnavailableException, ReadFailureException, ReadTimeoutException {
    long start = nanoTime();
    try {
        PartitionIterator result = fetchRows(group.queries, consistencyLevel, queryStartNanoTime);
        // Note that the only difference between the command in a group must be the partition key on which
        // they applied.
        boolean enforceStrictLiveness = group.queries.get(0).metadata().enforceStrictLiveness();
        // might not honor it and so we should enforce it
        if (group.queries.size() > 1)
            result = group.limits().filter(result, group.nowInSec(), group.selectsFullPartition(), enforceStrictLiveness);
        return result;
    } catch (UnavailableException e) {
        readMetrics.unavailables.mark();
        readMetricsForLevel(consistencyLevel).unavailables.mark();
        logRequestException(e, group.queries);
        throw e;
    } catch (ReadTimeoutException e) {
        readMetrics.timeouts.mark();
        readMetricsForLevel(consistencyLevel).timeouts.mark();
        logRequestException(e, group.queries);
        throw e;
    } catch (ReadAbortException e) {
        recordReadRegularAbort(consistencyLevel, e);
        throw e;
    } catch (ReadFailureException e) {
        readMetrics.failures.mark();
        readMetricsForLevel(consistencyLevel).failures.mark();
        throw e;
    } finally {
        long latency = nanoTime() - start;
        readMetrics.addNano(latency);
        readMetricsForLevel(consistencyLevel).addNano(latency);
        // TODO avoid giving every command the same latency number.  Can fix this in CASSADRA-5329
        for (ReadCommand command : group.queries) Keyspace.openAndGetStore(command.metadata()).metric.coordinatorReadLatency.update(latency, TimeUnit.NANOSECONDS);
    }
}
Also used : ReadFailureException(org.apache.cassandra.exceptions.ReadFailureException) ReadTimeoutException(org.apache.cassandra.exceptions.ReadTimeoutException) UnfilteredPartitionIterator(org.apache.cassandra.db.partitions.UnfilteredPartitionIterator) PartitionIterator(org.apache.cassandra.db.partitions.PartitionIterator) UnavailableException(org.apache.cassandra.exceptions.UnavailableException) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) PartitionRangeReadCommand(org.apache.cassandra.db.PartitionRangeReadCommand) ReadCommand(org.apache.cassandra.db.ReadCommand) ReadAbortException(org.apache.cassandra.exceptions.ReadAbortException)

Example 3 with ReadAbortException

use of org.apache.cassandra.exceptions.ReadAbortException in project cassandra by apache.

the class StorageProxy method readWithPaxos.

private static PartitionIterator readWithPaxos(SinglePartitionReadCommand.Group group, ConsistencyLevel consistencyLevel, ClientState state, long queryStartNanoTime) throws InvalidRequestException, UnavailableException, ReadFailureException, ReadTimeoutException {
    assert state != null;
    if (group.queries.size() > 1)
        throw new InvalidRequestException("SERIAL/LOCAL_SERIAL consistency may only be requested for one partition at a time");
    long start = nanoTime();
    SinglePartitionReadCommand command = group.queries.get(0);
    TableMetadata metadata = command.metadata();
    DecoratedKey key = command.partitionKey();
    // calculate the blockFor before repair any paxos round to avoid RS being altered in between.
    int blockForRead = consistencyLevel.blockFor(Keyspace.open(metadata.keyspace).getReplicationStrategy());
    PartitionIterator result = null;
    try {
        final ConsistencyLevel consistencyForReplayCommitsOrFetch = consistencyLevel == ConsistencyLevel.LOCAL_SERIAL ? ConsistencyLevel.LOCAL_QUORUM : ConsistencyLevel.QUORUM;
        try {
            // Commit an empty update to make sure all in-progress updates that should be finished first is, _and_
            // that no other in-progress can get resurrected.
            Supplier<Pair<PartitionUpdate, RowIterator>> updateProposer = Paxos.getPaxosVariant() == Config.PaxosVariant.v1_without_linearizable_reads ? () -> null : () -> Pair.create(PartitionUpdate.emptyUpdate(metadata, key), null);
            // When replaying, we commit at quorum/local quorum, as we want to be sure the following read (done at
            // quorum/local_quorum) sees any replayed updates. Our own update is however empty, and those don't even
            // get committed due to an optimiation described in doPaxos/beingRepairAndPaxos, so the commit
            // consistency is irrelevant (we use ANY just to emphasis that we don't wait on our commit).
            doPaxos(metadata, key, consistencyLevel, consistencyForReplayCommitsOrFetch, ConsistencyLevel.ANY, start, casReadMetrics, updateProposer);
        } catch (WriteTimeoutException e) {
            throw new ReadTimeoutException(consistencyLevel, 0, blockForRead, false);
        } catch (WriteFailureException e) {
            throw new ReadFailureException(consistencyLevel, e.received, e.blockFor, false, e.failureReasonByEndpoint);
        }
        result = fetchRows(group.queries, consistencyForReplayCommitsOrFetch, queryStartNanoTime);
    } catch (UnavailableException e) {
        readMetrics.unavailables.mark();
        casReadMetrics.unavailables.mark();
        readMetricsForLevel(consistencyLevel).unavailables.mark();
        logRequestException(e, group.queries);
        throw e;
    } catch (ReadTimeoutException e) {
        readMetrics.timeouts.mark();
        casReadMetrics.timeouts.mark();
        readMetricsForLevel(consistencyLevel).timeouts.mark();
        logRequestException(e, group.queries);
        throw e;
    } catch (ReadAbortException e) {
        readMetrics.markAbort(e);
        casReadMetrics.markAbort(e);
        readMetricsForLevel(consistencyLevel).markAbort(e);
        throw e;
    } catch (ReadFailureException e) {
        readMetrics.failures.mark();
        casReadMetrics.failures.mark();
        readMetricsForLevel(consistencyLevel).failures.mark();
        throw e;
    } finally {
        long latency = nanoTime() - start;
        readMetrics.addNano(latency);
        casReadMetrics.addNano(latency);
        readMetricsForLevel(consistencyLevel).addNano(latency);
        Keyspace.open(metadata.keyspace).getColumnFamilyStore(metadata.name).metric.coordinatorReadLatency.update(latency, TimeUnit.NANOSECONDS);
    }
    return result;
}
Also used : TableMetadata(org.apache.cassandra.schema.TableMetadata) ReadFailureException(org.apache.cassandra.exceptions.ReadFailureException) ReadTimeoutException(org.apache.cassandra.exceptions.ReadTimeoutException) SinglePartitionReadCommand(org.apache.cassandra.db.SinglePartitionReadCommand) DecoratedKey(org.apache.cassandra.db.DecoratedKey) UnavailableException(org.apache.cassandra.exceptions.UnavailableException) ReadAbortException(org.apache.cassandra.exceptions.ReadAbortException) Hint(org.apache.cassandra.hints.Hint) ConsistencyLevel(org.apache.cassandra.db.ConsistencyLevel) CasWriteTimeoutException(org.apache.cassandra.exceptions.CasWriteTimeoutException) WriteTimeoutException(org.apache.cassandra.exceptions.WriteTimeoutException) WriteFailureException(org.apache.cassandra.exceptions.WriteFailureException) UnfilteredPartitionIterator(org.apache.cassandra.db.partitions.UnfilteredPartitionIterator) PartitionIterator(org.apache.cassandra.db.partitions.PartitionIterator) InvalidRequestException(org.apache.cassandra.exceptions.InvalidRequestException) Pair(org.apache.cassandra.utils.Pair)

Aggregations

SinglePartitionReadCommand (org.apache.cassandra.db.SinglePartitionReadCommand)3 ReadAbortException (org.apache.cassandra.exceptions.ReadAbortException)3 ReadFailureException (org.apache.cassandra.exceptions.ReadFailureException)3 ReadTimeoutException (org.apache.cassandra.exceptions.ReadTimeoutException)3 UnavailableException (org.apache.cassandra.exceptions.UnavailableException)3 ConsistencyLevel (org.apache.cassandra.db.ConsistencyLevel)2 PartitionIterator (org.apache.cassandra.db.partitions.PartitionIterator)2 UnfilteredPartitionIterator (org.apache.cassandra.db.partitions.UnfilteredPartitionIterator)2 CasWriteTimeoutException (org.apache.cassandra.exceptions.CasWriteTimeoutException)2 InvalidRequestException (org.apache.cassandra.exceptions.InvalidRequestException)2 WriteFailureException (org.apache.cassandra.exceptions.WriteFailureException)2 TableMetadata (org.apache.cassandra.schema.TableMetadata)2 Pair (org.apache.cassandra.utils.Pair)2 DecoratedKey (org.apache.cassandra.db.DecoratedKey)1 PartitionRangeReadCommand (org.apache.cassandra.db.PartitionRangeReadCommand)1 ReadCommand (org.apache.cassandra.db.ReadCommand)1 FilteredPartition (org.apache.cassandra.db.partitions.FilteredPartition)1 PartitionUpdate (org.apache.cassandra.db.partitions.PartitionUpdate)1 RowIterator (org.apache.cassandra.db.rows.RowIterator)1 CasWriteUnknownResultException (org.apache.cassandra.exceptions.CasWriteUnknownResultException)1