Search in sources :

Example 6 with Commit

use of org.apache.cassandra.service.paxos.Commit in project cassandra by apache.

the class StorageProxy method cas.

/**
     * Apply @param updates if and only if the current values in the row for @param key
     * match the provided @param conditions.  The algorithm is "raw" Paxos: that is, Paxos
     * minus leader election -- any node in the cluster may propose changes for any row,
     * which (that is, the row) is the unit of values being proposed, not single columns.
     *
     * The Paxos cohort is only the replicas for the given key, not the entire cluster.
     * So we expect performance to be reasonable, but CAS is still intended to be used
     * "when you really need it," not for all your updates.
     *
     * There are three phases to Paxos:
     *  1. Prepare: the coordinator generates a ballot (timeUUID in our case) and asks replicas to (a) promise
     *     not to accept updates from older ballots and (b) tell us about the most recent update it has already
     *     accepted.
     *  2. Accept: if a majority of replicas reply, the coordinator asks replicas to accept the value of the
     *     highest proposal ballot it heard about, or a new value if no in-progress proposals were reported.
     *  3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can commit the new
     *     value.
     *
     *  Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned in "Paxos Made Live,"
     *  so here is our approach:
     *   3a. The coordinator sends a commit message to all replicas with the ballot and value.
     *   3b. Because of 1-2, this will be the highest-seen commit ballot.  The replicas will note that,
     *       and send it with subsequent promise replies.  This allows us to discard acceptance records
     *       for successfully committed replicas, without allowing incomplete proposals to commit erroneously
     *       later on.
     *
     *  Note that since we are performing a CAS rather than a simple update, we perform a read (of committed
     *  values) between the prepare and accept phases.  This gives us a slightly longer window for another
     *  coordinator to come along and trump our own promise with a newer one but is otherwise safe.
     *
     * @param keyspaceName the keyspace for the CAS
     * @param cfName the column family for the CAS
     * @param key the row key for the row to CAS
     * @param request the conditions for the CAS to apply as well as the update to perform if the conditions hold.
     * @param consistencyForPaxos the consistency for the paxos prepare and propose round. This can only be either SERIAL or LOCAL_SERIAL.
     * @param consistencyForCommit the consistency for write done during the commit phase. This can be anything, except SERIAL or LOCAL_SERIAL.
     *
     * @return null if the operation succeeds in updating the row, or the current values corresponding to conditions.
     * (since, if the CAS doesn't succeed, it means the current value do not match the conditions).
     */
public static RowIterator cas(String keyspaceName, String cfName, DecoratedKey key, CASRequest request, ConsistencyLevel consistencyForPaxos, ConsistencyLevel consistencyForCommit, ClientState state, long queryStartNanoTime) throws UnavailableException, IsBootstrappingException, RequestFailureException, RequestTimeoutException, InvalidRequestException {
    final long startTimeForMetrics = System.nanoTime();
    int contentions = 0;
    try {
        consistencyForPaxos.validateForCas();
        consistencyForCommit.validateForCasCommit(keyspaceName);
        TableMetadata metadata = Schema.instance.getTableMetadata(keyspaceName, cfName);
        long timeout = TimeUnit.MILLISECONDS.toNanos(DatabaseDescriptor.getCasContentionTimeout());
        while (System.nanoTime() - queryStartNanoTime < timeout) {
            // for simplicity, we'll do a single liveness check at the start of each attempt
            Pair<List<InetAddress>, Integer> p = getPaxosParticipants(metadata, key, consistencyForPaxos);
            List<InetAddress> liveEndpoints = p.left;
            int requiredParticipants = p.right;
            final Pair<UUID, Integer> pair = beginAndRepairPaxos(queryStartNanoTime, key, metadata, liveEndpoints, requiredParticipants, consistencyForPaxos, consistencyForCommit, true, state);
            final UUID ballot = pair.left;
            contentions += pair.right;
            // read the current values and check they validate the conditions
            Tracing.trace("Reading existing values for CAS precondition");
            SinglePartitionReadCommand readCommand = request.readCommand(FBUtilities.nowInSeconds());
            ConsistencyLevel readConsistency = consistencyForPaxos == ConsistencyLevel.LOCAL_SERIAL ? ConsistencyLevel.LOCAL_QUORUM : ConsistencyLevel.QUORUM;
            FilteredPartition current;
            try (RowIterator rowIter = readOne(readCommand, readConsistency, queryStartNanoTime)) {
                current = FilteredPartition.create(rowIter);
            }
            if (!request.appliesTo(current)) {
                Tracing.trace("CAS precondition does not match current values {}", current);
                casWriteMetrics.conditionNotMet.inc();
                return current.rowIterator();
            }
            // finish the paxos round w/ the desired updates
            // TODO turn null updates into delete?
            PartitionUpdate updates = request.makeUpdates(current);
            long size = updates.dataSize();
            casWriteMetrics.mutationSize.update(size);
            writeMetricsMap.get(consistencyForPaxos).mutationSize.update(size);
            // Apply triggers to cas updates. A consideration here is that
            // triggers emit Mutations, and so a given trigger implementation
            // may generate mutations for partitions other than the one this
            // paxos round is scoped for. In this case, TriggerExecutor will
            // validate that the generated mutations are targetted at the same
            // partition as the initial updates and reject (via an
            // InvalidRequestException) any which aren't.
            updates = TriggerExecutor.instance.execute(updates);
            Commit proposal = Commit.newProposal(ballot, updates);
            Tracing.trace("CAS precondition is met; proposing client-requested updates for {}", ballot);
            if (proposePaxos(proposal, liveEndpoints, requiredParticipants, true, consistencyForPaxos, queryStartNanoTime)) {
                commitPaxos(proposal, consistencyForCommit, true, queryStartNanoTime);
                Tracing.trace("CAS successful");
                return null;
            }
            Tracing.trace("Paxos proposal not accepted (pre-empted by a higher ballot)");
            contentions++;
            Uninterruptibles.sleepUninterruptibly(ThreadLocalRandom.current().nextInt(100), TimeUnit.MILLISECONDS);
        // continue to retry
        }
        throw new WriteTimeoutException(WriteType.CAS, consistencyForPaxos, 0, consistencyForPaxos.blockFor(Keyspace.open(keyspaceName)));
    } catch (WriteTimeoutException | ReadTimeoutException e) {
        casWriteMetrics.timeouts.mark();
        writeMetricsMap.get(consistencyForPaxos).timeouts.mark();
        throw e;
    } catch (WriteFailureException | ReadFailureException e) {
        casWriteMetrics.failures.mark();
        writeMetricsMap.get(consistencyForPaxos).failures.mark();
        throw e;
    } catch (UnavailableException e) {
        casWriteMetrics.unavailables.mark();
        writeMetricsMap.get(consistencyForPaxos).unavailables.mark();
        throw e;
    } finally {
        recordCasContention(contentions);
        final long latency = System.nanoTime() - startTimeForMetrics;
        casWriteMetrics.addNano(latency);
        writeMetricsMap.get(consistencyForPaxos).addNano(latency);
    }
}
Also used : TableMetadata(org.apache.cassandra.schema.TableMetadata) Hint(org.apache.cassandra.hints.Hint) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) Commit(org.apache.cassandra.service.paxos.Commit) RowIterator(org.apache.cassandra.db.rows.RowIterator) InetAddress(java.net.InetAddress)

Example 7 with Commit

use of org.apache.cassandra.service.paxos.Commit in project cassandra by apache.

the class StorageProxy method proposePaxos.

private static boolean proposePaxos(Commit proposal, List<InetAddress> endpoints, int requiredParticipants, boolean timeoutIfPartial, ConsistencyLevel consistencyLevel, long queryStartNanoTime) throws WriteTimeoutException {
    ProposeCallback callback = new ProposeCallback(endpoints.size(), requiredParticipants, !timeoutIfPartial, consistencyLevel, queryStartNanoTime);
    MessageOut<Commit> message = new MessageOut<Commit>(MessagingService.Verb.PAXOS_PROPOSE, proposal, Commit.serializer);
    for (InetAddress target : endpoints) MessagingService.instance().sendRR(message, target, callback);
    callback.await();
    if (callback.isSuccessful())
        return true;
    if (timeoutIfPartial && !callback.isFullyRefused())
        throw new WriteTimeoutException(WriteType.CAS, consistencyLevel, callback.getAcceptCount(), requiredParticipants);
    return false;
}
Also used : Commit(org.apache.cassandra.service.paxos.Commit) ProposeCallback(org.apache.cassandra.service.paxos.ProposeCallback) InetAddress(java.net.InetAddress)

Example 8 with Commit

use of org.apache.cassandra.service.paxos.Commit in project cassandra by apache.

the class StorageProxy method preparePaxos.

private static PrepareCallback preparePaxos(Commit toPrepare, List<InetAddress> endpoints, int requiredParticipants, ConsistencyLevel consistencyForPaxos, long queryStartNanoTime) throws WriteTimeoutException {
    PrepareCallback callback = new PrepareCallback(toPrepare.update.partitionKey(), toPrepare.update.metadata(), requiredParticipants, consistencyForPaxos, queryStartNanoTime);
    MessageOut<Commit> message = new MessageOut<Commit>(MessagingService.Verb.PAXOS_PREPARE, toPrepare, Commit.serializer);
    for (InetAddress target : endpoints) MessagingService.instance().sendRR(message, target, callback);
    callback.await();
    return callback;
}
Also used : Commit(org.apache.cassandra.service.paxos.Commit) PrepareCallback(org.apache.cassandra.service.paxos.PrepareCallback) InetAddress(java.net.InetAddress)

Example 9 with Commit

use of org.apache.cassandra.service.paxos.Commit in project cassandra by apache.

the class SystemKeyspace method loadPaxosState.

public static PaxosState loadPaxosState(DecoratedKey key, TableMetadata metadata, int nowInSec) {
    String req = "SELECT * FROM system.%s WHERE row_key = ? AND cf_id = ?";
    UntypedResultSet results = QueryProcessor.executeInternalWithNow(nowInSec, nanoTime(), format(req, PAXOS), key.getKey(), metadata.id.asUUID());
    if (results.isEmpty())
        return new PaxosState(key, metadata);
    UntypedResultSet.Row row = results.one();
    Commit promised = row.has("in_progress_ballot") ? new Commit(row.getUUID("in_progress_ballot"), new PartitionUpdate.Builder(metadata, key, metadata.regularAndStaticColumns(), 1).build()) : Commit.emptyCommit(key, metadata);
    // either we have both a recently accepted ballot and update or we have neither
    Commit accepted = row.has("proposal_version") && row.has("proposal") ? new Commit(row.getUUID("proposal_ballot"), PartitionUpdate.fromBytes(row.getBytes("proposal"), row.getInt("proposal_version"))) : Commit.emptyCommit(key, metadata);
    // either most_recent_commit and most_recent_commit_at will both be set, or neither
    Commit mostRecent = row.has("most_recent_commit_version") && row.has("most_recent_commit") ? new Commit(row.getUUID("most_recent_commit_at"), PartitionUpdate.fromBytes(row.getBytes("most_recent_commit"), row.getInt("most_recent_commit_version"))) : Commit.emptyCommit(key, metadata);
    return new PaxosState(promised, accepted, mostRecent);
}
Also used : UntypedResultSet(org.apache.cassandra.cql3.UntypedResultSet) PaxosState(org.apache.cassandra.service.paxos.PaxosState) Commit(org.apache.cassandra.service.paxos.Commit) PartitionUpdate(org.apache.cassandra.db.partitions.PartitionUpdate)

Example 10 with Commit

use of org.apache.cassandra.service.paxos.Commit in project cassandra by apache.

the class ModificationStatement method casInternal.

static RowIterator casInternal(ClientState state, CQL3CasRequest request, long timestamp, int nowInSeconds) {
    UUID ballot = UUIDGen.getTimeUUIDFromMicros(timestamp);
    SinglePartitionReadQuery readCommand = request.readCommand(nowInSeconds);
    FilteredPartition current;
    try (ReadExecutionController executionController = readCommand.executionController();
        PartitionIterator iter = readCommand.executeInternal(executionController)) {
        current = FilteredPartition.create(PartitionIterators.getOnlyElement(iter, readCommand));
    }
    if (!request.appliesTo(current))
        return current.rowIterator();
    PartitionUpdate updates = request.makeUpdates(current, state);
    updates = TriggerExecutor.instance.execute(updates);
    Commit proposal = Commit.newProposal(ballot, updates);
    proposal.makeMutation().apply();
    return null;
}
Also used : Commit(org.apache.cassandra.service.paxos.Commit)

Aggregations

Commit (org.apache.cassandra.service.paxos.Commit)10 InetAddress (java.net.InetAddress)4 PartitionUpdate (org.apache.cassandra.db.partitions.PartitionUpdate)3 Hint (org.apache.cassandra.hints.Hint)2 TableMetadata (org.apache.cassandra.schema.TableMetadata)2 PaxosState (org.apache.cassandra.service.paxos.PaxosState)2 PrepareCallback (org.apache.cassandra.service.paxos.PrepareCallback)2 ByteBuffer (java.nio.ByteBuffer)1 UUID (java.util.UUID)1 AtomicInteger (java.util.concurrent.atomic.AtomicInteger)1 UntypedResultSet (org.apache.cassandra.cql3.UntypedResultSet)1 Mutation (org.apache.cassandra.db.Mutation)1 SystemKeyspace.loadPaxosState (org.apache.cassandra.db.SystemKeyspace.loadPaxosState)1 RowIterator (org.apache.cassandra.db.rows.RowIterator)1 ProposeCallback (org.apache.cassandra.service.paxos.ProposeCallback)1 Test (org.junit.Test)1