Search in sources :

Example 1 with Matrix

use of org.apache.spark.mllib.linalg.Matrix in project gatk by broadinstitute.

the class PCATangentNormalizationUtils method tangentNormalizeSpark.

/**
     * Tangent normalize given the raw PoN data using Spark:  the code here is a little more complex for optimization purposes.
     *
     *  Please see notes in docs/PoN ...
     *
     *  Ahat^T = (C^T P^T) A^T
     *  Therefore, C^T is the RowMatrix
     *
     *  pinv: P
     *  panel: A
     *  projection: Ahat
     *  cases: C
     *  betahat: C^T P^T
     *  tangentNormalizedCounts: C - Ahat
     */
private static PCATangentNormalizationResult tangentNormalizeSpark(final ReadCountCollection targetFactorNormalizedCounts, final RealMatrix reducedPanelCounts, final RealMatrix reducedPanelPInvCounts, final CaseToPoNTargetMapper targetMapper, final RealMatrix tangentNormalizationInputCounts, final JavaSparkContext ctx) {
    // Make the C^T a distributed matrix (RowMatrix)
    final RowMatrix caseTDistMat = SparkConverter.convertRealMatrixToSparkRowMatrix(ctx, tangentNormalizationInputCounts.transpose(), TN_NUM_SLICES_SPARK);
    // Spark local matrices (transposed)
    final Matrix pinvTLocalMat = new DenseMatrix(reducedPanelPInvCounts.getRowDimension(), reducedPanelPInvCounts.getColumnDimension(), Doubles.concat(reducedPanelPInvCounts.getData()), true).transpose();
    final Matrix panelTLocalMat = new DenseMatrix(reducedPanelCounts.getRowDimension(), reducedPanelCounts.getColumnDimension(), Doubles.concat(reducedPanelCounts.getData()), true).transpose();
    // Calculate the projection transpose in a distributed matrix, then convert to Apache Commons matrix (not transposed)
    final RowMatrix betahatDistMat = caseTDistMat.multiply(pinvTLocalMat);
    final RowMatrix projectionTDistMat = betahatDistMat.multiply(panelTLocalMat);
    final RealMatrix projection = SparkConverter.convertSparkRowMatrixToRealMatrix(projectionTDistMat, tangentNormalizationInputCounts.transpose().getRowDimension()).transpose();
    // Subtract the projection from the cases
    final RealMatrix tangentNormalizedCounts = tangentNormalizationInputCounts.subtract(projection);
    // Construct the result object and return it with the correct targets.
    final ReadCountCollection tangentNormalized = targetMapper.fromPoNtoCaseCountCollection(tangentNormalizedCounts, targetFactorNormalizedCounts.columnNames());
    final ReadCountCollection preTangentNormalized = targetMapper.fromPoNtoCaseCountCollection(tangentNormalizationInputCounts, targetFactorNormalizedCounts.columnNames());
    final RealMatrix tangentBetaHats = SparkConverter.convertSparkRowMatrixToRealMatrix(betahatDistMat, tangentNormalizedCounts.getColumnDimension());
    return new PCATangentNormalizationResult(tangentNormalized, preTangentNormalized, tangentBetaHats.transpose(), targetFactorNormalizedCounts);
}
Also used : RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix) RealMatrix(org.apache.commons.math3.linear.RealMatrix) Matrix(org.apache.spark.mllib.linalg.Matrix) RealMatrix(org.apache.commons.math3.linear.RealMatrix) ReadCountCollection(org.broadinstitute.hellbender.tools.exome.ReadCountCollection) RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix)

Example 2 with Matrix

use of org.apache.spark.mllib.linalg.Matrix in project gatk-protected by broadinstitute.

the class PCATangentNormalizationUtils method composeTangentNormalizationInputMatrix.

/**
     * Prepares the data to perform tangent normalization.
     * <p>
     * This is done by count group or column:
     *   <ol>
     *     </li>we divide counts by the column mean,</li>
     *     </li>then we transform value to their log_2,</li>
     *     </li>and finally we center them around the median.</li>
     *   </ol>
     * </p>
     *
     * @param matrix input matrix.
     * @return never {@code null}.
     */
private static RealMatrix composeTangentNormalizationInputMatrix(final RealMatrix matrix) {
    final RealMatrix result = matrix.copy();
    // step 1: divide by column means and log_2 transform
    final double[] columnMeans = GATKProtectedMathUtils.columnMeans(matrix);
    result.walkInOptimizedOrder(new DefaultRealMatrixChangingVisitor() {

        @Override
        public double visit(final int row, final int column, final double value) {
            return truncatedLog2(value / columnMeans[column]);
        }
    });
    // step 2: subtract column medians
    final double[] columnMedians = IntStream.range(0, matrix.getColumnDimension()).mapToDouble(c -> new Median().evaluate(result.getColumn(c))).toArray();
    result.walkInOptimizedOrder(new DefaultRealMatrixChangingVisitor() {

        @Override
        public double visit(final int row, final int column, final double value) {
            return value - columnMedians[column];
        }
    });
    return result;
}
Also used : IntStream(java.util.stream.IntStream) DefaultRealMatrixChangingVisitor(org.apache.commons.math3.linear.DefaultRealMatrixChangingVisitor) GATKProtectedMathUtils(org.broadinstitute.hellbender.utils.GATKProtectedMathUtils) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix) ParamUtils(org.broadinstitute.hellbender.utils.param.ParamUtils) List(java.util.List) Logger(org.apache.logging.log4j.Logger) ReadCountCollection(org.broadinstitute.hellbender.tools.exome.ReadCountCollection) CaseToPoNTargetMapper(org.broadinstitute.hellbender.tools.pon.coverage.CaseToPoNTargetMapper) CoveragePanelOfNormals(org.broadinstitute.hellbender.tools.pon.coverage.CoveragePanelOfNormals) Median(org.apache.commons.math3.stat.descriptive.rank.Median) SparkConverter(org.broadinstitute.hellbender.utils.spark.SparkConverter) Doubles(com.google.common.primitives.Doubles) Utils(org.broadinstitute.hellbender.utils.Utils) RealMatrix(org.apache.commons.math3.linear.RealMatrix) Matrix(org.apache.spark.mllib.linalg.Matrix) VisibleForTesting(com.google.common.annotations.VisibleForTesting) LogManager(org.apache.logging.log4j.LogManager) RealMatrix(org.apache.commons.math3.linear.RealMatrix) DefaultRealMatrixChangingVisitor(org.apache.commons.math3.linear.DefaultRealMatrixChangingVisitor) Median(org.apache.commons.math3.stat.descriptive.rank.Median)

Example 3 with Matrix

use of org.apache.spark.mllib.linalg.Matrix in project gatk by broadinstitute.

the class PCATangentNormalizationUtils method composeTangentNormalizationInputMatrix.

/**
     * Prepares the data to perform tangent normalization.
     * <p>
     * This is done by count group or column:
     *   <ol>
     *     </li>we divide counts by the column mean,</li>
     *     </li>then we transform value to their log_2,</li>
     *     </li>and finally we center them around the median.</li>
     *   </ol>
     * </p>
     *
     * @param matrix input matrix.
     * @return never {@code null}.
     */
private static RealMatrix composeTangentNormalizationInputMatrix(final RealMatrix matrix) {
    final RealMatrix result = matrix.copy();
    // step 1: divide by column means and log_2 transform
    final double[] columnMeans = GATKProtectedMathUtils.columnMeans(matrix);
    result.walkInOptimizedOrder(new DefaultRealMatrixChangingVisitor() {

        @Override
        public double visit(final int row, final int column, final double value) {
            return truncatedLog2(value / columnMeans[column]);
        }
    });
    // step 2: subtract column medians
    final double[] columnMedians = IntStream.range(0, matrix.getColumnDimension()).mapToDouble(c -> new Median().evaluate(result.getColumn(c))).toArray();
    result.walkInOptimizedOrder(new DefaultRealMatrixChangingVisitor() {

        @Override
        public double visit(final int row, final int column, final double value) {
            return value - columnMedians[column];
        }
    });
    return result;
}
Also used : IntStream(java.util.stream.IntStream) DefaultRealMatrixChangingVisitor(org.apache.commons.math3.linear.DefaultRealMatrixChangingVisitor) GATKProtectedMathUtils(org.broadinstitute.hellbender.utils.GATKProtectedMathUtils) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix) ParamUtils(org.broadinstitute.hellbender.utils.param.ParamUtils) List(java.util.List) Logger(org.apache.logging.log4j.Logger) ReadCountCollection(org.broadinstitute.hellbender.tools.exome.ReadCountCollection) CaseToPoNTargetMapper(org.broadinstitute.hellbender.tools.pon.coverage.CaseToPoNTargetMapper) CoveragePanelOfNormals(org.broadinstitute.hellbender.tools.pon.coverage.CoveragePanelOfNormals) Median(org.apache.commons.math3.stat.descriptive.rank.Median) SparkConverter(org.broadinstitute.hellbender.utils.spark.SparkConverter) Doubles(com.google.common.primitives.Doubles) Utils(org.broadinstitute.hellbender.utils.Utils) RealMatrix(org.apache.commons.math3.linear.RealMatrix) Matrix(org.apache.spark.mllib.linalg.Matrix) VisibleForTesting(com.google.common.annotations.VisibleForTesting) LogManager(org.apache.logging.log4j.LogManager) RealMatrix(org.apache.commons.math3.linear.RealMatrix) DefaultRealMatrixChangingVisitor(org.apache.commons.math3.linear.DefaultRealMatrixChangingVisitor) Median(org.apache.commons.math3.stat.descriptive.rank.Median)

Example 4 with Matrix

use of org.apache.spark.mllib.linalg.Matrix in project gatk-protected by broadinstitute.

the class PCATangentNormalizationUtils method tangentNormalizeSpark.

/**
     * Tangent normalize given the raw PoN data using Spark:  the code here is a little more complex for optimization purposes.
     *
     *  Please see notes in docs/PoN ...
     *
     *  Ahat^T = (C^T P^T) A^T
     *  Therefore, C^T is the RowMatrix
     *
     *  pinv: P
     *  panel: A
     *  projection: Ahat
     *  cases: C
     *  betahat: C^T P^T
     *  tangentNormalizedCounts: C - Ahat
     */
private static PCATangentNormalizationResult tangentNormalizeSpark(final ReadCountCollection targetFactorNormalizedCounts, final RealMatrix reducedPanelCounts, final RealMatrix reducedPanelPInvCounts, final CaseToPoNTargetMapper targetMapper, final RealMatrix tangentNormalizationInputCounts, final JavaSparkContext ctx) {
    // Make the C^T a distributed matrix (RowMatrix)
    final RowMatrix caseTDistMat = SparkConverter.convertRealMatrixToSparkRowMatrix(ctx, tangentNormalizationInputCounts.transpose(), TN_NUM_SLICES_SPARK);
    // Spark local matrices (transposed)
    final Matrix pinvTLocalMat = new DenseMatrix(reducedPanelPInvCounts.getRowDimension(), reducedPanelPInvCounts.getColumnDimension(), Doubles.concat(reducedPanelPInvCounts.getData()), true).transpose();
    final Matrix panelTLocalMat = new DenseMatrix(reducedPanelCounts.getRowDimension(), reducedPanelCounts.getColumnDimension(), Doubles.concat(reducedPanelCounts.getData()), true).transpose();
    // Calculate the projection transpose in a distributed matrix, then convert to Apache Commons matrix (not transposed)
    final RowMatrix betahatDistMat = caseTDistMat.multiply(pinvTLocalMat);
    final RowMatrix projectionTDistMat = betahatDistMat.multiply(panelTLocalMat);
    final RealMatrix projection = SparkConverter.convertSparkRowMatrixToRealMatrix(projectionTDistMat, tangentNormalizationInputCounts.transpose().getRowDimension()).transpose();
    // Subtract the projection from the cases
    final RealMatrix tangentNormalizedCounts = tangentNormalizationInputCounts.subtract(projection);
    // Construct the result object and return it with the correct targets.
    final ReadCountCollection tangentNormalized = targetMapper.fromPoNtoCaseCountCollection(tangentNormalizedCounts, targetFactorNormalizedCounts.columnNames());
    final ReadCountCollection preTangentNormalized = targetMapper.fromPoNtoCaseCountCollection(tangentNormalizationInputCounts, targetFactorNormalizedCounts.columnNames());
    final RealMatrix tangentBetaHats = SparkConverter.convertSparkRowMatrixToRealMatrix(betahatDistMat, tangentNormalizedCounts.getColumnDimension());
    return new PCATangentNormalizationResult(tangentNormalized, preTangentNormalized, tangentBetaHats.transpose(), targetFactorNormalizedCounts);
}
Also used : RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix) RealMatrix(org.apache.commons.math3.linear.RealMatrix) Matrix(org.apache.spark.mllib.linalg.Matrix) RealMatrix(org.apache.commons.math3.linear.RealMatrix) ReadCountCollection(org.broadinstitute.hellbender.tools.exome.ReadCountCollection) RowMatrix(org.apache.spark.mllib.linalg.distributed.RowMatrix) DenseMatrix(org.apache.spark.mllib.linalg.DenseMatrix)

Aggregations

RealMatrix (org.apache.commons.math3.linear.RealMatrix)4 DenseMatrix (org.apache.spark.mllib.linalg.DenseMatrix)4 Matrix (org.apache.spark.mllib.linalg.Matrix)4 RowMatrix (org.apache.spark.mllib.linalg.distributed.RowMatrix)4 ReadCountCollection (org.broadinstitute.hellbender.tools.exome.ReadCountCollection)4 VisibleForTesting (com.google.common.annotations.VisibleForTesting)2 Doubles (com.google.common.primitives.Doubles)2 List (java.util.List)2 IntStream (java.util.stream.IntStream)2 DefaultRealMatrixChangingVisitor (org.apache.commons.math3.linear.DefaultRealMatrixChangingVisitor)2 Median (org.apache.commons.math3.stat.descriptive.rank.Median)2 LogManager (org.apache.logging.log4j.LogManager)2 Logger (org.apache.logging.log4j.Logger)2 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)2 CaseToPoNTargetMapper (org.broadinstitute.hellbender.tools.pon.coverage.CaseToPoNTargetMapper)2 CoveragePanelOfNormals (org.broadinstitute.hellbender.tools.pon.coverage.CoveragePanelOfNormals)2 GATKProtectedMathUtils (org.broadinstitute.hellbender.utils.GATKProtectedMathUtils)2 Utils (org.broadinstitute.hellbender.utils.Utils)2 ParamUtils (org.broadinstitute.hellbender.utils.param.ParamUtils)2 SparkConverter (org.broadinstitute.hellbender.utils.spark.SparkConverter)2