Search in sources :

Example 1 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class WordCountIndexSource method computeWordCount.

private void computeWordCount() throws TextDBException {
    try {
        HashMap<String, Integer> wordCountMap = new HashMap<>();
        DataReader dataReader = RelationManager.getRelationManager().getTableDataReader(predicate.getTableName(), new MatchAllDocsQuery());
        dataReader.open();
        IndexReader luceneIndexReader = dataReader.getLuceneIndexReader();
        for (int i = 0; i < luceneIndexReader.numDocs(); i++) {
            Terms termVector = luceneIndexReader.getTermVector(i, predicate.getAttribute());
            TermsEnum termsEnum = termVector.iterator();
            while (termsEnum.next() != null) {
                String key = termsEnum.term().utf8ToString();
                wordCountMap.put(key, wordCountMap.get(key) == null ? ((int) termsEnum.totalTermFreq()) : wordCountMap.get(key) + ((int) termsEnum.totalTermFreq()));
            }
        }
        luceneIndexReader.close();
        dataReader.close();
        sortedWordCountMap = wordCountMap.entrySet().stream().sorted((e1, e2) -> e2.getValue().compareTo(e1.getValue())).collect(Collectors.toList());
        wordCountIterator = sortedWordCountMap.iterator();
    } catch (IOException e) {
        throw new DataFlowException(e);
    }
}
Also used : DataReader(edu.uci.ics.textdb.storage.DataReader) HashMap(java.util.HashMap) IndexReader(org.apache.lucene.index.IndexReader) Terms(org.apache.lucene.index.Terms) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) IOException(java.io.IOException) MatchAllDocsQuery(org.apache.lucene.search.MatchAllDocsQuery) TermsEnum(org.apache.lucene.index.TermsEnum)

Example 2 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class ScanBasedSourceOperator method close.

@Override
public void close() throws TextDBException {
    if (!isOpen) {
        return;
    }
    try {
        dataReader.close();
        isOpen = false;
    } catch (Exception e) {
        throw new DataFlowException(e.getMessage(), e);
    }
}
Also used : DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) TextDBException(edu.uci.ics.textdb.api.exception.TextDBException) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) StorageException(edu.uci.ics.textdb.api.exception.StorageException)

Example 3 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class ExcelSink method open.

@Override
public void open() throws TextDBException {
    if (cursor != CLOSED) {
        return;
    }
    inputOperator.open();
    inputSchema = inputOperator.getOutputSchema();
    outputSchema = new Schema(inputSchema.getAttributes().stream().filter(attr -> !attr.getAttributeName().equalsIgnoreCase(SchemaConstants._ID)).filter(attr -> !attr.getAttributeName().equalsIgnoreCase(SchemaConstants.PAYLOAD)).filter(attr -> !attr.getAttributeType().equals(AttributeType.LIST)).toArray(Attribute[]::new));
    wb = new XSSFWorkbook();
    DateFormat df = new SimpleDateFormat("yyyyMMdd-HHmmss");
    fileName = df.format(new Date()) + ".xlsx";
    try {
        if (Files.notExists(Paths.get(excelIndexDirectory))) {
            Files.createDirectories(Paths.get(excelIndexDirectory));
        }
        fileOut = new FileOutputStream(Paths.get(excelIndexDirectory, fileName).toString());
    } catch (IOException e) {
        throw new DataFlowException(e);
    }
    sheet = wb.createSheet("new sheet");
    Row row = sheet.createRow(0);
    List<String> attributeNames = outputSchema.getAttributeNames();
    for (int i = 0; i < attributeNames.size(); i++) {
        String attributeName = attributeNames.get(i);
        row.createCell(i).setCellValue(attributeName);
    }
    cursor = OPENED;
}
Also used : SchemaConstants(edu.uci.ics.textdb.api.constants.SchemaConstants) DoubleField(edu.uci.ics.textdb.api.field.DoubleField) DateField(edu.uci.ics.textdb.api.field.DateField) Date(java.util.Date) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) SimpleDateFormat(java.text.SimpleDateFormat) ArrayList(java.util.ArrayList) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) XSSFWorkbook(org.apache.poi.xssf.usermodel.XSSFWorkbook) IntegerField(edu.uci.ics.textdb.api.field.IntegerField) ISink(edu.uci.ics.textdb.api.dataflow.ISink) TextDBException(edu.uci.ics.textdb.api.exception.TextDBException) Cell(org.apache.poi.ss.usermodel.Cell) DateFormat(java.text.DateFormat) Tuple(edu.uci.ics.textdb.api.tuple.Tuple) IOperator(edu.uci.ics.textdb.api.dataflow.IOperator) Sheet(org.apache.poi.ss.usermodel.Sheet) Attribute(edu.uci.ics.textdb.api.schema.Attribute) Files(java.nio.file.Files) FileOutputStream(java.io.FileOutputStream) IOException(java.io.IOException) Utils(edu.uci.ics.textdb.api.utils.Utils) Schema(edu.uci.ics.textdb.api.schema.Schema) List(java.util.List) Workbook(org.apache.poi.ss.usermodel.Workbook) Paths(java.nio.file.Paths) IField(edu.uci.ics.textdb.api.field.IField) Row(org.apache.poi.ss.usermodel.Row) Schema(edu.uci.ics.textdb.api.schema.Schema) IOException(java.io.IOException) Date(java.util.Date) SimpleDateFormat(java.text.SimpleDateFormat) DateFormat(java.text.DateFormat) FileOutputStream(java.io.FileOutputStream) XSSFWorkbook(org.apache.poi.xssf.usermodel.XSSFWorkbook) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Row(org.apache.poi.ss.usermodel.Row) SimpleDateFormat(java.text.SimpleDateFormat)

Example 4 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class RelationManager method createTable.

/**
     * Creates a new table. 
     *   Table name must be unique (case insensitive).
     *   LuceneAnalyzer must be a valid analyzer string.
     * 
     * The "_id" attribute will be added to the table schema.
     * System automatically generates a unique ID for each tuple inserted to a table,
     *   the generated ID will be in "_id" field.
     * 
     * @param tableName, the name of the table, must be unique, case is not sensitive
     * @param indexDirectory, the directory to store the index and data, must not duplicate with other tables' directories
     * @param schema, the schema of the table
     * @param luceneAnalyzerString, the string representing the lucene analyzer used
     * @throws StorageException
     */
public void createTable(String tableName, String indexDirectory, Schema schema, String luceneAnalyzerString) throws StorageException {
    // convert the table name to lower case
    tableName = tableName.toLowerCase();
    // table should not exist
    if (checkTableExistence(tableName)) {
        throw new StorageException(String.format("Table %s already exists.", tableName));
    }
    // and convert the index directory to its absolute path
    try {
        Path indexPath = Paths.get(indexDirectory);
        if (Files.notExists(indexPath)) {
            Files.createDirectories(indexPath);
        }
        indexDirectory = indexPath.toRealPath().toString();
    } catch (IOException e) {
        throw new StorageException(e);
    }
    // check if the indexDirectory overlaps with another table's index directory
    Query indexDirectoryQuery = new TermQuery(new Term(CatalogConstants.TABLE_DIRECTORY, indexDirectory));
    DataReader tableCatalogDataReader = new DataReader(CatalogConstants.TABLE_CATALOG_DATASTORE, indexDirectoryQuery);
    tableCatalogDataReader.setPayloadAdded(false);
    tableCatalogDataReader.open();
    Tuple nextTuple = tableCatalogDataReader.getNextTuple();
    tableCatalogDataReader.close();
    // if the index directory is already taken by another table, throws an exception
    if (nextTuple != null) {
        String overlapTableName = nextTuple.getField(CatalogConstants.TABLE_NAME).getValue().toString();
        throw new StorageException(String.format("Table %s already takes the index directory %s. Please choose another directory.", overlapTableName, indexDirectory));
    }
    // check if the lucene analyzer string is valid
    Analyzer luceneAnalyzer = null;
    try {
        luceneAnalyzer = LuceneAnalyzerConstants.getLuceneAnalyzer(luceneAnalyzerString);
    } catch (DataFlowException e) {
        throw new StorageException("Lucene Analyzer String is not valid.");
    }
    // create the directory and clear all data in the index directory
    Schema tableSchema = Utils.getSchemaWithID(schema);
    DataStore tableDataStore = new DataStore(indexDirectory, tableSchema);
    DataWriter dataWriter = new DataWriter(tableDataStore, luceneAnalyzer);
    dataWriter.open();
    dataWriter.clearData();
    dataWriter.close();
    // write table info to catalog
    writeTableInfoToCatalog(tableName, indexDirectory, schema, luceneAnalyzerString);
}
Also used : Path(java.nio.file.Path) TermQuery(org.apache.lucene.search.TermQuery) Query(org.apache.lucene.search.Query) MatchAllDocsQuery(org.apache.lucene.search.MatchAllDocsQuery) TermQuery(org.apache.lucene.search.TermQuery) Schema(edu.uci.ics.textdb.api.schema.Schema) IOException(java.io.IOException) Term(org.apache.lucene.index.Term) Analyzer(org.apache.lucene.analysis.Analyzer) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) StorageException(edu.uci.ics.textdb.api.exception.StorageException) Tuple(edu.uci.ics.textdb.api.tuple.Tuple)

Example 5 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class KeywordMatcherSourceOperator method buildConjunctionQuery.

private Query buildConjunctionQuery() throws DataFlowException {
    BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder();
    for (String attributeName : this.predicate.getAttributeNames()) {
        AttributeType attributeType = this.inputSchema.getAttribute(attributeName).getAttributeType();
        // types other than TEXT and STRING: throw Exception for now
        if (attributeType != AttributeType.STRING && attributeType != AttributeType.TEXT) {
            throw new DataFlowException("KeywordPredicate: Fields other than STRING and TEXT are not supported yet");
        }
        if (attributeType == AttributeType.STRING) {
            Query termQuery = new TermQuery(new Term(attributeName, predicate.getQuery()));
            booleanQueryBuilder.add(termQuery, BooleanClause.Occur.SHOULD);
        }
        if (attributeType == AttributeType.TEXT) {
            BooleanQuery.Builder fieldQueryBuilder = new BooleanQuery.Builder();
            for (String token : queryTokenSet) {
                Query termQuery = new TermQuery(new Term(attributeName, token.toLowerCase()));
                fieldQueryBuilder.add(termQuery, BooleanClause.Occur.MUST);
            }
            booleanQueryBuilder.add(fieldQueryBuilder.build(), BooleanClause.Occur.SHOULD);
        }
    }
    return booleanQueryBuilder.build();
}
Also used : BooleanQuery(org.apache.lucene.search.BooleanQuery) TermQuery(org.apache.lucene.search.TermQuery) Query(org.apache.lucene.search.Query) PhraseQuery(org.apache.lucene.search.PhraseQuery) MatchAllDocsQuery(org.apache.lucene.search.MatchAllDocsQuery) TermQuery(org.apache.lucene.search.TermQuery) BooleanQuery(org.apache.lucene.search.BooleanQuery) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Term(org.apache.lucene.index.Term)

Aggregations

DataFlowException (edu.uci.ics.textdb.api.exception.DataFlowException)34 TextDBException (edu.uci.ics.textdb.api.exception.TextDBException)13 AttributeType (edu.uci.ics.textdb.api.schema.AttributeType)12 Schema (edu.uci.ics.textdb.api.schema.Schema)11 Tuple (edu.uci.ics.textdb.api.tuple.Tuple)10 Attribute (edu.uci.ics.textdb.api.schema.Attribute)8 Span (edu.uci.ics.textdb.api.span.Span)7 ArrayList (java.util.ArrayList)7 SchemaConstants (edu.uci.ics.textdb.api.constants.SchemaConstants)6 List (java.util.List)6 Collectors (java.util.stream.Collectors)6 StorageException (edu.uci.ics.textdb.api.exception.StorageException)5 ListField (edu.uci.ics.textdb.api.field.ListField)5 IOException (java.io.IOException)5 IField (edu.uci.ics.textdb.api.field.IField)4 Utils (edu.uci.ics.textdb.api.utils.Utils)4 AbstractSingleInputOperator (edu.uci.ics.textdb.exp.common.AbstractSingleInputOperator)4 Iterator (java.util.Iterator)4 ErrorMessages (edu.uci.ics.textdb.api.constants.ErrorMessages)3 IOperator (edu.uci.ics.textdb.api.dataflow.IOperator)3