Search in sources :

Example 16 with TextField

use of edu.uci.ics.texera.api.field.TextField in project textdb by TextDB.

the class JoinDistanceTest method testIdsMatchFieldsMatchSpanWithinThreshold.

     * This case tests for the scenario when the difference of keyword spans
     *  is within the given span threshold.
     *  e.g.
     *  [<11, 18>]
     *  [<27, 33>]
     *  threshold = 20 (within threshold)
     *  result: [<11, 33>]
     * Test result: The list contains a tuple with all the fields and a span
     * list consisting of the joined span. The joined span is made up of the
     * field name, start and stop index (computed as <min(span1 spanStartIndex,
     * span2 spanStartIndex), max(span1 spanEndIndex, span2 spanEndIndex)>)
     * key (combination of span1 key and span2 key) and value (combination of
     * span1 value and span2 value).
public void testIdsMatchFieldsMatchSpanWithinThreshold() throws Exception {
    JoinTestHelper.insertToTable(BOOK_TABLE, JoinTestConstants.bookGroup1.get(0));
    KeywordMatcherSourceOperator keywordSourceOuter = JoinTestHelper.getKeywordSource(BOOK_TABLE, "special", conjunction);
    KeywordMatcherSourceOperator keywordSourceInner = JoinTestHelper.getKeywordSource(BOOK_TABLE, "writer", conjunction);
    List<Tuple> resultList = JoinTestHelper.getJoinDistanceResults(keywordSourceInner, keywordSourceOuter, new JoinDistancePredicate(JoinTestConstants.REVIEW, 20), Integer.MAX_VALUE, 0);
    Schema resultSchema = new Schema.Builder().add(JoinTestConstants.BOOK_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    List<Span> spanList = new ArrayList<>();
    Span span1 = new Span(JoinTestConstants.REVIEW, 11, 33, "special_writer", "special kind of " + "writer");
    IField[] book1 = { new IntegerField(52), new StringField("Mary Roach"), new StringField("Grunt: The Curious Science of Humans at War"), new IntegerField(288), new TextField("It takes a special kind " + "of writer to make topics ranging from death to our " + "gastrointestinal tract interesting (sometimes " + "hilariously so), and pop science writer Mary Roach is " + "always up to the task."), new ListField<>(spanList) };
    Tuple expectedTuple = new Tuple(resultSchema, book1);
    List<Tuple> expectedResult = new ArrayList<>();
    Assert.assertEquals(1, resultList.size());
    Assert.assertTrue(TestUtils.equals(expectedResult, resultList));
Also used : Schema(edu.uci.ics.texera.api.schema.Schema) ArrayList(java.util.ArrayList) IntegerField(edu.uci.ics.texera.api.field.IntegerField) IField(edu.uci.ics.texera.api.field.IField) JoinDistancePredicate(edu.uci.ics.texera.dataflow.join.JoinDistancePredicate) Span(edu.uci.ics.texera.api.span.Span) KeywordMatcherSourceOperator(edu.uci.ics.texera.dataflow.keywordmatcher.KeywordMatcherSourceOperator) StringField(edu.uci.ics.texera.api.field.StringField) TextField(edu.uci.ics.texera.api.field.TextField) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 17 with TextField

use of edu.uci.ics.texera.api.field.TextField in project textdb by TextDB.

the class JoinDistanceTest method testOneSpanEncompassesOtherAndDifferenceLessThanThreshold.

// This case tests for the scenario when one of the spans to be joined encompasses the other span
// and both |(span 1 spanStartIndex) - (span 2 spanStartIndex)|,
// |(span 1 spanEndIndex) - (span 2 spanEndIndex)| are within threshold.
// e.g.
// [<11, 18>]
// [<3, 33>]
// threshold = 20 (within threshold)
// Test result: The bigger span should be returned.
// [<3, 33>]
public void testOneSpanEncompassesOtherAndDifferenceLessThanThreshold() throws Exception {
    JoinTestHelper.insertToTable(BOOK_TABLE, JoinTestConstants.bookGroup1.get(0));
    KeywordMatcherSourceOperator keywordSourceOuter = JoinTestHelper.getKeywordSource(BOOK_TABLE, "special", conjunction);
    KeywordMatcherSourceOperator keywordSourceInner = JoinTestHelper.getKeywordSource(BOOK_TABLE, "takes a special kind of writer", phrase);
    List<Tuple> resultList = JoinTestHelper.getJoinDistanceResults(keywordSourceInner, keywordSourceOuter, new JoinDistancePredicate(JoinTestConstants.REVIEW, 20), Integer.MAX_VALUE, 0);
    Schema resultSchema = new Schema.Builder().add(JoinTestConstants.BOOK_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    List<Span> spanList = new ArrayList<>();
    Span span1 = new Span(JoinTestConstants.REVIEW, 3, 33, "special_takes a special " + "kind of writer", "takes a special " + "kind of writer");
    IField[] book1 = { new IntegerField(52), new StringField("Mary Roach"), new StringField("Grunt: The Curious Science of Humans at War"), new IntegerField(288), new TextField("It takes a special kind " + "of writer to make topics ranging from death to our " + "gastrointestinal tract interesting (sometimes " + "hilariously so), and pop science writer Mary Roach is " + "always up to the task."), new ListField<>(spanList) };
    Tuple expectedTuple = new Tuple(resultSchema, book1);
    List<Tuple> expectedResult = new ArrayList<>();
    Assert.assertEquals(1, resultList.size());
    Assert.assertTrue(TestUtils.equals(expectedResult, resultList));
Also used : Schema(edu.uci.ics.texera.api.schema.Schema) ArrayList(java.util.ArrayList) IntegerField(edu.uci.ics.texera.api.field.IntegerField) IField(edu.uci.ics.texera.api.field.IField) JoinDistancePredicate(edu.uci.ics.texera.dataflow.join.JoinDistancePredicate) Span(edu.uci.ics.texera.api.span.Span) KeywordMatcherSourceOperator(edu.uci.ics.texera.dataflow.keywordmatcher.KeywordMatcherSourceOperator) StringField(edu.uci.ics.texera.api.field.StringField) TextField(edu.uci.ics.texera.api.field.TextField) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 18 with TextField

use of edu.uci.ics.texera.api.field.TextField in project textdb by TextDB.

the class JoinDistanceTest method testForLimitWhenLimitIsLesserThanActualNumberOfResults.

// ---------------------<Limit and offset test cases.>---------------------
     * This case tests for the scenario when limit is some integer greater than
     * 0 and less than the actual number of results and offset is 0 and join 
     * is performed.
     * Test result: A list of tuples with number of tuples equal to limit.
public void testForLimitWhenLimitIsLesserThanActualNumberOfResults() throws Exception {
    List<Tuple> tuples = JoinTestConstants.bookGroup1.subList(1, 5);
    JoinTestHelper.insertToTable(BOOK_TABLE, tuples);
    KeywordMatcherSourceOperator keywordSourceOuter = JoinTestHelper.getKeywordSource(BOOK_TABLE, "typical", conjunction);
    KeywordMatcherSourceOperator keywordSourceInner = JoinTestHelper.getKeywordSource(BOOK_TABLE, "actually", conjunction);
    List<Tuple> resultList = JoinTestHelper.getJoinDistanceResults(keywordSourceInner, keywordSourceOuter, new JoinDistancePredicate(JoinTestConstants.REVIEW, 90), 3, 0);
    Schema resultSchema = new Schema.Builder().add(JoinTestConstants.BOOK_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    List<Span> spanList = new ArrayList<>();
    Span span1 = new Span(JoinTestConstants.REVIEW, 28, 119, "typical_actually", "typical review. " + "This is a test. A book review test. " + "A test to test queries without actually");
    Span span2 = new Span(JoinTestConstants.REVIEW, 186, 234, "typical_actually", "actually a review " + "even if it is not your typical");
    IField[] book1 = { new IntegerField(51), new StringField("author unknown"), new StringField("typical"), new IntegerField(300), new TextField("Review of a Book. This is a typical " + "review. This is a test. A book review " + "test. A test to test queries without " + "actually using actual review. From " + "here onwards, we can pretend this to " + "be actually a review even if it is not " + "your typical book review."), new ListField<>(spanList) };
    IField[] book2 = { new IntegerField(53), new StringField("Noah Hawley"), new StringField("Before the Fall"), new IntegerField(400), new TextField("Review of a Book. This is a typical " + "review. This is a test. A book review " + "test. A test to test queries without " + "actually using actual review. From " + "here onwards, we can pretend this to " + "be actually a review even if it is not " + "your typical book review."), new ListField<>(spanList) };
    IField[] book3 = { new IntegerField(54), new StringField("Andria Williams"), new StringField("The Longest Night: A Novel"), new IntegerField(400), new TextField("Review of a Book. This is a typical " + "review. This is a test. A book review " + "test. A test to test queries without " + "actually using actual review. From " + "here onwards, we can pretend this to " + "be actually a review even if it is not " + "your typical book review."), new ListField<>(spanList) };
    IField[] book4 = { new IntegerField(55), new StringField("Matti Friedman"), new StringField("Pumpkinflowers: A Soldier's " + "Story"), new IntegerField(256), new TextField("Review of a Book. This is a typical " + "review. This is a test. A book review " + "test. A test to test queries without " + "actually using actual review. From " + "here onwards, we can pretend this to " + "be actually a review even if it is not " + "your typical book review."), new ListField<>(spanList) };
    Tuple expectedTuple1 = new Tuple(resultSchema, book1);
    Tuple expectedTuple2 = new Tuple(resultSchema, book2);
    Tuple expectedTuple3 = new Tuple(resultSchema, book3);
    Tuple expectedTuple4 = new Tuple(resultSchema, book4);
    List<Tuple> expectedResult = new ArrayList<>(3);
    Assert.assertEquals(3, resultList.size());
    Assert.assertTrue(TestUtils.containsAll(expectedResult, resultList));
Also used : Schema(edu.uci.ics.texera.api.schema.Schema) ArrayList(java.util.ArrayList) IntegerField(edu.uci.ics.texera.api.field.IntegerField) IField(edu.uci.ics.texera.api.field.IField) JoinDistancePredicate(edu.uci.ics.texera.dataflow.join.JoinDistancePredicate) Span(edu.uci.ics.texera.api.span.Span) KeywordMatcherSourceOperator(edu.uci.ics.texera.dataflow.keywordmatcher.KeywordMatcherSourceOperator) StringField(edu.uci.ics.texera.api.field.StringField) TextField(edu.uci.ics.texera.api.field.TextField) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 19 with TextField

use of edu.uci.ics.texera.api.field.TextField in project textdb by TextDB.

the class SimilarityJoinTest method test1.

     * Tests the Similarity Join Predicate on two similar words:
     *   Donald J. Trump
     *   Donald Trump
     * Under the condition of similarity (NormalizedLevenshtein) > 0.8, these two words should match.
public void test1() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(0));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(1));
    String trumpRegex = "[Dd]onald.{1,5}[Tt]rump";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, trumpRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, trumpRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.8);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Schema joinInputSchema = new Schema.Builder().add(JoinTestConstants.NEWS_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    Schema resultSchema = similarityJoinPredicate.generateOutputSchema(joinInputSchema, joinInputSchema);
    List<Span> resultSpanList = Arrays.asList(new Span("inner_" + JoinTestConstants.NEWS_BODY, 5, 20, trumpRegex, "Donald J. Trump", -1), new Span("outer_" + JoinTestConstants.NEWS_BODY, 18, 30, trumpRegex, "Donald Trump", -1));
    Tuple resultTuple = new Tuple(resultSchema, new IDField(UUID.randomUUID().toString()), new IntegerField(2), new TextField("Alternative Facts and the Costs of Trump-Branded Reality"), new TextField("When Donald J. Trump swore the presidential oath on Friday, he assumed " + "responsibility not only for the levers of government but also for one of " + "the United States’ most valuable assets, battered though it may be: its credibility. " + "The country’s sentimental reverence for truth and its jealously guarded press freedoms, " + "while never perfect, have been as important to its global standing as the strength of " + "its military and the reliability of its currency. It’s the bedrock of that " + "American exceptionalism we’ve heard so much about for so long."), new IntegerField(1), new TextField("UCI marchers protest as Trump begins his presidency"), new TextField("a few hours after Donald Trump was sworn in Friday as the nation’s 45th president, " + "a line of more than 100 UC Irvine faculty members and students took to the campus " + "in pouring rain to demonstrate their opposition to his policies on immigration and " + "other issues and urge other opponents to keep organizing during Trump’s presidency."), new ListField<>(resultSpanList));
    Assert.assertTrue(TestUtils.equals(Arrays.asList(resultTuple), results));
Also used : IDField(edu.uci.ics.texera.api.field.IDField) SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) Schema(edu.uci.ics.texera.api.schema.Schema) IntegerField(edu.uci.ics.texera.api.field.IntegerField) Span(edu.uci.ics.texera.api.span.Span) TextField(edu.uci.ics.texera.api.field.TextField) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 20 with TextField

use of edu.uci.ics.texera.api.field.TextField in project textdb by TextDB.

the class SimilarityJoinTest method test3.

     * Tests the Similarity Join Predicate on two similar words:
     *   Galaxy S8
     *   Galaxy Note 7
     * Under the condition of similarity (NormalizedLevenshtein) > 0.5, these two words should match.
public void test3() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(2));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(3));
    String phoneRegex = "[Gg]alaxy.{1,6}\\d";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, phoneRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, phoneRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.5);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Schema joinInputSchema = new Schema.Builder().add(JoinTestConstants.NEWS_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    Schema resultSchema = similarityJoinPredicate.generateOutputSchema(joinInputSchema, joinInputSchema);
    List<Span> resultSpanList = Arrays.asList(new Span("inner_" + JoinTestConstants.NEWS_BODY, 327, 336, phoneRegex, "Galaxy S8", -1), new Span("outer_" + JoinTestConstants.NEWS_BODY, 21, 34, phoneRegex, "Galaxy Note 7", -1));
    Tuple resultTuple = new Tuple(resultSchema, new IDField(UUID.randomUUID().toString()), new IntegerField(4), new TextField("This is how Samsung plans to prevent future phones from catching fire"), new TextField("Samsung said that it has implemented a new eight-step testing process for " + "its lithium ion batteries, and that it’s forming a battery advisory board as well, " + "comprised of academics from Cambridge, Berkeley, and Stanford. " + "Note, this is for all lithium ion batteries in Samsung products, " + "not just Note phablets or the anticipated Galaxy S8 phone."), new IntegerField(3), new TextField("Samsung Explains Note 7 Battery Explosions, And Turns Crisis Into Opportunity"), new TextField("Samsung launched the Galaxy Note 7 to record preorders and sales in August, " + "but the rosy start soon turned sour. Samsung had to initiate a recall in September of " + "the first version of the Note 7 due to faulty batteries that overheated and exploded. " + "By October it had to recall over 2 million devices and discontinue the product. " + "It’s estimated that the recall will cost Samsung $5.3 billion."), new ListField<>(resultSpanList));
    Assert.assertTrue(TestUtils.equals(Arrays.asList(resultTuple), results));
Also used : IDField(edu.uci.ics.texera.api.field.IDField) SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) Schema(edu.uci.ics.texera.api.schema.Schema) IntegerField(edu.uci.ics.texera.api.field.IntegerField) Span(edu.uci.ics.texera.api.span.Span) TextField(edu.uci.ics.texera.api.field.TextField) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)


TextField (edu.uci.ics.texera.api.field.TextField)115 IField (edu.uci.ics.texera.api.field.IField)99 Tuple (edu.uci.ics.texera.api.tuple.Tuple)89 ArrayList (java.util.ArrayList)84 IntegerField (edu.uci.ics.texera.api.field.IntegerField)78 StringField (edu.uci.ics.texera.api.field.StringField)78 Span (edu.uci.ics.texera.api.span.Span)78 Schema (edu.uci.ics.texera.api.schema.Schema)77 Test (org.junit.Test)76 DoubleField (edu.uci.ics.texera.api.field.DoubleField)63 DateField (edu.uci.ics.texera.api.field.DateField)58 Attribute (edu.uci.ics.texera.api.schema.Attribute)56 SimpleDateFormat (java.text.SimpleDateFormat)56 Dictionary (edu.uci.ics.texera.dataflow.dictionarymatcher.Dictionary)29 ListField (edu.uci.ics.texera.api.field.ListField)11 JoinDistancePredicate (edu.uci.ics.texera.dataflow.join.JoinDistancePredicate)9 KeywordMatcherSourceOperator (edu.uci.ics.texera.dataflow.keywordmatcher.KeywordMatcherSourceOperator)9 JsonNode (com.fasterxml.jackson.databind.JsonNode)5 IOperator (edu.uci.ics.texera.api.dataflow.IOperator)5 ObjectMapper (com.fasterxml.jackson.databind.ObjectMapper)4