Search in sources :

Example 1 with SimilarityJoinPredicate

use of edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate in project textdb by TextDB.

the class SimilarityJoinTest method test3.

/*
     * Tests the Similarity Join Predicate on two similar words:
     *   Galaxy S8
     *   Galaxy Note 7
     * Under the condition of similarity (NormalizedLevenshtein) > 0.5, these two words should match.
     *
     */
@Test
public void test3() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(2));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(3));
    String phoneRegex = "[Gg]alaxy.{1,6}\\d";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, phoneRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, phoneRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.5);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Schema joinInputSchema = new Schema.Builder().add(JoinTestConstants.NEWS_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    Schema resultSchema = similarityJoinPredicate.generateOutputSchema(joinInputSchema, joinInputSchema);
    List<Span> resultSpanList = Arrays.asList(new Span("inner_" + JoinTestConstants.NEWS_BODY, 327, 336, phoneRegex, "Galaxy S8", -1), new Span("outer_" + JoinTestConstants.NEWS_BODY, 21, 34, phoneRegex, "Galaxy Note 7", -1));
    Tuple resultTuple = new Tuple(resultSchema, new IDField(UUID.randomUUID().toString()), new IntegerField(4), new TextField("This is how Samsung plans to prevent future phones from catching fire"), new TextField("Samsung said that it has implemented a new eight-step testing process for " + "its lithium ion batteries, and that it’s forming a battery advisory board as well, " + "comprised of academics from Cambridge, Berkeley, and Stanford. " + "Note, this is for all lithium ion batteries in Samsung products, " + "not just Note phablets or the anticipated Galaxy S8 phone."), new IntegerField(3), new TextField("Samsung Explains Note 7 Battery Explosions, And Turns Crisis Into Opportunity"), new TextField("Samsung launched the Galaxy Note 7 to record preorders and sales in August, " + "but the rosy start soon turned sour. Samsung had to initiate a recall in September of " + "the first version of the Note 7 due to faulty batteries that overheated and exploded. " + "By October it had to recall over 2 million devices and discontinue the product. " + "It’s estimated that the recall will cost Samsung $5.3 billion."), new ListField<>(resultSpanList));
    Assert.assertTrue(TestUtils.equals(Arrays.asList(resultTuple), results));
}
Also used : IDField(edu.uci.ics.texera.api.field.IDField) SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) Schema(edu.uci.ics.texera.api.schema.Schema) IntegerField(edu.uci.ics.texera.api.field.IntegerField) Span(edu.uci.ics.texera.api.span.Span) TextField(edu.uci.ics.texera.api.field.TextField) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 2 with SimilarityJoinPredicate

use of edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate in project textdb by TextDB.

the class SimilarityJoinTest method test1.

/*
     * Tests the Similarity Join Predicate on two similar words:
     *   Donald J. Trump
     *   Donald Trump
     * Under the condition of similarity (NormalizedLevenshtein) > 0.8, these two words should match.
     *
     */
@Test
public void test1() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(0));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(1));
    String trumpRegex = "[Dd]onald.{1,5}[Tt]rump";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, trumpRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, trumpRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.8);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Schema joinInputSchema = new Schema.Builder().add(JoinTestConstants.NEWS_SCHEMA).add(SchemaConstants.SPAN_LIST_ATTRIBUTE).build();
    Schema resultSchema = similarityJoinPredicate.generateOutputSchema(joinInputSchema, joinInputSchema);
    List<Span> resultSpanList = Arrays.asList(new Span("inner_" + JoinTestConstants.NEWS_BODY, 5, 20, trumpRegex, "Donald J. Trump", -1), new Span("outer_" + JoinTestConstants.NEWS_BODY, 18, 30, trumpRegex, "Donald Trump", -1));
    Tuple resultTuple = new Tuple(resultSchema, new IDField(UUID.randomUUID().toString()), new IntegerField(2), new TextField("Alternative Facts and the Costs of Trump-Branded Reality"), new TextField("When Donald J. Trump swore the presidential oath on Friday, he assumed " + "responsibility not only for the levers of government but also for one of " + "the United States’ most valuable assets, battered though it may be: its credibility. " + "The country’s sentimental reverence for truth and its jealously guarded press freedoms, " + "while never perfect, have been as important to its global standing as the strength of " + "its military and the reliability of its currency. It’s the bedrock of that " + "American exceptionalism we’ve heard so much about for so long."), new IntegerField(1), new TextField("UCI marchers protest as Trump begins his presidency"), new TextField("a few hours after Donald Trump was sworn in Friday as the nation’s 45th president, " + "a line of more than 100 UC Irvine faculty members and students took to the campus " + "in pouring rain to demonstrate their opposition to his policies on immigration and " + "other issues and urge other opponents to keep organizing during Trump’s presidency."), new ListField<>(resultSpanList));
    Assert.assertTrue(TestUtils.equals(Arrays.asList(resultTuple), results));
}
Also used : IDField(edu.uci.ics.texera.api.field.IDField) SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) Schema(edu.uci.ics.texera.api.schema.Schema) IntegerField(edu.uci.ics.texera.api.field.IntegerField) Span(edu.uci.ics.texera.api.span.Span) TextField(edu.uci.ics.texera.api.field.TextField) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 3 with SimilarityJoinPredicate

use of edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate in project textdb by TextDB.

the class SimilarityJoinTest method test2.

/*
     * Tests the Similarity Join Predicate on two similar words:
     *   Donald J. Trump
     *   Donald Trump
     * Under the condition of similarity (NormalizedLevenshtein) > 0.9, these two words should NOT match.
     *
     */
@Test
public void test2() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(0));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(1));
    String trumpRegex = "[Dd]onald.{1,5}[Tt]rump";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, trumpRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, trumpRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.9);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Assert.assertTrue(results.isEmpty());
}
Also used : SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 4 with SimilarityJoinPredicate

use of edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate in project textdb by TextDB.

the class SimilarityJoinTest method test4.

/*
     * Tests the Similarity Join Predicate on two similar words:
     *   Galaxy S8
     *   Galaxy Note 7
     * Under the condition of similarity (NormalizedLevenshtein) > 0.8, these two words should NOT match.
     *
     */
@Test
public void test4() throws TexeraException {
    JoinTestHelper.insertToTable(NEWS_TABLE_OUTER, JoinTestConstants.getNewsTuples().get(2));
    JoinTestHelper.insertToTable(NEWS_TABLE_INNER, JoinTestConstants.getNewsTuples().get(3));
    String phoneRegex = "[Gg]alaxy.{1,6}\\d";
    RegexMatcher regexMatcherInner = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_INNER, phoneRegex, JoinTestConstants.NEWS_BODY);
    RegexMatcher regexMatcherOuter = JoinTestHelper.getRegexMatcher(JoinTestHelper.NEWS_TABLE_OUTER, phoneRegex, JoinTestConstants.NEWS_BODY);
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate(JoinTestConstants.NEWS_BODY, 0.8);
    List<Tuple> results = JoinTestHelper.getJoinDistanceResults(regexMatcherInner, regexMatcherOuter, similarityJoinPredicate, Integer.MAX_VALUE, 0);
    Assert.assertTrue(results.isEmpty());
}
Also used : SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) RegexMatcher(edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher) Tuple(edu.uci.ics.texera.api.tuple.Tuple) Test(org.junit.Test)

Example 5 with SimilarityJoinPredicate

use of edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate in project textdb by TextDB.

the class PredicateBaseTest method testSimilarityJoin.

@Test
public void testSimilarityJoin() throws Exception {
    SimilarityJoinPredicate similarityJoinPredicate = new SimilarityJoinPredicate("attr1", "attr1", 0.8);
    testPredicate(similarityJoinPredicate);
}
Also used : SimilarityJoinPredicate(edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate) Test(org.junit.Test)

Aggregations

SimilarityJoinPredicate (edu.uci.ics.texera.dataflow.join.SimilarityJoinPredicate)5 Test (org.junit.Test)5 Tuple (edu.uci.ics.texera.api.tuple.Tuple)4 RegexMatcher (edu.uci.ics.texera.dataflow.regexmatcher.RegexMatcher)4 IDField (edu.uci.ics.texera.api.field.IDField)2 IntegerField (edu.uci.ics.texera.api.field.IntegerField)2 TextField (edu.uci.ics.texera.api.field.TextField)2 Schema (edu.uci.ics.texera.api.schema.Schema)2 Span (edu.uci.ics.texera.api.span.Span)2