Search in sources :

Example 61 with Feature

use of org.dkpro.tc.api.features.Feature in project dkpro-tc by dkpro.

the class TokenRatioPerDocument method extract.

@Override
public Set<Feature> extract(JCas jcas, TextClassificationTarget aTarget) throws TextClassificationException {
    long maxLen = getMax();
    Collection<Token> tokens = JCasUtil.selectCovered(jcas, Token.class, aTarget);
    double ratio = getRatio(tokens.size(), maxLen);
    return new Feature(FEATURE_NAME, ratio, FeatureType.NUMERIC).asSet();
}
Also used : Token(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token) Feature(org.dkpro.tc.api.features.Feature)

Example 62 with Feature

use of org.dkpro.tc.api.features.Feature in project dkpro-tc by dkpro.

the class KeywordNGram method extract.

@Override
public Set<Feature> extract(JCas jcas, TextClassificationTarget aTarget) throws TextClassificationException {
    Set<Feature> features = new HashSet<Feature>();
    FrequencyDistribution<String> documentNgrams = KeywordNGramUtils.getDocumentKeywordNgrams(jcas, aTarget, ngramMaxN, ngramMaxN, markSentenceBoundary, markSentenceLocation, includeCommas, keywords);
    for (String topNgram : topKSet.getKeys()) {
        if (documentNgrams.getKeys().contains(topNgram)) {
            features.add(new Feature(getFeaturePrefix() + "_" + topNgram, 1, FeatureType.BOOLEAN));
        } else {
            features.add(new Feature(getFeaturePrefix() + "_" + topNgram, 0, true, FeatureType.BOOLEAN));
        }
    }
    return features;
}
Also used : Feature(org.dkpro.tc.api.features.Feature) HashSet(java.util.HashSet)

Example 63 with Feature

use of org.dkpro.tc.api.features.Feature in project dkpro-tc by dkpro.

the class PosNGram method extract.

@Override
public Set<Feature> extract(JCas view, TextClassificationTarget classificationUnit) throws TextClassificationException {
    Set<Feature> features = new HashSet<Feature>();
    FrequencyDistribution<String> documentPOSNgrams = null;
    documentPOSNgrams = PosNGramMC.getDocumentPosNgrams(view, classificationUnit, ngramMinN, ngramMaxN, useCanonicalTags);
    for (String topNgram : topKSet.getKeys()) {
        if (documentPOSNgrams.getKeys().contains(topNgram)) {
            features.add(new Feature(getFeaturePrefix() + "_" + topNgram, 1, FeatureType.BOOLEAN));
        } else {
            features.add(new Feature(getFeaturePrefix() + "_" + topNgram, 0, true, FeatureType.BOOLEAN));
        }
    }
    return features;
}
Also used : Feature(org.dkpro.tc.api.features.Feature) HashSet(java.util.HashSet)

Example 64 with Feature

use of org.dkpro.tc.api.features.Feature in project dkpro-tc by dkpro.

the class NumberOfHashTagsTest method numberOfHashTagsFeatureExtractorTest.

@Test
public void numberOfHashTagsFeatureExtractorTest() throws Exception {
    AnalysisEngineDescription desc = createEngineDescription(NoOpAnnotator.class);
    AnalysisEngine engine = createEngine(desc);
    JCas jcas = engine.newJCas();
    jcas.setDocumentLanguage("en");
    jcas.setDocumentText("This is a very #emotional tweet ;-) #icouldcry #ILoveHashTags");
    engine.process(jcas);
    TextClassificationTarget aTarget = new TextClassificationTarget(jcas, 0, jcas.getDocumentText().length());
    aTarget.addToIndexes();
    NumberOfHashTags extractor = new NumberOfHashTags();
    List<Feature> features = new ArrayList<Feature>(extractor.extract(jcas, aTarget));
    Assert.assertEquals(1, features.size());
    for (Feature feature : features) {
        assertFeature(NumberOfHashTags.class.getSimpleName(), 3, feature);
    }
}
Also used : AnalysisEngineDescription(org.apache.uima.analysis_engine.AnalysisEngineDescription) TextClassificationTarget(org.dkpro.tc.api.type.TextClassificationTarget) ArrayList(java.util.ArrayList) JCas(org.apache.uima.jcas.JCas) NumberOfHashTags(org.dkpro.tc.features.twitter.NumberOfHashTags) FeatureTestUtil.assertFeature(org.dkpro.tc.testing.FeatureTestUtil.assertFeature) Feature(org.dkpro.tc.api.features.Feature) AnalysisEngine(org.apache.uima.analysis_engine.AnalysisEngine) Test(org.junit.Test)

Example 65 with Feature

use of org.dkpro.tc.api.features.Feature in project dkpro-tc by dkpro.

the class LibsvmDataFormatLoadModelConnector method createInputFile.

private File createInputFile(JCas jcas) throws Exception {
    File tempFile = FileUtil.createTempFile("libsvm", ".txt");
    tempFile.deleteOnExit();
    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(tempFile), "utf-8"));
    InstanceExtractor extractor = new InstanceExtractor(featureMode, featureExtractors, true);
    List<Instance> instances = extractor.getInstances(jcas, true);
    for (Instance instance : instances) {
        bw.write(OUTCOME_PLACEHOLDER);
        bw.write(injectSequenceId(instance));
        for (Feature f : instance.getFeatures()) {
            if (!sanityCheckValue(f)) {
                continue;
            }
            bw.write("\t");
            bw.write(featureMapping.get(f.getName()) + ":" + f.getValue());
        }
        bw.write("\n");
    }
    bw.close();
    return tempFile;
}
Also used : Instance(org.dkpro.tc.api.features.Instance) FileOutputStream(java.io.FileOutputStream) OutputStreamWriter(java.io.OutputStreamWriter) File(java.io.File) InstanceExtractor(org.dkpro.tc.core.task.uima.InstanceExtractor) Feature(org.dkpro.tc.api.features.Feature) BufferedWriter(java.io.BufferedWriter)

Aggregations

Feature (org.dkpro.tc.api.features.Feature)94 Test (org.junit.Test)48 Instance (org.dkpro.tc.api.features.Instance)30 ArrayList (java.util.ArrayList)29 HashSet (java.util.HashSet)21 FeatureTestUtil.assertFeature (org.dkpro.tc.testing.FeatureTestUtil.assertFeature)17 AnalysisEngine (org.apache.uima.analysis_engine.AnalysisEngine)16 TextClassificationTarget (org.dkpro.tc.api.type.TextClassificationTarget)16 JCas (org.apache.uima.jcas.JCas)15 AnalysisEngineDescription (org.apache.uima.analysis_engine.AnalysisEngineDescription)13 File (java.io.File)8 Attribute (weka.core.Attribute)8 Token (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token)7 Sentence (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence)6 TextClassificationException (org.dkpro.tc.api.exception.TextClassificationException)5 Chunk (de.tudarmstadt.ukp.dkpro.core.api.syntax.type.chunk.Chunk)4 OpenNlpPosTagger (de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger)4 BreakIteratorSegmenter (de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter)4 Instances (weka.core.Instances)4 IOException (java.io.IOException)3