Search in sources :

Example 1 with SitemapSet

use of org.codelibs.fess.crawler.entity.SitemapSet in project fess-crawler by codelibs.

the class SitemapsHelperTest method test_parseXmlSitemapsGz.

public void test_parseXmlSitemapsGz() {
    final InputStream in = ResourceUtil.getResourceAsStream("sitemaps/sitemap1.xml.gz");
    final SitemapSet sitemapSet = sitemapsHelper.parse(in);
    final Sitemap[] sitemaps = sitemapSet.getSitemaps();
    assertEquals(5, sitemaps.length);
    assertTrue(sitemapSet.isUrlSet());
    assertFalse(sitemapSet.isIndex());
    assertEquals("2005-01-01", sitemaps[0].getLastmod());
    assertEquals("http://www.example.com/", sitemaps[0].getLoc());
    assertEquals("monthly", ((SitemapUrl) sitemaps[0]).getChangefreq());
    assertEquals("0.8", ((SitemapUrl) sitemaps[0]).getPriority());
    assertNull(sitemaps[1].getLastmod());
    assertEquals("http://www.example.com/catalog?item=12&desc=vacation_hawaii", sitemaps[1].getLoc());
    assertEquals("weekly", ((SitemapUrl) sitemaps[1]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[1]).getPriority());
    assertEquals("2004-12-23", sitemaps[2].getLastmod());
    assertEquals("http://www.example.com/catalog?item=73&desc=vacation_new_zealand", sitemaps[2].getLoc());
    assertEquals("weekly", ((SitemapUrl) sitemaps[2]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[2]).getPriority());
    assertEquals("2004-12-23T18:00:15+00:00", sitemaps[3].getLastmod());
    assertEquals("http://www.example.com/catalog?item=74&desc=vacation_newfoundland", sitemaps[3].getLoc());
    assertNull(((SitemapUrl) sitemaps[3]).getChangefreq());
    assertEquals("0.3", ((SitemapUrl) sitemaps[3]).getPriority());
    assertEquals("2004-11-23", sitemaps[4].getLastmod());
    assertEquals("http://www.example.com/catalog?item=83&desc=vacation_usa", sitemaps[4].getLoc());
    assertNull(((SitemapUrl) sitemaps[4]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[4]).getPriority());
}
Also used : Sitemap(org.codelibs.fess.crawler.entity.Sitemap) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) SitemapSet(org.codelibs.fess.crawler.entity.SitemapSet)

Example 2 with SitemapSet

use of org.codelibs.fess.crawler.entity.SitemapSet in project fess-crawler by codelibs.

the class SitemapsHelperTest method test_parseXmlSitemaps.

public void test_parseXmlSitemaps() {
    final InputStream in = ResourceUtil.getResourceAsStream("sitemaps/sitemap1.xml");
    final SitemapSet sitemapSet = sitemapsHelper.parse(in);
    final Sitemap[] sitemaps = sitemapSet.getSitemaps();
    assertEquals(5, sitemaps.length);
    assertTrue(sitemapSet.isUrlSet());
    assertFalse(sitemapSet.isIndex());
    assertEquals("2005-01-01", sitemaps[0].getLastmod());
    assertEquals("http://www.example.com/", sitemaps[0].getLoc());
    assertEquals("monthly", ((SitemapUrl) sitemaps[0]).getChangefreq());
    assertEquals("0.8", ((SitemapUrl) sitemaps[0]).getPriority());
    assertNull(sitemaps[1].getLastmod());
    assertEquals("http://www.example.com/catalog?item=12&desc=vacation_hawaii", sitemaps[1].getLoc());
    assertEquals("weekly", ((SitemapUrl) sitemaps[1]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[1]).getPriority());
    assertEquals("2004-12-23", sitemaps[2].getLastmod());
    assertEquals("http://www.example.com/catalog?item=73&desc=vacation_new_zealand", sitemaps[2].getLoc());
    assertEquals("weekly", ((SitemapUrl) sitemaps[2]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[2]).getPriority());
    assertEquals("2004-12-23T18:00:15+00:00", sitemaps[3].getLastmod());
    assertEquals("http://www.example.com/catalog?item=74&desc=vacation_newfoundland", sitemaps[3].getLoc());
    assertNull(((SitemapUrl) sitemaps[3]).getChangefreq());
    assertEquals("0.3", ((SitemapUrl) sitemaps[3]).getPriority());
    assertEquals("2004-11-23", sitemaps[4].getLastmod());
    assertEquals("http://www.example.com/catalog?item=83&desc=vacation_usa", sitemaps[4].getLoc());
    assertNull(((SitemapUrl) sitemaps[4]).getChangefreq());
    assertNull(((SitemapUrl) sitemaps[4]).getPriority());
}
Also used : Sitemap(org.codelibs.fess.crawler.entity.Sitemap) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) SitemapSet(org.codelibs.fess.crawler.entity.SitemapSet)

Example 3 with SitemapSet

use of org.codelibs.fess.crawler.entity.SitemapSet in project fess-crawler by codelibs.

the class SitemapsHelper method parseTextSitemaps.

protected SitemapSet parseTextSitemaps(final InputStream in) {
    final SitemapSet sitemapSet = new SitemapSet();
    sitemapSet.setType(SitemapSet.URLSET);
    try {
        final BufferedReader br = new BufferedReader(new InputStreamReader(in, Constants.UTF_8));
        String line;
        while ((line = br.readLine()) != null) {
            final String url = line.trim();
            if (StringUtil.isNotBlank(url) && (url.startsWith("http://") || url.startsWith("https://"))) {
                final SitemapUrl sitemapUrl = new SitemapUrl();
                sitemapUrl.setLoc(url);
                sitemapSet.addSitemap(sitemapUrl);
            }
        }
        return sitemapSet;
    } catch (final Exception e) {
        throw new SitemapsException("Could not parse Text Sitemaps.", e);
    }
}
Also used : SitemapUrl(org.codelibs.fess.crawler.entity.SitemapUrl) InputStreamReader(java.io.InputStreamReader) SitemapSet(org.codelibs.fess.crawler.entity.SitemapSet) BufferedReader(java.io.BufferedReader) SitemapsException(org.codelibs.fess.crawler.exception.SitemapsException) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) SitemapsException(org.codelibs.fess.crawler.exception.SitemapsException)

Example 4 with SitemapSet

use of org.codelibs.fess.crawler.entity.SitemapSet in project fess-crawler by codelibs.

the class SitemapsResponseProcessor method process.

@Override
public void process(final ResponseData responseData) {
    final SitemapsHelper sitemapsHelper = crawlerContainer.getComponent("sitemapsHelper");
    try (final InputStream responseBody = responseData.getResponseBody()) {
        final SitemapSet sitemapSet = sitemapsHelper.parse(responseBody);
        final Set<RequestData> requestDataSet = new LinkedHashSet<>();
        for (final Sitemap sitemap : sitemapSet.getSitemaps()) {
            if (sitemap != null) {
                requestDataSet.add(RequestDataBuilder.newRequestData().get().url(sitemap.getLoc()).build());
            }
        }
        throw new ChildUrlsException(requestDataSet, this.getClass().getName() + "#process");
    } catch (final IOException e) {
        throw new IORuntimeException(e);
    }
}
Also used : LinkedHashSet(java.util.LinkedHashSet) ChildUrlsException(org.codelibs.fess.crawler.exception.ChildUrlsException) Sitemap(org.codelibs.fess.crawler.entity.Sitemap) IORuntimeException(org.codelibs.core.exception.IORuntimeException) InputStream(java.io.InputStream) RequestData(org.codelibs.fess.crawler.entity.RequestData) SitemapSet(org.codelibs.fess.crawler.entity.SitemapSet) IOException(java.io.IOException) SitemapsHelper(org.codelibs.fess.crawler.helper.SitemapsHelper)

Example 5 with SitemapSet

use of org.codelibs.fess.crawler.entity.SitemapSet in project fess-crawler by codelibs.

the class SitemapsHelperTest method test_parseXmlSitemapsIndex.

public void test_parseXmlSitemapsIndex() {
    final InputStream in = ResourceUtil.getResourceAsStream("sitemaps/sitemap2.xml");
    final SitemapSet sitemapSet = sitemapsHelper.parse(in);
    final Sitemap[] sitemaps = sitemapSet.getSitemaps();
    assertEquals(2, sitemaps.length);
    assertFalse(sitemapSet.isUrlSet());
    assertTrue(sitemapSet.isIndex());
    assertEquals("2004-10-01T18:23:17+00:00", sitemaps[0].getLastmod());
    assertEquals("http://www.example.com/sitemap1.xml.gz", sitemaps[0].getLoc());
    assertEquals("2005-01-01", sitemaps[1].getLastmod());
    assertEquals("http://www.example.com/sitemap2.xml.gz", sitemaps[1].getLoc());
}
Also used : Sitemap(org.codelibs.fess.crawler.entity.Sitemap) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) SitemapSet(org.codelibs.fess.crawler.entity.SitemapSet)

Aggregations

SitemapSet (org.codelibs.fess.crawler.entity.SitemapSet)7 InputStream (java.io.InputStream)6 Sitemap (org.codelibs.fess.crawler.entity.Sitemap)6 ByteArrayInputStream (java.io.ByteArrayInputStream)5 BufferedReader (java.io.BufferedReader)1 IOException (java.io.IOException)1 InputStreamReader (java.io.InputStreamReader)1 LinkedHashSet (java.util.LinkedHashSet)1 IORuntimeException (org.codelibs.core.exception.IORuntimeException)1 RequestData (org.codelibs.fess.crawler.entity.RequestData)1 SitemapUrl (org.codelibs.fess.crawler.entity.SitemapUrl)1 ChildUrlsException (org.codelibs.fess.crawler.exception.ChildUrlsException)1 CrawlingAccessException (org.codelibs.fess.crawler.exception.CrawlingAccessException)1 SitemapsException (org.codelibs.fess.crawler.exception.SitemapsException)1 SitemapsHelper (org.codelibs.fess.crawler.helper.SitemapsHelper)1