Search in sources :

Example 86 with HttpRequest

use of com.github.kevinsawicki.http.HttpRequest in project springboot by LiJinHongPassion.

the class CrawlerImageUtil method getAllImgUrl.

/**
 * 描述: 爬虫 -- 获取页面中所有图片链接
 *
 * @param url  非图片链接,例如:http://www.baidu.com/artical=424
 * @param regx 正则表法式子
 * @return java.util.List<java.lang.String>
 * @author LJH-1755497577 2019/11/8 15:57
 */
public static Set<String> getAllImgUrl(String url, String regx, Map<String, String> headers) {
    Optional<IPEntity> randomIPEntity = getRandomIPEntity();
    HttpRequest httpRequest = HttpRequest.get(url);
    // httpRequest.useProxy(randomIPEntity.get().getIp(), randomIPEntity.get().getPort());
    httpRequest.headers(headers);
    // httpRequest.trustAllCerts().trustAllHosts().ok();
    String body = "";
    try {
        body = httpRequest.body();
    } catch (HttpRequest.HttpRequestException e) {
        System.out.println("获取图片链接失败   =====>  " + url);
        return null;
    }
    body = body.replaceAll(" ", "").replaceAll("\r\n", "").replaceAll("\t", "").replaceAll("\\\\", "");
    // 创建 Pattern 对象
    Pattern r = Pattern.compile(regx);
    // 现在创建 matcher 对象
    Matcher m = r.matcher(body);
    // 创建list存储
    Set<String> re = new HashSet<>();
    while (m.find()) {
        try {
            re.add(m.group().replaceAll("((http|https|HTTP|HTTPS):)*//", "https://"));
        } catch (Exception e) {
        }
    }
    return re;
}
Also used : HttpRequest(com.github.kevinsawicki.http.HttpRequest) IPEntity(com.example.li.springboot_crawler_demo.utils.img.entity.IPEntity) Pattern(java.util.regex.Pattern) Matcher(java.util.regex.Matcher)

Aggregations

HttpRequest (com.github.kevinsawicki.http.HttpRequest)86 HttpRequestException (com.github.kevinsawicki.http.HttpRequest.HttpRequestException)29 IOException (java.io.IOException)25 JSONObject (org.json.JSONObject)19 UnsupportedEncodingException (java.io.UnsupportedEncodingException)13 File (java.io.File)8 TimerTask (java.util.TimerTask)8 AtomicLong (java.util.concurrent.atomic.AtomicLong)8 Pair (android.util.Pair)6 URL (java.net.URL)5 JSONException (org.json.JSONException)5 NameNotFoundException (android.content.pm.PackageManager.NameNotFoundException)4 JSONObject (com.alibaba.fastjson.JSONObject)4 ZipFile (net.lingala.zip4j.ZipFile)4 SSLHandshakeException (javax.net.ssl.SSLHandshakeException)3 IPEntity (com.example.li.springboot_crawler_demo.utils.img.entity.IPEntity)2 HttpURLConnection (java.net.HttpURLConnection)2 MalformedURLException (java.net.MalformedURLException)2 Matcher (java.util.regex.Matcher)2 JSONArray (org.json.JSONArray)2