`
wangshaofei
  • 浏览: 272628 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

(转)php抓取网页内容汇总

阅读更多

①、使用php 获取网页 内容
http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html
header("Content-type: text/html; charset=utf-8");
1、
$xhr = new COM("MSXML2.XMLHTTP");
$xhr->open("GET","http://localhost/xxx.php?id=2",false);
$xhr->send();
echo $xhr->responseText

2、file_get_contents实现
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>

3、fopen()实现
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

②、使用php获取网页内容
http://www.blogjava.net/pts/archive/2007/08/26/99188.html
简单的做法:
<?php
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
或者:
<?
if ($stream = fopen('http://www.sohu.com', 'r')) {
    // print all the page starting at the offset 10
    echo stream_get_contents($stream, -1, 10);
    fclose($stream);
}

if ($stream = fopen('http://www.sohu.net', 'r')) {
    // print the first 5 bytes
    echo stream_get_contents($stream, 5);
    fclose($stream);
}
?>

③、PHP获取网站内容,保存为TXT文件源码
http://blog.chinaunix.net/u1/44325/showart_348444.html
<?
$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';
ereg("http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/",$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, "r");//读取文件
unlink("test.txt");
while (!feof($file_handle)) { //循环到文件结束
    $line = fgets($file_handle); //读取一行文件
    $line1=ereg("href=\"[0-9]+.html",$line,$reg); //分析文件内部书的文章页面
       $handle = fopen("test.txt", 'a');
   if ($line1) {
     $my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
   $my_book_txt_url=str_replace("href=\"","",$my_book_txt_url);
      $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址
      echo "$my_book_txt_over_url</p>"; //显示工作状态
      $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址
      while (!feof($file_handle_txt)) {
       $line_txt = fgets($file_handle_txt);
       $line1=ereg("^&nbsp.+",$line_txt,$reg); //根据抓取内容标示抓取
       $my_over_txt=$reg[0];
       $my_over_txt=str_replace("&nbsp;&nbsp;&nbsp;&nbsp;","    ",$my_over_txt); //过滤字符
       $my_over_txt=str_replace("<br />","",$my_over_txt);
       $my_over_txt=str_replace("<script. language=\"javascript\">","",$my_over_txt);
       $my_over_txt=str_replace("&quot;","",$my_over_txt);
       if ($line1) {
         $handle1=fwrite($handle,"$my_over_txt\n"); //写入文件
       }
      }
    }
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo "完成</p>";
?>


下面是比较嚣张的方法。
这里使用一个名叫Snoopy 的类。
先是在这里看到的:
PHP中获取网页内容的Snoopy
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网:
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明:
代码收藏-Snoopy 类及简单的使用方法
http://blog.passport86.com/?p=161
下载:http://sourceforge.net/projects/snoopy/


今天才发现这个好东西,赶紧去下载了来看看,是用的parse_url
还是比较习惯curl

snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
下面是它的一些特征:
1、方便抓取网页的内容
2、方便抓取网页的文字(去掉HTML代码)
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向,并能控制转向深度
8、能把网页中的链接扩展成高质量的url(默认)
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架(v0.92增加)
11、支持再转向的时候传递cookies

具体使用请看下载文件中的说明。

<?php
include Snoopy.class.php ;
$snoopy = new Snoopy ;
$snoopy -> fetchform ( http://www.phpx.com/happy/logging.php?action=login ) ;
print $snoopy -> results ;
?>
<?php
include Snoopy.class.php ;
$snoopy = new Snoopy ;
$submit_url = http://www.phpx.com/happy/logging.php?action=login ; $submit_vars [ " loginmode " ] = normal ;
$submit_vars [ " styleid " ] = 1 ;
$submit_vars [ " cookietime " ] = 315360000 ;
$submit_vars [ " loginfield " ] = username ;
$submit_vars [ " username " ] = ******** ; //你的用户名
$submit_vars [ " password " ] = ******* ; //你的密码
$submit_vars [ " questionid " ] = 0 ;
$submit_vars [ " answer " ] = “” ;
$submit_vars [ " loginsubmit " ] = 提 &nbsp; 交 ;
$snoopy -> submit ( $submit_url , $submit_vars ) ;
print $snoopy -> results ; ?>


下面是 Snoopy Readme
NAME:

    Snoopy - the PHP net client v1.2.4
   
SYNOPSIS:

    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->fetchtext("http://www.php.net/");
    print $snoopy->results;
   
    $snoopy->fetchlinks("http://www.phpbuilder.com/");
    print $snoopy->results;
   
    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
   
    $submit_vars["q"] = "amiga";
    $submit_vars["submit"] = "Search!";
    $submit_vars["searchhost"] = "Altavista";
       
    $snoopy->submit($submit_url,$submit_vars);
    print $snoopy->results;
   
    $snoopy->maxframes=5;
    $snoopy->fetch("http://www.ispi.net/");
    echo "<PRE>\n";
    echo htmlentities($snoopy->results[0]);
    echo htmlentities($snoopy->results[1]);
    echo htmlentities($snoopy->results[2]);
    echo "</PRE>\n";

    $snoopy->fetchform("http://www.altavista.com");
    print $snoopy->results;

DESCRIPTION:

    What is Snoopy?
   
    Snoopy is a PHP class that simulates a web browser. It automates the
    task of retrieving web page content and posting forms, for example.

    Some of Snoopy's features:
   
    * easily fetch the contents of a web page
    * easily fetch the text from a web page (strip html tags)
    * easily fetch the the links from a web page
    * supports proxy hosts
    * supports basic user/pass authentication
    * supports setting user_agent, referer, cookies and header content
    * supports browser redirects, and controlled depth of redirects
    * expands fetched links to fully qualified URLs (default)
    * easily submit form. data and retrieve the results
    * supports following html frames (added v0.92)
    * supports passing cookies on redirects (added v0.92)
   
   
REQUIREMENTS:

    Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
    which should be PHP 3.0.9 and up. For read timeout support, it requires
    PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.

CLASS METHODS:

    fetch($URI)
    -----------
   
    This is the method used for fetching the contents of a web page.
    $URI is the fully qualified URL of the page to fetch.
    The results of the fetch are stored in $this->results.
    If you are fetching frames, then $this->results
    contains each frame. fetched in an array.
       
    fetchtext($URI)
    ---------------   
   
    This behaves exactly like fetch() except that it only returns
    the text from the page, stripping out html tags and other
    irrelevant data.       

    fetchform($URI)
    ---------------   
   
    This behaves exactly like fetch() except that it only returns
    the form. elements from the page, stripping out html tags and other
    irrelevant data.       

    fetchlinks($URI)
    ----------------

    This behaves exactly like fetch() except that it only returns
    the links from the page. By default, relative links are
    converted to their fully qualified URL form.

    submit($URI,$formvars)
    ----------------------
   
    This submits a form. to the specified $URI. $formvars is an
    array of the form. variables to pass.
       
       
    submittext($URI,$formvars)
    --------------------------

    This behaves exactly like submit() except that it only returns
    the text from the page, stripping out html tags and other
    irrelevant data.       

    submitlinks($URI)
    ----------------

    This behaves exactly like submit() except that it only returns
    the links from the page. By default, relative links are
    converted to their fully qualified URL form.


CLASS VARIABLES:    (default value in parenthesis)

    $host            the host to connect to
    $port            the port to connect to
    $proxy_host        the proxy host to use, if any
    $proxy_port        the proxy port to use, if any
    $agent            the user agent to masqerade as (Snoopy v0.1)
    $referer        referer information to pass, if any
    $cookies        cookies to pass if any
    $rawheaders        other header info to pass, if any
    $maxredirs        maximum redirects to allow. 0=none allowed. (5)
    $offsiteok        whether or not to allow redirects off-site. (true)
    $expandlinks    whether or not to expand links to fully qualified URLs (true)
    $user            authentication username, if any
    $pass            authentication password, if any
    $accept            http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
    $error            where errors are sent, if any
    $response_code    responde code returned from server
    $headers        headers returned from server
    $maxlength        max return data length
    $read_timeout    timeout on read operations (requires PHP 4 Beta 4+)
                    set to 0 to disallow timeouts
    $timed_out        true if a read operation timed out (requires PHP 4 Beta 4+)
    $maxframes        number of frames we will follow
    $status            http status of fetch
    $temp_dir        temp directory that the webserver can write to. (/tmp)
    $curl_path        system path to cURL binary, set to false if none
   

EXAMPLES:

    Example:     fetch a web page and display the return headers and
                the contents of the page (html-escaped):
   
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->user = "joe";
    $snoopy->pass = "bloe";
   
    if($snoopy->fetch("http://www.slashdot.org/"))
    {
        echo "response code: ".$snoopy->response_code."<br>\n";
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";



    Example:    submit a form. and print out the result headers
                and html-escaped page:

    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
   
    $submit_vars["q"] = "amiga";
    $submit_vars["submit"] = "Search!";
    $submit_vars["searchhost"] = "Altavista";

       
    if($snoopy->submit($submit_url,$submit_vars))
    {
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";



    Example:    showing functionality of all the variables:
   

    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->proxy_host = "my.proxy.host";
    $snoopy->proxy_port = "8080";
   
    $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
    $snoopy->referer = "http://www.microsnot.com/";
   
    $snoopy->cookies["SessionID"] = 238472834723489l;
    $snoopy->cookies["favoriteColor"] = "RED";
   
    $snoopy->rawheaders["Pragma"] = "no-cache";
   
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;
   
    $snoopy->user = "joe";
    $snoopy->pass = "bloe";
   
    if($snoopy->fetchtext("http://www.phpbuilder.com"))
    {
        while(list($key,$val) = each($snoopy->headers))
            echo $key.": ".$val."<br>\n";
        echo "<p>\n";
       
        echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";


    Example:     fetched framed content and display the results
   
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
   
    $snoopy->maxframes = 5;
   
    if($snoopy->fetch("http://www.ispi.net/"))
    {
        echo "<PRE>".htmlspecialchars($snoopy->results[0])."</PRE>\n";
        echo "<PRE>".htmlspecialchars($snoopy->results[1])."</PRE>\n";
        echo "<PRE>".htmlspecialchars($snoopy->results[2])."</PRE>\n";
    }
    else
        echo "error fetching document: ".$snoopy->error."\n";

 

 

<?php

//获取所有内容url保存到文件
function get_index($save_file, $prefix="index_"){
    $count = 68;
    $i = 1;
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed");
    while($i<$count){
        $url = $prefix . $i .".htm";
        echo "Get ". $url ."...";
        $url_str = get_content_url(get_url($url));
        echo " OKn";
        fwrite($fp, $url_str);
        ++$i;
    }
    fclose($fp);
}

//获取目标多媒体对象
function get_object($url_file, $save_file, $split="|--:**:--|"){
    if (!file_exists($url_file)) die($url_file ." not exist");
    $file_arr = file($url_file);
    if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content");
    $url_arr = array_unique($file_arr);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    foreach($url_arr as $url){
        if (empty($url)) continue;
        echo "Get ". $url ."...";
        $html_str = get_url($url);
        echo $html_str;
        echo $url;
        exit;
        $obj_str = get_content_object($html_str);
        echo " OKn";
        fwrite($fp, $obj_str);
    }
    fclose($fp);
}

//遍历目录获取文件内容
function get_dir($save_file, $dir){
    $dp = opendir($dir);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    while(($file = readdir($dp)) != false){
        if ($file!="." && $file!=".."){
            echo "Read file ". $file ."...";
            $file_content = file_get_contents($dir . $file);
            $obj_str = get_content_object($file_content);
            echo " OKn";
            fwrite($fp, $obj_str);
        }
    }
    fclose($fp);
}


//获取指定url内容
function get_url($url){
    $reg = '/^http://[^/].+$/';
    if (!preg_match($reg, $url)) die($url ." invalid");
    $fp = fopen($url, "r") or die("Open url: ". $url ." failed.");
    while($fc = fread($fp, 8192)){
        $content .= $fc;
    }
    fclose($fp);
    if (empty($content)){
        die("Get url: ". $url ." content failed.");
    }
    return $content;
}

//使用socket获取指定网页
function get_content_by_socket($url, $host){
    $fp = fsockopen($host, 80) or die("Open ". $url ." failed");
    $header = "GET /".$url ." HTTP/1.1rn";
    $header .= "Accept: */*rn";
    $header .= "Accept-Language: zh-cnrn";
    $header .= "Accept-Encoding: gzip, deflatern";
    $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)rn";
    $header .= "Host: ". $host ."rn";
    $header .= "Connection: Keep-Alivern";
    //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-rnrn";
    $header .= "Connection: Closernrn";

    fwrite($fp, $header);
    while (!feof($fp)) {
        $contents .= fgets($fp, 8192);
    }
    fclose($fp);
    return $contents;
}


//获取指定内容里的url
function get_content_url($host_url, $file_contents){

    //$reg = '/^(#|javascript.*?|ftp://.+|http://.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';
    //$reg = '/^(down.*?.html|d+_d+.htm.*?)$/i';
    $rex = "/([hH][rR][eE][Ff])s*=s*['"]*([^>'"s]+)["'>]*s*/i";
    $reg = '/^(down.*?.html)$/i';
    preg_match_all ($rex, $file_contents, $r);
    $result = ""; //array();
    foreach($r as $c){
        if (is_array($c)){
            foreach($c as $d){
                if (preg_match($reg, $d)){ $result .= $host_url . $d."n"; }
            }
        }
    }
    return $result;
}

//获取指定内容中的多媒体文件
function get_content_object($str, $split="|--:**:--|"){    
    $regx = "/hrefs*=s*['"]*([^>'"s]+)["'>]*s*(<b>.*?</b>)/i";
    preg_match_all($regx, $str, $result);

    if (count($result) == 3){
        $result[2] = str_replace("<b>多媒体: ", "", $result[2]);
        $result[2] = str_replace("</b>", "", $result[2]);
        $result = $result[1][0] . $split .$result[2][0] . "n";
    }
    return $result;
}

?> 





php抓取网页特定div区块及图片

(2009-06-05 09:56:23)
标签:

php

抓取

图片

it

分类: PHP

1. 取得指定網頁內的所有圖片:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/<img[^>]*>/Ui', $text, $match);

//印出match
print_r($match);
?>

-----------------
2. 取得指定網頁內的第一張圖片:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');

//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/<img[^>]*>/Ui', $text, $match);

//印出match
print_r($match);
?>

------------------------------------


3. 取得指定網頁內的特定div區塊(藉由id判斷):
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');

//去除換行及空白字元(序列化內容才需使用)
//$text=str_replace(array("\r","\n","\t","\s"), '', $text);   

//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?) <\/div>/si',$text,$match);

//印出match[0]
print($match[0]);
?>

-------------------------------------------
4. 上述2及3的結合:
<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');    

//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?) <\/div>/si',$text,$match);   

//取得第一個img標籤,並儲存至陣列match2
preg_



  


  
分享到:
评论

相关推荐

    php面试\笔试题汇总

    php笔试题汇总 1、抓取远程图片到本地,你会用什么函数? fsockopen, A 2、用最少的代码写一个求3值最大值的函数. function($a,$b,$c){ return $a&gt;$b? ($a&gt;$c? $a : $c) : ($b&gt;$c? $b : $c ); }

    php笔试题汇总(超级精华的试题)

    1、抓取远程图片到本地,你会用什么函数? fsockopen, A 2、用最少的代码写一个求3值最大值的函数. function($a,$b,$c){ return $a&gt;$b? ($a&gt;$c? $a : $c) : ($b&gt;$c? $b : $c ); } 3、用PHP打印出前一天的时间,打印...

    php知识点大总结

    php中从数据类型和变量定义,各种函数应用,常用的框架介绍,数据库联系,文件上传和下载,缓存,xhtml,服务器配置,文字处理,图片处Sphinx/Coreseek 特性,php检索,页面抓取数据

    php 伪造ip以及url来路信息方法汇总

    本文汇总了一些关于php来路伪造,页面抓取等相关技术的资料,非常的全面,非常的详尽,有需要的小伙伴自己从中选取吧。

    scout:侦察员,灵活的结构化报废-随心所欲

    将网站抓取为自己的汇总网站 从大型静态网站迁移数据以导入CMS 从各种各样的工作板上在线获取您感兴趣的工作清单 将Web服务中的XML响应转换为JSON 其他任何东西,包括把XML / HTML你想要的数据结构。 咨询服务 如...

    remote-url-summarizer:WordPress插件,可抓取在帖子或评论中找到的远程URL并将其展开。 侧面加载图像并汇总html

    在查看帖子或评论时,此插件将创建远程URL的摘要,并将其显示在内容下方。 如果远程URL通过其mimetype被确定为图像,则将其侧面加载到WordPress媒体库中并附加到帖子中。 特征 每个post_type支持 评论支持 可通过...

    B2Bbuilder_v6.5.2.zip 电子商务行业网站内容管理系统!

    贸易提醒模块 会员可以在线订阅相关的商情及产品信息,系统会自动抓取相关信息,定期发送到会员邮箱。 工商信息验证 管理员可以在后台对会员进行工商信息验证,进行审核管理。 商友模块 会员可以在线添加其它会员...

    b2b软件 b2b网站管理系统 b2bbuilder

    贸易提醒模块 会员可以在线订阅相关的商情及产品信息,系统会自动抓取相关信息,定期发送到会员邮箱。 工商信息验证 管理员可以在后台对会员进行工商信息验证,进行审核管理。 商友模块 会员可以在线添加其它会员...

    B2Bbuilder行业企业网站中英文最新版

    贸易提醒模块 会员可以在线订阅相关的商情及产品信息,系统会自动抓取相关信息,定期发送到会员邮箱。 工商信息验证 管理员可以在后台对会员进行工商信息验证,进行审核管理。 商友模块 会员可以在线添加其它会员...

    B2Bbuilder行业网站英文系统最新版

    贸易提醒模块 会员可以在线订阅相关的商情及产品信息,系统会自动抓取相关信息,定期发送到会员邮箱。 工商信息验证 管理员可以在后台对会员进行工商信息验证,进行审核管理。 商友模块 会员可以在线添加其它会员...

    NYCdaycare:获得 NYC DoHMH 许可的团体托儿中心地图

    尝试通过纽约市开放数据门户发布数据失败,因此我们在必要时抓取数据(家庭日托的基本信息可用,见上文)。 我们正在使用 python 包 mechanize,代码在文件夹“python_scraper”中 旧版本的刮板,不再使用: 基于...

    B2Bbuilder网站管理系统源码 英文版 v7.0.1

    专题模块 强大的专题功能,站长可以跟据某向内容在某些特定的时期推出一些专题,可以制作不同的专题模板,在线操作,各种功能调用模块,自由组合出不同的专题版面。 问答模块 强大的知道问答系统,可分可合,和产品...

    B2Bbuilder B2B网站管理系统中文版本 v7.0.1

    专题模块 强大的专题功能,站长可以跟据某向内容在某些特定的时期推出一些专题,可以制作不同的专题模板,在线操作,各种功能调用模块,自由组合出不同的专题版面。 问答模块 强大的知道问答系统,可分可合,和产品...

Global site tag (gtag.js) - Google Analytics