首页云计算正文

python提取页面内URL的方法

admin

2024-12-18 2 0条评论

什么是URL？

URL是Internet上描述信息资源的字符串，主要用在各种WWW客户程序和服务器程序上。采用URL可以用一种统一的格式来描述各种信息资源，包括文件、服务器的地址和目录等。URL的一般格式为(带方括号[]的为可选项)：

protocol :// hostname[:port] / path / [;parameters][?query]#fragment

URL的格式由三部分组成：

①第一部分是协议(或称为服务方式)。

②第二部分是存有该资源的主机IP地址(有时也包括端口号)。

③第三部分是主机资源的具体地址，如目录和文件名等。

第一部分和第二部分用“://”符号隔开，

第二部分和第三部分用“/”符号隔开。

第一部分和第二部分是不可缺少的，第三部分有时可以省略。

python提取页面内URL的方法

from bs4 import BeautifulSoup

import time,re,urllib2

t=time.time()

websiteurls={}

def scanpage(url):

websiteurl=url

t=time.time()

n=0

html=urllib2.urlopen(websiteurl).read()

soup=BeautifulSoup(html)

pageurls=[]

Upageurls={}

pageurls=soup.find_all(“a”,href=True)

for links in pageurls:

if websiteurl in links.get(“href”) and links.get(“href”) not in Upageurls and links.get(“href”) not in websiteurls:

Upageurls[links.get(“href”)]=0

for links in Upageurls.keys():

try:

urllib2.urlopen(links).getcode()

except:

print “connect failed”

else:

t2=time.time()

Upageurls[links]=urllib2.urlopen(links).getcode()

print n,

print links,

print Upageurls[links]

t1=time.time()

print t1-t2

n+=1

print (“total is “+repr(n)+” links”)

print time.time()-t

scanpage(“http://news.163.com/”)

GDCA一直以“构建网络信任体系，服务现代数字生活”的宗旨，致力于提供全球化的数字证书认证服务。其自主品牌——信鉴易®TrustAUTH® SSL证书系列，为涉足互联网的企业打造更安全的生态环境，建立更具公信力的企业网站形象。

打赏

海报

文章版权及转载声明

本文作者：admin 网址：http://news.edns.com/post/218045.html 发布于 2024-12-18
文章转载或复制请以超链接形式并注明出处。

相关文章