python

薅“疫情公益”羊毛，黑产恶意爬取各大出版社电子书上万册

python爬虫 • Magiccc 发表了文章 • 0 个评论 • 3516 次浏览 • 2020-02-26 13:17 • 来自相关话题

疫情以来，所有企业都上班延期选择在线复工，在我们居家自我隔离期间，极验观察爬虫却没有消停，反而爬虫行为更加活跃且更胜往常。本周五，我们和无糖信息一起聊聊线上爬虫的“疫情”。

爬虫发送弹幕问题

python爬虫 • naythefirst 发起了问题 • 1 人关注 • 0 个回复 • 4507 次浏览 • 2020-02-26 11:28 • 来自相关话题

jieba.posseg TypeError: cannot unpack non-iterable pair object 词性分析报错

李魔佛发表了文章 • 0 个评论 • 4785 次浏览 • 2019-11-23 10:12 • 来自相关话题

词性标注的例子出现错误 'pair' object is not iterable

例子：import jieba.posseg as pseg
seg_list = pseg.cut("我爱北京天安门")
for word,flag in seg_list:
print(word)
print(flag)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-f105f6980f88> in <module>()
1 import jieba.posseg as pseg
2 seg_list = pseg.cut("我爱北京天安门")
----> 3 for word,flag in seg_list:
4 print(word)
5 print(flag)

TypeError: cannot unpack non-iterable pair object原因是新版本中seg_list是一个生成器，所以只能 for win seg_list然后从word中解包出来

print(w.word)

print(w.flag)

这样问题就解决了。查看全部

词性标注的例子出现错误 'pair' object is not iterable

例子：

import jieba.posseg as pseg

seg_list = pseg.cut("我爱北京天安门")

for word,flag in seg_list:

    print(word)

    print(flag)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-5-f105f6980f88> in <module>()

      1 import jieba.posseg as pseg

      2 seg_list = pseg.cut("我爱北京天安门")

----> 3 for word,flag in seg_list:

      4     print(word)

      5     print(flag)



TypeError: cannot unpack non-iterable pair object

原因是新版本中seg_list是一个生成器，所以只能 for win seg_list

然后从word中解包出来

print(w.word)

print(w.flag)

这样问题就解决了。

scrapy在settings中定义变量不能包含小写！

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 3443 次浏览 • 2019-11-16 16:39 • 来自相关话题

如果变量名包含小写字母，那么你的变量会被过滤掉，在scrapy编码的其他地方都会无法被识别。
比如定义了一个叫 Redis_host = '192.168.1.1'，的值

然后在spider中，如果你调用self.settings.get('Redis_host')
那么返回值是 None。

如果用REDIS_HOST定义，那么就可以正确返回它的值。

如果你一定要用小写，也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名

host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的查看全部

如果变量名包含小写字母，那么你的变量会被过滤掉，在scrapy编码的其他地方都会无法被识别。
比如定义了一个叫 Redis_host = '192.168.1.1'，的值

然后在spider中，如果你调用self.settings.get('Redis_host')
那么返回值是 None。

如果用REDIS_HOST定义，那么就可以正确返回它的值。

如果你一定要用小写，也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名

host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的

etree.strip_tags的用法

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 4775 次浏览 • 2019-10-24 11:24 • 来自相关话题

直接从官方文档那里拿过来，发现这个函数功能还挺不错的。
它把参数中的标签从源htmlelement中删除，并且把里面的标签文本给合并进来。

举个例子：from lxml.html import etree
from lxml.html import fromstring, HtmlElement

test_html = '''helloworld'''
test_element = fromstring(test_html)
etree.strip_tags(test_element,'span') # 清除span标签
etree.tostring(test_element)
因为上述操作直接应用于test_element上的，所以test_element的值已经被修改了。

所以现在test_element 的值是
b'helloworld'

原创文章，转载请注明出处
http://30daydo.com/article/553
查看全部

直接从官方文档那里拿过来，发现这个函数功能还挺不错的。
它把参数中的标签从源htmlelement中删除，并且把里面的标签文本给合并进来。

举个例子：

from lxml.html import etree

from lxml.html import fromstring, HtmlElement



test_html = '''<p><span>hello</span><span>world</span></p>'''

test_element = fromstring(test_html)

etree.strip_tags(test_element,'span') # 清除span标签

etree.tostring(test_element)

因为上述操作直接应用于test_element上的，所以test_element的值已经被修改了。

所以现在test_element 的值是
b'helloworld'

原创文章，转载请注明出处
http://30daydo.com/article/553

mumu模拟器adb无法识别

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 5710 次浏览 • 2019-10-17 08:41 • 来自相关话题

因为端口号被mumu改了。

<Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>

在mumu浏览器里面可以看到这个配置信息。

adb connect 127.0.0.1:7555
然后adb shell 就可以了。

配置文件名是：myandrovm_vbox86.nemu 查看全部

因为端口号被mumu改了。

<Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>

在mumu浏览器里面可以看到这个配置信息。

adb connect 127.0.0.1:7555
然后adb shell 就可以了。

配置文件名是：myandrovm_vbox86.nemu

aiohttp异步下载图片

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 5334 次浏览 • 2019-09-16 17:14 • 来自相关话题

保存图片的时候不能用自带的open函数打开文件，需要用到异步io库 aiofiles来打开url = 'http://xyhz.huizhou.gov.cn/static/js/common/jigsaw/images/{}.jpg'
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
async def getPage(num):

async with aiohttp.ClientSession() as session:
async with session.get(url.format(num),headers=headers) as resp:
if resp.status==200:
f= await aiofiles.open('{}.jpg'.format(num),mode='wb')
await f.write(await resp.read())
await f.close()

loop = asyncio.get_event_loop()
tasks = [getPage(i) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))
原创文章，
转载请注明出处：
http://30daydo.com/article/537
查看全部

保存图片的时候不能用自带的open函数打开文件，需要用到异步io库 aiofiles来打开

url = 'http://xyhz.huizhou.gov.cn/static/js/common/jigsaw/images/{}.jpg'

headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

async def getPage(num):



    async with aiohttp.ClientSession() as session:

        async with session.get(url.format(num),headers=headers) as resp:

            if resp.status==200:

                f= await aiofiles.open('{}.jpg'.format(num),mode='wb')

                await f.write(await resp.read())

                await f.close()



loop = asyncio.get_event_loop()

tasks = [getPage(i) for i in range(5)]

loop.run_until_complete(asyncio.wait(tasks))

原创文章，
转载请注明出处：
http://30daydo.com/article/537

基于文本及符号密度的网页正文提取方法 python实现

李魔佛发表了文章 • 0 个评论 • 5474 次浏览 • 2019-09-10 15:19 • 来自相关话题

基于文本及符号密度的网页正文提取方法 python实现
项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍，
请密切关注。查看全部

基于文本及符号密度的网页正文提取方法 python实现
项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍，
请密切关注。

性能对比 pypy vs python

李魔佛发表了文章 • 0 个评论 • 5262 次浏览 • 2019-09-06 17:04 • 来自相关话题

性能对比 pypy vs python
不试不知道，一试吓一跳。
如果是CPU密集型的程序，pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!

代码很简单，运行加法运算：
执行2千万次
import time

LOOP = 2*10**8

def add(x,y):
return x+y

def cpu_pressure(loop):

for i in range(loop):
result = add(i,i+1)

if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')
python执行：
python main.py
返回用时：time used 21.422261476516724s

pypy执行：
pypy main.py
返回用时：time used 0.1925642490386963s

差距真的很大。查看全部

性能对比 pypy vs python
不试不知道，一试吓一跳。
如果是CPU密集型的程序，pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!

代码很简单，运行加法运算：
执行2千万次

import time



LOOP = 2*10**8



def add(x,y):

    return x+y



def cpu_pressure(loop):

    

    for i in range(loop):

        result = add(i,i+1)





if __name__ == '__main__':

    start = time.time()

    cpu_pressure(LOOP)

    print(f'time used {time.time()-start}s')

python执行：
python main.py
返回用时：time used 21.422261476516724s

pypy执行：
pypy main.py
返回用时：time used 0.1925642490386963s

差距真的很大。

scrapy源码分析<一>：入口函数以及是如何运行

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 6694 次浏览 • 2019-08-31 10:47 • 来自相关话题

运行scrapy crawl example 命令的时候，就会执行我们写的爬虫程序。
下面我们从源码分析一下scrapy执行的流程：

执行scrapy crawl 命令时，调用的是Command类class Command(ScrapyCommand):

requires_project = True

def syntax(self):
return '[options]'

def short_desc(self):
return 'Runs all of the spiders - My Defined'

def run(self,args,opts):
print('==================')
print(type(self.crawler_process))
spider_list = self.crawler_process.spiders.list() # 找到爬虫类

for name in spider_list:
print('=================')
print(name)
self.crawler_process.crawl(name,**opts.__dict__)

self.crawler_process.start()
然后我们去看看crawler_process，这个是来自ScrapyCommand，而ScrapyCommand又是CrawlerProcess的子类，而CrawlerProcess又是CrawlerRunner的子类

在CrawlerRunner构造函数里面主要作用就是这个 def __init__(self, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
self.settings = settings
self.spider_loader = _get_spider_loader(settings) # 构造爬虫
self._crawlers = set()
self._active = set()
self.bootstrap_failed = False
1. 加载配置文件def _get_spider_loader(settings):

cls_path = settings.get('SPIDER_LOADER_CLASS')

# settings文件没有定义SPIDER_LOADER_CLASS，所以这里获取到的是系统的默认配置文件，
# 默认配置文件在接下来的代码块A
# SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'

loader_cls = load_object(cls_path)
# 这个函数就是根据路径转为类对象，也就是上面crapy.spiderloader.SpiderLoader 这个
# 字符串变成一个类对象
# 具体的load_object 对象代码见下面代码块B

return loader_cls.from_settings(settings.frozencopy())
默认配置文件defautl_settting.py# 代码块A
#......省略若干
SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'

SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值
SPIDER_LOADER_WARN_ONLY = False

SPIDER_MIDDLEWARES = {}

load_object的实现# 代码块B 为了方便，我把异常处理的去除
from importlib import import_module #导入第三方库

def load_object(path):
dot = path.rindex('.')
module, name = path[:dot], path[dot+1:]
# 上面把路径分为基本路径+模块名

mod = import_module(module)
obj = getattr(mod, name)
# 获取模块里面那个值

return obj

测试代码：In [33]: mod = import_module(module)

In [34]: mod
Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>

In [35]: getattr(mod,name)
Out[35]: scrapy.spiderloader.SpiderLoader

In [36]: obj = getattr(mod,name)

In [37]: obj
Out[37]: scrapy.spiderloader.SpiderLoader

In [38]: type(obj)
Out[38]: type
在代码块A中，loader_cls是SpiderLoader，最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings， def from_settings(cls, settings):
return cls(settings)
返回类对象自己，所以直接看__init__函数即可class SpiderLoader(object):
"""
SpiderLoader is a class which locates and loads spiders
in a Scrapy project.
"""
def __init__(self, settings):
self.spider_modules = settings.getlist('SPIDER_MODULES')
# 获得settting中的模块名字，创建scrapy的时候就默认帮你生成了
# 你可以看看你的settings文件里面的内容就可以找到这个值，是一个list

self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
self._spiders = {}
self._found = defaultdict(list)
self._load_all_spiders() # 加载所有爬虫

核心就是这个_load_all_spiders：
走起：def _load_all_spiders(self):
for name in self.spider_modules:

for module in walk_modules(name): # 这个遍历文件夹里面的文件，然后再转化为类对象，
# 保存到字典：self._spiders = {}
self._load_spiders(module) # 模块变成spider

self._check_name_duplicates() # 去重，如果名字一样就异常

接下来看看_load_spiders
核心就是下面的。def iter_spider_classes(module):
from scrapy.spiders import Spider

for obj in six.itervalues(vars(module)): # 找到模块里面的变量，然后迭代出来
if inspect.isclass(obj) and \
issubclass(obj, Spider) and \
obj.__module__ == module.__name__ and \
getattr(obj, 'name', None): # 有name属性，继承于Spider
yield obj
这个obj就是我们平时写的spider类了。
原来分析了这么多，才找到了我们平时写的爬虫类

待续。。。。

原创文章
转载请注明出处
http://30daydo.com/article/530
查看全部

运行scrapy crawl example 命令的时候，就会执行我们写的爬虫程序。
下面我们从源码分析一下scrapy执行的流程：

执行scrapy crawl 命令时，调用的是Command类

class Command(ScrapyCommand):



    requires_project = True



    def syntax(self):

        return '[options]'



    def short_desc(self):

        return 'Runs all of the spiders - My Defined'



    def run(self,args,opts):

        print('==================')

        print(type(self.crawler_process))

        spider_list = self.crawler_process.spiders.list() # 找到爬虫类



        for name in spider_list:

            print('=================')

            print(name)

            self.crawler_process.crawl(name,**opts.__dict__)



        self.crawler_process.start()

然后我们去看看crawler_process，这个是来自ScrapyCommand，而ScrapyCommand又是CrawlerProcess的子类，而CrawlerProcess又是CrawlerRunner的子类

在CrawlerRunner构造函数里面主要作用就是这个

      def __init__(self, settings=None):

        if isinstance(settings, dict) or settings is None:

            settings = Settings(settings)

        self.settings = settings

        self.spider_loader = _get_spider_loader(settings) # 构造爬虫

        self._crawlers = set()

        self._active = set()

        self.bootstrap_failed = False

1. 加载配置文件

def _get_spider_loader(settings):



    cls_path = settings.get('SPIDER_LOADER_CLASS')

    

    # settings文件没有定义SPIDER_LOADER_CLASS，所以这里获取到的是系统的默认配置文件，

    # 默认配置文件在接下来的代码块A

    # SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'    

    

    loader_cls = load_object(cls_path) 

    # 这个函数就是根据路径转为类对象，也就是上面crapy.spiderloader.SpiderLoader 这个

    # 字符串变成一个类对象

    # 具体的load_object 对象代码见下面代码块B



    return loader_cls.from_settings(settings.frozencopy())

默认配置文件defautl_settting.py

# 代码块A

#......省略若干

SCHEDULER = 'scrapy.core.scheduler.Scheduler'

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'

SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'

SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'



SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值

SPIDER_LOADER_WARN_ONLY = False



SPIDER_MIDDLEWARES = {}

load_object的实现

# 代码块B 为了方便，我把异常处理的去除

from importlib import import_module #导入第三方库



def load_object(path):

    dot = path.rindex('.') 

    module, name = path[:dot], path[dot+1:]

    # 上面把路径分为基本路径+模块名

    

    mod = import_module(module)

    obj = getattr(mod, name)

    # 获取模块里面那个值

    

    return obj

测试代码：

In [33]: mod = import_module(module)                                                                                                                                             



In [34]: mod                                                                                                                                                                     

Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>



In [35]: getattr(mod,name)                                                                                                                                                       

Out[35]: scrapy.spiderloader.SpiderLoader



In [36]: obj = getattr(mod,name)                                                                                                                                                 



In [37]: obj                                                                                                                                                                     

Out[37]: scrapy.spiderloader.SpiderLoader



In [38]: type(obj)                                                                                                                                                               

Out[38]: type

在代码块A中，loader_cls是SpiderLoader，最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings，

    def from_settings(cls, settings):

        return cls(settings)

返回类对象自己，所以直接看__init__函数即可

class SpiderLoader(object):

    """

    SpiderLoader is a class which locates and loads spiders

    in a Scrapy project.

    """

    def __init__(self, settings):

        self.spider_modules = settings.getlist('SPIDER_MODULES') 

        # 获得settting中的模块名字，创建scrapy的时候就默认帮你生成了

        # 你可以看看你的settings文件里面的内容就可以找到这个值，是一个list

        

        self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')

        self._spiders = {}

        self._found = defaultdict(list)

        self._load_all_spiders() # 加载所有爬虫

核心就是这个_load_all_spiders：
走起：

def _load_all_spiders(self):

        for name in self.spider_modules:            



                for module in walk_modules(name): # 这个遍历文件夹里面的文件，然后再转化为类对象，

                    # 保存到字典：self._spiders = {}       

                    self._load_spiders(module) # 模块变成spider



        self._check_name_duplicates() # 去重，如果名字一样就异常

接下来看看_load_spiders
核心就是下面的。

def iter_spider_classes(module):

    from scrapy.spiders import Spider



    for obj in six.itervalues(vars(module)): # 找到模块里面的变量，然后迭代出来

        if inspect.isclass(obj) and \

           issubclass(obj, Spider) and \

           obj.__module__ == module.__name__ and \

           getattr(obj, 'name', None): # 有name属性，继承于Spider

           yield obj

这个obj就是我们平时写的spider类了。
原来分析了这么多，才找到了我们平时写的爬虫类

待续。。。。

原创文章
转载请注明出处
http://30daydo.com/article/530

anaconda环境下无法启动jupyter notebook

李魔佛发表了文章 • 0 个评论 • 7687 次浏览 • 2019-08-19 17:16 • 来自相关话题

运行 jupyter notebook
报错： from . import (constants, error, message, context,
ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动，所以判断是环境问题。
切换到anaconda的虚拟环境，（在菜单中进入anaconda prompt command），在当前命令行下执行 jupyter notebook就能够正常运行。

查看全部

运行 jupyter notebook
报错：

    from . import (constants, error, message, context,

ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动，所以判断是环境问题。
切换到anaconda的虚拟环境，（在菜单中进入anaconda prompt command），在当前命令行下执行 jupyter notebook就能够正常运行。

random.randint的用法

李魔佛发表了文章 • 0 个评论 • 13506 次浏览 • 2019-08-01 16:31 • 来自相关话题

random.randint的用法：
from random import randint

randint(0,1)
Out[25]: 1

randint(0,1)
Out[26]: 1

randint(0,1)
Out[27]: 1

randint(0,1)
Out[28]: 1

randint(0,1)
Out[29]: 0

randint(0,1)
Out[30]: 1
random.randint（a,b）

输出的整数范围包含a和b，和之间的整数
查看全部

random.randint的用法：

from random import randint



randint(0,1)

Out[25]: 1



randint(0,1)

Out[26]: 1



randint(0,1)

Out[27]: 1



randint(0,1)

Out[28]: 1



randint(0,1)

Out[29]: 0



randint(0,1)

Out[30]: 1

random.randint（a,b）

输出的整数范围包含a和b，和之间的整数

frontera运行link_follower.py 报错：doesn't define any object named 'FIFO'

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 3972 次浏览 • 2019-07-18 11:29 • 来自相关话题

代码如下：
from __future__ import print_function

import re

import requests

from frontera.contrib.requests.manager import RequestsFrontierManager
# from frontera.contrib.requests.manager import RequestsFrontierManager
from frontera import Settings

from six.moves.urllib.parse import urljoin

SETTINGS = Settings()
SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'
# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10

SEEDS = [
'http://www.imdb.com',
]

LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)

def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]

if __name__ == '__main__':

frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [
requests.Request(url=url)
for url in extract_page_links(response)
]
frontier.page_crawled(response)
print('Crawled', response.url, '(found', len(links), 'urls)')

if links:
frontier.links_extracted(request, links)
except requests.RequestException as e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
print('Failed to process request', request.url, 'Error:', e)

无论用的py2或者py3，都会报以下的错误。raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO' 查看全部

代码如下：

from __future__ import print_function



import re



import requests



from frontera.contrib.requests.manager import RequestsFrontierManager

# from frontera.contrib.requests.manager import RequestsFrontierManager

from frontera import Settings



from six.moves.urllib.parse import urljoin





SETTINGS = Settings()

SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'

# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'



SETTINGS.LOGGING_MANAGER_ENABLED = True

SETTINGS.LOGGING_BACKEND_ENABLED = True

SETTINGS.MAX_REQUESTS = 100

SETTINGS.MAX_NEXT_REQUESTS = 10



SEEDS = [

    'http://www.imdb.com',

]



LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)





def extract_page_links(response):

    return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]



if __name__ == '__main__':



    frontier = RequestsFrontierManager(SETTINGS)

    frontier.add_seeds([requests.Request(url=url) for url in SEEDS])

    while True:

        next_requests = frontier.get_next_requests()

        if not next_requests:

            break

        for request in next_requests:

                try:

                    response = requests.get(request.url)

                    links = [

                        requests.Request(url=url)

                        for url in extract_page_links(response)

                    ]

                    frontier.page_crawled(response)

                    print('Crawled', response.url, '(found', len(links), 'urls)')



                    if links:

                        frontier.links_extracted(request, links)

                except requests.RequestException as e:

                    error_code = type(e).__name__

                    frontier.request_error(request, error_code)

                    print('Failed to process request', request.url, 'Error:', e)

无论用的py2或者py3，都会报以下的错误。

raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))

NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO'

scrapy-rabbitmq 不支持python3 [修改源码使它支持]

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 3707 次浏览 • 2019-07-17 17:24 • 来自相关话题

官方版本在2015年就没有更新了。
在python3上运行的收会报错。

需要修改以下地方：

待续。。

scrapy rabbitmq 分布式爬虫

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 6599 次浏览 • 2019-07-17 16:59 • 来自相关话题

对于没接触过rabbitmq的同学，可以看这个文章：https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务，可以配合scrapy作为消息队列.

下面是一个简单的demo：import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika

# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider

class Website(scrapy.Spider):
name = "rabbit"

def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://36kr.com/information/web_news'

yield Request(url=url,
headers=headers)

def parse(self, response):

credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')

result = channel.queue_declare(exclusive=True, queue='')

queue_name = result.method.queue

# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'

# 绑定多个值

channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))

channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)

channel.start_consuming()

def callback_func(self, ch, method, properties, body):
print(body)
启动spider：from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())
然后往rabbitmq里面推送数据：import pika
import settings

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播

routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)

print('sending message {}'.format(message))
connection.close()

推送数据后，scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令，因为start_requests函数需要返回一个生成器，否则程序会报错。

待续。。。
###### 2019-08-29 更新 ###################
发现一个坑，就是rabbitMQ在接受到数据后，无法在回调函数里面使用yield生成器。
查看全部

对于没接触过rabbitmq的同学，可以看这个文章：https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务，可以配合scrapy作为消息队列.

下面是一个简单的demo：

import re

import requests

import scrapy

from scrapy import Request

from rabbit_spider import settings

from scrapy.log import logger

import json

from rabbit_spider.items import RabbitSpiderItem

import datetime

from scrapy.selector import Selector

import pika



# from scrapy_rabbitmq.spiders import RabbitMQMixin

# from scrapy.contrib.spiders import CrawlSpider



class Website(scrapy.Spider):

    name = "rabbit"



    def start_requests(self):

        headers = {'Accept': '*/*',

                   'Accept-Encoding': 'gzip, deflate, br',

                   'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',

                   'Host': '36kr.com',

                   'Referer': 'https://36kr.com/information/web_news',

                   'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'

                   }



        url = 'https://36kr.com/information/web_news'

        



        yield Request(url=url,

                      headers=headers)



    def parse(self, response):

       



        credentials = pika.PlainCredentials('admin', 'admin')

        connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))



        channel = connection.channel()

        channel.exchange_declare(exchange='direct_log', exchange_type='direct')



        result = channel.queue_declare(exclusive=True, queue='')



        queue_name = result.method.queue



        # print(queue_name)

        # infos = sys.argv[1:] if len(sys.argv)>1 else ['info']

        info = 'info'



        # 绑定多个值



        channel.queue_bind(

            exchange='direct_log',

            routing_key=info,

            queue=queue_name

        )

        print('start to receive [{}]'.format(info))



        channel.basic_consume(

            on_message_callback=self.callback_func,

            queue=queue_name,

            auto_ack=True,

        )



        channel.start_consuming()





    def callback_func(self, ch, method, properties, body):

        print(body)

启动spider：

from scrapy import cmdline

cmdline.execute('scrapy crawl rabbit'.split())

然后往rabbitmq里面推送数据：

import pika

import settings



credentials = pika.PlainCredentials('admin','admin')

connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))



channel = connection.channel()

channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播



routing_key = 'info'

message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'

channel.basic_publish(

	exchange='direct_log',

	routing_key=routing_key,

	body=message

	)



print('sending message {}'.format(message))

connection.close()

推送数据后，scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令，因为start_requests函数需要返回一个生成器，否则程序会报错。

待续。。。
###### 2019-08-29 更新 ###################
发现一个坑，就是rabbitMQ在接受到数据后，无法在回调函数里面使用yield生成器。

exchange_declare() got an unexpected keyword argument 'type'

李魔佛发表了文章 • 0 个评论 • 3325 次浏览 • 2019-07-16 14:40 • 来自相关话题

In new version of pika, now it is using
exchange_type instead of type

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout') 查看全部

In new version of pika, now it is using
exchange_type instead of type

	credentials = pika.PlainCredentials('admin','admin')

	connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))



	channel = connection.channel()



	channel.exchange_declare(exchange='logs',exchange_type='fanout')

twisted　reactor运行后，添加了addBoth函数，但是还是无法停止

李魔佛发表了文章 • 0 个评论 • 4435 次浏览 • 2019-07-11 09:43 • 来自相关话题

代码如下：
from scrapy.selector import Selector

def get_response_callback(content):
txt = str(content,encoding='utf-8')
resp = Selector(text=txt)
title = resp.xpath('//title/text()').extract_first()
print(title)

@defer.inlineCallbacks
def task():
url = 'http://www.baidu.com'
d=getPage(url.encode('utf-8'))
d.addCallback(get_response_callback)
yield d

def done():
reactor.stop()

def done1(*args,**kwargs):
reactor.stop()

task_list =
for i in range(4):
d=task()
task_list.append(d)

dd = defer.DeferredList(task_list)

dd.addBoth(done)

reactor.run()
上面的代码是无法停止的，如果使用的是　
dd.addBoth(done)

done函数的定义是没有参数的。　

而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的，done里面写了reactor.stop() 函数

原创文章
转载请注明出处：
http://30daydo.com/article/509
查看全部

代码如下：

	from scrapy.selector import Selector

	

	def get_response_callback(content):

		txt = str(content,encoding='utf-8')

		resp = Selector(text=txt)

		title = resp.xpath('//title/text()').extract_first()

		print(title)

	

	@defer.inlineCallbacks

	def task():

		url = 'http://www.baidu.com'

		d=getPage(url.encode('utf-8'))

		d.addCallback(get_response_callback)

		yield d



	def done():

		reactor.stop()



	def done1(*args,**kwargs):

		reactor.stop()



	task_list = 

	for i in range(4):

		d=task()

		task_list.append(d)



	dd = defer.DeferredList(task_list)



	dd.addBoth(done)



	reactor.run()

上面的代码是无法停止的，如果使用的是　
dd.addBoth(done)

done函数的定义是没有参数的。　

而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的，done里面写了reactor.stop() 函数

原创文章
转载请注明出处：
http://30daydo.com/article/509

cv2 distanceTransform函数的用法 python

李魔佛发表了文章 • 0 个评论 • 12042 次浏览 • 2019-07-08 15:35 • 来自相关话题

distanceTransform
Calculates the distance to the closest zero pixel for each pixel of the source image.

Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst

Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None
Parameters:
src – 8-bit, single-channel (binary) source image.
dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .
distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .
maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a 3\times 3 mask gives the same result as 5\times 5 or any larger aperture.
labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.
labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.
The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.

When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.

In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:

CV_DIST_C (3\times 3) a = 1, b = 1
CV_DIST_L1 (3\times 3) a = 1, b = 2
CV_DIST_L2 (3\times 3) a=0.955, b=1.3693
CV_DIST_L2 (5\times 5) a=1, b=1.4, c=2.1969
Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.

The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.

In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.

Note
An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp
(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py

查看全部

distanceTransform

Calculates the distance to the closest zero pixel for each pixel of the source image.





Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst



Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None



Parameters:	

src – 8-bit, single-channel (binary) source image.

dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .



distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .

maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a  3\times 3 mask gives the same result as  5\times 5 or any larger aperture.



labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.



labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.

The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.





When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.



In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:



CV_DIST_C	(3\times 3)	a = 1, b = 1

CV_DIST_L1	(3\times 3)	a = 1, b = 2

CV_DIST_L2	(3\times 3)	a=0.955, b=1.3693

CV_DIST_L2	(5\times 5)	a=1, b=1.4, c=2.1969

Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.



The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.



In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.



Note

An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp

(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py

Win10下PhantomJS无法运行【版本兼容问题】

李魔佛发表了文章 • 0 个评论 • 6053 次浏览 • 2019-07-04 09:07 • 来自相关话题

以前在win7上运行的好好的。
在win10下就报错：
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295

后来替换了一个旧的版本，发现问题就这么解决了。
旧版本：phantomjs-2.1.1-windows

原创文章
转载请注明出处
http://30daydo.com/article/505
查看全部

以前在win7上运行的好好的。
在win10下就报错：
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295

后来替换了一个旧的版本，发现问题就这么解决了。
旧版本：phantomjs-2.1.1-windows

原创文章
转载请注明出处
http://30daydo.com/article/505

喜马拉雅app 爬取音频文件

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 6716 次浏览 • 2019-06-30 12:24 • 来自相关话题

============== 2019-10-28更新 =================
因为喜马拉雅的源码格式改了，所以爬虫代码也更新了一波
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
import os

url = 'http://180.153.255.6/mobile/v1/album/track/ts-1571294887744?albumId=23057324&device=android&isAsc=true&isQueryInvitationBrand=true&pageId={}&pageSize=20&pre_page=0'
headers = {'User-Agent': 'Xiaomi'}

def download():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
# trackName=re.sub(':','',trackName)
src_url = item.get('playUrl64')
filename = '{}.mp3'.format(trackName)
if not os.path.exists(filename):

try:
r0 = requests.get(src_url, headers=headers)
except Exception as e:
print(e)
print(trackName)
r0 = requests.get(src_url, headers=headers)

else:
with open(filename, 'wb') as f:
f.write(r0.content)

print('{} downloaded'.format(trackName))

else:
print(f'{filename}已经下载过了')

import shutil

def rename_():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
src_url = item.get('playUrl64')

orderNo=item.get('orderNo')

filename = '{}.mp3'.format(trackName)
try:

if os.path.exists(filename):
new_file='{}_{}.mp3'.format(orderNo,trackName)
shutil.move(filename,new_file)
except Exception as e:
print(e)

if __name__=='__main__':
rename_()

音频文件也更新了，详情见百度网盘。

======== 2018-10=============
爬取喜马拉雅app上杨继东的投资之道的音频文件
运行环境：python3# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
url = 'https://www.ximalaya.com/revision/play/album?albumId=23057324&pageNum=1&sort=1&pageSize=60'
headers={'User-Agent':'Xiaomi'}

r = requests.get(url=url,headers=headers)
js_data = r.json()
data_list = js_data.get('data',{}).get('tracksAudioPlay',)
for item in data_list:
trackName=item.get('trackName')
trackName=re.sub(':','',trackName)
src_url = item.get('src')
try:
r0=requests.get(src_url,headers=headers)
except Exception as e:
print(e)
print(trackName)
else:
with open('{}.m4a'.format(trackName),'wb') as f:
f.write(r0.content)
print('{} downloaded'.format(trackName))
保存为main.py
然后运行 python main.py
稍微等几分钟就自动下载好了。

附下载好的音频文件：
链接：https://pan.baidu.com/s/1t_vJhTvSJSeFdI1IaDS6fA
提取码：e3zb

原创文章
转载请注明出处
http://30daydo.com/article/503 查看全部

============== 2019-10-28更新 =================
因为喜马拉雅的源码格式改了，所以爬虫代码也更新了一波

# -*- coding: utf-8 -*-

# website: http://30daydo.com

# @Time : 2019/6/30 12:03

# @File : main.py



import requests

import re

import os



url = 'http://180.153.255.6/mobile/v1/album/track/ts-1571294887744?albumId=23057324&device=android&isAsc=true&isQueryInvitationBrand=true&pageId={}&pageSize=20&pre_page=0'

headers = {'User-Agent': 'Xiaomi'}



def download():

    for i in range(1, 3):

        r = requests.get(url=url.format(i), headers=headers)

        js_data = r.json()

        data_list = js_data.get('data', {}).get('list', [])

        for item in data_list:

            trackName = item.get('title')

            trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)

            # trackName=re.sub(':','',trackName)

            src_url = item.get('playUrl64')

            filename = '{}.mp3'.format(trackName)

            if not os.path.exists(filename):



                try:

                    r0 = requests.get(src_url, headers=headers)

                except Exception as e:

                    print(e)

                    print(trackName)

                    r0 = requests.get(src_url, headers=headers)





                else:

                    with open(filename, 'wb') as f:

                        f.write(r0.content)



                    print('{} downloaded'.format(trackName))



            else:

                print(f'{filename}已经下载过了')



import shutil



def rename_():

    for i in range(1, 3):

        r = requests.get(url=url.format(i), headers=headers)

        js_data = r.json()

        data_list = js_data.get('data', {}).get('list', [])

        for item in data_list:

            trackName = item.get('title')

            trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)

            src_url = item.get('playUrl64')



            orderNo=item.get('orderNo')



            filename = '{}.mp3'.format(trackName)

            try:



                if os.path.exists(filename):

                    new_file='{}_{}.mp3'.format(orderNo,trackName)

                    shutil.move(filename,new_file)

            except Exception as e:

                print(e)











if __name__=='__main__':

    rename_()

音频文件也更新了，详情见百度网盘。

======== 2018-10=============
爬取喜马拉雅app上杨继东的投资之道的音频文件
运行环境：python3

# -*- coding: utf-8 -*-

# website: http://30daydo.com

# @Time : 2019/6/30 12:03

# @File : main.py



import requests

import re

url = 'https://www.ximalaya.com/revision/play/album?albumId=23057324&pageNum=1&sort=1&pageSize=60'

headers={'User-Agent':'Xiaomi'}



r = requests.get(url=url,headers=headers)

js_data = r.json()

data_list = js_data.get('data',{}).get('tracksAudioPlay',)

for item in data_list:

    trackName=item.get('trackName')

    trackName=re.sub(':','',trackName)

    src_url = item.get('src')

    try:

        r0=requests.get(src_url,headers=headers)

    except Exception as e:

        print(e)

        print(trackName)

    else:

        with open('{}.m4a'.format(trackName),'wb') as f:

            f.write(r0.content)

        print('{} downloaded'.format(trackName))

保存为main.py
然后运行 python main.py
稍微等几分钟就自动下载好了。

附下载好的音频文件：
链接：https://pan.baidu.com/s/1t_vJhTvSJSeFdI1IaDS6fA
提取码：e3zb

原创文章
转载请注明出处
http://30daydo.com/article/503

python3与python2迭代器的写法的区别

李魔佛发表了文章 • 0 个评论 • 3508 次浏览 • 2019-06-26 11:22 • 来自相关话题

大部分相同，只是python2里面需要实现在类中实现next()方法，而python3里面需要实现__next__()方法。

附一个例子：
def iter_demo():

class DefineIter(object):

def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0

def __iter__(self):
return self

def __next__(self):

if self.index >=self.length:
# return None
raise StopIteration

d = self.data[self.index]*50
self.index =self.index + 1

return d

a = DefineIter(10)
print(type(a))
for i in a:
print(i) 查看全部

大部分相同，只是python2里面需要实现在类中实现next()方法，而python3里面需要实现__next__()方法。

附一个例子：

def iter_demo():



    class DefineIter(object):



        def __init__(self,length):

            self.length = length

            self.data = range(self.length)

            self.index=0



        def __iter__(self):

            return self





        def __next__(self):



            if self.index >=self.length:

                # return None

                raise StopIteration



            d = self.data[self.index]*50

            self.index =self.index + 1



            return d



    a = DefineIter(10)

    print(type(a))

    for i in a:

        print(i)

PyCharm 快捷键快速插入当前时间

李魔佛发表了文章 • 0 个评论 • 4346 次浏览 • 2019-06-26 09:18 • 来自相关话题

个人觉得这是一个非常常用的功能，不过需要自定义实现。

方式
通过 Live Template 快速添加时间

步骤
1、添加一个 Template Group 命名为 Common
2、添加一个 Live Template 设置如下
Abbreviation： time
Description ： current time
Template Text: $time$

Edit Variables -> Expresssion : date("yyyy-MM-dd HH:mm:ss")

3、让设置生效
Define->Everywhere

4、使用
输入 time 后按下tab键就能转换为当前时间了
查看全部

个人觉得这是一个非常常用的功能，不过需要自定义实现。

方式

通过 Live Template 快速添加时间



步骤

1、添加一个 Template Group 命名为 Common

2、添加一个 Live Template 设置如下

Abbreviation： time

Description ： current time

Template Text:  $time$



Edit Variables ->  Expresssion : date("yyyy-MM-dd HH:mm:ss")







3、让设置生效

Define->Everywhere



4、使用

输入 time 后 按下tab键 就能转换为当前时间了

conda无法在win10下用命令行切换虚拟环境

李魔佛发表了文章 • 0 个评论 • 5727 次浏览 • 2019-06-11 10:04 • 来自相关话题

虚拟环境已经安装好了
然后在PowerShell下运行activate py2，没有任何反应。（powershell是win7后面系统的增强命令行）
后来使用系统原始的cmd命令行，在运行里面敲入cmd，然后重新执行activate py2，问题得到解决了。
原因是兼容问题。查看全部

虚拟环境已经安装好了
然后在PowerShell下运行activate py2，没有任何反应。（powershell是win7后面系统的增强命令行）
后来使用系统原始的cmd命令行，在运行里面敲入cmd，然后重新执行activate py2，问题得到解决了。
原因是兼容问题。

jupyter notebook格式的文件损坏如何修复

李魔佛发表了文章 • 0 个评论 • 5371 次浏览 • 2019-06-08 13:44 • 来自相关话题

有时候用git同步时，造成了冲突后合并，jupyter notebook的文件被插入了诸如>>>>>HEAD，ORIGIN等字符，这时候再打开jupyter notebook文件（.ipynb后缀），会无法打开。修复过程：

使用下面的代码：
# 拯救损坏的jupyter 文件
import re
import codecs

pattern = re.compile('"source": \[(.*?)\]\s+\},',re.S)
filename = 'tushare_usage.ipynb'
with codecs.open(filename,encoding='utf8') as f:
content = f.read()

source = pattern.findall(content)
for s in source:
t=s.replace('\\n','')
t=re.sub('"','',t)
t=re.sub('(,$)','',t)
print(t)只要把你要修复的文件替换一下就可以了。查看全部

有时候用git同步时，造成了冲突后合并，jupyter notebook的文件被插入了诸如>>>>>HEAD，ORIGIN等字符，这时候再打开jupyter notebook文件（.ipynb后缀），会无法打开。修复过程：

使用下面的代码：

# 拯救损坏的jupyter 文件

import re

import codecs



pattern = re.compile('"source": \[(.*?)\]\s+\},',re.S)

filename = 'tushare_usage.ipynb'

with codecs.open(filename,encoding='utf8') as f:

    content = f.read()

    

source = pattern.findall(content)

for s in source:

    t=s.replace('\\n','')

    t=re.sub('"','',t)

    t=re.sub('(,$)','',t)

    print(t)

只要把你要修复的文件替换一下就可以了。

requests直接post图片文件

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 4104 次浏览 • 2019-05-17 16:32 • 来自相关话题

代码如下：
file_path=r'9927_15562445086485238.png'
file=open(file_path, 'rb').read()
r=requests.post(url=code_url,data=file)
print(r.text) 查看全部

代码如下：

    file_path=r'9927_15562445086485238.png'

    file=open(file_path, 'rb').read()

    r=requests.post(url=code_url,data=file)

    print(r.text)

python的mixin类

李魔佛发表了文章 • 0 个评论 • 3242 次浏览 • 2019-05-16 16:30 • 来自相关话题

A mixin is a limited form of multiple inheritance.

maxin类似多重继承的一种限制形式：
关于Python的Mixin模式

像C或C++这类语言都支持多重继承，一个子类可以有多个父类，这样的设计常被人诟病。因为继承应该是个”is-a”关系。比如轿车类继承交通工具类，因为轿车是一个(“is-a”)交通工具。一个物品不可能是多种不同的东西，因此就不应该存在多重继承。不过有没有这种情况，一个类的确是需要继承多个类呢？

答案是有，我们还是拿交通工具来举例子，民航飞机是一种交通工具，对于土豪们来说直升机也是一种交通工具。对于这两种交通工具，它们都有一个功能是飞行，但是轿车没有。所以，我们不可能将飞行功能写在交通工具这个父类中。但是如果民航飞机和直升机都各自写自己的飞行方法，又违背了代码尽可能重用的原则（如果以后飞行工具越来越多，那会出现许多重复代码）。怎么办，那就只好让这两种飞机同时继承交通工具以及飞行器两个父类，这样就出现了多重继承。这时又违背了继承必须是”is-a”关系。这个难题该怎么破？

不同的语言给出了不同的方法，让我们先来看下Java。Java提供了接口interface功能，来实现多重继承：public abstract class Vehicle {
}

public interface Flyable {
public void fly();
}

public class FlyableImpl implements Flyable {
public void fly() {
System.out.println("I am flying");
}
}

public class Airplane extends Vehicle implements Flyable {
private flyable;

public Airplane() {
flyable = new FlyableImpl();
}

public void fly() {
flyable.fly();
}
}

现在我们的飞机同时具有了交通工具及飞行器两种属性，而且我们不需要重写飞行器中的飞行方法，同时我们没有破坏单一继承的原则。飞机就是一种交通工具，可飞行的能力是是飞机的属性，通过继承接口来获取。

回到主题，Python语言可没有接口功能，但是它可以多重继承。那Python是不是就该用多重继承来实现呢？是，也不是。说是，因为从语法上看，的确是通过多重继承实现的。说不是，因为它的继承依然遵守”is-a”关系，从含义上看依然遵循单继承的原则。这个怎么理解呢？我们还是看例子吧。class Vehicle(object):
pass

class PlaneMixin(object):
def fly(self):
print 'I am flying'

class Airplane(Vehicle, PlaneMixin):
pass

可以看到，上面的Airplane类实现了多继承，不过它继承的第二个类我们起名为PlaneMixin，而不是Plane，这个并不影响功能，但是会告诉后来读代码的人，这个类是一个Mixin类。所以从含义上理解，Airplane只是一个Vehicle，不是一个Plane。这个Mixin，表示混入(mix-in)，它告诉别人，这个类是作为功能添加到子类中，而不是作为父类，它的作用同Java中的接口。

使用Mixin类实现多重继承要非常小心
首先它必须表示某一种功能，而不是某个物品，如同Java中的Runnable，Callable等

其次它必须责任单一，如果有多个功能，那就写多个Mixin类然后，它不依赖于子类的实现最后，子类即便没有继承这个Mixin类，也照样可以工作，就是缺少了某个功能。（比如飞机照样可以载客，就是不能飞了^_^）

原创文章，转载请注明出处
http://30daydo.com/article/480
查看全部

A mixin is a limited form of multiple inheritance.

maxin类似多重继承的一种限制形式：
关于Python的Mixin模式

像C或C++这类语言都支持多重继承，一个子类可以有多个父类，这样的设计常被人诟病。因为继承应该是个”is-a”关系。比如轿车类继承交通工具类，因为轿车是一个(“is-a”)交通工具。一个物品不可能是多种不同的东西，因此就不应该存在多重继承。不过有没有这种情况，一个类的确是需要继承多个类呢？

答案是有，我们还是拿交通工具来举例子，民航飞机是一种交通工具，对于土豪们来说直升机也是一种交通工具。对于这两种交通工具，它们都有一个功能是飞行，但是轿车没有。所以，我们不可能将飞行功能写在交通工具这个父类中。但是如果民航飞机和直升机都各自写自己的飞行方法，又违背了代码尽可能重用的原则（如果以后飞行工具越来越多，那会出现许多重复代码）。怎么办，那就只好让这两种飞机同时继承交通工具以及飞行器两个父类，这样就出现了多重继承。这时又违背了继承必须是”is-a”关系。这个难题该怎么破？

不同的语言给出了不同的方法，让我们先来看下Java。Java提供了接口interface功能，来实现多重继承：

public abstract class Vehicle {

}

 

public interface Flyable {

    public void fly();

}

 

public class FlyableImpl implements Flyable {

    public void fly() {

        System.out.println("I am flying");

    }

} 

 

public class Airplane extends Vehicle implements Flyable {

    private flyable;

 

    public Airplane() {

        flyable = new FlyableImpl();

    }

 

    public void fly() {

        flyable.fly();

    }

}

现在我们的飞机同时具有了交通工具及飞行器两种属性，而且我们不需要重写飞行器中的飞行方法，同时我们没有破坏单一继承的原则。飞机就是一种交通工具，可飞行的能力是是飞机的属性，通过继承接口来获取。

回到主题，Python语言可没有接口功能，但是它可以多重继承。那Python是不是就该用多重继承来实现呢？是，也不是。说是，因为从语法上看，的确是通过多重继承实现的。说不是，因为它的继承依然遵守”is-a”关系，从含义上看依然遵循单继承的原则。这个怎么理解呢？我们还是看例子吧。

class Vehicle(object):

    pass

 

class PlaneMixin(object):

    def fly(self):

        print 'I am flying'

 

class Airplane(Vehicle, PlaneMixin):

    pass

可以看到，上面的Airplane类实现了多继承，不过它继承的第二个类我们起名为PlaneMixin，而不是Plane，这个并不影响功能，但是会告诉后来读代码的人，这个类是一个Mixin类。所以从含义上理解，Airplane只是一个Vehicle，不是一个Plane。这个Mixin，表示混入(mix-in)，它告诉别人，这个类是作为功能添加到子类中，而不是作为父类，它的作用同Java中的接口。

使用Mixin类实现多重继承要非常小心

首先它必须表示某一种功能，而不是某个物品，如同Java中的Runnable，Callable等

其次它必须责任单一，如果有多个功能，那就写多个Mixin类
然后，它不依赖于子类的实现
最后，子类即便没有继承这个Mixin类，也照样可以工作，就是缺少了某个功能。（比如飞机照样可以载客，就是不能飞了^_^）

原创文章，转载请注明出处
http://30daydo.com/article/480

正则表达式替换中文换行符【python】

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 3475 次浏览 • 2019-05-13 11:02 • 来自相关话题

js里面的内容有中文的换行符。
使用正则表达式替换换行符。（也可以替换为任意字符）js=re.sub('\r\n','',js)
完毕。

js里面的内容有中文的换行符。
使用正则表达式替换换行符。（也可以替换为任意字符）

js=re.sub('\r\n','',js)

完毕。

request header显示Provisional headers are shown

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 5378 次浏览 • 2019-05-13 10:07 • 来自相关话题

出现这个情况，一般是因为装了一些插件，比如屏蔽广告的插件 ad block导致的。
把插件卸载了问题就解决了。

异步爬虫aiohttp post提交数据

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 8370 次浏览 • 2019-05-08 16:40 • 来自相关话题

基本的用法：async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()
完整的例子：import aiohttp
import asyncio

page = 30

post_data = {
'page': 1,
'pageSize': 10,
'keyWord': '',
'dpIds': '',
}

headers = {

"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}

result=

async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()

async def parse(html):
xzcf_list = html.get('newtxzcfList')
if xzcf_list is None:
return
for i in xzcf_list:
result.append(i)

async def downlod(page):
data=post_data.copy()
data['page']=page
url = 'http://credit.chaozhou.gov.cn/tfieldTypeActionJson!initXzcfListnew.do'
async with aiohttp.ClientSession() as session:
html=await fetch(session,url,data)
await parse(html)

loop = asyncio.get_event_loop()
tasks=[asyncio.ensure_future(downlod(i)) for i in range(1,page)]
tasks=asyncio.gather(*tasks)
# print(tasks)
loop.run_until_complete(tasks)
# loop.close()
# print(result)
count=0
for i in result:
print(i.get('cfXdrMc'))
count+=1
print(f'total {count}') 查看全部

基本的用法：

async def fetch(session,url, data):

    async with session.post(url=url, data=data, headers=headers) as response:

        return await response.json()

完整的例子：

import aiohttp

import asyncio



page = 30



post_data = {

    'page': 1,

    'pageSize': 10,

    'keyWord': '',

    'dpIds': '',

}



headers = {

    

    "Accept-Encoding": "gzip, deflate",

    "Accept-Language": "en-US,en;q=0.9",

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",

    "X-Requested-With": "XMLHttpRequest",

}



result=





async def fetch(session,url, data):

    async with session.post(url=url, data=data, headers=headers) as response:

        return await response.json()



async def parse(html):

    xzcf_list = html.get('newtxzcfList')

    if xzcf_list is None:

        return

    for i in xzcf_list:

        result.append(i)



async def downlod(page):

    data=post_data.copy()

    data['page']=page

    url = 'http://credit.chaozhou.gov.cn/tfieldTypeActionJson!initXzcfListnew.do'

    async with aiohttp.ClientSession() as session:

            html=await fetch(session,url,data)

            await parse(html)



loop = asyncio.get_event_loop()

tasks=[asyncio.ensure_future(downlod(i)) for i in range(1,page)]

tasks=asyncio.gather(*tasks)

# print(tasks)

loop.run_until_complete(tasks)

# loop.close()

# print(result)

count=0

for i in result:

    print(i.get('cfXdrMc'))

    count+=1

print(f'total {count}')

python异步aiohttp爬虫 - 异步爬取链家数据

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 3424 次浏览 • 2019-05-08 15:52 • 来自相关话题

import requests
from lxml import etree
import asyncio
import aiohttp
import pandas
import re
import math
import time

loction_info = ''' 1→杭州
2→武汉
3→北京
按ENTER确认：'''
loction_select = input(loction_info)
loction_dic = {'1': 'hz',
'2': 'wh',
'3': 'bj'}
city_url = 'https://{}.lianjia.com/ershoufang/'.format(loction_dic[loction_select])
down = input('请输入价格下限（万）:')
up = input('请输入价格上限（万）:')

inter_list = [(int(down), int(up))]

def half_inter(inter):
lower = inter[0]
upper = inter[1]
delta = int((upper - lower) / 2)
inter_list.remove(inter)
print('已经缩小价格区间', inter)
inter_list.append((lower, lower + delta))
inter_list.append((lower + delta, upper))

pagenum = {}

def get_num(inter):
url = city_url + 'bp{}ep{}/'.format(inter[0], inter[1])
r = requests.get(url).text
print(r)
num = int(etree.HTML(r).xpath("//h2[@class='total fl']/span/text()")[0].strip())
pagenum[(inter[0], inter[1])] = num
return num

totalnum = get_num(inter_list[0])

judge = True
while judge:
a = [get_num(x) > 3000 for x in inter_list]
if True in a:
judge = True
else:
judge = False
for i in inter_list:
if get_num(i) > 3000:
half_inter(i)
print('价格区间缩小完毕！')

url_lst = []
url_lst_failed = []
url_lst_successed = []
url_lst_duplicated = []

for i in inter_list:
totalpage = math.ceil(pagenum[i] / 30)
for j in range(1, totalpage + 1):
url = city_url + 'pg{}bp{}ep{}/'.format(j, i[0], i[1])
url_lst.append(url)
print('url列表获取完毕！')

info_lst = []

async def get_info(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5) as resp:
if resp.status != 200:
url_lst_failed.append(url)
else:
url_lst_successed.append(url)
r = await resp.text()
nodelist = etree.HTML(r).xpath("//ul[@class='sellListContent']/li")
# print('-------------------------------------------------------------')
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url),len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('-------------------------------------------------------------')
info_dic = {}
index = 1
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
for node in nodelist:
try:
info_dic['title'] = node.xpath(".//div[@class='title']/a/text()")[0]
except:
info_dic['title'] = '/'
try:
info_dic['href'] = node.xpath(".//div[@class='title']/a/@href")[0]
except:
info_dic['href'] = '/'
try:
info_dic['xiaoqu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[0]
except:
info_dic['xiaoqu'] = '/'
try:
info_dic['huxing'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[1]
except:
info_dic['huxing'] = '/'
try:
info_dic['area'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[2]
except:
info_dic['area'] = '/'
try:
info_dic['chaoxiang'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[3]
except:
info_dic['chaoxiang'] = '/'
try:
info_dic['zhuangxiu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[4]
except:
info_dic['zhuangxiu'] = '/'
try:
info_dic['dianti'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[5]
except:
info_dic['dianti'] = '/'
try:
info_dic['louceng'] = re.findall('$(.*)$', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['louceng'] = '/'
try:
info_dic['nianxian'] = re.findall('\)(.*?)年', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['nianxian'] = '/'
try:
info_dic['guanzhu'] = ''.join(re.findall('[0-9]', node.xpath(".//div[@class='followInfo']/text()")[
0].replace(' ', '').split('/')[0]))
except:
info_dic['guanzhu'] = '/'
try:
info_dic['daikan'] = ''.join(re.findall('[0-9]',
node.xpath(".//div[@class='followInfo']/text()")[0].replace(
' ', '').split('/')[1]))
except:
info_dic['daikan'] = '/'
try:
info_dic['fabu'] = node.xpath(".//div[@class='followInfo']/text()")[0].replace(' ', '').split('/')[
2]
except:
info_dic['fabu'] = '/'
try:
info_dic['totalprice'] = node.xpath(".//div[@class='totalPrice']/span/text()")[0]
except:
info_dic['totalprice'] = '/'
try:
info_dic['unitprice'] = node.xpath(".//div[@class='unitPrice']/span/text()")[0].replace('单价', '')
except:
info_dic['unitprice'] = '/'
if True in [info_dic['href'] in dic.values() for dic in info_lst]:
url_lst_duplicated.append(info_dic)
else:
info_lst.append(info_dic)
print('第{}条: {}→房屋信息抓取完毕！'.format(index, info_dic['title']))
index += 1
info_dic = {}

start = time.time()

# 首次抓取url_lst中的信息，部分url没有对其发起请求，不知道为什么
tasks = [asyncio.ensure_future(get_info(url)) for url in url_lst]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

# 将没有发起请求的url放入一个列表，对其进行循环抓取，直到所有url都被发起请求
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed or url_lst_failed:
url_lst_unrequested.append(url)
while len(url_lst_unrequested) > 0:
tasks_unrequested = [asyncio.ensure_future(get_info(url)) for url in url_lst_unrequested]
loop.run_until_complete(asyncio.wait(tasks_unrequested))
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed:
url_lst_unrequested.append(url)
end = time.time()
print('当前价格区间段内共有{}套二手房源$包含{}条重复房源$,实际获得{}条房源信息。'.format(totalnum, len(url_lst_duplicated), len(info_lst)))
print('总共耗时{}秒'.format(end - start))

df = pandas.DataFrame(info_lst)
df.to_csv("ljwh.csv", encoding='gbk') 查看全部

import requests

from lxml import etree

import asyncio

import aiohttp

import pandas

import re

import math

import time



loction_info = '''    1→杭州

    2→武汉

    3→北京

    按ENTER确认：'''

loction_select = input(loction_info)

loction_dic = {'1': 'hz',

               '2': 'wh',

               '3': 'bj'}

city_url = 'https://{}.lianjia.com/ershoufang/'.format(loction_dic[loction_select])

down = input('请输入价格下限（万）:')

up = input('请输入价格上限（万）:')



inter_list = [(int(down), int(up))]





def half_inter(inter):

    lower = inter[0]

    upper = inter[1]

    delta = int((upper - lower) / 2)

    inter_list.remove(inter)

    print('已经缩小价格区间', inter)

    inter_list.append((lower, lower + delta))

    inter_list.append((lower + delta, upper))





pagenum = {}





def get_num(inter):

    url = city_url + 'bp{}ep{}/'.format(inter[0], inter[1])

    r = requests.get(url).text

    print(r)

    num = int(etree.HTML(r).xpath("//h2[@class='total fl']/span/text()")[0].strip())

    pagenum[(inter[0], inter[1])] = num

    return num





totalnum = get_num(inter_list[0])



judge = True

while judge:

    a = [get_num(x) > 3000 for x in inter_list]

    if True in a:

        judge = True

    else:

        judge = False

    for i in inter_list:

        if get_num(i) > 3000:

            half_inter(i)

print('价格区间缩小完毕！')



url_lst = []

url_lst_failed = []

url_lst_successed = []

url_lst_duplicated = []



for i in inter_list:

    totalpage = math.ceil(pagenum[i] / 30)

    for j in range(1, totalpage + 1):

        url = city_url + 'pg{}bp{}ep{}/'.format(j, i[0], i[1])

        url_lst.append(url)

print('url列表获取完毕！')



info_lst = []





async def get_info(url):

    async with aiohttp.ClientSession() as session:

        async with session.get(url, timeout=5) as resp:

            if resp.status != 200:

                url_lst_failed.append(url)

            else:

                url_lst_successed.append(url)

            r = await resp.text()

            nodelist = etree.HTML(r).xpath("//ul[@class='sellListContent']/li")

            # print('-------------------------------------------------------------')

            # print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url),len(url_lst)))

            # print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))

            # print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))

            # print('-------------------------------------------------------------')

            info_dic = {}

            index = 1

            print('开始抓取{}'.format(resp.url))

            print('开始抓取{}'.format(resp.url))

            print('开始抓取{}'.format(resp.url))

            for node in nodelist:

                try:

                    info_dic['title'] = node.xpath(".//div[@class='title']/a/text()")[0]

                except:

                    info_dic['title'] = '/'

                try:

                    info_dic['href'] = node.xpath(".//div[@class='title']/a/@href")[0]

                except:

                    info_dic['href'] = '/'

                try:

                    info_dic['xiaoqu'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[0]

                except:

                    info_dic['xiaoqu'] = '/'

                try:

                    info_dic['huxing'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[1]

                except:

                    info_dic['huxing'] = '/'

                try:

                    info_dic['area'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[2]

                except:

                    info_dic['area'] = '/'

                try:

                    info_dic['chaoxiang'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[3]

                except:

                    info_dic['chaoxiang'] = '/'

                try:

                    info_dic['zhuangxiu'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[4]

                except:

                    info_dic['zhuangxiu'] = '/'

                try:

                    info_dic['dianti'] = \

                    node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[5]

                except:

                    info_dic['dianti'] = '/'

                try:

                    info_dic['louceng'] = re.findall('\((.*)\)', node.xpath(".//div[@class='positionInfo']/text()")[0])

                except:

                    info_dic['louceng'] = '/'

                try:

                    info_dic['nianxian'] = re.findall('\)(.*?)年', node.xpath(".//div[@class='positionInfo']/text()")[0])

                except:

                    info_dic['nianxian'] = '/'

                try:

                    info_dic['guanzhu'] = ''.join(re.findall('[0-9]', node.xpath(".//div[@class='followInfo']/text()")[

                        0].replace(' ', '').split('/')[0]))

                except:

                    info_dic['guanzhu'] = '/'

                try:

                    info_dic['daikan'] = ''.join(re.findall('[0-9]',

                                                            node.xpath(".//div[@class='followInfo']/text()")[0].replace(

                                                                ' ', '').split('/')[1]))

                except:

                    info_dic['daikan'] = '/'

                try:

                    info_dic['fabu'] = node.xpath(".//div[@class='followInfo']/text()")[0].replace(' ', '').split('/')[

                        2]

                except:

                    info_dic['fabu'] = '/'

                try:

                    info_dic['totalprice'] = node.xpath(".//div[@class='totalPrice']/span/text()")[0]

                except:

                    info_dic['totalprice'] = '/'

                try:

                    info_dic['unitprice'] = node.xpath(".//div[@class='unitPrice']/span/text()")[0].replace('单价', '')

                except:

                    info_dic['unitprice'] = '/'

                if True in [info_dic['href'] in dic.values() for dic in info_lst]:

                    url_lst_duplicated.append(info_dic)

                else:

                    info_lst.append(info_dic)

                print('第{}条:    {}→房屋信息抓取完毕！'.format(index, info_dic['title']))

                index += 1

                info_dic = {}





start = time.time()



# 首次抓取url_lst中的信息，部分url没有对其发起请求，不知道为什么

tasks = [asyncio.ensure_future(get_info(url)) for url in url_lst]

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.wait(tasks))



# 将没有发起请求的url放入一个列表，对其进行循环抓取，直到所有url都被发起请求

url_lst_unrequested = []

for url in url_lst:

    if url not in url_lst_successed or url_lst_failed:

        url_lst_unrequested.append(url)

while len(url_lst_unrequested) > 0:

    tasks_unrequested = [asyncio.ensure_future(get_info(url)) for url in url_lst_unrequested]

    loop.run_until_complete(asyncio.wait(tasks_unrequested))

    url_lst_unrequested = []

    for url in url_lst:

        if url not in url_lst_successed:

            url_lst_unrequested.append(url)

end = time.time()

print('当前价格区间段内共有{}套二手房源\(包含{}条重复房源\),实际获得{}条房源信息。'.format(totalnum, len(url_lst_duplicated), len(info_lst)))

print('总共耗时{}秒'.format(end - start))



df = pandas.DataFrame(info_lst)

df.to_csv("ljwh.csv", encoding='gbk')

通知设置新通知

薅“疫情公益”羊毛，黑产恶意爬取各大出版社电子书上万册

爬虫发送弹幕问题

jieba.posseg TypeError: cannot unpack non-iterable pair object 词性分析报错

scrapy在settings中定义变量不能包含小写！

etree.strip_tags的用法

mumu模拟器adb无法识别

aiohttp异步下载图片

基于文本及符号密度的网页正文提取方法 python实现

性能对比 pypy vs python

scrapy源码分析<一>：入口函数以及是如何运行

anaconda环境下无法启动jupyter notebook

random.randint的用法

frontera运行link_follower.py 报错：doesn't define any object named 'FIFO'

scrapy-rabbitmq 不支持python3 [修改源码使它支持]

scrapy rabbitmq 分布式爬虫

exchange_declare() got an unexpected keyword argument 'type'

twisted　reactor运行后，添加了addBoth函数，但是还是无法停止

cv2 distanceTransform函数的用法 python

Win10下PhantomJS无法运行【版本兼容问题】

喜马拉雅app 爬取音频文件

python3与python2迭代器的写法的区别

PyCharm 快捷键快速插入当前时间

conda无法在win10下用命令行切换虚拟环境

jupyter notebook格式的文件损坏如何修复

requests直接post图片文件

python的mixin类

正则表达式替换中文换行符【python】

request header显示Provisional headers are shown

异步爬虫aiohttp post提交数据

python异步aiohttp爬虫 - 异步爬取链家数据

热门话题

热门用户

通知设置 新通知

python

热门话题

热门用户

通知设置新通知