scrapy在settings中定义变量不能包含小写!

python爬虫李魔佛 发表了文章 • 0 个评论 • 2649 次浏览 • 2019-11-16 16:39 • 来自相关话题

如果变量名包含小写字母,那么你的变量会被过滤掉,在scrapy编码的其他地方都会无法被识别。
比如定义了一个叫 Redis_host = '192.168.1.1',的值
 
然后在spider中,如果你调用self.settings.get('Redis_host')
那么返回值是 None。
 
如果用REDIS_HOST定义,那么就可以正确返回它的值。
 
如果你一定要用小写,也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名
 
host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的 查看全部
如果变量名包含小写字母,那么你的变量会被过滤掉,在scrapy编码的其他地方都会无法被识别。
比如定义了一个叫 Redis_host = '192.168.1.1',的值
 
然后在spider中,如果你调用self.settings.get('Redis_host')
那么返回值是 None。
 
如果用REDIS_HOST定义,那么就可以正确返回它的值。
 
如果你一定要用小写,也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名
 
host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的

etree.strip_tags的用法

python爬虫李魔佛 发表了文章 • 0 个评论 • 3835 次浏览 • 2019-10-24 11:24 • 来自相关话题

直接从官方文档那里拿过来,发现这个函数功能还挺不错的。
它把参数中的标签从源htmlelement中删除,并且把里面的标签文本给合并进来。
 
举个例子:from lxml.html import etree
from lxml.html import fromstring, HtmlElement

test_html = '''<p><span>hello</span><span>world</span></p>'''
test_element = fromstring(test_html)
etree.strip_tags(test_element,'span') # 清除span标签
etree.tostring(test_element)
因为上述操作直接应用于test_element上的,所以test_element的值已经被修改了。
 
所以现在test_element 的值是 
b'<p>helloworld</p>'

原创文章,转载请注明出处
http://30daydo.com/article/553
  查看全部
直接从官方文档那里拿过来,发现这个函数功能还挺不错的。
它把参数中的标签从源htmlelement中删除,并且把里面的标签文本给合并进来。
 
举个例子:
from lxml.html import etree
from lxml.html import fromstring, HtmlElement

test_html = '''<p><span>hello</span><span>world</span></p>'''
test_element = fromstring(test_html)
etree.strip_tags(test_element,'span') # 清除span标签
etree.tostring(test_element)

因为上述操作直接应用于test_element上的,所以test_element的值已经被修改了。
 
所以现在test_element 的值是 
b'<p>helloworld</p>'

原创文章,转载请注明出处
http://30daydo.com/article/553
 

mumu模拟器adb无法识别

python爬虫李魔佛 发表了文章 • 0 个评论 • 4877 次浏览 • 2019-10-17 08:41 • 来自相关话题

因为端口号被mumu改了。
 
            <Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>
 
在mumu浏览器里面可以看到这个配置信息。
 
adb connect 127.0.0.1:7555
然后adb shell 就可以了。
 
配置文件名是:myandrovm_vbox86.nemu 查看全部
因为端口号被mumu改了。
 
            <Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>
 
在mumu浏览器里面可以看到这个配置信息。
 
adb connect 127.0.0.1:7555
然后adb shell 就可以了。
 
配置文件名是:myandrovm_vbox86.nemu

aiohttp异步下载图片

python爬虫李魔佛 发表了文章 • 0 个评论 • 4430 次浏览 • 2019-09-16 17:14 • 来自相关话题

保存图片的时候不能用自带的open函数打开文件,需要用到异步io库 aiofiles来打开url = 'http://xyhz.huizhou.gov.cn/static/js/common/jigsaw/images/{}.jpg'
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
async def getPage(num):

async with aiohttp.ClientSession() as session:
async with session.get(url.format(num),headers=headers) as resp:
if resp.status==200:
f= await aiofiles.open('{}.jpg'.format(num),mode='wb')
await f.write(await resp.read())
await f.close()

loop = asyncio.get_event_loop()
tasks = [getPage(i) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))
原创文章,
转载请注明出处:
http://30daydo.com/article/537
  查看全部
保存图片的时候不能用自带的open函数打开文件,需要用到异步io库 aiofiles来打开
url = 'http://xyhz.huizhou.gov.cn/static/js/common/jigsaw/images/{}.jpg'
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
async def getPage(num):

async with aiohttp.ClientSession() as session:
async with session.get(url.format(num),headers=headers) as resp:
if resp.status==200:
f= await aiofiles.open('{}.jpg'.format(num),mode='wb')
await f.write(await resp.read())
await f.close()

loop = asyncio.get_event_loop()
tasks = [getPage(i) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))

原创文章,
转载请注明出处:
http://30daydo.com/article/537
 

基于文本及符号密度的网页正文提取方法 python实现

李魔佛 发表了文章 • 0 个评论 • 4609 次浏览 • 2019-09-10 15:19 • 来自相关话题

基于文本及符号密度的网页正文提取方法 python实现
 项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。 查看全部
基于文本及符号密度的网页正文提取方法 python实现
 项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。

性能对比 pypy vs python

李魔佛 发表了文章 • 0 个评论 • 4664 次浏览 • 2019-09-06 17:04 • 来自相关话题

性能对比 pypy vs python
 不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
 
代码很简单,运行加法运算:
执行2千万次
 import time

LOOP = 2*10**8

def add(x,y):
return x+y

def cpu_pressure(loop):

for i in range(loop):
result = add(i,i+1)


if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')
python执行:
python main.py
返回用时:time used 21.422261476516724s
 
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
 
差距真的很大。 查看全部
性能对比 pypy vs python
 不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
 
代码很简单,运行加法运算:
执行2千万次
 
import time

LOOP = 2*10**8

def add(x,y):
return x+y

def cpu_pressure(loop):

for i in range(loop):
result = add(i,i+1)


if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')

python执行:
python main.py
返回用时:time used 21.422261476516724s
 
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
 
差距真的很大。

scrapy源码分析<一>:入口函数以及是如何运行

python爬虫李魔佛 发表了文章 • 0 个评论 • 5472 次浏览 • 2019-08-31 10:47 • 来自相关话题

运行scrapy crawl example 命令的时候,就会执行我们写的爬虫程序。
下面我们从源码分析一下scrapy执行的流程:
 

执行scrapy crawl 命令时,调用的是Command类class Command(ScrapyCommand):

requires_project = True

def syntax(self):
return '[options]'

def short_desc(self):
return 'Runs all of the spiders - My Defined'

def run(self,args,opts):
print('==================')
print(type(self.crawler_process))
spider_list = self.crawler_process.spiders.list() # 找到爬虫类

for name in spider_list:
print('=================')
print(name)
self.crawler_process.crawl(name,**opts.__dict__)

self.crawler_process.start()
然后我们去看看crawler_process,这个是来自ScrapyCommand,而ScrapyCommand又是CrawlerProcess的子类,而CrawlerProcess又是CrawlerRunner的子类

在CrawlerRunner构造函数里面主要作用就是这个 def __init__(self, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
self.settings = settings
self.spider_loader = _get_spider_loader(settings) # 构造爬虫
self._crawlers = set()
self._active = set()
self.bootstrap_failed = False
1. 加载配置文件def _get_spider_loader(settings):

cls_path = settings.get('SPIDER_LOADER_CLASS')

# settings文件没有定义SPIDER_LOADER_CLASS,所以这里获取到的是系统的默认配置文件,
# 默认配置文件在接下来的代码块A
# SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'

loader_cls = load_object(cls_path)
# 这个函数就是根据路径转为类对象,也就是上面crapy.spiderloader.SpiderLoader 这个
# 字符串变成一个类对象
# 具体的load_object 对象代码见下面代码块B

return loader_cls.from_settings(settings.frozencopy())
默认配置文件defautl_settting.py# 代码块A
#......省略若干
SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'

SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值
SPIDER_LOADER_WARN_ONLY = False

SPIDER_MIDDLEWARES = {}

load_object的实现# 代码块B 为了方便,我把异常处理的去除
from importlib import import_module #导入第三方库

def load_object(path):
dot = path.rindex('.')
module, name = path[:dot], path[dot+1:]
# 上面把路径分为基本路径+模块名

mod = import_module(module)
obj = getattr(mod, name)
# 获取模块里面那个值

return obj

测试代码:In [33]: mod = import_module(module)

In [34]: mod
Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>

In [35]: getattr(mod,name)
Out[35]: scrapy.spiderloader.SpiderLoader

In [36]: obj = getattr(mod,name)

In [37]: obj
Out[37]: scrapy.spiderloader.SpiderLoader

In [38]: type(obj)
Out[38]: type
在代码块A中,loader_cls是SpiderLoader,最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings, def from_settings(cls, settings):
return cls(settings)
返回类对象自己,所以直接看__init__函数即可class SpiderLoader(object):
"""
SpiderLoader is a class which locates and loads spiders
in a Scrapy project.
"""
def __init__(self, settings):
self.spider_modules = settings.getlist('SPIDER_MODULES')
# 获得settting中的模块名字,创建scrapy的时候就默认帮你生成了
# 你可以看看你的settings文件里面的内容就可以找到这个值,是一个list

self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
self._spiders = {}
self._found = defaultdict(list)
self._load_all_spiders() # 加载所有爬虫

核心就是这个_load_all_spiders:
走起:def _load_all_spiders(self):
for name in self.spider_modules:

for module in walk_modules(name): # 这个遍历文件夹里面的文件,然后再转化为类对象,
# 保存到字典:self._spiders = {}
self._load_spiders(module) # 模块变成spider

self._check_name_duplicates() # 去重,如果名字一样就异常

接下来看看_load_spiders
核心就是下面的。def iter_spider_classes(module):
from scrapy.spiders import Spider

for obj in six.itervalues(vars(module)): # 找到模块里面的变量,然后迭代出来
if inspect.isclass(obj) and \
issubclass(obj, Spider) and \
obj.__module__ == module.__name__ and \
getattr(obj, 'name', None): # 有name属性,继承于Spider
yield obj
这个obj就是我们平时写的spider类了。
原来分析了这么多,才找到了我们平时写的爬虫类

待续。。。。
 
原创文章
转载请注明出处
http://30daydo.com/article/530
  查看全部
运行scrapy crawl example 命令的时候,就会执行我们写的爬虫程序。
下面我们从源码分析一下scrapy执行的流程:
 

执行scrapy crawl 命令时,调用的是Command类
class Command(ScrapyCommand):

requires_project = True

def syntax(self):
return '[options]'

def short_desc(self):
return 'Runs all of the spiders - My Defined'

def run(self,args,opts):
print('==================')
print(type(self.crawler_process))
spider_list = self.crawler_process.spiders.list() # 找到爬虫类

for name in spider_list:
print('=================')
print(name)
self.crawler_process.crawl(name,**opts.__dict__)

self.crawler_process.start()

然后我们去看看crawler_process,这个是来自ScrapyCommand,而ScrapyCommand又是CrawlerProcess的子类,而CrawlerProcess又是CrawlerRunner的子类

在CrawlerRunner构造函数里面主要作用就是这个
      def __init__(self, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
self.settings = settings
self.spider_loader = _get_spider_loader(settings) # 构造爬虫
self._crawlers = set()
self._active = set()
self.bootstrap_failed = False

1. 加载配置文件
def _get_spider_loader(settings):

cls_path = settings.get('SPIDER_LOADER_CLASS')

# settings文件没有定义SPIDER_LOADER_CLASS,所以这里获取到的是系统的默认配置文件,
# 默认配置文件在接下来的代码块A
# SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'

loader_cls = load_object(cls_path)
# 这个函数就是根据路径转为类对象,也就是上面crapy.spiderloader.SpiderLoader 这个
# 字符串变成一个类对象
# 具体的load_object 对象代码见下面代码块B

return loader_cls.from_settings(settings.frozencopy())

默认配置文件defautl_settting.py
# 代码块A
#......省略若干
SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'

SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值
SPIDER_LOADER_WARN_ONLY = False

SPIDER_MIDDLEWARES = {}


load_object的实现
# 代码块B 为了方便,我把异常处理的去除
from importlib import import_module #导入第三方库

def load_object(path):
dot = path.rindex('.')
module, name = path[:dot], path[dot+1:]
# 上面把路径分为基本路径+模块名

mod = import_module(module)
obj = getattr(mod, name)
# 获取模块里面那个值

return obj


测试代码:
In [33]: mod = import_module(module)                                                                                                                                             

In [34]: mod
Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>

In [35]: getattr(mod,name)
Out[35]: scrapy.spiderloader.SpiderLoader

In [36]: obj = getattr(mod,name)

In [37]: obj
Out[37]: scrapy.spiderloader.SpiderLoader

In [38]: type(obj)
Out[38]: type

在代码块A中,loader_cls是SpiderLoader,最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings,
    def from_settings(cls, settings):
return cls(settings)

返回类对象自己,所以直接看__init__函数即可
class SpiderLoader(object):
"""
SpiderLoader is a class which locates and loads spiders
in a Scrapy project.
"""
def __init__(self, settings):
self.spider_modules = settings.getlist('SPIDER_MODULES')
# 获得settting中的模块名字,创建scrapy的时候就默认帮你生成了
# 你可以看看你的settings文件里面的内容就可以找到这个值,是一个list

self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
self._spiders = {}
self._found = defaultdict(list)
self._load_all_spiders() # 加载所有爬虫


核心就是这个_load_all_spiders:
走起:
def _load_all_spiders(self):
for name in self.spider_modules:

for module in walk_modules(name): # 这个遍历文件夹里面的文件,然后再转化为类对象,
# 保存到字典:self._spiders = {}
self._load_spiders(module) # 模块变成spider

self._check_name_duplicates() # 去重,如果名字一样就异常


接下来看看_load_spiders
核心就是下面的。
def iter_spider_classes(module):
from scrapy.spiders import Spider

for obj in six.itervalues(vars(module)): # 找到模块里面的变量,然后迭代出来
if inspect.isclass(obj) and \
issubclass(obj, Spider) and \
obj.__module__ == module.__name__ and \
getattr(obj, 'name', None): # 有name属性,继承于Spider
yield obj

这个obj就是我们平时写的spider类了。
原来分析了这么多,才找到了我们平时写的爬虫类

待续。。。。
 
原创文章
转载请注明出处
http://30daydo.com/article/530
 

anaconda环境下无法启动jupyter notebook

李魔佛 发表了文章 • 0 个评论 • 6998 次浏览 • 2019-08-19 17:16 • 来自相关话题

运行 jupyter notebook
报错: from . import (constants, error, message, context,
ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动,所以判断是环境问题。
切换到anaconda的虚拟环境,(在菜单中进入anaconda prompt command),在当前命令行下执行 jupyter notebook就能够正常运行。
 
  查看全部
运行 jupyter notebook
报错:
    from . import (constants, error, message, context,
ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动,所以判断是环境问题。
切换到anaconda的虚拟环境,(在菜单中进入anaconda prompt command),在当前命令行下执行 jupyter notebook就能够正常运行。
 
 

random.randint的用法

李魔佛 发表了文章 • 0 个评论 • 12801 次浏览 • 2019-08-01 16:31 • 来自相关话题

random.randint的用法:
from random import randint

randint(0,1)
Out[25]: 1

randint(0,1)
Out[26]: 1

randint(0,1)
Out[27]: 1

randint(0,1)
Out[28]: 1

randint(0,1)
Out[29]: 0

randint(0,1)
Out[30]: 1
random.randint(a,b)
 
输出的整数范围包含a和b,和之间的整数
  查看全部
random.randint的用法:
from random import randint

randint(0,1)
Out[25]: 1

randint(0,1)
Out[26]: 1

randint(0,1)
Out[27]: 1

randint(0,1)
Out[28]: 1

randint(0,1)
Out[29]: 0

randint(0,1)
Out[30]: 1

random.randint(a,b)
 
输出的整数范围包含a和b,和之间的整数
 

frontera运行link_follower.py 报错:doesn't define any object named 'FIFO'

python爬虫李魔佛 发表了文章 • 0 个评论 • 3273 次浏览 • 2019-07-18 11:29 • 来自相关话题

代码如下:
from __future__ import print_function

import re

import requests

from frontera.contrib.requests.manager import RequestsFrontierManager
# from frontera.contrib.requests.manager import RequestsFrontierManager
from frontera import Settings

from six.moves.urllib.parse import urljoin


SETTINGS = Settings()
SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'
# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10

SEEDS = [
'http://www.imdb.com',
]

LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)


def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]

if __name__ == '__main__':

frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [
requests.Request(url=url)
for url in extract_page_links(response)
]
frontier.page_crawled(response)
print('Crawled', response.url, '(found', len(links), 'urls)')

if links:
frontier.links_extracted(request, links)
except requests.RequestException as e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
print('Failed to process request', request.url, 'Error:', e)

 无论用的py2或者py3,都会报以下的错误。raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO' 查看全部
代码如下:
from __future__ import print_function

import re

import requests

from frontera.contrib.requests.manager import RequestsFrontierManager
# from frontera.contrib.requests.manager import RequestsFrontierManager
from frontera import Settings

from six.moves.urllib.parse import urljoin


SETTINGS = Settings()
SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'
# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10

SEEDS = [
'http://www.imdb.com',
]

LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)


def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]

if __name__ == '__main__':

frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [
requests.Request(url=url)
for url in extract_page_links(response)
]
frontier.page_crawled(response)
print('Crawled', response.url, '(found', len(links), 'urls)')

if links:
frontier.links_extracted(request, links)
except requests.RequestException as e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
print('Failed to process request', request.url, 'Error:', e)

 无论用的py2或者py3,都会报以下的错误。
raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO'

scrapy-rabbitmq 不支持python3 [修改源码使它支持]

python爬虫李魔佛 发表了文章 • 0 个评论 • 2993 次浏览 • 2019-07-17 17:24 • 来自相关话题

官方版本在2015年就没有更新了。
在python3上运行的收会报错。
 
需要修改以下地方:
 
待续。。
官方版本在2015年就没有更新了。
在python3上运行的收会报错。
 
需要修改以下地方:
 
待续。。

scrapy rabbitmq 分布式爬虫

python爬虫李魔佛 发表了文章 • 0 个评论 • 5709 次浏览 • 2019-07-17 16:59 • 来自相关话题

对于没接触过rabbitmq的同学,可以看这个文章:https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
 
下面是一个简单的demo:import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika

# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider

class Website(scrapy.Spider):
name = "rabbit"

def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://36kr.com/information/web_news'


yield Request(url=url,
headers=headers)

def parse(self, response):


credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')

result = channel.queue_declare(exclusive=True, queue='')

queue_name = result.method.queue

# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'

# 绑定多个值

channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))

channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)

channel.start_consuming()


def callback_func(self, ch, method, properties, body):
print(body)
 启动spider:from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())
 然后往rabbitmq里面推送数据:import pika
import settings

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播

routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)

print('sending message {}'.format(message))
connection.close()
 
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
 
待续。。。
###### 2019-08-29 更新 ################### 
发现一个坑,就是rabbitMQ在接受到数据后,无法在回调函数里面使用yield生成器。
  查看全部
对于没接触过rabbitmq的同学,可以看这个文章:https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
 
下面是一个简单的demo:
import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika

# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider

class Website(scrapy.Spider):
name = "rabbit"

def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://36kr.com/information/web_news'


yield Request(url=url,
headers=headers)

def parse(self, response):


credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')

result = channel.queue_declare(exclusive=True, queue='')

queue_name = result.method.queue

# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'

# 绑定多个值

channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))

channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)

channel.start_consuming()


def callback_func(self, ch, method, properties, body):
print(body)

 启动spider:
from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())

 然后往rabbitmq里面推送数据:
import pika
import settings

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播

routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)

print('sending message {}'.format(message))
connection.close()

 
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
 
待续。。。
###### 2019-08-29 更新 ################### 
发现一个坑,就是rabbitMQ在接受到数据后,无法在回调函数里面使用yield生成器。
 

exchange_declare() got an unexpected keyword argument 'type'

李魔佛 发表了文章 • 0 个评论 • 2762 次浏览 • 2019-07-16 14:40 • 来自相关话题

In new version of pika, now it is using 
exchange_type instead of type
 
credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout') 查看全部
In new version of pika, now it is using 
exchange_type instead of type
 
	credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout')

twisted reactor运行后,添加了addBoth函数,但是还是无法停止

李魔佛 发表了文章 • 0 个评论 • 3791 次浏览 • 2019-07-11 09:43 • 来自相关话题

代码如下:
  from scrapy.selector import Selector

def get_response_callback(content):
txt = str(content,encoding='utf-8')
resp = Selector(text=txt)
title = resp.xpath('//title/text()').extract_first()
print(title)

@defer.inlineCallbacks
def task():
url = 'http://www.baidu.com'
d=getPage(url.encode('utf-8'))
d.addCallback(get_response_callback)
yield d

def done():
reactor.stop()

def done1(*args,**kwargs):
reactor.stop()

task_list =
for i in range(4):
d=task()
task_list.append(d)

dd = defer.DeferredList(task_list)

dd.addBoth(done)

reactor.run()
上面的代码是无法停止的,如果使用的是 
dd.addBoth(done)
 
done函数的定义是没有参数的。 
 
而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的,done里面写了reactor.stop() 函数
 
原创文章
转载请注明出处:
http://30daydo.com/article/509
  查看全部
代码如下:
 
	from scrapy.selector import Selector

def get_response_callback(content):
txt = str(content,encoding='utf-8')
resp = Selector(text=txt)
title = resp.xpath('//title/text()').extract_first()
print(title)

@defer.inlineCallbacks
def task():
url = 'http://www.baidu.com'
d=getPage(url.encode('utf-8'))
d.addCallback(get_response_callback)
yield d

def done():
reactor.stop()

def done1(*args,**kwargs):
reactor.stop()

task_list =
for i in range(4):
d=task()
task_list.append(d)

dd = defer.DeferredList(task_list)

dd.addBoth(done)

reactor.run()

上面的代码是无法停止的,如果使用的是 
dd.addBoth(done)
 
done函数的定义是没有参数的。 
 
而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的,done里面写了reactor.stop() 函数
 
原创文章
转载请注明出处:
http://30daydo.com/article/509
 

cv2 distanceTransform函数的用法 python

李魔佛 发表了文章 • 0 个评论 • 11337 次浏览 • 2019-07-08 15:35 • 来自相关话题

distanceTransform
Calculates the distance to the closest zero pixel for each pixel of the source image.


Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst

Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None
Parameters:
src – 8-bit, single-channel (binary) source image.
dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .
distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .
maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a 3\times 3 mask gives the same result as 5\times 5 or any larger aperture.
labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.
labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.
The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.


When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.

In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:

CV_DIST_C (3\times 3) a = 1, b = 1
CV_DIST_L1 (3\times 3) a = 1, b = 2
CV_DIST_L2 (3\times 3) a=0.955, b=1.3693
CV_DIST_L2 (5\times 5) a=1, b=1.4, c=2.1969
Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.

The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.

In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.

Note
An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp
(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py 

  查看全部
distanceTransform
Calculates the distance to the closest zero pixel for each pixel of the source image.


Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst

Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None

Parameters:
src – 8-bit, single-channel (binary) source image.
dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .

distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .
maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a 3\times 3 mask gives the same result as 5\times 5 or any larger aperture.

labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.

labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.
The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.


When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.

In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:

CV_DIST_C (3\times 3) a = 1, b = 1
CV_DIST_L1 (3\times 3) a = 1, b = 2
CV_DIST_L2 (3\times 3) a=0.955, b=1.3693
CV_DIST_L2 (5\times 5) a=1, b=1.4, c=2.1969
Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.

The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.

In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.

Note
An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp
(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py
 

 

Win10下PhantomJS无法运行 【版本兼容问题】

李魔佛 发表了文章 • 0 个评论 • 5219 次浏览 • 2019-07-04 09:07 • 来自相关话题

以前在win7上运行的好好的。
在win10下就报错:
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295
 
后来替换了一个旧的版本,发现问题就这么解决了。
旧版本:phantomjs-2.1.1-windows
 
原创文章
转载请注明出处 
http://30daydo.com/article/505
  查看全部
以前在win7上运行的好好的。
在win10下就报错:
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295
 
后来替换了一个旧的版本,发现问题就这么解决了。
旧版本:phantomjs-2.1.1-windows
 
原创文章
转载请注明出处 
http://30daydo.com/article/505
 

喜马拉雅app 爬取音频文件

python爬虫李魔佛 发表了文章 • 0 个评论 • 5756 次浏览 • 2019-06-30 12:24 • 来自相关话题

============== 2019-10-28更新 =================
因为喜马拉雅的源码格式改了,所以爬虫代码也更新了一波
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
import os

url = 'http://180.153.255.6/mobile/v1/album/track/ts-1571294887744?albumId=23057324&device=android&isAsc=true&isQueryInvitationBrand=true&pageId={}&pageSize=20&pre_page=0'
headers = {'User-Agent': 'Xiaomi'}

def download():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
# trackName=re.sub(':','',trackName)
src_url = item.get('playUrl64')
filename = '{}.mp3'.format(trackName)
if not os.path.exists(filename):

try:
r0 = requests.get(src_url, headers=headers)
except Exception as e:
print(e)
print(trackName)
r0 = requests.get(src_url, headers=headers)


else:
with open(filename, 'wb') as f:
f.write(r0.content)

print('{} downloaded'.format(trackName))

else:
print(f'{filename}已经下载过了')

import shutil

def rename_():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
src_url = item.get('playUrl64')

orderNo=item.get('orderNo')

filename = '{}.mp3'.format(trackName)
try:

if os.path.exists(filename):
new_file='{}_{}.mp3'.format(orderNo,trackName)
shutil.move(filename,new_file)
except Exception as e:
print(e)





if __name__=='__main__':
rename_()
 
音频文件也更新了,详情见百度网盘。
 
 
======== 2018-10=============
 爬取喜马拉雅app上 杨继东的投资之道 的音频文件
运行环境:python3# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
url = 'https://www.ximalaya.com/revision/play/album?albumId=23057324&pageNum=1&sort=1&pageSize=60'
headers={'User-Agent':'Xiaomi'}

r = requests.get(url=url,headers=headers)
js_data = r.json()
data_list = js_data.get('data',{}).get('tracksAudioPlay',)
for item in data_list:
trackName=item.get('trackName')
trackName=re.sub(':','',trackName)
src_url = item.get('src')
try:
r0=requests.get(src_url,headers=headers)
except Exception as e:
print(e)
print(trackName)
else:
with open('{}.m4a'.format(trackName),'wb') as f:
f.write(r0.content)
print('{} downloaded'.format(trackName))
保存为main.py
然后运行 python main.py
稍微等几分钟就自动下载好了。







附下载好的音频文件:
链接:https://pan.baidu.com/s/1t_vJhTvSJSeFdI1IaDS6fA 
提取码:e3zb 
 

原创文章
转载请注明出处
http://30daydo.com/article/503 查看全部
============== 2019-10-28更新 =================
因为喜马拉雅的源码格式改了,所以爬虫代码也更新了一波
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
import os

url = 'http://180.153.255.6/mobile/v1/album/track/ts-1571294887744?albumId=23057324&device=android&isAsc=true&isQueryInvitationBrand=true&pageId={}&pageSize=20&pre_page=0'
headers = {'User-Agent': 'Xiaomi'}

def download():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
# trackName=re.sub(':','',trackName)
src_url = item.get('playUrl64')
filename = '{}.mp3'.format(trackName)
if not os.path.exists(filename):

try:
r0 = requests.get(src_url, headers=headers)
except Exception as e:
print(e)
print(trackName)
r0 = requests.get(src_url, headers=headers)


else:
with open(filename, 'wb') as f:
f.write(r0.content)

print('{} downloaded'.format(trackName))

else:
print(f'{filename}已经下载过了')

import shutil

def rename_():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
src_url = item.get('playUrl64')

orderNo=item.get('orderNo')

filename = '{}.mp3'.format(trackName)
try:

if os.path.exists(filename):
new_file='{}_{}.mp3'.format(orderNo,trackName)
shutil.move(filename,new_file)
except Exception as e:
print(e)





if __name__=='__main__':
rename_()

 
音频文件也更新了,详情见百度网盘。
 
 
======== 2018-10=============
 爬取喜马拉雅app上 杨继东的投资之道 的音频文件
运行环境:python3
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py

import requests
import re
url = 'https://www.ximalaya.com/revision/play/album?albumId=23057324&pageNum=1&sort=1&pageSize=60'
headers={'User-Agent':'Xiaomi'}

r = requests.get(url=url,headers=headers)
js_data = r.json()
data_list = js_data.get('data',{}).get('tracksAudioPlay',)
for item in data_list:
trackName=item.get('trackName')
trackName=re.sub(':','',trackName)
src_url = item.get('src')
try:
r0=requests.get(src_url,headers=headers)
except Exception as e:
print(e)
print(trackName)
else:
with open('{}.m4a'.format(trackName),'wb') as f:
f.write(r0.content)
print('{} downloaded'.format(trackName))

保存为main.py
然后运行 python main.py
稍微等几分钟就自动下载好了。


喜马拉雅.PNG


附下载好的音频文件:
链接:https://pan.baidu.com/s/1t_vJhTvSJSeFdI1IaDS6fA 
提取码:e3zb 
 

原创文章
转载请注明出处
http://30daydo.com/article/503

python3与python2迭代器的写法的区别

李魔佛 发表了文章 • 0 个评论 • 2884 次浏览 • 2019-06-26 11:22 • 来自相关话题

大部分相同,只是python2里面需要实现在类中实现next()方法,而python3里面需要实现__next__()方法。
 
附一个例子:
def iter_demo():

class DefineIter(object):

def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0

def __iter__(self):
return self


def __next__(self):

if self.index >=self.length:
# return None
raise StopIteration

d = self.data[self.index]*50
self.index =self.index + 1

return d

a = DefineIter(10)
print(type(a))
for i in a:
print(i) 查看全部
大部分相同,只是python2里面需要实现在类中实现next()方法,而python3里面需要实现__next__()方法。
 
附一个例子:
def iter_demo():

class DefineIter(object):

def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0

def __iter__(self):
return self


def __next__(self):

if self.index >=self.length:
# return None
raise StopIteration

d = self.data[self.index]*50
self.index =self.index + 1

return d

a = DefineIter(10)
print(type(a))
for i in a:
print(i)

PyCharm 快捷键快速插入当前时间

李魔佛 发表了文章 • 0 个评论 • 3716 次浏览 • 2019-06-26 09:18 • 来自相关话题

个人觉得这是一个非常常用的功能,不过需要自定义实现。
 
方式
通过 Live Template 快速添加时间

步骤
1、添加一个 Template Group 命名为 Common
2、添加一个 Live Template 设置如下
Abbreviation: time
Description : current time
Template Text: $time$

Edit Variables -> Expresssion : date("yyyy-MM-dd HH:mm:ss")



3、让设置生效
Define->Everywhere

4、使用
输入 time 后 按下tab键 就能转换为当前时间了
  查看全部
个人觉得这是一个非常常用的功能,不过需要自定义实现。
 
方式
通过 Live Template 快速添加时间

步骤
1、添加一个 Template Group 命名为 Common
2、添加一个 Live Template 设置如下
Abbreviation: time
Description : current time
Template Text: $time$

Edit Variables -> Expresssion : date("yyyy-MM-dd HH:mm:ss")



3、让设置生效
Define->Everywhere

4、使用
输入 time 后 按下tab键 就能转换为当前时间了

 

conda无法在win10下用命令行切换虚拟环境

李魔佛 发表了文章 • 0 个评论 • 5078 次浏览 • 2019-06-11 10:04 • 来自相关话题

虚拟环境已经安装好了
然后在PowerShell下运行activate py2,没有任何反应。(powershell是win7后面系统的增强命令行)
后来使用系统原始的cmd命令行,在运行里面敲入cmd,然后重新执行activate py2,问题得到解决了。
原因是兼容问题。 查看全部
虚拟环境已经安装好了
然后在PowerShell下运行activate py2,没有任何反应。(powershell是win7后面系统的增强命令行)
后来使用系统原始的cmd命令行,在运行里面敲入cmd,然后重新执行activate py2,问题得到解决了。
原因是兼容问题。

jupyter notebook格式的文件损坏如何修复

李魔佛 发表了文章 • 0 个评论 • 4317 次浏览 • 2019-06-08 13:44 • 来自相关话题

有时候用git同步时,造成了冲突后合并,jupyter notebook的文件被插入了诸如>>>>>HEAD,ORIGIN等字符,这时候再打开jupyter notebook文件(.ipynb后缀),会无法打开。修复过程:
 
使用下面的代码:
# 拯救损坏的jupyter 文件
import re
import codecs

pattern = re.compile('"source": \[(.*?)\]\s+\},',re.S)
filename = 'tushare_usage.ipynb'
with codecs.open(filename,encoding='utf8') as f:
content = f.read()

source = pattern.findall(content)
for s in source:
t=s.replace('\\n','')
t=re.sub('"','',t)
t=re.sub('(,$)','',t)
print(t)只要把你要修复的文件替换一下就可以了。 查看全部
有时候用git同步时,造成了冲突后合并,jupyter notebook的文件被插入了诸如>>>>>HEAD,ORIGIN等字符,这时候再打开jupyter notebook文件(.ipynb后缀),会无法打开。修复过程:
 
使用下面的代码:
# 拯救损坏的jupyter 文件
import re
import codecs

pattern = re.compile('"source": \[(.*?)\]\s+\},',re.S)
filename = 'tushare_usage.ipynb'
with codecs.open(filename,encoding='utf8') as f:
content = f.read()

source = pattern.findall(content)
for s in source:
t=s.replace('\\n','')
t=re.sub('"','',t)
t=re.sub('(,$)','',t)
print(t)
只要把你要修复的文件替换一下就可以了。

requests直接post图片文件

python爬虫李魔佛 发表了文章 • 0 个评论 • 3376 次浏览 • 2019-05-17 16:32 • 来自相关话题

代码如下:
file_path=r'9927_15562445086485238.png'
file=open(file_path, 'rb').read()
r=requests.post(url=code_url,data=file)
print(r.text) 查看全部
代码如下:
    file_path=r'9927_15562445086485238.png'
file=open(file_path, 'rb').read()
r=requests.post(url=code_url,data=file)
print(r.text)

python的mixin类

李魔佛 发表了文章 • 0 个评论 • 2715 次浏览 • 2019-05-16 16:30 • 来自相关话题

A mixin is a limited form of multiple inheritance.
 
maxin类似多重继承的一种限制形式:
 关于Python的Mixin模式

像C或C++这类语言都支持多重继承,一个子类可以有多个父类,这样的设计常被人诟病。因为继承应该是个”is-a”关系。比如轿车类继承交通工具类,因为轿车是一个(“is-a”)交通工具。一个物品不可能是多种不同的东西,因此就不应该存在多重继承。不过有没有这种情况,一个类的确是需要继承多个类呢?

答案是有,我们还是拿交通工具来举例子,民航飞机是一种交通工具,对于土豪们来说直升机也是一种交通工具。对于这两种交通工具,它们都有一个功能是飞行,但是轿车没有。所以,我们不可能将飞行功能写在交通工具这个父类中。但是如果民航飞机和直升机都各自写自己的飞行方法,又违背了代码尽可能重用的原则(如果以后飞行工具越来越多,那会出现许多重复代码)。怎么办,那就只好让这两种飞机同时继承交通工具以及飞行器两个父类,这样就出现了多重继承。这时又违背了继承必须是”is-a”关系。这个难题该怎么破?

不同的语言给出了不同的方法,让我们先来看下Java。Java提供了接口interface功能,来实现多重继承:public abstract class Vehicle {
}

public interface Flyable {
public void fly();
}

public class FlyableImpl implements Flyable {
public void fly() {
System.out.println("I am flying");
}
}

public class Airplane extends Vehicle implements Flyable {
private flyable;

public Airplane() {
flyable = new FlyableImpl();
}

public void fly() {
flyable.fly();
}
}

现在我们的飞机同时具有了交通工具及飞行器两种属性,而且我们不需要重写飞行器中的飞行方法,同时我们没有破坏单一继承的原则。飞机就是一种交通工具,可飞行的能力是是飞机的属性,通过继承接口来获取。

回到主题,Python语言可没有接口功能,但是它可以多重继承。那Python是不是就该用多重继承来实现呢?是,也不是。说是,因为从语法上看,的确是通过多重继承实现的。说不是,因为它的继承依然遵守”is-a”关系,从含义上看依然遵循单继承的原则。这个怎么理解呢?我们还是看例子吧。class Vehicle(object):
pass

class PlaneMixin(object):
def fly(self):
print 'I am flying'

class Airplane(Vehicle, PlaneMixin):
pass

可以看到,上面的Airplane类实现了多继承,不过它继承的第二个类我们起名为PlaneMixin,而不是Plane,这个并不影响功能,但是会告诉后来读代码的人,这个类是一个Mixin类。所以从含义上理解,Airplane只是一个Vehicle,不是一个Plane。这个Mixin,表示混入(mix-in),它告诉别人,这个类是作为功能添加到子类中,而不是作为父类,它的作用同Java中的接口。

使用Mixin类实现多重继承要非常小心
首先它必须表示某一种功能,而不是某个物品,如同Java中的Runnable,Callable等
 
其次它必须责任单一,如果有多个功能,那就写多个Mixin类然后,它不依赖于子类的实现最后,子类即便没有继承这个Mixin类,也照样可以工作,就是缺少了某个功能。(比如飞机照样可以载客,就是不能飞了^_^)
 
原创文章,转载请注明出处
http://30daydo.com/article/480
  查看全部
A mixin is a limited form of multiple inheritance.
 
maxin类似多重继承的一种限制形式:
 关于Python的Mixin模式

像C或C++这类语言都支持多重继承,一个子类可以有多个父类,这样的设计常被人诟病。因为继承应该是个”is-a”关系。比如轿车类继承交通工具类,因为轿车是一个(“is-a”)交通工具。一个物品不可能是多种不同的东西,因此就不应该存在多重继承。不过有没有这种情况,一个类的确是需要继承多个类呢?

答案是有,我们还是拿交通工具来举例子,民航飞机是一种交通工具,对于土豪们来说直升机也是一种交通工具。对于这两种交通工具,它们都有一个功能是飞行,但是轿车没有。所以,我们不可能将飞行功能写在交通工具这个父类中。但是如果民航飞机和直升机都各自写自己的飞行方法,又违背了代码尽可能重用的原则(如果以后飞行工具越来越多,那会出现许多重复代码)。怎么办,那就只好让这两种飞机同时继承交通工具以及飞行器两个父类,这样就出现了多重继承。这时又违背了继承必须是”is-a”关系。这个难题该怎么破?

不同的语言给出了不同的方法,让我们先来看下Java。Java提供了接口interface功能,来实现多重继承:
public abstract class Vehicle {
}

public interface Flyable {
public void fly();
}

public class FlyableImpl implements Flyable {
public void fly() {
System.out.println("I am flying");
}
}

public class Airplane extends Vehicle implements Flyable {
private flyable;

public Airplane() {
flyable = new FlyableImpl();
}

public void fly() {
flyable.fly();
}
}


现在我们的飞机同时具有了交通工具及飞行器两种属性,而且我们不需要重写飞行器中的飞行方法,同时我们没有破坏单一继承的原则。飞机就是一种交通工具,可飞行的能力是是飞机的属性,通过继承接口来获取。

回到主题,Python语言可没有接口功能,但是它可以多重继承。那Python是不是就该用多重继承来实现呢?是,也不是。说是,因为从语法上看,的确是通过多重继承实现的。说不是,因为它的继承依然遵守”is-a”关系,从含义上看依然遵循单继承的原则。这个怎么理解呢?我们还是看例子吧。
class Vehicle(object):
pass

class PlaneMixin(object):
def fly(self):
print 'I am flying'

class Airplane(Vehicle, PlaneMixin):
pass


可以看到,上面的Airplane类实现了多继承,不过它继承的第二个类我们起名为PlaneMixin,而不是Plane,这个并不影响功能,但是会告诉后来读代码的人,这个类是一个Mixin类。所以从含义上理解,Airplane只是一个Vehicle,不是一个Plane。这个Mixin,表示混入(mix-in),它告诉别人,这个类是作为功能添加到子类中,而不是作为父类,它的作用同Java中的接口。

使用Mixin类实现多重继承要非常小心
  • 首先它必须表示某一种功能,而不是某个物品,如同Java中的Runnable,Callable等

 
  • 其次它必须责任单一,如果有多个功能,那就写多个Mixin类
  • 然后,它不依赖于子类的实现
  • 最后,子类即便没有继承这个Mixin类,也照样可以工作,就是缺少了某个功能。(比如飞机照样可以载客,就是不能飞了^_^)

 
原创文章,转载请注明出处
http://30daydo.com/article/480
 

正则表达式替换中文换行符【python】

python爬虫李魔佛 发表了文章 • 0 个评论 • 2815 次浏览 • 2019-05-13 11:02 • 来自相关话题

js里面的内容有中文的换行符。
使用正则表达式替换换行符。(也可以替换为任意字符)js=re.sub('\r\n','',js)
完毕。
js里面的内容有中文的换行符。
使用正则表达式替换换行符。(也可以替换为任意字符)
js=re.sub('\r\n','',js)

完毕。

request header显示Provisional headers are shown

python爬虫李魔佛 发表了文章 • 0 个评论 • 4812 次浏览 • 2019-05-13 10:07 • 来自相关话题

出现这个情况,一般是因为装了一些插件,比如屏蔽广告的插件 ad block导致的。
把插件卸载了问题就解决了。
出现这个情况,一般是因为装了一些插件,比如屏蔽广告的插件 ad block导致的。
把插件卸载了问题就解决了。

异步爬虫aiohttp post提交数据

python爬虫李魔佛 发表了文章 • 0 个评论 • 7678 次浏览 • 2019-05-08 16:40 • 来自相关话题

基本的用法:async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()
 完整的例子:import aiohttp
import asyncio

page = 30

post_data = {
'page': 1,
'pageSize': 10,
'keyWord': '',
'dpIds': '',
}

headers = {

"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}

result=


async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()

async def parse(html):
xzcf_list = html.get('newtxzcfList')
if xzcf_list is None:
return
for i in xzcf_list:
result.append(i)

async def downlod(page):
data=post_data.copy()
data['page']=page
url = 'http://credit.chaozhou.gov.cn/tfieldTypeActionJson!initXzcfListnew.do'
async with aiohttp.ClientSession() as session:
html=await fetch(session,url,data)
await parse(html)

loop = asyncio.get_event_loop()
tasks=[asyncio.ensure_future(downlod(i)) for i in range(1,page)]
tasks=asyncio.gather(*tasks)
# print(tasks)
loop.run_until_complete(tasks)
# loop.close()
# print(result)
count=0
for i in result:
print(i.get('cfXdrMc'))
count+=1
print(f'total {count}') 查看全部
基本的用法:
async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()

 完整的例子:
import aiohttp
import asyncio

page = 30

post_data = {
'page': 1,
'pageSize': 10,
'keyWord': '',
'dpIds': '',
}

headers = {

"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}

result=


async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()

async def parse(html):
xzcf_list = html.get('newtxzcfList')
if xzcf_list is None:
return
for i in xzcf_list:
result.append(i)

async def downlod(page):
data=post_data.copy()
data['page']=page
url = 'http://credit.chaozhou.gov.cn/tfieldTypeActionJson!initXzcfListnew.do'
async with aiohttp.ClientSession() as session:
html=await fetch(session,url,data)
await parse(html)

loop = asyncio.get_event_loop()
tasks=[asyncio.ensure_future(downlod(i)) for i in range(1,page)]
tasks=asyncio.gather(*tasks)
# print(tasks)
loop.run_until_complete(tasks)
# loop.close()
# print(result)
count=0
for i in result:
print(i.get('cfXdrMc'))
count+=1
print(f'total {count}')

python异步aiohttp爬虫 - 异步爬取链家数据

python爬虫李魔佛 发表了文章 • 0 个评论 • 2688 次浏览 • 2019-05-08 15:52 • 来自相关话题

import requests
from lxml import etree
import asyncio
import aiohttp
import pandas
import re
import math
import time

loction_info = ''' 1→杭州
2→武汉
3→北京
按ENTER确认:'''
loction_select = input(loction_info)
loction_dic = {'1': 'hz',
'2': 'wh',
'3': 'bj'}
city_url = 'https://{}.lianjia.com/ershoufang/'.format(loction_dic[loction_select])
down = input('请输入价格下限(万):')
up = input('请输入价格上限(万):')

inter_list = [(int(down), int(up))]


def half_inter(inter):
lower = inter[0]
upper = inter[1]
delta = int((upper - lower) / 2)
inter_list.remove(inter)
print('已经缩小价格区间', inter)
inter_list.append((lower, lower + delta))
inter_list.append((lower + delta, upper))


pagenum = {}


def get_num(inter):
url = city_url + 'bp{}ep{}/'.format(inter[0], inter[1])
r = requests.get(url).text
print(r)
num = int(etree.HTML(r).xpath("//h2[@class='total fl']/span/text()")[0].strip())
pagenum[(inter[0], inter[1])] = num
return num


totalnum = get_num(inter_list[0])

judge = True
while judge:
a = [get_num(x) > 3000 for x in inter_list]
if True in a:
judge = True
else:
judge = False
for i in inter_list:
if get_num(i) > 3000:
half_inter(i)
print('价格区间缩小完毕!')

url_lst = []
url_lst_failed = []
url_lst_successed = []
url_lst_duplicated = []

for i in inter_list:
totalpage = math.ceil(pagenum[i] / 30)
for j in range(1, totalpage + 1):
url = city_url + 'pg{}bp{}ep{}/'.format(j, i[0], i[1])
url_lst.append(url)
print('url列表获取完毕!')

info_lst = []


async def get_info(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5) as resp:
if resp.status != 200:
url_lst_failed.append(url)
else:
url_lst_successed.append(url)
r = await resp.text()
nodelist = etree.HTML(r).xpath("//ul[@class='sellListContent']/li")
# print('-------------------------------------------------------------')
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url),len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('-------------------------------------------------------------')
info_dic = {}
index = 1
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
for node in nodelist:
try:
info_dic['title'] = node.xpath(".//div[@class='title']/a/text()")[0]
except:
info_dic['title'] = '/'
try:
info_dic['href'] = node.xpath(".//div[@class='title']/a/@href")[0]
except:
info_dic['href'] = '/'
try:
info_dic['xiaoqu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[0]
except:
info_dic['xiaoqu'] = '/'
try:
info_dic['huxing'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[1]
except:
info_dic['huxing'] = '/'
try:
info_dic['area'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[2]
except:
info_dic['area'] = '/'
try:
info_dic['chaoxiang'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[3]
except:
info_dic['chaoxiang'] = '/'
try:
info_dic['zhuangxiu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[4]
except:
info_dic['zhuangxiu'] = '/'
try:
info_dic['dianti'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[5]
except:
info_dic['dianti'] = '/'
try:
info_dic['louceng'] = re.findall('\((.*)\)', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['louceng'] = '/'
try:
info_dic['nianxian'] = re.findall('\)(.*?)年', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['nianxian'] = '/'
try:
info_dic['guanzhu'] = ''.join(re.findall('[0-9]', node.xpath(".//div[@class='followInfo']/text()")[
0].replace(' ', '').split('/')[0]))
except:
info_dic['guanzhu'] = '/'
try:
info_dic['daikan'] = ''.join(re.findall('[0-9]',
node.xpath(".//div[@class='followInfo']/text()")[0].replace(
' ', '').split('/')[1]))
except:
info_dic['daikan'] = '/'
try:
info_dic['fabu'] = node.xpath(".//div[@class='followInfo']/text()")[0].replace(' ', '').split('/')[
2]
except:
info_dic['fabu'] = '/'
try:
info_dic['totalprice'] = node.xpath(".//div[@class='totalPrice']/span/text()")[0]
except:
info_dic['totalprice'] = '/'
try:
info_dic['unitprice'] = node.xpath(".//div[@class='unitPrice']/span/text()")[0].replace('单价', '')
except:
info_dic['unitprice'] = '/'
if True in [info_dic['href'] in dic.values() for dic in info_lst]:
url_lst_duplicated.append(info_dic)
else:
info_lst.append(info_dic)
print('第{}条: {}→房屋信息抓取完毕!'.format(index, info_dic['title']))
index += 1
info_dic = {}


start = time.time()

# 首次抓取url_lst中的信息,部分url没有对其发起请求,不知道为什么
tasks = [asyncio.ensure_future(get_info(url)) for url in url_lst]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

# 将没有发起请求的url放入一个列表,对其进行循环抓取,直到所有url都被发起请求
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed or url_lst_failed:
url_lst_unrequested.append(url)
while len(url_lst_unrequested) > 0:
tasks_unrequested = [asyncio.ensure_future(get_info(url)) for url in url_lst_unrequested]
loop.run_until_complete(asyncio.wait(tasks_unrequested))
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed:
url_lst_unrequested.append(url)
end = time.time()
print('当前价格区间段内共有{}套二手房源\(包含{}条重复房源\),实际获得{}条房源信息。'.format(totalnum, len(url_lst_duplicated), len(info_lst)))
print('总共耗时{}秒'.format(end - start))

df = pandas.DataFrame(info_lst)
df.to_csv("ljwh.csv", encoding='gbk') 查看全部
import requests
from lxml import etree
import asyncio
import aiohttp
import pandas
import re
import math
import time

loction_info = ''' 1→杭州
2→武汉
3→北京
按ENTER确认:'''
loction_select = input(loction_info)
loction_dic = {'1': 'hz',
'2': 'wh',
'3': 'bj'}
city_url = 'https://{}.lianjia.com/ershoufang/'.format(loction_dic[loction_select])
down = input('请输入价格下限(万):')
up = input('请输入价格上限(万):')

inter_list = [(int(down), int(up))]


def half_inter(inter):
lower = inter[0]
upper = inter[1]
delta = int((upper - lower) / 2)
inter_list.remove(inter)
print('已经缩小价格区间', inter)
inter_list.append((lower, lower + delta))
inter_list.append((lower + delta, upper))


pagenum = {}


def get_num(inter):
url = city_url + 'bp{}ep{}/'.format(inter[0], inter[1])
r = requests.get(url).text
print(r)
num = int(etree.HTML(r).xpath("//h2[@class='total fl']/span/text()")[0].strip())
pagenum[(inter[0], inter[1])] = num
return num


totalnum = get_num(inter_list[0])

judge = True
while judge:
a = [get_num(x) > 3000 for x in inter_list]
if True in a:
judge = True
else:
judge = False
for i in inter_list:
if get_num(i) > 3000:
half_inter(i)
print('价格区间缩小完毕!')

url_lst = []
url_lst_failed = []
url_lst_successed = []
url_lst_duplicated = []

for i in inter_list:
totalpage = math.ceil(pagenum[i] / 30)
for j in range(1, totalpage + 1):
url = city_url + 'pg{}bp{}ep{}/'.format(j, i[0], i[1])
url_lst.append(url)
print('url列表获取完毕!')

info_lst = []


async def get_info(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5) as resp:
if resp.status != 200:
url_lst_failed.append(url)
else:
url_lst_successed.append(url)
r = await resp.text()
nodelist = etree.HTML(r).xpath("//ul[@class='sellListContent']/li")
# print('-------------------------------------------------------------')
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url),len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('-------------------------------------------------------------')
info_dic = {}
index = 1
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
for node in nodelist:
try:
info_dic['title'] = node.xpath(".//div[@class='title']/a/text()")[0]
except:
info_dic['title'] = '/'
try:
info_dic['href'] = node.xpath(".//div[@class='title']/a/@href")[0]
except:
info_dic['href'] = '/'
try:
info_dic['xiaoqu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[0]
except:
info_dic['xiaoqu'] = '/'
try:
info_dic['huxing'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[1]
except:
info_dic['huxing'] = '/'
try:
info_dic['area'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[2]
except:
info_dic['area'] = '/'
try:
info_dic['chaoxiang'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[3]
except:
info_dic['chaoxiang'] = '/'
try:
info_dic['zhuangxiu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[4]
except:
info_dic['zhuangxiu'] = '/'
try:
info_dic['dianti'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[5]
except:
info_dic['dianti'] = '/'
try:
info_dic['louceng'] = re.findall('\((.*)\)', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['louceng'] = '/'
try:
info_dic['nianxian'] = re.findall('\)(.*?)年', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['nianxian'] = '/'
try:
info_dic['guanzhu'] = ''.join(re.findall('[0-9]', node.xpath(".//div[@class='followInfo']/text()")[
0].replace(' ', '').split('/')[0]))
except:
info_dic['guanzhu'] = '/'
try:
info_dic['daikan'] = ''.join(re.findall('[0-9]',
node.xpath(".//div[@class='followInfo']/text()")[0].replace(
' ', '').split('/')[1]))
except:
info_dic['daikan'] = '/'
try:
info_dic['fabu'] = node.xpath(".//div[@class='followInfo']/text()")[0].replace(' ', '').split('/')[
2]
except:
info_dic['fabu'] = '/'
try:
info_dic['totalprice'] = node.xpath(".//div[@class='totalPrice']/span/text()")[0]
except:
info_dic['totalprice'] = '/'
try:
info_dic['unitprice'] = node.xpath(".//div[@class='unitPrice']/span/text()")[0].replace('单价', '')
except:
info_dic['unitprice'] = '/'
if True in [info_dic['href'] in dic.values() for dic in info_lst]:
url_lst_duplicated.append(info_dic)
else:
info_lst.append(info_dic)
print('第{}条: {}→房屋信息抓取完毕!'.format(index, info_dic['title']))
index += 1
info_dic = {}


start = time.time()

# 首次抓取url_lst中的信息,部分url没有对其发起请求,不知道为什么
tasks = [asyncio.ensure_future(get_info(url)) for url in url_lst]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

# 将没有发起请求的url放入一个列表,对其进行循环抓取,直到所有url都被发起请求
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed or url_lst_failed:
url_lst_unrequested.append(url)
while len(url_lst_unrequested) > 0:
tasks_unrequested = [asyncio.ensure_future(get_info(url)) for url in url_lst_unrequested]
loop.run_until_complete(asyncio.wait(tasks_unrequested))
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed:
url_lst_unrequested.append(url)
end = time.time()
print('当前价格区间段内共有{}套二手房源\(包含{}条重复房源\),实际获得{}条房源信息。'.format(totalnum, len(url_lst_duplicated), len(info_lst)))
print('总共耗时{}秒'.format(end - start))

df = pandas.DataFrame(info_lst)
df.to_csv("ljwh.csv", encoding='gbk')

pycharm debug scrapy 报错 twisted.internet.error.ReactorNotRestartable

python爬虫李魔佛 发表了文章 • 0 个评论 • 5979 次浏览 • 2019-04-23 11:35 • 来自相关话题

没发现哪里不妥,以前debug调试scrapy一直没问题。 
后来才发现,
scrapy run的启动文件名不能命令为cmd.py !!!!!
我把scrapy的启动写到cmd.py里面
from scrapy import cmdline cmdline.execute('scrapy crawl xxxx'.split())
 
然后cmd.py和系统某个调试功能的库重名了。 查看全部
没发现哪里不妥,以前debug调试scrapy一直没问题。 
后来才发现,
scrapy run的启动文件名不能命令为cmd.py !!!!!
我把scrapy的启动写到cmd.py里面
from scrapy import cmdline cmdline.execute('scrapy crawl xxxx'.split())
 
然后cmd.py和系统某个调试功能的库重名了。

python不支持多重继承中的重复继承

李魔佛 发表了文章 • 0 个评论 • 2287 次浏览 • 2019-04-18 16:36 • 来自相关话题

代码如下:
class First(object):
def __init__(self):
print("first")

class Second(First):
def __init__(self):
print("second")

class Third(First,Second):
def __init__(self):
print("third")
运行代码会直接报错:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-c90f7b77d3e0> in <module>()
7 print("second")
8
----> 9 class Third(First,Second):
10 def __init__(self):
11 print("third")

TypeError: Cannot create a consistent method resolution order (MRO) for bases First, Second
  查看全部
代码如下:
class First(object):
def __init__(self):
print("first")

class Second(First):
def __init__(self):
print("second")

class Third(First,Second):
def __init__(self):
print("third")

运行代码会直接报错:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-c90f7b77d3e0> in <module>()
7 print("second")
8
----> 9 class Third(First,Second):
10 def __init__(self):
11 print("third")

TypeError: Cannot create a consistent method resolution order (MRO) for bases First, Second

 

gevent异步 入门教程(入坑)

李魔佛 发表了文章 • 0 个评论 • 2936 次浏览 • 2019-04-18 11:37 • 来自相关话题

code1
import time
import gevent
import requests
def foo():
print('Running in foo')

r=requests.get('http://30daydo.com')
print(r.status_code)
print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')

r=requests.get('http://www.qq.com') #
print(r.status_code)
print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))
上面的异步代码不起作用,因为requests阻塞了,所以用的时间和顺序执行的时间一样.
 
或者用以下代码替代:
import time
import gevent
import requests
def foo():
print('Running in foo')
time.sleep(2) # 这样子不起作用
print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')
time.sleep(2)
print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))
把访问网络部分使用sleep替代,那么最后的运行时间是2+2 =4秒,并不是2秒,那么要怎样才是2秒呢,需要改成以下的代码:
 
import time
import gevent
import requests
def foo():
print('Running in foo')

gevent.sleep(2) # 通过它各自yield向对方

print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')

gevent.sleep(2)

print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))
使用gevent.sleep()
这个函数才可以达到目的. 查看全部
code1
import time
import gevent
import requests
def foo():
print('Running in foo')

r=requests.get('http://30daydo.com')
print(r.status_code)
print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')

r=requests.get('http://www.qq.com') #
print(r.status_code)
print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))

上面的异步代码不起作用,因为requests阻塞了,所以用的时间和顺序执行的时间一样.
 
或者用以下代码替代:
import time
import gevent
import requests
def foo():
print('Running in foo')
time.sleep(2) # 这样子不起作用
print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')
time.sleep(2)
print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))

把访问网络部分使用sleep替代,那么最后的运行时间是2+2 =4秒,并不是2秒,那么要怎样才是2秒呢,需要改成以下的代码:
 
import time
import gevent
import requests
def foo():
print('Running in foo')

gevent.sleep(2) # 通过它各自yield向对方

print('Explicit context switch to foo again')

def bar():
print('Explicit context to bar')

gevent.sleep(2)

print('Implicit context switch back to bar')

start=time.time()
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
print('time used {}'.format(time.time()-start))

使用gevent.sleep()
这个函数才可以达到目的.