python

浏览器抓包post字段里面有 (unable to decode value) ，requests如何正确的post

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 7171 次浏览 • 2018-12-12 14:52 • 来自相关话题

在浏览器F12的抓包信息里面看到如下的数据：

数据是通过post形式提交的，字段txtTKeyword无法显示，看来是用了其他的编码导致了浏览器无法识别。
可以使用fiddler工具查看。

在python中用代码直接编码后post，不然服务器无法识别提交的数据

注意不需要用 urllib.parse.quote(uncode_str)，直接encode就可以（特殊情况特殊处理，有些网站就是奇怪）
s='耐克球鞋'
s =s.encode('gb2312')
data = {'__VIEWSTATE': view_state,
'__EVENTVALIDATION': event_validation,
'txtTKeyword': s,
'btQuery.x': 41,
'btQuery.y': 24,
}

r = session.post(url=self.base_url, headers=headers,
data=data,proxies=self.get_proxy()
查看全部

在浏览器F12的抓包信息里面看到如下的数据：

数据是通过post形式提交的，字段txtTKeyword无法显示，看来是用了其他的编码导致了浏览器无法识别。
可以使用fiddler工具查看。

在python中用代码直接编码后post，不然服务器无法识别提交的数据

注意不需要用 urllib.parse.quote(uncode_str)，直接encode就可以（特殊情况特殊处理，有些网站就是奇怪）

s='耐克球鞋'

s =s.encode('gb2312')

 data = {'__VIEWSTATE': view_state,

        '__EVENTVALIDATION': event_validation,

        'txtTKeyword': s,

        'btQuery.x': 41,

        'btQuery.y': 24,

        }



r = session.post(url=self.base_url, headers=headers,

                     data=data,proxies=self.get_proxy()

randint python 的用法

李魔佛发表了文章 • 0 个评论 • 3486 次浏览 • 2018-12-10 14:50 • 来自相关话题

官方的文档：

random.randint(a, b)
Return a random integer N such that a <= N <= b.

返回一个a到b之间的整数，包括a和b。

python 代码获取mongodb数据库下所有的collection 文档名字

李魔佛发表了文章 • 0 个评论 • 5850 次浏览 • 2018-11-27 11:41 • 来自相关话题

获取一个数据库下所有的collection 文档db['db_pledge'].collection_names()
db['db_pledge'].list_collection_names()

获取一个数据库下所有的collection 文档

db['db_pledge'].collection_names()

db['db_pledge'].list_collection_names()

批量获取Grequests返回内容

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 6943 次浏览 • 2018-11-23 10:36 • 来自相关话题

Grequests是一个异步requests的封装库。
如何批量获取Grequests返回内容？

import grequests
import requests
import bs4

def simple_request(url):
page = requests.get(url)
return page

urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]

rs = [grequests.get(simple_request(u)) for u in urls]

grequests.map(rs)
注意，上面的写法是错误的！！！！！！

grequests.get只能接受url！！！不能放入一个函数。

正确的写法：

rs = (grequests.get(u) for u in urls)
requests = grequests.map(rs)
for response in requests:
market_watch(response.content)
具体的对response内容操作放入到market_watch函数中。

查看全部

Grequests是一个异步requests的封装库。
如何批量获取Grequests返回内容？

import grequests

import requests

import bs4



def simple_request(url):

    page = requests.get(url)

    return page



urls = [

    'http://www.heroku.com',

    'http://python-tablib.org',

    'http://httpbin.org',

    'http://python-requests.org',

    'http://kennethreitz.com'

]



rs = [grequests.get(simple_request(u)) for u in urls]





grequests.map(rs)

注意，上面的写法是错误的！！！！！！

grequests.get只能接受url！！！不能放入一个函数。

正确的写法：

rs = (grequests.get(u) for u in urls)

requests = grequests.map(rs)

for response in requests:

    market_watch(response.content)

具体的对response内容操作放入到market_watch函数中。

python3 列表推导式 vs map 差别

李魔佛发表了文章 • 0 个评论 • 4796 次浏览 • 2018-11-22 11:25 • 来自相关话题

（针对python3，因为python3的map返回的是一个map对象，属于生成器）
速度：
如果map里面是用的lambda，那么map速度会比列表推导式要慢，正常情况map速度稍微快那么一点点。
$ python -mtimeit -s'xs=range(10)' 'map(hex, xs)'
100000 loops, best of 3: 4.86 usec per loop

$ python -mtimeit -s'xs=range(10)' '[hex(x) for x in xs]'
100000 loops, best of 3: 5.58 usec per loop可以看到map稍微快一些

使用lambda$ python -mtimeit -s'xs=range(10)' 'map(lambda x: x+2, xs)'
100000 loops, best of 3: 4.24 usec per loop
$ python -mtimeit -s'xs=range(10)' '[x+2 for x in xs]'
100000 loops, best of 3: 2.32 usec per loop列表推导式稍微快些。

因为map返回的是生成器，所以map对于大容量的操作，不会导致内存爆掉。
而列表推导式则会爆内存，不过也有解决方案，就是使用（）替代【】，这时返回的是生成器推导式
>>> [str(n) for n in range(10**100)]谨慎运行上面的，你电脑会卡到爆

如果换成map就不会有问题>>> map(str, range(10**100))
<map object at 0x2201d50>
或者>>> (str(n) for n in range(10**100))
<generator object <genexpr> at 0xacbdef>也不会有问题。

原创文章，转载请注明：
http://30daydo.com/article/378
查看全部

（针对python3，因为python3的map返回的是一个map对象，属于生成器）
速度：
如果map里面是用的lambda，那么map速度会比列表推导式要慢，正常情况map速度稍微快那么一点点。

$ python -mtimeit -s'xs=range(10)' 'map(hex, xs)'

100000 loops, best of 3: 4.86 usec per loop



$ python -mtimeit -s'xs=range(10)' '[hex(x) for x in xs]'

100000 loops, best of 3: 5.58 usec per loop

可以看到map稍微快一些

使用lambda

$ python -mtimeit -s'xs=range(10)' 'map(lambda x: x+2, xs)'

100000 loops, best of 3: 4.24 usec per loop

$ python -mtimeit -s'xs=range(10)' '[x+2 for x in xs]'

100000 loops, best of 3: 2.32 usec per loop

列表推导式稍微快些。

因为map返回的是生成器，所以map对于大容量的操作，不会导致内存爆掉。
而列表推导式则会爆内存，不过也有解决方案，就是使用（）替代【】，这时返回的是生成器推导式

>>> [str(n) for n in range(10**100)]

谨慎运行上面的，你电脑会卡到爆

如果换成map就不会有问题

>>> map(str, range(10**100))

<map object at 0x2201d50>

或者

>>> (str(n) for n in range(10**100))

<generator object <genexpr> at 0xacbdef>

也不会有问题。

原创文章，转载请注明：
http://30daydo.com/article/378

统一社会信用代码真伪校验

李魔佛发表了文章 • 0 个评论 • 8312 次浏览 • 2018-10-26 11:28 • 来自相关话题

一是嵌入了组织机构代码作为主体标识码。通过组织机构代码的唯一性确保社会信用代码不会重码。换言之，组织机构代码的唯一性完美“遗传”给统一社会信用代码。
二是在组织机构代码前增加行政区划代码，这个组合不难发现就是税务登记证号码。这样就提高了统一社会代码的兼容性，在过渡期内税务机关可以利用这种嵌套规则更加便利地升级到新的信用代码系统。
三是预留前两位给登记机关和机构类别，这样统一社会信用代码在应用中更加清晰高效，第一位便于登记机关管理，可以作为检索条目，第二位可以准确给组织机构归类，方便细化分管。
四是统一社会信用代码的主体标识码天生具有的大容量。通过数字字母组合，加上指数级增长，可以确保在很长一段时间内无需升位就可容纳大量组织机构。
五是统一社会信用代码位数为18位，和身份证的位数相同，这一巧妙设计在未来“两码管两人”的应用中可以实现登记、检索、填表等统一。
六是统一社会信用代码中内嵌的主体标识码具有校验位，同时自身第十八位也是校验位，与身份证号相比是双校验，确保了号码准确性

第17,18位是校验位，具体的校验规则如下： # -*-coding=utf-8-*-

# @Time : 2018/10/30 15:23
# @File : social_code_gen2.py

# -*- coding: utf-8 -*-
'''
Created on 2017年4月5日
18位统一社会信用代码从2015年10月1日正式实行

@author: rocky
'''
# 统一社会信用代码中不使用I,O,Z,S,V

SOCIAL_CREDIT_CHECK_CODE_DICT = {
'0':0,'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8,'9':9,
'A':10,'B':11,'C':12, 'D':13, 'E':14, 'F':15, 'G':16, 'H':17, 'J':18, 'K':19, 'L':20, 'M':21, 'N':22, 'P':23, 'Q':24,
'R':25, 'T':26, 'U':27, 'W':28, 'X':29, 'Y':30}
# GB11714-1997全国组织机构代码编制规则中代码字符集
ORGANIZATION_CHECK_CODE_DICT = {
'0':0,'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8,'9':9,
'A':10,'B':11,'C':12, 'D':13, 'E':14, 'F':15, 'G':16, 'H':17,'I':18, 'J':19, 'K':20, 'L':21, 'M':22, 'N':23, 'O':24,'P':25, 'Q':26,
'R':27,'S':28, 'T':29, 'U':30,'V':31, 'W':32, 'X':33, 'Y':34,'Z':35}

class UnifiedSocialCreditIdentifier(object):
'''
统一社会信用代码
'''

def __init__(self):
'''
Constructor
'''
def check_social_credit_code(self,code):
'''
校验统一社会信用代码的校验码
计算校验码公式:
C9 = 31-mod(sum(Ci*Wi)，31)，其中Ci为组织机构代码的第i位字符,Wi为第i位置的加权因子,C9为校验码
'''
# 第i位置上的加权因子
weighting_factor = [1,3,9,27,19,26,16,17,20,29,25,13,8,24,10,30,28]
# 本体代码
ontology_code = code[0:17]
# 校验码
check_code = code[17]
# 计算校验码
tmp_check_code = self.gen_check_code(weighting_factor, ontology_code, 31, SOCIAL_CREDIT_CHECK_CODE_DICT)
if tmp_check_code==check_code:
return True
else:
return False

def check_organization_code(self,code):
'''
校验组织机构代码是否正确,该规则按照GB 11714编制
统一社会信用代码的第9~17位为主体标识码(组织机构代码)，共九位字符
计算校验码公式:
C9 = 11-mod(sum(Ci*Wi)，11)，其中Ci为组织机构代码的第i位字符,Wi为第i位置的加权因子,C9为校验码
@param code: 统一社会信用代码
'''
# 第i位置上的加权因子
weighting_factor = [3,7,9,10,5,8,4,2]
# 第9~17位为主体标识码(组织机构代码)
organization_code = code[8:17]
# 本体代码
ontology_code=organization_code[0:8]
# 校验码
check_code = organization_code[8]
#
print(organization_code,ontology_code,check_code)
# 计算校验码
tmp_check_code = self.gen_check_code(weighting_factor, ontology_code, 11, ORGANIZATION_CHECK_CODE_DICT)
if tmp_check_code==check_code:
return True
else:
return False

def gen_check_code(self,weighting_factor,ontology_code, modulus,check_code_dict):
'''
@param weighting_factor: 加权因子
@param ontology_code:本体代码
@param modulus: 模数
@param check_code_dict: 字符字典
'''
total = 0
for i in range(len(ontology_code)):
if ontology_code[i].isdigit():
print(ontology_code[i] ,weighting_factor[i])
total += int(ontology_code[i]) * weighting_factor[i]
else:
total += check_code_dict[ontology_code[i]]*weighting_factor[i]
diff = modulus - total % modulus
print(diff)
return list(check_code_dict.keys())[list(check_code_dict.values())[diff]]

if __name__ == '__main__':
u = UnifiedSocialCreditIdentifier()
print(u.check_organization_code(code='91421126331832178C'))
print(u.check_social_credit_code(code='91420100052045470K'))

更新：
引用具体的生成规则

如下是《法人和其他组织统一社会信用代码编码规则》的说明。

1 范围

本标准规定了法人和其他组织统一社会信用代码（以下简称统一代码）的术语和定义、构成。本标准适用于对统一代码的编码、信息处理和信息共享交换。

2 规范性引用文件

下列文件对于本文件的应用是必不可少的。凡是注日期的引用文件，仅注日期的版本适用于本文件。凡是不注日期的引用文件，其最新版本（包括所有的修改单）适用于本文件。

GB/T 2260 中华人民共和国行政区划代码GB 11714 全国组织机构代码编制规则GB/T 17710 信息技术安全技术校验字符系统

3 术语和定义

下列术语和定义适用于本文件。

3.1 组织机构 organization

企业、事业单位、机关、社会团体及其他依法成立的单位的通称。[GB/T 20091-2006, 定义2.2]

3.2 法人 legal entities

具有民事权利能力和民事行为能力，依法独立享有民事权利和承担民事义务的组织。

3.3 其他组织 other organizations

合法成立、有一定的组织机构和财产，不具备法人资格的组织。

3.4 组织机构代码 organization code

主体标识码 subject identification code按照GB 11714编制，赋予每一个组织机构在全国范围内唯一的，始终不变的识别标识码。

3.5 统一社会信用代码 unified social credit identifier

每一个法人和其他组织在全国范围内唯一的，终身不变的法定身份识别码。

4 统一代码的构成

4.1 结构

统一代码由十八位的阿拉伯数字或大写英文字母（不使用I、O、Z、S、V）组成。

第1位：登记管理部门代码（共一位字符）第2位：机构类别代码（共一位字符）第3位~第8位：登记管理机关行政区划码（共六位阿拉伯数字）第9位~第17位：主体标识码（组织机构代码）（共九位字符）第18位：校验码（共一位字符）

4.2 代码及说明

登记管理部门代码：使用阿拉伯数字或大写英文字母表示。

机构编制：1民政：5工商：9其他：Y

机构类别代码：使用阿拉伯数字或大写英文字母表示。

机构编制机关：11打头机构编制事业单位：12打头机构编制中央编办直接管理机构编制的群众团体：13打头机构编制其他：19打头民政社会团体：51打头民政民办非企业单位：52打头民政基金会：53打头民政其他：59打头工商企业：91打头工商个体工商户：92打头工商农民专业合作社：93打头其他：Y1打头

登记管理机关行政区划码：只能使用阿拉伯数字表示。按照GB/T 2260编码。

主体标识码（组织机构代码）：使用阿拉伯数字或英文大写字母表示。按照GB 11714编码。

在实行统一社会信用代码之前，以前的组织机构代码证上的组织机构代码由九位字符组成。格式为XXXXXXXX-Y。前面八位被称为“本体代码”；最后一位被称为“校验码”。校验码和本体代码由一个连字号（-）连接起来。以便让人很容易的看出校验码。但是三证合一后，组织机构的九位字符全部被纳入统一社会信用代码的第9位至第17位，其原有组织机构代码上的连字号不带入统一社会信用代码。

原有组织机构代码上的“校验码”的计算规则是：

例如：某公司的组织机构代码是：59467239-9。那其最后一位的组织机构代码校验码9是如何计算出来的呢？

第一步：取组织机构代码的前八位本体代码为基数。5 9 4 6 7 2 3 9提示：如果本体代码中含有英文大写字母。则A的基数是10，B的基数是11，C的基数是12，依此类推，直到Z的基数是35。

第二步：取加权因子数值。因为组织机构代码的本体代码一共是八位字符。则这八位的加权因子数值从左到右分别是：3、7、9、10、5、8、4、2。

第三步：本体代码基数与对应位数的因子数值相乘。5×3＝15，9×7＝63，4×9＝36，6×10＝60，7×5＝35，2×8＝16，3×4=12，9×2＝18第四步：将乘积求和相加。15+63+36+60+35+16+12+18=255第五步：将和数除以11，求余数。255÷11=33，余数是2。第六步：用阿拉伯数字11减去余数，得求校验码的数值。当校验码的数值为10时，校验码用英文大写字母X来表示；当校验码的数值为11时，校验码用0来表示；其余求出的校验码数值就用其本身的阿拉伯数字来表示。11-2＝9，因此此公司完整的组织机构代码为 59467239-9。

校验码：使用阿拉伯数字或大写英文字母来表示。校验码的计算方法参照 GB/T 17710。

例如：某公司的统一社会信用代码为91512081MA62K0260E，那其最后一位的校验码E是如何计算出来的呢？

第一步：取统一社会信用代码的前十七位为基数。9 1 5 1 2 0 8 1 21 10 6 2 19 0 2 6 0提示：如果前十七位统一社会信用代码含有英文大写字母（不使用I、O、Z、S、V这五个英文字母）。则英文字母对应的基数分别为：A=10、B=11、C=12、D=13、E=14、F=15、G=16、H=17、J=18、K=19、L=20、M=21、N=22、P=23、Q=24、R=25、T=26、U=27、W=28、X=29、Y=30

第二步：取加权因子数值。因为统一社会信用代码前面前面有十七位字符。则这十七位的加权因子数值从左到右分别是：1、3、9、27、19、26、16、17、20、29、25、13、8、24、10、30、28

第三步：基数与对应位数的因子数值相乘。9×1=9，1×3=3，5×9=45，1×27=27，2×19=38，0×26=0，8×16=1281×17=17，21×20=420，10×29=290，6×25=150，2×13=26，19×8=1520×23=0，2×10=20，6×30=180，0×28=0

第四步：将乘积求和相加。9+3+45+27+38+0+128+17+420+290+150+26+152+0+20+180+0=1495

第五步：将和数除以31，求余数。1495÷31=48，余数是17。

第六步：用阿拉伯数字31减去余数，得求校验码的数值。当校验码的数值为0~9时，就直接用该校验码的数值作为最终的统一社会信用代码的校验码；如果校验码的数值是10~30，则校验码转换为对应的大写英文字母。对应关系为：A=10、B=11、C=12、D=13、E=14、F=15、G=16、H=17、J=18、K=19、L=20、M=21、N=22、P=23、Q=24、R=25、T=26、U=27、W=28、X=29、Y=30

因为，31-17＝14，所以该公司完整的统一社会信用代码为 91512081MA62K0260E。

————————————————

统一社会信用代码与原来营业执照注册号、税务登记号、组织机构代码的转换关系

由于18位统一社会信用代码从2015年10月1日才正式实行。当前还有很多系统并没有完全转换到统一社会信用代码上。当您遇到需要让您填写组织机构代码或者税务登记号的时候，您应该如何从统一社会信用代码获取信息呢？

实质上：统一社会信用代码的第九位到第十七位就是原来的组织机构代码。统一社会信用代码的第三位到第十七位绝大多数的情况都是原来的税务登记证号。（不过由于少数发证机构对地方行政区划代码做了规范。所以，有少部分企业的新的统一社会信用代码并不一定的第3位到第8位的阿拉伯数字并一定能完全对应以前的税务登记证号的前六位。）统一社会信用代码无法对应原来营业执照的注册号。当遇到非要您填写营业执照的注册号，又暂时无法识别统一社会信用代码的场合。你则只有拿出以前旧的营业执照查看上面的注册号。

例如：91370200163562681G这个统一社会信用代码。

其组织机构代码是：16356268-1其税务登记号是：370200163562681 如果与之前的税务登记号稍微有所出入，则一般是370200不一致。尤其是00这两位

原创文章，转载请注明出处
http://30daydo.com/article/364
查看全部

一是嵌入了组织机构代码作为主体标识码。通过组织机构代码的唯一性确保社会信用代码不会重码。换言之，组织机构代码的唯一性完美“遗传”给统一社会信用代码。
二是在组织机构代码前增加行政区划代码，这个组合不难发现就是税务登记证号码。这样就提高了统一社会代码的兼容性，在过渡期内税务机关可以利用这种嵌套规则更加便利地升级到新的信用代码系统。
三是预留前两位给登记机关和机构类别，这样统一社会信用代码在应用中更加清晰高效，第一位便于登记机关管理，可以作为检索条目，第二位可以准确给组织机构归类，方便细化分管。
四是统一社会信用代码的主体标识码天生具有的大容量。通过数字字母组合，加上指数级增长，可以确保在很长一段时间内无需升位就可容纳大量组织机构。
五是统一社会信用代码位数为18位，和身份证的位数相同，这一巧妙设计在未来“两码管两人”的应用中可以实现登记、检索、填表等统一。
六是统一社会信用代码中内嵌的主体标识码具有校验位，同时自身第十八位也是校验位，与身份证号相比是双校验，确保了号码准确性

第17,18位是校验位，具体的校验规则如下：

# -*-coding=utf-8-*-



# @Time : 2018/10/30 15:23

# @File : social_code_gen2.py



# -*- coding: utf-8 -*-

'''

Created on 2017年4月5日

18位统一社会信用代码从2015年10月1日正式实行



@author: rocky

'''

# 统一社会信用代码中不使用I,O,Z,S,V



SOCIAL_CREDIT_CHECK_CODE_DICT = {

                '0':0,'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8,'9':9,

                'A':10,'B':11,'C':12, 'D':13, 'E':14, 'F':15, 'G':16, 'H':17, 'J':18, 'K':19, 'L':20, 'M':21, 'N':22, 'P':23, 'Q':24,

               'R':25, 'T':26, 'U':27, 'W':28, 'X':29, 'Y':30}

# GB11714-1997全国组织机构代码编制规则中代码字符集

ORGANIZATION_CHECK_CODE_DICT = {

                '0':0,'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8,'9':9,

                'A':10,'B':11,'C':12, 'D':13, 'E':14, 'F':15, 'G':16, 'H':17,'I':18, 'J':19, 'K':20, 'L':21, 'M':22, 'N':23, 'O':24,'P':25, 'Q':26,

               'R':27,'S':28, 'T':29, 'U':30,'V':31, 'W':32, 'X':33, 'Y':34,'Z':35}



class UnifiedSocialCreditIdentifier(object):

    '''

    统一社会信用代码

    '''



    def __init__(self):

        '''

        Constructor

        '''

    def check_social_credit_code(self,code):

        '''

        校验统一社会信用代码的校验码

        计算校验码公式:

            C9 = 31-mod(sum(Ci*Wi)，31)，其中Ci为组织机构代码的第i位字符,Wi为第i位置的加权因子,C9为校验码

        '''

        # 第i位置上的加权因子

        weighting_factor = [1,3,9,27,19,26,16,17,20,29,25,13,8,24,10,30,28]

        # 本体代码

        ontology_code = code[0:17]

        # 校验码

        check_code = code[17]

        # 计算校验码

        tmp_check_code = self.gen_check_code(weighting_factor, ontology_code, 31, SOCIAL_CREDIT_CHECK_CODE_DICT)

        if tmp_check_code==check_code:

            return True

        else:

            return False



    def check_organization_code(self,code):

        '''

        校验组织机构代码是否正确,该规则按照GB 11714编制

        统一社会信用代码的第9~17位为主体标识码(组织机构代码)，共九位字符

        计算校验码公式:

            C9 = 11-mod(sum(Ci*Wi)，11)，其中Ci为组织机构代码的第i位字符,Wi为第i位置的加权因子,C9为校验码

        @param  code: 统一社会信用代码

        '''

        # 第i位置上的加权因子

        weighting_factor = [3,7,9,10,5,8,4,2]

        # 第9~17位为主体标识码(组织机构代码)

        organization_code = code[8:17]

        # 本体代码

        ontology_code=organization_code[0:8]

        # 校验码

        check_code = organization_code[8]

        #

        print(organization_code,ontology_code,check_code)

        # 计算校验码

        tmp_check_code = self.gen_check_code(weighting_factor, ontology_code, 11, ORGANIZATION_CHECK_CODE_DICT)

        if tmp_check_code==check_code:

            return True

        else:

            return False



    def gen_check_code(self,weighting_factor,ontology_code, modulus,check_code_dict):

        '''

        @param weighting_factor: 加权因子

        @param ontology_code:本体代码

        @param modulus:  模数

        @param check_code_dict: 字符字典

        '''

        total = 0

        for i in range(len(ontology_code)):

            if ontology_code[i].isdigit():

                print(ontology_code[i] ,weighting_factor[i])

                total += int(ontology_code[i]) * weighting_factor[i]

            else:

                total += check_code_dict[ontology_code[i]]*weighting_factor[i]

        diff = modulus - total % modulus

        print(diff)

        return list(check_code_dict.keys())[list(check_code_dict.values())[diff]]







if __name__ == '__main__':

    u = UnifiedSocialCreditIdentifier()

    print(u.check_organization_code(code='91421126331832178C'))

    print(u.check_social_credit_code(code='91420100052045470K'))

更新：
引用具体的生成规则

如下是《法人和其他组织统一社会信用代码编码规则》的说明。

1 范围

本标准规定了法人和其他组织统一社会信用代码（以下简称统一代码）的术语和定义、构成。本标准适用于对统一代码的编码、信息处理和信息共享交换。

2 规范性引用文件

下列文件对于本文件的应用是必不可少的。凡是注日期的引用文件，仅注日期的版本适用于本文件。凡是不注日期的引用文件，其最新版本（包括所有的修改单）适用于本文件。

GB/T 2260 中华人民共和国行政区划代码GB 11714 全国组织机构代码编制规则GB/T 17710 信息技术安全技术校验字符系统

3 术语和定义

下列术语和定义适用于本文件。

3.1 组织机构 organization

企业、事业单位、机关、社会团体及其他依法成立的单位的通称。[GB/T 20091-2006, 定义2.2]

3.2 法人 legal entities

具有民事权利能力和民事行为能力，依法独立享有民事权利和承担民事义务的组织。

3.3 其他组织 other organizations

合法成立、有一定的组织机构和财产，不具备法人资格的组织。

3.4 组织机构代码 organization code

主体标识码 subject identification code按照GB 11714编制，赋予每一个组织机构在全国范围内唯一的，始终不变的识别标识码。

3.5 统一社会信用代码 unified social credit identifier

每一个法人和其他组织在全国范围内唯一的，终身不变的法定身份识别码。

4 统一代码的构成

4.1 结构

统一代码由十八位的阿拉伯数字或大写英文字母（不使用I、O、Z、S、V）组成。

第1位：登记管理部门代码（共一位字符）第2位：机构类别代码（共一位字符）第3位~第8位：登记管理机关行政区划码（共六位阿拉伯数字）第9位~第17位：主体标识码（组织机构代码）（共九位字符）第18位：校验码（共一位字符）

4.2 代码及说明

登记管理部门代码：使用阿拉伯数字或大写英文字母表示。

机构编制：1民政：5工商：9其他：Y

机构类别代码：使用阿拉伯数字或大写英文字母表示。

机构编制机关：11打头机构编制事业单位：12打头机构编制中央编办直接管理机构编制的群众团体：13打头机构编制其他：19打头民政社会团体：51打头民政民办非企业单位：52打头民政基金会：53打头民政其他：59打头工商企业：91打头工商个体工商户：92打头工商农民专业合作社：93打头其他：Y1打头

登记管理机关行政区划码：只能使用阿拉伯数字表示。按照GB/T 2260编码。

主体标识码（组织机构代码）：使用阿拉伯数字或英文大写字母表示。按照GB 11714编码。

在实行统一社会信用代码之前，以前的组织机构代码证上的组织机构代码由九位字符组成。格式为XXXXXXXX-Y。前面八位被称为“本体代码”；最后一位被称为“校验码”。校验码和本体代码由一个连字号（-）连接起来。以便让人很容易的看出校验码。但是三证合一后，组织机构的九位字符全部被纳入统一社会信用代码的第9位至第17位，其原有组织机构代码上的连字号不带入统一社会信用代码。

原有组织机构代码上的“校验码”的计算规则是：

例如：某公司的组织机构代码是：59467239-9。那其最后一位的组织机构代码校验码9是如何计算出来的呢？

第一步：取组织机构代码的前八位本体代码为基数。5 9 4 6 7 2 3 9提示：如果本体代码中含有英文大写字母。则A的基数是10，B的基数是11，C的基数是12，依此类推，直到Z的基数是35。

第二步：取加权因子数值。因为组织机构代码的本体代码一共是八位字符。则这八位的加权因子数值从左到右分别是：3、7、9、10、5、8、4、2。

第三步：本体代码基数与对应位数的因子数值相乘。5×3＝15，9×7＝63，4×9＝36，6×10＝60，7×5＝35，2×8＝16，3×4=12，9×2＝18第四步：将乘积求和相加。15+63+36+60+35+16+12+18=255第五步：将和数除以11，求余数。255÷11=33，余数是2。第六步：用阿拉伯数字11减去余数，得求校验码的数值。当校验码的数值为10时，校验码用英文大写字母X来表示；当校验码的数值为11时，校验码用0来表示；其余求出的校验码数值就用其本身的阿拉伯数字来表示。11-2＝9，因此此公司完整的组织机构代码为 59467239-9。

校验码：使用阿拉伯数字或大写英文字母来表示。校验码的计算方法参照 GB/T 17710。

例如：某公司的统一社会信用代码为91512081MA62K0260E，那其最后一位的校验码E是如何计算出来的呢？

第一步：取统一社会信用代码的前十七位为基数。9 1 5 1 2 0 8 1 21 10 6 2 19 0 2 6 0提示：如果前十七位统一社会信用代码含有英文大写字母（不使用I、O、Z、S、V这五个英文字母）。则英文字母对应的基数分别为：A=10、B=11、C=12、D=13、E=14、F=15、G=16、H=17、J=18、K=19、L=20、M=21、N=22、P=23、Q=24、R=25、T=26、U=27、W=28、X=29、Y=30

第二步：取加权因子数值。因为统一社会信用代码前面前面有十七位字符。则这十七位的加权因子数值从左到右分别是：1、3、9、27、19、26、16、17、20、29、25、13、8、24、10、30、28

第三步：基数与对应位数的因子数值相乘。9×1=9，1×3=3，5×9=45，1×27=27，2×19=38，0×26=0，8×16=1281×17=17，21×20=420，10×29=290，6×25=150，2×13=26，19×8=1520×23=0，2×10=20，6×30=180，0×28=0

第四步：将乘积求和相加。9+3+45+27+38+0+128+17+420+290+150+26+152+0+20+180+0=1495

第五步：将和数除以31，求余数。1495÷31=48，余数是17。

第六步：用阿拉伯数字31减去余数，得求校验码的数值。当校验码的数值为0~9时，就直接用该校验码的数值作为最终的统一社会信用代码的校验码；如果校验码的数值是10~30，则校验码转换为对应的大写英文字母。对应关系为：A=10、B=11、C=12、D=13、E=14、F=15、G=16、H=17、J=18、K=19、L=20、M=21、N=22、P=23、Q=24、R=25、T=26、U=27、W=28、X=29、Y=30

因为，31-17＝14，所以该公司完整的统一社会信用代码为 91512081MA62K0260E。

————————————————

统一社会信用代码与原来营业执照注册号、税务登记号、组织机构代码的转换关系

由于18位统一社会信用代码从2015年10月1日才正式实行。当前还有很多系统并没有完全转换到统一社会信用代码上。当您遇到需要让您填写组织机构代码或者税务登记号的时候，您应该如何从统一社会信用代码获取信息呢？

实质上：统一社会信用代码的第九位到第十七位就是原来的组织机构代码。统一社会信用代码的第三位到第十七位绝大多数的情况都是原来的税务登记证号。（不过由于少数发证机构对地方行政区划代码做了规范。所以，有少部分企业的新的统一社会信用代码并不一定的第3位到第8位的阿拉伯数字并一定能完全对应以前的税务登记证号的前六位。）统一社会信用代码无法对应原来营业执照的注册号。当遇到非要您填写营业执照的注册号，又暂时无法识别统一社会信用代码的场合。你则只有拿出以前旧的营业执照查看上面的注册号。

例如：91370200163562681G这个统一社会信用代码。

其组织机构代码是：16356268-1其税务登记号是：370200163562681 如果与之前的税务登记号稍微有所出入，则一般是370200不一致。尤其是00这两位

原创文章，转载请注明出处
http://30daydo.com/article/364

报错 ImportError cannot import name patterns Django版本兼容问题

李魔佛发表了文章 • 0 个评论 • 5406 次浏览 • 2018-10-25 11:20 • 来自相关话题

网上都是一个炒一个，没有通过验证的。
百度出来的csdn上的结果：https://blog.csdn.net/xudailong_blog/article/details/78313568
就是不对的，我把django降级到1.10，也是报错，明显不对嘛。

官方上说的1.8之后不建议使用，所以应该降级到1.8才可以。

降级命令：
pip install django==1.8

即可。
查看全部

网上都是一个炒一个，没有通过验证的。
百度出来的csdn上的结果：https://blog.csdn.net/xudailong_blog/article/details/78313568
就是不对的，我把django降级到1.10，也是报错，明显不对嘛。

官方上说的1.8之后不建议使用，所以应该降级到1.8才可以。

降级命令：
pip install django==1.8

即可。

jupyter notebook 显示 opencv的图片

李魔佛发表了文章 • 0 个评论 • 8844 次浏览 • 2018-09-22 22:55 • 来自相关话题

import sys
import cv2
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inlineimg = cv2.imread('forest.jpg')
plt.imshow(img)效果如图：

查看全部

import sys

import cv2

from matplotlib import pyplot as plt

import matplotlib

%matplotlib inline

img = cv2.imread('forest.jpg')

plt.imshow(img)

效果如图：

python爬虫集思录所有用户的帖子 scrapy写入mongodb数据库

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 7135 次浏览 • 2018-09-02 21:52 • 来自相关话题

好久没更新了，把之前做的一些爬虫分享一下。不然都没有用户来了。-. -

项目采用scrapy的框架，数据写入到mongodb的数据库。整个站点爬下来大概用了半小时，数据有12w条。

项目中的主要代码如下：

主spider# -*- coding: utf-8 -*-
import re
import scrapy
from scrapy import Request, FormRequest
from jsl.items import JslItem
from jsl import config
import logging

class AllcontentSpider(scrapy.Spider):
name = 'allcontent'

headers = {
'Host': 'www.jisilu.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache',
'Cache-Control': 'no-cache', 'Accept': 'application/json,text/javascript,*/*;q=0.01',
'Origin': 'https://www.jisilu.cn', 'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'Referer': 'https://www.jisilu.cn/login/',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'zh,en;q=0.9,en-US;q=0.8'
}

def start_requests(self):
login_url = 'https://www.jisilu.cn/login/'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate,br', 'Accept-Language': 'zh,en;q=0.9,en-US;q=0.8',
'Cache-Control': 'no-cache', 'Connection': 'keep-alive',
'Host': 'www.jisilu.cn', 'Pragma': 'no-cache', 'Referer': 'https://www.jisilu.cn/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36'}

yield Request(url=login_url, headers=headers, callback=self.login,dont_filter=True)

def login(self, response):
url = 'https://www.jisilu.cn/account/ajax/login_process/'
data = {
'return_url': 'https://www.jisilu.cn/',
'user_name': config.username,
'password': config.password,
'net_auto_login': '1',
'_post_type': 'ajax',
}

yield FormRequest(
url=url,
headers=self.headers,
formdata=data,
callback=self.parse,
dont_filter=True
)

def parse(self, response):
for i in range(1,3726):
focus_url = 'https://www.jisilu.cn/home/explore/sort_type-new__day-0__page-{}'.format(i)
yield Request(url=focus_url, headers=self.headers, callback=self.parse_page,dont_filter=True)

def parse_page(self, response):
nodes = response.xpath('//div[@class="aw-question-list"]/div')
for node in nodes:
each_url=node.xpath('.//h4/a/@href').extract_first()
yield Request(url=each_url,headers=self.headers,callback=self.parse_item,dont_filter=True)

def parse_item(self,response):
item = JslItem()
title = response.xpath('//div[@class="aw-mod-head"]/h1/text()').extract_first()
s = response.xpath('//div[@class="aw-question-detail-txt markitup-box"]').xpath('string(.)').extract_first()
ret = re.findall('(.*?)\.donate_user_avatar', s, re.S)

try:
content = ret[0].strip()
except:
content = None

createTime = response.xpath('//div[@class="aw-question-detail-meta"]/span/text()').extract_first()

resp_no = response.xpath('//div[@class="aw-mod aw-question-detail-box"]//ul/h2/text()').re_first('\d+')

url = response.url
item['title'] = title.strip()
item['content'] = content
try:
item['resp_no']=int(resp_no)
except Exception as e:
logging.warning('e')
item['resp_no']=None

item['createTime'] = createTime
item['url'] = url.strip()
resp =
for index,reply in enumerate(response.xpath('//div[@class="aw-mod-body aw-dynamic-topic"]/div[@class="aw-item"]')):
replay_user = reply.xpath('.//div[@class="pull-left aw-dynamic-topic-content"]//p/a/text()').extract_first()
rep_content = reply.xpath(
'.//div[@class="pull-left aw-dynamic-topic-content"]//div[@class="markitup-box"]/text()').extract_first()
# print rep_content
agree=reply.xpath('.//em[@class="aw-border-radius-5 aw-vote-count pull-left"]/text()').extract_first()
resp.append({replay_user.strip()+'_{}'.format(index): [int(agree),rep_content.strip()]})

item['resp'] = resp
yield item

login函数是模拟登录集思录，通过抓包就可以知道一些上传的data。
然后就是分页去抓取。逻辑很简单。

然后pipeline里面写入mongodb。import pymongo
from collections import OrderedDict
class JslPipeline(object):
def __init__(self):
self.db = pymongo.MongoClient(host='10.18.6.1',port=27017)
# self.user = u'neo牛3' # 修改为指定的用户名如毛之川，然后找到用户的id，在用户也的源码哪里可以找到比如持有封基是8132
self.collection = self.db['db_parker']['jsl']
def process_item(self, item, spider):
self.collection.insert(OrderedDict(item))
return item
抓取到的数据入库mongodb：

点击查看大图

原创文章
转载请注明出处：http://30daydo.com/publish/article/351

查看全部

好久没更新了，把之前做的一些爬虫分享一下。不然都没有用户来了。-. -

项目采用scrapy的框架，数据写入到mongodb的数据库。整个站点爬下来大概用了半小时，数据有12w条。

项目中的主要代码如下：

主spider

# -*- coding: utf-8 -*-

import re

import scrapy

from scrapy import Request, FormRequest

from jsl.items import JslItem

from jsl import config

import logging



class AllcontentSpider(scrapy.Spider):

    name = 'allcontent'



    headers = {

        'Host': 'www.jisilu.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache',

        'Cache-Control': 'no-cache', 'Accept': 'application/json,text/javascript,*/*;q=0.01',

        'Origin': 'https://www.jisilu.cn', 'X-Requested-With': 'XMLHttpRequest',

        'User-Agent': 'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36',

        'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',

        'Referer': 'https://www.jisilu.cn/login/',

        'Accept-Encoding': 'gzip,deflate,br',

        'Accept-Language': 'zh,en;q=0.9,en-US;q=0.8'

    }



    def start_requests(self):

        login_url = 'https://www.jisilu.cn/login/'

        headers = {

            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

            'Accept-Encoding': 'gzip,deflate,br', 'Accept-Language': 'zh,en;q=0.9,en-US;q=0.8',

            'Cache-Control': 'no-cache', 'Connection': 'keep-alive',

            'Host': 'www.jisilu.cn', 'Pragma': 'no-cache', 'Referer': 'https://www.jisilu.cn/',

            'Upgrade-Insecure-Requests': '1',

            'User-Agent': 'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36'}



        yield Request(url=login_url, headers=headers, callback=self.login,dont_filter=True)



    def login(self, response):

        url = 'https://www.jisilu.cn/account/ajax/login_process/'

        data = {

            'return_url': 'https://www.jisilu.cn/',

            'user_name': config.username,

            'password': config.password,

            'net_auto_login': '1',

            '_post_type': 'ajax',

        }



        yield FormRequest(

            url=url,

            headers=self.headers,

            formdata=data,

            callback=self.parse,

            dont_filter=True

        )



    def parse(self, response):

        for i in range(1,3726):

            focus_url = 'https://www.jisilu.cn/home/explore/sort_type-new__day-0__page-{}'.format(i)

            yield Request(url=focus_url, headers=self.headers, callback=self.parse_page,dont_filter=True)



    def parse_page(self, response):

        nodes = response.xpath('//div[@class="aw-question-list"]/div')

        for node in nodes:

            each_url=node.xpath('.//h4/a/@href').extract_first()

            yield Request(url=each_url,headers=self.headers,callback=self.parse_item,dont_filter=True)



    def parse_item(self,response):

        item = JslItem()

        title = response.xpath('//div[@class="aw-mod-head"]/h1/text()').extract_first()

        s = response.xpath('//div[@class="aw-question-detail-txt markitup-box"]').xpath('string(.)').extract_first()

        ret = re.findall('(.*?)\.donate_user_avatar', s, re.S)



        try:

            content = ret[0].strip()

        except:

            content = None



        createTime = response.xpath('//div[@class="aw-question-detail-meta"]/span/text()').extract_first()



        resp_no = response.xpath('//div[@class="aw-mod aw-question-detail-box"]//ul/h2/text()').re_first('\d+')



        url = response.url

        item['title'] = title.strip()

        item['content'] = content

        try:

            item['resp_no']=int(resp_no)

        except Exception as e:

            logging.warning('e')

            item['resp_no']=None



        item['createTime'] = createTime

        item['url'] = url.strip()

        resp = 

        for index,reply in enumerate(response.xpath('//div[@class="aw-mod-body aw-dynamic-topic"]/div[@class="aw-item"]')):

            replay_user = reply.xpath('.//div[@class="pull-left aw-dynamic-topic-content"]//p/a/text()').extract_first()

            rep_content = reply.xpath(

                './/div[@class="pull-left aw-dynamic-topic-content"]//div[@class="markitup-box"]/text()').extract_first()

            # print rep_content

            agree=reply.xpath('.//em[@class="aw-border-radius-5 aw-vote-count pull-left"]/text()').extract_first()

            resp.append({replay_user.strip()+'_{}'.format(index): [int(agree),rep_content.strip()]})



        item['resp'] = resp

        yield item

login函数是模拟登录集思录，通过抓包就可以知道一些上传的data。
然后就是分页去抓取。逻辑很简单。

然后pipeline里面写入mongodb。

import pymongo

from collections import OrderedDict

class JslPipeline(object):

    def __init__(self):

        self.db = pymongo.MongoClient(host='10.18.6.1',port=27017)

        # self.user = u'neo牛3' # 修改为指定的用户名 如 毛之川 ，然后找到用户的id，在用户也的源码哪里可以找到 比如持有封基是8132

        self.collection = self.db['db_parker']['jsl']

    def process_item(self, item, spider):

        self.collection.insert(OrderedDict(item))

        return item

抓取到的数据入库mongodb：

点击查看大图

原创文章
转载请注明出处：http://30daydo.com/publish/article/351

docker里运行mongodb，保存的数据在外部使用mongoexport不能导出：提示错误Unrecognized field 'snapshot'

李魔佛发表了文章 • 0 个评论 • 11036 次浏览 • 2018-08-31 14:21 • 来自相关话题

## 2019-03-19更新问题已解决
很无语。目前还找不到原因。

docker里面运行的mongodb， mongodb的数据挂载到宿主机。开放了27017端口。
在windows下使用mongoexport工具导出数据：

错误信息：C:\Program Files\MongoDB\Server\3.4\bin>mongoexport.exe /h 10.18.6.102 /d stock
/c company /o company.json /type json
2018-08-31T14:13:47.841+0800 connected to: 10.18.6.102
2018-08-31T14:13:47.854+0800 Failed: Failed to parse: { find: "company", filt
er: {}, sort: {}, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPr
eferred" }, $db: "stock" }. Unrecognized field 'snapshot'.

C:\Program Files\MongoDB\Server\3.4\bin>
目前这个问题已经解决：
需要进去docker容器里面，然后在容器里面操作，把数据导出来到挂载的目录下，然后可以直接获取到数据了。查看全部

## 2019-03-19更新问题已解决
很无语。目前还找不到原因。

docker里面运行的mongodb， mongodb的数据挂载到宿主机。开放了27017端口。
在windows下使用mongoexport工具导出数据：

错误信息：

C:\Program Files\MongoDB\Server\3.4\bin>mongoexport.exe /h 10.18.6.102 /d stock

/c company /o company.json /type json

2018-08-31T14:13:47.841+0800    connected to: 10.18.6.102

2018-08-31T14:13:47.854+0800    Failed: Failed to parse: { find: "company", filt

er: {}, sort: {}, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPr

eferred" }, $db: "stock" }. Unrecognized field 'snapshot'.



C:\Program Files\MongoDB\Server\3.4\bin>

目前这个问题已经解决：
需要进去docker容器里面，然后在容器里面操作，把数据导出来到挂载的目录下，然后可以直接获取到数据了。

django不同版本的兼容性太麻烦了

李魔佛发表了文章 • 0 个评论 • 3922 次浏览 • 2018-08-26 18:20 • 来自相关话题

对于新人来说太坑爹，不同版本，即使是一个小版本，很多函数都作了修改，或者直接被移除。好坑。

python mongodb大数据(>3GB)转移Mysql数据库

李魔佛发表了文章 • 0 个评论 • 5615 次浏览 • 2018-08-20 15:44 • 来自相关话题

数据约为5GB左右,如果直接用for i in doc.find({})进行逐行遍历的话,游标就会超时,而且越到后面速度越慢.

于是使用了分段遍历的方法.# -*-coding=utf-8-*-
import pandas as pd
import json
import pymongo
from sqlalchemy import create_engine

# 将mongo数据转移到mysql

client = pymongo.MongoClient('xxx')
doc = client['spider']['meituan']
engine = create_engine('mysql+pymysql://xxx:xxx@xxx:/xxx?charset=utf8')

def classic_method():
temp =
start = 0
# 数据太大还是会爆内存,或者游标丢失
for i in doc.find().batch_size(500):
start += 1
del i['_id']
temp.append(i)
print(start)

print('start to save to mysql')
df = pd.read_json(json.dumps(temp))
df = df.set_index('poiid', drop=True)
df.to_sql('meituan', con=engine, if_exists='replace')
print('done')

def chunksize_move():
block = 10000
total = doc.find({}).count()
iter_number = total // block

for i in range(iter_number + 1):
small_part = doc.find({}).limit(block).skip(i * block)
list_data =

for item in small_part:
del item['_id']
del item['crawl_time']
item['poiid'] = int(item['poiid'])
for k, v in item.items():
if isinstance(v, dict) or isinstance(v, list):

item[k] = json.dumps(v, ensure_ascii=False)

list_data.append(item)

df = pd.DataFrame(list_data)
df = df.set_index('poiid', drop=True)

try:
df.to_sql('meituan', con=engine, if_exists='append')
print('to sql {}'.format(i))
except Exception as e:
print(e)

chunksize_move()

速度比一次批量的要快不少. 查看全部

数据约为5GB左右,如果直接用

for i in doc.find({})

进行逐行遍历的话,游标就会超时,而且越到后面速度越慢.

于是使用了分段遍历的方法.

# -*-coding=utf-8-*-

import pandas as pd

import json

import pymongo

from sqlalchemy import create_engine



# 将mongo数据转移到mysql



client = pymongo.MongoClient('xxx')

doc = client['spider']['meituan']

engine = create_engine('mysql+pymysql://xxx:xxx@xxx:/xxx?charset=utf8')





def classic_method():

    temp = 

    start = 0

    # 数据太大还是会爆内存,或者游标丢失

    for i in doc.find().batch_size(500):

        start += 1

        del i['_id']

        temp.append(i)

        print(start)



    print('start to save to mysql')

    df = pd.read_json(json.dumps(temp))

    df = df.set_index('poiid', drop=True)

    df.to_sql('meituan', con=engine, if_exists='replace')

    print('done')





def chunksize_move():

    block = 10000

    total = doc.find({}).count()

    iter_number = total // block



    for i in range(iter_number + 1):

        small_part = doc.find({}).limit(block).skip(i * block)

        list_data = 



        for item in small_part:

            del item['_id']

            del item['crawl_time']

            item['poiid'] = int(item['poiid'])

            for k, v in item.items():

                if isinstance(v, dict) or isinstance(v, list):



                    item[k] = json.dumps(v, ensure_ascii=False)



            list_data.append(item)



        df = pd.DataFrame(list_data)

        df = df.set_index('poiid', drop=True)



        try:

            df.to_sql('meituan', con=engine, if_exists='append')

            print('to sql {}'.format(i))

        except Exception as e:

            print(e)



chunksize_move()

速度比一次批量的要快不少.

python 把mongodb的数据迁移到mysql

李魔佛发表了文章 • 0 个评论 • 5143 次浏览 • 2018-08-20 11:02 • 来自相关话题

代码如下: 很简短.
import pymongo
from setting import get_engine

# 将mongo数据转移到mysql

client = pymongo.MongoClient('10.18.6.101')
doc = client['spider']['meituan']
engine = create_engine('mysql+pymysql://localhost:1234@10.18.4.211/spider?charset=utf8')
temp=[]

for i in doc.find({}):
del i['_id']
temp.append(i)
print('start to save to mysql')
df = pd.read_json(json.dumps(temp))
df = df.set_index('poiid',drop=True)
df.to_sql('meituan',con=engine,if_exists='replace')
print('done')

居然CPU飙到了90%
查看全部

代码如下: 很简短.

import pymongo

from setting import get_engine



# 将mongo数据转移到mysql



client = pymongo.MongoClient('10.18.6.101')

doc = client['spider']['meituan']

engine = create_engine('mysql+pymysql://localhost:1234@10.18.4.211/spider?charset=utf8')

temp=[]



for i in doc.find({}):

    del i['_id']

    temp.append(i)

print('start to save to mysql')

df = pd.read_json(json.dumps(temp))

df = df.set_index('poiid',drop=True)

df.to_sql('meituan',con=engine,if_exists='replace')

print('done')

居然CPU飙到了90%

python json.loads 文件中的字典不能用单引号

李魔佛发表了文章 • 0 个评论 • 5650 次浏览 • 2018-08-20 09:28 • 来自相关话题

python json.loads 文件中的字典不能用单引号
只能改成双引号,或者使用

with open('cookies', 'r') as f:
# js = json.load(f)
js=eval(f.read())
# cookie=js.get('Cookie','')
headers = js.get('headers', '')

#content为文件的内容查看全部

python json.loads 文件中的字典不能用单引号

只能改成双引号,或者使用



with open('cookies', 'r') as f:

    # js = json.load(f)

    js=eval(f.read())

# cookie=js.get('Cookie','')

headers = js.get('headers', '')



#content为文件的内容

scrapy记录日志的最新方法

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 4681 次浏览 • 2018-08-15 15:01 • 来自相关话题

旧的方法：from scrapy import log
log.msg("This is a warning", level=log.WARING)

在Spider中添加log

在spider中添加log的推荐方式是使用Spider的 log() 方法。该方法会自动在调用 scrapy.log.start() 时赋值 spider 参数。

其它的参数则直接传递给 msg() 方法

scrapy.log模块scrapy.log.start(logfile=None, loglevel=None, logstdout=None)启动log功能。该方法必须在记录任何信息之前被调用。否则调用前的信息将会丢失。

但是运行的时候出现警告：

[py.warnings] WARNING: E:\git\CrawlMan\bilibili\bilibili\spiders\bili.py:14: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg

原来官方以及不推荐使用log.msg了

最新的用法：# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
import logging
# from scrapy import log
class BiliSpider(scrapy.Spider):
name = 'ordinary' # 这个名字就是上面连接中那个启动应用的名字
allowed_domain = ["bilibili.com"]
start_urls = [
"https://www.bilibili.com/"
]

def parse(self, response):
logging.info('====================================================')
content = response.xpath("//div[@class='num-wrap']").extract_first()
logging.info(content)
logging.info('====================================================') 查看全部

旧的方法：

from scrapy import log

log.msg("This is a warning", level=log.WARING)

在Spider中添加log

在spider中添加log的推荐方式是使用Spider的 log() 方法。该方法会自动在调用 scrapy.log.start() 时赋值 spider 参数。

其它的参数则直接传递给 msg() 方法

scrapy.log模块scrapy.log.start(logfile=None, loglevel=None, logstdout=None)启动log功能。该方法必须在记录任何信息之前被调用。否则调用前的信息将会丢失。

但是运行的时候出现警告：



[py.warnings] WARNING: E:\git\CrawlMan\bilibili\bilibili\spiders\bili.py:14: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead

  log.msg

原来官方以及不推荐使用log.msg了

最新的用法：

# -*- coding: utf-8 -*-

import scrapy

from scrapy_splash import SplashRequest

import logging

# from scrapy import log

class BiliSpider(scrapy.Spider):

    name = 'ordinary'  # 这个名字就是上面连接中那个启动应用的名字

    allowed_domain = ["bilibili.com"]

    start_urls = [

        "https://www.bilibili.com/"

    ]



    def parse(self, response):

        logging.info('====================================================')

        content = response.xpath("//div[@class='num-wrap']").extract_first()

        logging.info(content)

        logging.info('====================================================')

adbapi查询语句 -- python3

李魔佛发表了文章 • 0 个评论 • 4453 次浏览 • 2018-08-12 19:40 • 来自相关话题

Introduction to Twisted Enterprise
Abstract

Twisted is an asynchronous networking framework, but most database API implementations unfortunately have blocking interfaces -- for this reason, twisted.enterprise.adbapi was created. It is a non-blocking interface to the standardized DB-API 2.0 API, which allows you to access a number of different RDBMSes.

What you should already know

Python :-)
How to write a simple Twisted Server (see this tutorial to learn how)
Familiarity with using database interfaces (see the documentation for DBAPI 2.0 or this article by Andrew Kuchling)

Quick Overview

Twisted is an asynchronous framework. This means standard database modules cannot be used directly, as they typically work something like:# Create connection... db = dbmodule.connect('mydb', 'andrew', 'password') # ...which blocks for an unknown amount of time # Create a cursor cursor = db.cursor() # Do a query... resultset = cursor.query('SELECT * FROM table WHERE ...') # ...which could take a long time, perhaps even minutes.Those delays are unacceptable when using an asynchronous framework such as Twisted. For this reason, twisted provides twisted.enterprise.adbapi, an asynchronous wrapper for any DB-API 2.0-compliant module. It is currently best tested with the pyPgSQL module for PostgreSQL.

enterprise.adbapi will do blocking database operations in seperate threads, which trigger callbacks in the originating thread when they complete. In the meantime, the original thread can continue doing normal work, like servicing other requests.

How do I use adbapi?

Rather than creating a database connection directly, use the adbapi.ConnectionPool class to manage a connections for you. This allows enterprise.adbapi to use multiple connections, one per thread. This is easy:# Using the "dbmodule" from the previous example, create a ConnectionPool from twisted.enterprise import adbapi dbpool = adbapi.ConnectionPool("dbmodule", 'mydb', 'andrew', 'password')Things to note about doing this:

There is no need to import dbmodule directly. You just pass the name to adbapi.ConnectionPool's constructor.
The parameters you would pass to dbmodule.connect are passed as extra arguments to adbapi.ConnectionPool's constructor. Keyword parameters work as well.
You may also control the size of the connection pool with the keyword parameters cp_min and cp_max. The default minimum and maximum values are 3 and 5.

So, now you need to be able to dispatch queries to your ConnectionPool. We do this by subclassing adbapi.Augmentation. Here's an example:class AgeDatabase(adbapi.Augmentation): """A simple example that can retrieve an age from the database""" def getAge(self, name): # Define the query sql = """SELECT Age FROM People WHERE name = ?""" # Run the query, and return a Deferred to the caller to add # callbacks to. return self.runQuery(sql, name) def gotAge(resultlist, name): """Callback for handling the result of the query""" age = resultlist[0][0] # First field of first record print "%s is %d years old" % (name, age) db = AgeDatabase(dbpool) # These will *not* block. Hooray! db.getAge("Andrew").addCallbacks(gotAge, db.operationError, callbackArgs=("Andrew",)) db.getAge("Glyph").addCallbacks(gotAge, db.operationError, callbackArgs=("Glyph",)) # Of course, nothing will happen until the reactor is started from twisted.internet import reactor reactor.run()This is straightforward, except perhaps for the return value of getAge. It returns a twisted.internet.defer.Deferred, which allows arbitrary callbacks to be called upon completion (or upon failure). More documentation on Deferred is available here.

Also worth noting is that this example assumes that dbmodule uses the qmarks paramstyle (see the DB-API specification). If your dbmodule uses a different paramstyle (e.g. pyformat) then use that. Twisted doesn't attempt to offer any sort of magic paramater munging -- runQuery(query, params, ...) maps directly onto cursor.execute(query, params, ...).

And that's it!

That's all you need to know to use a database from within Twisted. You probably should read the adbapi module's documentation to get an idea of the other functions it has, but hopefully this document presents the core ideas. 查看全部

Introduction to Twisted Enterprise
Abstract

Twisted is an asynchronous networking framework, but most database API implementations unfortunately have blocking interfaces -- for this reason, twisted.enterprise.adbapi was created. It is a non-blocking interface to the standardized DB-API 2.0 API, which allows you to access a number of different RDBMSes.

What you should already know

Python :-)
How to write a simple Twisted Server (see this tutorial to learn how)
Familiarity with using database interfaces (see the documentation for DBAPI 2.0 or this article by Andrew Kuchling)

Quick Overview

Twisted is an asynchronous framework. This means standard database modules cannot be used directly, as they typically work something like:# Create connection... db = dbmodule.connect('mydb', 'andrew', 'password') # ...which blocks for an unknown amount of time # Create a cursor cursor = db.cursor() # Do a query... resultset = cursor.query('SELECT * FROM table WHERE ...') # ...which could take a long time, perhaps even minutes.Those delays are unacceptable when using an asynchronous framework such as Twisted. For this reason, twisted provides twisted.enterprise.adbapi, an asynchronous wrapper for any DB-API 2.0-compliant module. It is currently best tested with the pyPgSQL module for PostgreSQL.

enterprise.adbapi will do blocking database operations in seperate threads, which trigger callbacks in the originating thread when they complete. In the meantime, the original thread can continue doing normal work, like servicing other requests.

How do I use adbapi?

Rather than creating a database connection directly, use the adbapi.ConnectionPool class to manage a connections for you. This allows enterprise.adbapi to use multiple connections, one per thread. This is easy:# Using the "dbmodule" from the previous example, create a ConnectionPool from twisted.enterprise import adbapi dbpool = adbapi.ConnectionPool("dbmodule", 'mydb', 'andrew', 'password')Things to note about doing this:

There is no need to import dbmodule directly. You just pass the name to adbapi.ConnectionPool's constructor.
The parameters you would pass to dbmodule.connect are passed as extra arguments to adbapi.ConnectionPool's constructor. Keyword parameters work as well.
You may also control the size of the connection pool with the keyword parameters cp_min and cp_max. The default minimum and maximum values are 3 and 5.

So, now you need to be able to dispatch queries to your ConnectionPool. We do this by subclassing adbapi.Augmentation. Here's an example:class AgeDatabase(adbapi.Augmentation): """A simple example that can retrieve an age from the database""" def getAge(self, name): # Define the query sql = """SELECT Age FROM People WHERE name = ?""" # Run the query, and return a Deferred to the caller to add # callbacks to. return self.runQuery(sql, name) def gotAge(resultlist, name): """Callback for handling the result of the query""" age = resultlist[0][0] # First field of first record print "%s is %d years old" % (name, age) db = AgeDatabase(dbpool) # These will *not* block. Hooray! db.getAge("Andrew").addCallbacks(gotAge, db.operationError, callbackArgs=("Andrew",)) db.getAge("Glyph").addCallbacks(gotAge, db.operationError, callbackArgs=("Glyph",)) # Of course, nothing will happen until the reactor is started from twisted.internet import reactor reactor.run()This is straightforward, except perhaps for the return value of getAge. It returns a twisted.internet.defer.Deferred, which allows arbitrary callbacks to be called upon completion (or upon failure). More documentation on Deferred is available here.

Also worth noting is that this example assumes that dbmodule uses the qmarks paramstyle (see the DB-API specification). If your dbmodule uses a different paramstyle (e.g. pyformat) then use that. Twisted doesn't attempt to offer any sort of magic paramater munging -- runQuery(query, params, ...) maps directly onto cursor.execute(query, params, ...).

And that's it!

That's all you need to know to use a database from within Twisted. You probably should read the adbapi module's documentation to get an idea of the other functions it has, but hopefully this document presents the core ideas.

python判断身份证的合法性

李魔佛发表了文章 • 0 个评论 • 6696 次浏览 • 2018-08-10 13:56 • 来自相关话题

输入身份证号码, 判断18位身份证号码是否合法, 并查询信息(性别, 年龄, 所在地)

验证原理

将前面的身份证号码17位数分别乘以不同的系数, 从第一位到第十七位的系数分别为: 7 9 10 5 8 4 2 1 6 3 7 9 10 5 8 4 2
将这17位数字和系数相乘的结果相加.
用加出来和除以11, 看余数是多少?
余数只可能有<0 1 2 3 4 5 6 7 8 9 10>这11个数字, 其分别对应的最后一位身份证的号码为<1 0 X 9 8 7 6 5 4 3 2>.
通过上面得知如果余数是2，就会在身份证的第18位数字上出现罗马数字的Ⅹ。如果余数是10，身份证的最后一位号码就是2.

例如: 某男性的身份证号码是34052419800101001X, 我们要看看这个身份证是不是合法的身份证.

首先: 我们得出, 前17位的乘积和是189.

然后: 用189除以11得出的余数是2.

最后: 通过对应规则就可以知道余数2对应的数字是x. 所以, 这是一个合格的身份证号码.

代码如下：#!/bin/env python
# -*- coding: utf-8 -*-

from sys import platform
import json
import codecs

with codecs.open('data.json', 'r', encoding='utf8') as json_data:
city = json.load(json_data)

def check_valid(idcard):
# 城市编码, 出生日期, 归属地
city_id = idcard[:6]
print(city_id)
birth = idcard[6:14]

city_name = city.get(city_id,'Not found')

# 根据规则校验身份证是否符合规则
idcard_tuple = [int(num) for num in list(idcard[:-1])]
coefficient = [7, 9, 10, 5, 8, 4, 2, 1, 6, 3, 7, 9, 10, 5, 8, 4, 2]
sum_value = sum([idcard_tuple[i] * coefficient[i] for i in range(17)])

remainder = sum_value % 11

maptable = {0: '1', 1: '0', 2: 'x', 3: '9', 4: '8', 5: '7', 6: '6', 7: '5', 8: '4', 9: '3', 10: '2'}

if maptable[remainder] == idcard[17]:
print('<身份证合法>')
sex = int(idcard[16]) % 2
sex = '男' if sex == 1 else '女'
print('性别：' + sex)
birth_format="{}年{}月{}日".format(birth[:4],birth[4:6],birth[6:8])
print('出生日期:' + birth_format)
print('归属地:' + city_name)
return True
else:
print('<身份证不合法>')
return False

if __name__=='__main__':
idcard = str(input('请输入身份证号码：'))
check_valid(idcard)[/i]

github源码：https://github.com/Rockyzsu/IdentityCheck
原创文章，转载请注明
http://30daydo.com/article/340
查看全部

输入身份证号码, 判断18位身份证号码是否合法, 并查询信息(性别, 年龄, 所在地)

验证原理

将前面的身份证号码17位数分别乘以不同的系数, 从第一位到第十七位的系数分别为: 7 9 10 5 8 4 2 1 6 3 7 9 10 5 8 4 2
将这17位数字和系数相乘的结果相加.
用加出来和除以11, 看余数是多少?
余数只可能有<0 1 2 3 4 5 6 7 8 9 10>这11个数字, 其分别对应的最后一位身份证的号码为<1 0 X 9 8 7 6 5 4 3 2>.
通过上面得知如果余数是2，就会在身份证的第18位数字上出现罗马数字的Ⅹ。如果余数是10，身份证的最后一位号码就是2.

例如: 某男性的身份证号码是34052419800101001X, 我们要看看这个身份证是不是合法的身份证.

首先: 我们得出, 前17位的乘积和是189.

然后: 用189除以11得出的余数是2.

最后: 通过对应规则就可以知道余数2对应的数字是x. 所以, 这是一个合格的身份证号码.

代码如下：

#!/bin/env python

# -*- coding: utf-8 -*-



from sys import platform

import json

import codecs



with codecs.open('data.json', 'r', encoding='utf8') as json_data:

    city = json.load(json_data)



def check_valid(idcard):

    # 城市编码, 出生日期, 归属地

    city_id = idcard[:6]

    print(city_id)

    birth = idcard[6:14]



    city_name = city.get(city_id,'Not found')



    # 根据规则校验身份证是否符合规则

    idcard_tuple = [int(num) for num in list(idcard[:-1])]

    coefficient = [7, 9, 10, 5, 8, 4, 2, 1, 6, 3, 7, 9, 10, 5, 8, 4, 2]

    sum_value = sum([idcard_tuple[i] * coefficient[i] for i in range(17)])



    remainder = sum_value % 11



    maptable = {0: '1', 1: '0', 2: 'x', 3: '9', 4: '8', 5: '7', 6: '6', 7: '5', 8: '4', 9: '3', 10: '2'}



    if maptable[remainder] == idcard[17]:

        print('<身份证合法>')

        sex = int(idcard[16]) % 2

        sex = '男' if sex == 1 else '女'

        print('性别：' + sex)

        birth_format="{}年{}月{}日".format(birth[:4],birth[4:6],birth[6:8])

        print('出生日期:' + birth_format)

        print('归属地:' + city_name)

        return True

    else:

        print('<身份证不合法>')

        return False





if __name__=='__main__':

    idcard = str(input('请输入身份证号码：'))

    check_valid(idcard)[/i]

github源码：https://github.com/Rockyzsu/IdentityCheck
原创文章，转载请注明
http://30daydo.com/article/340

python sqlalchemy ORM 添加注释

李魔佛发表了文章 • 0 个评论 • 4409 次浏览 • 2018-06-25 16:17 • 来自相关话题

需要更新sqlalchemy到最新版本，旧版本会不支持。

在定义ORM对象的时候，
class CreditRecord(Base):
__tablename__ = 'tb_PersonPunishment'

id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(180),comment='名字')
添加一个comment参数即可。

查看全部

需要更新sqlalchemy到最新版本，旧版本会不支持。

在定义ORM对象的时候，

class CreditRecord(Base):

    __tablename__ = 'tb_PersonPunishment'



    id = Column(Integer, primary_key=True, autoincrement=True)

    name = Column(String(180),comment='名字')

添加一个comment参数即可。

windows 7 python3 安装MySQLdb 库

李魔佛发表了文章 • 0 个评论 • 3267 次浏览 • 2018-06-20 18:04 • 来自相关话题

python3下没有MySQLdb的库，可以直接到这里下载mysqlclient库来替代。https://www.lfd.uci.edu/~gohlke/pythonlibs/#mysqlclient

python3中定义抽象类的方法在python2中不兼容

李魔佛发表了文章 • 0 个评论 • 5001 次浏览 • 2018-06-10 20:54 • 来自相关话题

在python3中新式的定义抽象类的方法如下：from abc import ABCMeta,abstractmethod

class Server(metaclass=ABCMeta):

@abstractmethod
def __init__(self):
pass

def __str__(self):
return self.name

@abstractmethod
def boot(self):
pass

@abstractmethod
def kill(self):
pass

但是这个方法在python2中会提示语法错误。

在python2中只能像下面这种方式定义抽象类：
from abc import ABCMeta,abstractmethod

class Server(object):
__metaclass__=ABCMeta
@abstractmethod
def __init__(self):
pass

def __str__(self):
return self.name

@abstractmethod
def boot(self):
pass

@abstractmethod
def kill(self):
pass
这种方式不仅在python2中可以正常运行，在python3中也可以。但是python3的方法只能兼容python3，无法在python2中运行。

原创地址：
http://30daydo.com/article/326
欢迎转载，请注明出处。查看全部

在python3中新式的定义抽象类的方法如下：

from abc import ABCMeta,abstractmethod



class Server(metaclass=ABCMeta):

	

	@abstractmethod

	def __init__(self):

		pass



	def __str__(self):

		return self.name



	@abstractmethod

	def boot(self):

		pass



	@abstractmethod

	def kill(self):

		pass

但是这个方法在python2中会提示语法错误。

在python2中只能像下面这种方式定义抽象类：

from abc import ABCMeta,abstractmethod



class Server(object):

	__metaclass__=ABCMeta

	@abstractmethod

	def __init__(self):

		pass



	def __str__(self):

		return self.name



	@abstractmethod

	def boot(self):

		pass



	@abstractmethod

	def kill(self):

		pass

这种方式不仅在python2中可以正常运行，在python3中也可以。但是python3的方法只能兼容python3，无法在python2中运行。

原创地址：
http://30daydo.com/article/326
欢迎转载，请注明出处。

numpy数组四舍五入

李魔佛发表了文章 • 0 个评论 • 14425 次浏览 • 2018-05-21 09:17 • 来自相关话题

numpy.around(nlist, number)
传入一个np的数组和需要保留的位数作为参数

例子：import numpy as np
x = np.arange(10)
x=x/77.0
print x
输出结果为：[b][0. 0.01298701 0.02597403 0.03896104 0.05194805 0.06493506
0.07792208 0.09090909 0.1038961 0.11688312][/b] [b]np.around(x, 3) #保存为3位小数[/b]
array([0. , 0.013, 0.026, 0.039, 0.052, 0.065, 0.078, 0.091, 0.104, 0.117]) 查看全部

numpy.around(nlist, number)
传入一个np的数组和需要保留的位数作为参数

例子：

import numpy as np

x = np.arange(10)

x=x/77.0

print x

输出结果为：

[b][0.         0.01298701 0.02597403 0.03896104 0.05194805 0.06493506

 0.07792208 0.09090909 0.1038961  0.11688312][/b]

[b]np.around(x, 3)   #保存为3位小数[/b]

array([0. , 0.013, 0.026, 0.039, 0.052, 0.065, 0.078, 0.091, 0.104, 0.117])

pandas中diff控制移动方向，向上移动

李魔佛发表了文章 • 0 个评论 • 6358 次浏览 • 2018-04-25 20:39 • 来自相关话题

初始化一个dataframe

然后使用默认的diff（periods=1）

行的索引不变，数据被往下拉了一行。当然你也可以使用periods=2 ，那么数据整体会往下移2格。

如果要往上移动，只要把periods的值设为负的就可以了。

查看全部

初始化一个dataframe

然后使用默认的diff（periods=1）

行的索引不变，数据被往下拉了一行。当然你也可以使用periods=2 ，那么数据整体会往下移2格。

如果要往上移动，只要把periods的值设为负的就可以了。

python安装mpl_finance [finance模块已经从matplotlib2.0.2中脱离出来]

李魔佛发表了文章 • 0 个评论 • 19690 次浏览 • 2018-04-23 23:17 • 来自相关话题

最新的matplotlib中已经把其中的finance库脱离出来，目前还没有放入PIP的仓库中，所以使用pip install mpl_finance会提示找不到所需要的库.

解决办法：
到官方github中下载源码，然后在本地安装即可。目前的mpl_finance的版本还是dev版，不过用起来也没什么大问题。

git clone git@github.com:matplotlib/mpl_finance.git

等待下载后，进入该目录， sudo python setup.py install

OK
查看全部

最新的matplotlib中已经把其中的finance库脱离出来，目前还没有放入PIP的仓库中，所以使用pip install mpl_finance会提示找不到所需要的库.

解决办法：
到官方github中下载源码，然后在本地安装即可。目前的mpl_finance的版本还是dev版，不过用起来也没什么大问题。

git clone git@github.com:matplotlib/mpl_finance.git

等待下载后，进入该目录， sudo python setup.py install

OK

python取出两个两个同样表结构的MySQL数据库中不同的行

李魔佛发表了文章 • 0 个评论 • 3899 次浏览 • 2018-04-14 11:11 • 来自相关话题

因为平时有本地数据库和远程数据库，本地的时候是离线的时候看的。有时候因为修改代码的缘故，导致远程数据和本地数据有不一样的地方，那么可以使用python+pandas很简单的筛选出不同的行。

df_new[~(df_new['URL'].isin(df_old['URL'].values))]

其中df_old 为本地的数据库读取的dataframe数据，而df_new 为远程的数据，通过判断唯一的key URL的值来筛选出不同的数据行查看全部

因为平时有本地数据库和远程数据库，本地的时候是离线的时候看的。有时候因为修改代码的缘故，导致远程数据和本地数据有不一样的地方，那么可以使用python+pandas很简单的筛选出不同的行。

df_new[~(df_new['URL'].isin(df_old['URL'].values))]

其中df_old 为本地的数据库读取的dataframe数据，而df_new 为远程的数据，通过判断唯一的key URL的值来筛选出不同的数据行

urlparse中defrag函数的用法

李魔佛发表了文章 • 0 个评论 • 3415 次浏览 • 2018-03-11 17:59 • 来自相关话题

urlparse.urldefrag(url)¶

If url contains a fragment identifier, returns a modified version of url with no fragment identifier, and the fragment identifier as a separate string. If there is no fragment identifier in url, returns url unmodified and an empty string.

官网的解释如上，作用就是把url中的fragment标识符去掉。What ？
fragment标识符是url中#号的部分。
比如 http://www.example.com/index.html#print

#代表网页中的一个位置。其右面的字符，就是该位置的标识符。

就代表网页index.html的print位置。浏览器读取这个URL后，会自动将print位置滚动至可视区域。

为网页位置指定标识符，有两个方法。一是使用锚点，比如<a name="print"></a>，二是使用id属性，比如<div id="print" >。

所以：
url='http://www.example.com/index.html#print'
url=urlparse.defrag(url)
那么返回的url是http://www.example.com/index.html，因为这两个页面实际是同一个url，在爬虫程序中可以用来过滤同一个页面查看全部

urlparse.urldefrag(url)¶

If url contains a fragment identifier, returns a modified version of url with no fragment identifier, and the fragment identifier as a separate string. If there is no fragment identifier in url, returns url unmodified and an empty string.

官网的解释如上，作用就是把url中的fragment标识符去掉。What ？
fragment标识符是url中#号的部分。
比如 http://www.example.com/index.html#print

#代表网页中的一个位置。其右面的字符，就是该位置的标识符。

就代表网页index.html的print位置。浏览器读取这个URL后，会自动将print位置滚动至可视区域。

为网页位置指定标识符，有两个方法。一是使用锚点，比如<a name="print"></a>，二是使用id属性，比如<div id="print" >。

所以：
url='http://www.example.com/index.html#print'
url=urlparse.defrag(url)
那么返回的url是http://www.example.com/index.html，因为这两个页面实际是同一个url，在爬虫程序中可以用来过滤同一个页面

strptime修改默认年份，datetime - strptime默认值为 1900

李魔佛发表了文章 • 0 个评论 • 4557 次浏览 • 2018-03-07 08:42 • 来自相关话题

比如
s='03-06 18:36'
news_time_f=datetime.datetime.strptime(s,%m-%d %H:%M')
print news_time_f

返回来的结果是datetime类型，但是年份是1900年。
1900-03-06 18:36:00

有两种办法：
1. 在日期格式前人为添加年份
news_time_f=datetime.datetime.strptime(''s,'%Y-%m-%d %H:%M')

2.使用自带的replace函数
s='03-06 18:36'
news_time_f=datetime.datetime.strptime(s,%m-%d %H:%M')
news_time_f=news_time_f.replace(2018)

上面两种方法都可以把03-06 18:36
转换为2018-03-06 18:36:00的datetime类型查看全部

比如
s='03-06 18:36'
news_time_f=datetime.datetime.strptime(s,%m-%d %H:%M')
print news_time_f

返回来的结果是datetime类型，但是年份是1900年。
1900-03-06 18:36:00

有两种办法：
1. 在日期格式前人为添加年份
news_time_f=datetime.datetime.strptime(''s,'%Y-%m-%d %H:%M')

2.使用自带的replace函数
s='03-06 18:36'
news_time_f=datetime.datetime.strptime(s,%m-%d %H:%M')
news_time_f=news_time_f.replace(2018)

上面两种方法都可以把03-06 18:36
转换为2018-03-06 18:36:00的datetime类型

Message: invalid selector: Compound class names not permitted

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 4274 次浏览 • 2018-01-30 00:59 • 来自相关话题

使用selenium的时候如果使用了
driver.find_element_by_class_name("content")
使用class名字来查找元素的话，就会出现
Message: invalid selector: Compound class names not permitted
这个错误。

比如京东的登录页面中：
<div id="content">
<div class="login-wrap">
<div class="w">
<div class="login-form">
<div class="login-tab login-tab-l">
<a href="javascript:void(0)" clstag="pageclick|keycount|201607144|1"> 扫码登录</a>
</div>
<div class="login-tab login-tab-r">
<a href="javascript:void(0)" clstag="pageclick|keycount|201607144|2">账户登录</a>
</div>
<div class="login-box">
<div class="mt tab-h">
</div>
<div class="msg-wrap">
<div class="msg-error hide"><b></b></div>
</div>

我要找的是<div class="login-tab login-tab-l">

那么应该使用css选择器：

browser.find_element_by_css_selector('div.login-tab.login-tab-r')
查看全部

使用selenium的时候如果使用了
driver.find_element_by_class_name("content")
使用class名字来查找元素的话，就会出现
Message: invalid selector: Compound class names not permitted
这个错误。

比如京东的登录页面中：

<div id="content">

    <div class="login-wrap">

		<div class="w">

            <div class="login-form">

                <div class="login-tab login-tab-l">

                    <a href="javascript:void(0)" clstag="pageclick|keycount|201607144|1"> 扫码登录</a>

                </div>

                <div class="login-tab login-tab-r">

                    <a href="javascript:void(0)" clstag="pageclick|keycount|201607144|2">账户登录</a>

                </div>

                <div class="login-box">

                    <div class="mt tab-h">

                    </div>

                    <div class="msg-wrap">

						                        <div class="msg-error hide"><b></b></div>

                    </div>

我要找的是<div class="login-tab login-tab-l">

那么应该使用css选择器：

browser.find_element_by_css_selector('div.login-tab.login-tab-r')

Pycharm控制台窗口怎样可以显示不同程序的运行结果

李魔佛发表了文章 • 0 个评论 • 10909 次浏览 • 2018-01-27 20:31 • 来自相关话题

默认情况下，每次运行会把之前的那个结果给清理掉。有时候运行多个程序像对比结果，不太方便。
可以在pycharm的控制台那里点击右键，在弹出的菜单中，选择“Pin Tab”，那么当前的控制台就不会被清掉啦，它可以一直保留着，自带你自己手动去关闭它。

python模拟登录vexx.pro 获取你的总资产/币值和其他个人信息

python爬虫 • 李魔佛发表了文章 • 0 个评论 • 4457 次浏览 • 2018-01-10 03:22 • 来自相关话题

因为每次登录vexx.pro，第一次输入正常的验证码都会说你是错误的，搞得每次都要输入2次验证码，所以为了节省点时间，就写了个模拟登录来自动获取自己的账户信息的python程序。

# -*-coding=utf-8-*-

import requests
session = requests.Session()
user = ''
password = ''

def getCode():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'http://vexx.pro/verify/code.html'
s = session.get(url=url, headers=headers)

with open('code.png', 'wb') as f:
f.write(s.content)

code = raw_input('input the code: ')
print 'code is ', code

login_url = 'http://vexx.pro/login/up_login.html'
post_data = {
'moble': user,
'mobles': '+86',
'password': password,
'verify': code,
'login_token': ''}

login_s = session.post(url=login_url, headers=header, data=post_data)
print login_s.status_code

zzc_url = 'http://vexx.pro/ajax/check_zzc/'
zzc_s = session.get(url=zzc_url, headers=headers)
print zzc_s.text

def main():
getCode()

if __name__ == '__main__':
main()

把自己的用户名和密码填上去，中途输入一次验证码。
可以把session保存到本地，然后下一次就可以不用再输入密码。

后记：经过几个月后，这个网站被证实是一个圈钱跑路的网站，目前已经无法正常登陆了。希望大家不要再上当了
原创地址：http://30daydo.com/article/263
转载请注明出处。查看全部

因为每次登录vexx.pro，第一次输入正常的验证码都会说你是错误的，搞得每次都要输入2次验证码，所以为了节省点时间，就写了个模拟登录来自动获取自己的账户信息的python程序。

# -*-coding=utf-8-*-



import requests

session = requests.Session()

user = ''

password = ''



def getCode():

    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

    url = 'http://vexx.pro/verify/code.html'

    s = session.get(url=url, headers=headers)



    with open('code.png', 'wb') as f:

        f.write(s.content)



    code = raw_input('input the code: ')

    print 'code is ', code



    login_url = 'http://vexx.pro/login/up_login.html'

    post_data = {

        'moble': user,

        'mobles': '+86',

        'password': password,

        'verify': code,

        'login_token': ''}



    login_s = session.post(url=login_url, headers=header, data=post_data)

    print login_s.status_code



    zzc_url = 'http://vexx.pro/ajax/check_zzc/'

    zzc_s = session.get(url=zzc_url, headers=headers)

    print zzc_s.text



def main():

    getCode()



if __name__ == '__main__':

    main()

把自己的用户名和密码填上去，中途输入一次验证码。
可以把session保存到本地，然后下一次就可以不用再输入密码。

后记：经过几个月后，这个网站被证实是一个圈钱跑路的网站，目前已经无法正常登陆了。希望大家不要再上当了
原创地址：http://30daydo.com/article/263
转载请注明出处。

通知设置新通知