python 下使用beautifulsoup还是lxml ?

刚开始接触爬虫是从beautifulsoup开始的,觉得beautifulsoup很好用。 然后后面又因为使用scrapy的缘故,接触到lxml。 到底哪一个更加好用?
 
然后看了下beautifulsoup的源码,其实现原理使用的是正则表达式,而lxml使用的节点递归的技术。
 


Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.
 
 
 
9down vote
In summary, 

lxml
 is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a 
soupparser
 module to fall back on BeautifulSoup's functionality. 
BeautifulSoup
is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, 
lxml
 provides a 
soupparser
 so you can switch back and forth. Quoting,
[quote]
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.


In the end they are saying,


The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.


If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas 
lxml
 is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to 
BeautifulSoup
 itself, not just to the 
soupparser
 for 
lxml
.
They also show how to benefit from 
BeautifulSoup
's encoding detection, while still parsing quickly with 
lxml
:
[code]>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
... converted = UnicodeDammit(html_string, isHTML=True)
... if not converted.unicode:
... raise UnicodeDecodeError(
... "Failed to detect encoding, tried [%s]",
... ', '.join(converted.triedEncodings))
... # print converted.originalEncoding
... return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))
[/code]
(Same source: http://lxml.de/elementsoup.html).
In words of 
BeautifulSoup
's creator,


That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.

[code] --Leonard
[/code]


Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,


lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.


And, from Why lxml?,


The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...


[/quote]
意思大概就是 不要用Beautifulsoup,使用lxml, lxml才能让你提要到让你体会到html节点解析的速度之快。
 
  

0 个评论

要回复文章请先登录注册