Comparison of HTML parsers

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Lua error in package.lua at line 80: module 'strict' not found.

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] Clean HTML** Update HTML***
html.parser Python S. F. L. Python 2015-02-25[2] Yes No No
Html Agility Pack Microsoft Public License C# 2014-09-16[3] Yes No  ?
Beautiful Soup (base on lxml and html5lib)[4] Python S. F. L. Python 2015-07-03 Yes Yes Yes
Gumbo Apache License 2.0 C 2013-08-13 Yes  ?  ?
html5lib MIT License Python (and PHP, six years ago) 2013-12-23[5] Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 Yes[6]  ?  ?
htmlPurifier GNU Lesser GPL PHP 2009-03-25[7] No Yes Yes
HTML Tidy W3C license ANSI C 2015-05-24[8] No[9] Yes[10] Yes[11]
HtmlUnit Apache License 2.0 Java 2.15 / June 2, 2014 Yes No No
HtmlCleaner BSD License[12] Java 2015-08-24 No Yes  ?
Hubbub MIT License C 2013-04-19 Yes  ?  ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes Yes No
Jericho HTML Parser Eclipse Public License Java 2012-10-30[13] No??  ?  ?
jsdom MIT license JavaScript 2013-07-21 No  ?  ?
jsoup MIT license Java 2016-04-16[14] Yes Yes Yes
JTidy JTidy License Java 2012-10-09[15] No Yes  ?
libxml2 HTMLparser MIT License C 2012-09-11[16] Yes  ?  ?
NekoHTML Apache License 2.0 Java 2014-06-02[17] No  ?  ?
TagSoup Apache License 2.0 Java 2011-07-07 No  ?  ?
Validator.nu HTML Parser MIT License Java 2012-06-05 Yes  ?  ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes No No
The PHP DOMDocument-class PHP License PHP 2014-10-04 Yes No No
Nokogiri MIT License Ruby 2015-01-23[18] Yes No No
AVHTML AGPL C++ 2015-07-17 Yes No Yes
Parser License Implementation language(s) Latest date* HTML Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

<templatestyles src="Reflist/styles.css" />

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />
  1. 12.2 Parsing HTML documents — HTML Standard
  2. Python 3.4.3
  3. Nuget Html AgilityPack
  4. http://www.crummy.com/software/BeautifulSoup/
  5. Releases · html5lib/html5lib-python
  6. Bug #53300 for HTML-Parser: HTML 5
  7. HTML Tidy for Windows
  8. HTML Tidy release 4.9.30
  9. What is Tidy?
  10. What is Tidy?
  11. What is Tidy?
  12. HtmlCleaner is distributed under BSD License
  13. Jericho HTML Parser - Browse /jericho-html/3.3 at SourceForge.net
  14. jsoup release 1.9.1
  15. JTidy - Browse /JTidy at SourceForge.net
  16. libxml2 Releases
  17. NekoHTML | Change History
  18. Nokogiri release 1.6.6.2