diff options
Diffstat (limited to 'bitbake/lib/bs4/CHANGELOG')
-rw-r--r-- | bitbake/lib/bs4/CHANGELOG | 1839 |
1 files changed, 1839 insertions, 0 deletions
diff --git a/bitbake/lib/bs4/CHANGELOG b/bitbake/lib/bs4/CHANGELOG new file mode 100644 index 0000000000..2701446a6d --- /dev/null +++ b/bitbake/lib/bs4/CHANGELOG | |||
@@ -0,0 +1,1839 @@ | |||
1 | = 4.12.3 (20240117) | ||
2 | |||
3 | * The Beautiful Soup documentation now has a Spanish translation, thanks | ||
4 | to Carlos Romero. Delong Wang's Chinese translation has been updated | ||
5 | to cover Beautiful Soup 4.12.0. | ||
6 | |||
7 | * Fixed a regression such that if you set .hidden on a tag, the tag | ||
8 | becomes invisible but its contents are still visible. User manipulation | ||
9 | of .hidden is not a documented or supported feature, so don't do this, | ||
10 | but it wasn't too difficult to keep the old behavior working. | ||
11 | |||
12 | * Fixed a case found by Mengyuhan where html.parser giving up on | ||
13 | markup would result in an AssertionError instead of a | ||
14 | ParserRejectedMarkup exception. | ||
15 | |||
16 | * Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning. | ||
17 | [bug=2034451] | ||
18 | |||
19 | * Corrected the syntax of the license definition in pyproject.toml. Patch | ||
20 | by Louis Maddox. [bug=2032848] | ||
21 | |||
22 | * Corrected a typo in a test that was causing test failures when run against | ||
23 | libxml2 2.12.1. [bug=2045481] | ||
24 | |||
25 | = 4.12.2 (20230407) | ||
26 | |||
27 | * Fixed an unhandled exception in BeautifulSoup.decode_contents | ||
28 | and methods that call it. [bug=2015545] | ||
29 | |||
30 | = 4.12.1 (20230405) | ||
31 | |||
32 | NOTE: the following things are likely to be dropped in the next | ||
33 | feature release of Beautiful Soup: | ||
34 | |||
35 | Official support for Python 3.6. | ||
36 | Inclusion of unit tests and test data in the wheel file. | ||
37 | Two scripts: demonstrate_parser_differences.py and test-all-versions. | ||
38 | |||
39 | Changes: | ||
40 | |||
41 | * This version of Beautiful Soup replaces setup.py and setup.cfg | ||
42 | with pyproject.toml. Beautiful Soup now uses tox as its test backend | ||
43 | and hatch to do builds. | ||
44 | |||
45 | * The main functional improvement in this version is a nonrecursive technique | ||
46 | for regenerating a tree. This technique is used to avoid situations where, | ||
47 | in previous versions, doing something to a very deeply nested tree | ||
48 | would overflow the Python interpreter stack: | ||
49 | |||
50 | 1. Outputting a tree as a string, e.g. with | ||
51 | BeautifulSoup.encode() [bug=1471755] | ||
52 | |||
53 | 2. Making copies of trees (copy.copy() and | ||
54 | copy.deepcopy() from the Python standard library). [bug=1709837] | ||
55 | |||
56 | 3. Pickling a BeautifulSoup object. (Note that pickling a Tag | ||
57 | object can still cause an overflow.) | ||
58 | |||
59 | * Making a copy of a BeautifulSoup object no longer parses the | ||
60 | document again, which should improve performance significantly. | ||
61 | |||
62 | * When a BeautifulSoup object is unpickled, Beautiful Soup now | ||
63 | tries to associate an appropriate TreeBuilder object with it. | ||
64 | |||
65 | * Tag.prettify() will now consistently end prettified markup with | ||
66 | a newline. | ||
67 | |||
68 | * Added unit tests for fuzz test cases created by third | ||
69 | parties. Some of these tests are skipped since they point | ||
70 | to problems outside of Beautiful Soup, but this change | ||
71 | puts them all in one convenient place. | ||
72 | |||
73 | * PageElement now implements the known_xml attribute. (This was technically | ||
74 | a bug, but it shouldn't be an issue in normal use.) [bug=2007895] | ||
75 | |||
76 | * The demonstrate_parser_differences.py script was still written in | ||
77 | Python 2. I've converted it to Python 3, but since no one has | ||
78 | mentioned this over the years, it's a sign that no one uses this | ||
79 | script and it's not serving its purpose. | ||
80 | |||
81 | = 4.12.0 (20230320) | ||
82 | |||
83 | * Introduced the .css property, which centralizes all access to | ||
84 | the Soup Sieve API. This allows Beautiful Soup to give direct | ||
85 | access to as much of Soup Sieve that makes sense, without cluttering | ||
86 | the BeautifulSoup and Tag classes with a lot of new methods. | ||
87 | |||
88 | This does mean one addition to the BeautifulSoup and Tag classes | ||
89 | (the .css property itself), so this might be a breaking change if you | ||
90 | happen to use Beautiful Soup to parse XML that includes a tag called | ||
91 | <css>. In particular, code like this will stop working in 4.12.0: | ||
92 | |||
93 | soup.css['id'] | ||
94 | |||
95 | Code like this will work just as before: | ||
96 | |||
97 | soup.find_one('css')['id'] | ||
98 | |||
99 | The Soup Sieve methods supported through the .css property are | ||
100 | select(), select_one(), iselect(), closest(), match(), filter(), | ||
101 | escape(), and compile(). The BeautifulSoup and Tag classes still | ||
102 | support the select() and select_one() methods; they have not been | ||
103 | deprecated, but they have been demoted to convenience methods. | ||
104 | |||
105 | [bug=2003677] | ||
106 | |||
107 | * When the html.parser parser decides it can't parse a document, Beautiful | ||
108 | Soup now consistently propagates this fact by raising a | ||
109 | ParserRejectedMarkup error. [bug=2007343] | ||
110 | |||
111 | * Removed some error checking code from diagnose(), which is redundant with | ||
112 | similar (but more Pythonic) code in the BeautifulSoup constructor. | ||
113 | [bug=2007344] | ||
114 | |||
115 | * Added intersphinx references to the documentation so that other | ||
116 | projects have a target to point to when they reference Beautiful | ||
117 | Soup classes. [bug=1453370] | ||
118 | |||
119 | = 4.11.2 (20230131) | ||
120 | |||
121 | * Fixed test failures caused by nondeterministic behavior of | ||
122 | UnicodeDammit's character detection, depending on the platform setup. | ||
123 | [bug=1973072] | ||
124 | |||
125 | * Fixed another crash when overriding multi_valued_attributes and using the | ||
126 | html5lib parser. [bug=1948488] | ||
127 | |||
128 | * The HTMLFormatter and XMLFormatter constructors no longer return a | ||
129 | value. [bug=1992693] | ||
130 | |||
131 | * Tag.interesting_string_types is now propagated when a tag is | ||
132 | copied. [bug=1990400] | ||
133 | |||
134 | * Warnings now do their best to provide an appropriate stacklevel, | ||
135 | improving the usefulness of the message. [bug=1978744] | ||
136 | |||
137 | * Passing a Tag's .contents into PageElement.extend() now works the | ||
138 | same way as passing the Tag itself. | ||
139 | |||
140 | * Soup Sieve tests will be skipped if the library is not installed. | ||
141 | |||
142 | = 4.11.1 (20220408) | ||
143 | |||
144 | This release was done to ensure that the unit tests are packaged along | ||
145 | with the released source. There are no functionality changes in this | ||
146 | release, but there are a few other packaging changes: | ||
147 | |||
148 | * The Japanese and Korean translations of the documentation are included. | ||
149 | * The changelog is now packaged as CHANGELOG, and the license file is | ||
150 | packaged as LICENSE. NEWS.txt and COPYING.txt are still present, | ||
151 | but may be removed in the future. | ||
152 | * TODO.txt is no longer packaged, since a TODO is not relevant for released | ||
153 | code. | ||
154 | |||
155 | = 4.11.0 (20220407) | ||
156 | |||
157 | * Ported unit tests to use pytest. | ||
158 | |||
159 | * Added special string classes, RubyParenthesisString and RubyTextString, | ||
160 | to make it possible to treat ruby text specially in get_text() calls. | ||
161 | [bug=1941980] | ||
162 | |||
163 | * It's now possible to customize the way output is indented by | ||
164 | providing a value for the 'indent' argument to the Formatter | ||
165 | constructor. The 'indent' argument works very similarly to the | ||
166 | argument of the same name in the Python standard library's | ||
167 | json.dump() function. [bug=1955497] | ||
168 | |||
169 | * If the charset-normalizer Python module | ||
170 | (https://pypi.org/project/charset-normalizer/) is installed, Beautiful | ||
171 | Soup will use it to detect the character sets of incoming documents. | ||
172 | This is also the module used by newer versions of the Requests library. | ||
173 | For the sake of backwards compatibility, chardet and cchardet both take | ||
174 | precedence if installed. [bug=1955346] | ||
175 | |||
176 | * Added a workaround for an lxml bug | ||
177 | (https://bugs.launchpad.net/lxml/+bug/1948551) that causes | ||
178 | problems when parsing a Unicode string beginning with BYTE ORDER MARK. | ||
179 | [bug=1947768] | ||
180 | |||
181 | * Issue a warning when an HTML parser is used to parse a document that | ||
182 | looks like XML but not XHTML. [bug=1939121] | ||
183 | |||
184 | * Do a better job of keeping track of namespaces as an XML document is | ||
185 | parsed, so that CSS selectors that use namespaces will do the right | ||
186 | thing more often. [bug=1946243] | ||
187 | |||
188 | * Some time ago, the misleadingly named "text" argument to find-type | ||
189 | methods was renamed to the more accurate "string." But this supposed | ||
190 | "renaming" didn't make it into important places like the method | ||
191 | signatures or the docstrings. That's corrected in this | ||
192 | version. "text" still works, but will give a DeprecationWarning. | ||
193 | [bug=1947038] | ||
194 | |||
195 | * Fixed a crash when pickling a BeautifulSoup object that has no | ||
196 | tree builder. [bug=1934003] | ||
197 | |||
198 | * Fixed a crash when overriding multi_valued_attributes and using the | ||
199 | html5lib parser. [bug=1948488] | ||
200 | |||
201 | * Standardized the wording of the MarkupResemblesLocatorWarning | ||
202 | warnings to omit untrusted input and make the warnings less | ||
203 | judgmental about what you ought to be doing. [bug=1955450] | ||
204 | |||
205 | * Removed support for the iconv_codec library, which doesn't seem | ||
206 | to exist anymore and was never put up on PyPI. (The closest | ||
207 | replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use | ||
208 | it--it's also quite old.) | ||
209 | |||
210 | = 4.10.0 (20210907) | ||
211 | |||
212 | * This is the first release of Beautiful Soup to only support Python | ||
213 | 3. I dropped Python 2 support to maintain support for newer versions | ||
214 | (58 and up) of setuptools. See: | ||
215 | https://github.com/pypa/setuptools/issues/2769 [bug=1942919] | ||
216 | |||
217 | * The behavior of methods like .get_text() and .strings now differs | ||
218 | depending on the type of tag. The change is visible with HTML tags | ||
219 | like <script>, <style>, and <template>. Starting in 4.9.0, methods | ||
220 | like get_text() returned no results on such tags, because the | ||
221 | contents of those tags are not considered 'text' within the document | ||
222 | as a whole. | ||
223 | |||
224 | But a user who calls script.get_text() is working from a different | ||
225 | definition of 'text' than a user who calls div.get_text()--otherwise | ||
226 | there would be no need to call script.get_text() at all. In 4.10.0, | ||
227 | the contents of (e.g.) a <script> tag are considered 'text' during a | ||
228 | get_text() call on the tag itself, but not considered 'text' during | ||
229 | a get_text() call on the tag's parent. | ||
230 | |||
231 | Because of this change, calling get_text() on each child of a tag | ||
232 | may now return a different result than calling get_text() on the tag | ||
233 | itself. That's because different tags now have different | ||
234 | understandings of what counts as 'text'. [bug=1906226] [bug=1868861] | ||
235 | |||
236 | * NavigableString and its subclasses now implement the get_text() | ||
237 | method, as well as the properties .strings and | ||
238 | .stripped_strings. These methods will either return the string | ||
239 | itself, or nothing, so the only reason to use this is when iterating | ||
240 | over a list of mixed Tag and NavigableString objects. [bug=1904309] | ||
241 | |||
242 | * The 'html5' formatter now treats attributes whose values are the | ||
243 | empty string as HTML boolean attributes. Previously (and in other | ||
244 | formatters), an attribute value must be set as None to be treated as | ||
245 | a boolean attribute. In a future release, I plan to also give this | ||
246 | behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424] | ||
247 | |||
248 | * The 'replace_with()' method now takes a variable number of arguments, | ||
249 | and can be used to replace a single element with a sequence of elements. | ||
250 | Patch by Bill Chandos. [rev=605] | ||
251 | |||
252 | * Corrected output when the namespace prefix associated with a | ||
253 | namespaced attribute is the empty string, as opposed to | ||
254 | None. [bug=1915583] | ||
255 | |||
256 | * Performance improvement when processing tags that speeds up overall | ||
257 | tree construction by 2%. Patch by Morotti. [bug=1899358] | ||
258 | |||
259 | * Corrected the use of special string container classes in cases when a | ||
260 | single tag may contain strings with different containers; such as | ||
261 | the <template> tag, which may contain both TemplateString objects | ||
262 | and Comment objects. [bug=1913406] | ||
263 | |||
264 | * The html.parser tree builder can now handle named entities | ||
265 | found in the HTML5 spec in much the same way that the html5lib | ||
266 | tree builder does. Note that the lxml HTML tree builder doesn't handle | ||
267 | named entities this way. [bug=1924908] | ||
268 | |||
269 | * Added a second way to pass specify encodings to UnicodeDammit and | ||
270 | EncodingDetector, based on the order of precedence defined in the | ||
271 | HTML5 spec, starting at: | ||
272 | https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding | ||
273 | |||
274 | Encodings in 'known_definite_encodings' are tried first, then | ||
275 | byte-order-mark sniffing is run, then encodings in 'user_encodings' | ||
276 | are tried. The old argument, 'override_encodings', is now a | ||
277 | deprecated alias for 'known_definite_encodings'. | ||
278 | |||
279 | This changes the default behavior of the html.parser and lxml tree | ||
280 | builders, in a way that may slightly improve encoding | ||
281 | detection but will probably have no effect. [bug=1889014] | ||
282 | |||
283 | * Improve the warning issued when a directory name (as opposed to | ||
284 | the name of a regular file) is passed as markup into the BeautifulSoup | ||
285 | constructor. [bug=1913628] | ||
286 | |||
287 | = 4.9.3 (20201003) | ||
288 | |||
289 | This is the final release of Beautiful Soup to support Python | ||
290 | 2. Beautiful Soup's official support for Python 2 ended on 01 January, | ||
291 | 2021. In the Launchpad Git repository, the final revision to support | ||
292 | Python 2 was revision 70f546b1e689a70e2f103795efce6d261a3dadf7; it is | ||
293 | tagged as "python2". | ||
294 | |||
295 | * Implemented a significant performance optimization to the process of | ||
296 | searching the parse tree. Patch by Morotti. [bug=1898212] | ||
297 | |||
298 | = 4.9.2 (20200926) | ||
299 | |||
300 | * Fixed a bug that caused too many tags to be popped from the tag | ||
301 | stack during tree building, when encountering a closing tag that had | ||
302 | no matching opening tag. [bug=1880420] | ||
303 | |||
304 | * Fixed a bug that inconsistently moved elements over when passing | ||
305 | a Tag, rather than a list, into Tag.extend(). [bug=1885710] | ||
306 | |||
307 | * Specify the soupsieve dependency in a way that complies with | ||
308 | PEP 508. Patch by Mike Nerone. [bug=1893696] | ||
309 | |||
310 | * Change the signatures for BeautifulSoup.insert_before and insert_after | ||
311 | (which are not implemented) to match PageElement.insert_before and | ||
312 | insert_after, quieting warnings in some IDEs. [bug=1897120] | ||
313 | |||
314 | = 4.9.1 (20200517) | ||
315 | |||
316 | * Added a keyword argument 'on_duplicate_attribute' to the | ||
317 | BeautifulSoupHTMLParser constructor (used by the html.parser tree | ||
318 | builder) which lets you customize the handling of markup that | ||
319 | contains the same attribute more than once, as in: | ||
320 | <a href="url1" href="url2"> [bug=1878209] | ||
321 | |||
322 | * Added a distinct subclass, GuessedAtParserWarning, for the warning | ||
323 | issued when BeautifulSoup is instantiated without a parser being | ||
324 | specified. [bug=1873787] | ||
325 | |||
326 | * Added a distinct subclass, MarkupResemblesLocatorWarning, for the | ||
327 | warning issued when BeautifulSoup is instantiated with 'markup' that | ||
328 | actually seems to be a URL or the path to a file on | ||
329 | disk. [bug=1873787] | ||
330 | |||
331 | * The new NavigableString subclasses (Stylesheet, Script, and | ||
332 | TemplateString) can now be imported directly from the bs4 package. | ||
333 | |||
334 | * If you encode a document with a Python-specific encoding like | ||
335 | 'unicode_escape', that encoding is no longer mentioned in the final | ||
336 | XML or HTML document. Instead, encoding information is omitted or | ||
337 | left blank. [bug=1874955] | ||
338 | |||
339 | * Fixed test failures when run against soupselect 2.0. Patch by Tomáš | ||
340 | Chvátal. [bug=1872279] | ||
341 | |||
342 | = 4.9.0 (20200405) | ||
343 | |||
344 | * Added PageElement.decomposed, a new property which lets you | ||
345 | check whether you've already called decompose() on a Tag or | ||
346 | NavigableString. | ||
347 | |||
348 | * Embedded CSS and Javascript is now stored in distinct Stylesheet and | ||
349 | Script tags, which are ignored by methods like get_text() since most | ||
350 | people don't consider this sort of content to be 'text'. This | ||
351 | feature is not supported by the html5lib treebuilder. [bug=1868861] | ||
352 | |||
353 | * Added a Russian translation by 'authoress' to the repository. | ||
354 | |||
355 | * Fixed an unhandled exception when formatting a Tag that had been | ||
356 | decomposed.[bug=1857767] | ||
357 | |||
358 | * Fixed a bug that happened when passing a Unicode filename containing | ||
359 | non-ASCII characters as markup into Beautiful Soup, on a system that | ||
360 | allows Unicode filenames. [bug=1866717] | ||
361 | |||
362 | * Added a performance optimization to PageElement.extract(). Patch by | ||
363 | Arthur Darcet. | ||
364 | |||
365 | = 4.8.2 (20191224) | ||
366 | |||
367 | * Added Python docstrings to all public methods of the most commonly | ||
368 | used classes. | ||
369 | |||
370 | * Added a Chinese translation by Deron Wang and a Brazilian Portuguese | ||
371 | translation by Cezar Peixeiro to the repository. | ||
372 | |||
373 | * Fixed two deprecation warnings. Patches by Colin | ||
374 | Watson and Nicholas Neumann. [bug=1847592] [bug=1855301] | ||
375 | |||
376 | * The html.parser tree builder now correctly handles DOCTYPEs that are | ||
377 | not uppercase. [bug=1848401] | ||
378 | |||
379 | * PageElement.select() now returns a ResultSet rather than a regular | ||
380 | list, making it consistent with methods like find_all(). | ||
381 | |||
382 | = 4.8.1 (20191006) | ||
383 | |||
384 | * When the html.parser or html5lib parsers are in use, Beautiful Soup | ||
385 | will, by default, record the position in the original document where | ||
386 | each tag was encountered. This includes line number (Tag.sourceline) | ||
387 | and position within a line (Tag.sourcepos). Based on code by Chris | ||
388 | Mayo. [bug=1742921] | ||
389 | |||
390 | * When instantiating a BeautifulSoup object, it's now possible to | ||
391 | provide a dictionary ('element_classes') of the classes you'd like to be | ||
392 | instantiated instead of Tag, NavigableString, etc. | ||
393 | |||
394 | * Fixed the definition of the default XML namespace when using | ||
395 | lxml 4.4. Patch by Isaac Muse. [bug=1840141] | ||
396 | |||
397 | * Fixed a crash when pretty-printing tags that were not created | ||
398 | during initial parsing. [bug=1838903] | ||
399 | |||
400 | * Copying a Tag preserves information that was originally obtained from | ||
401 | the TreeBuilder used to build the original Tag. [bug=1838903] | ||
402 | |||
403 | * Raise an explanatory exception when the underlying parser | ||
404 | completely rejects the incoming markup. [bug=1838877] | ||
405 | |||
406 | * Avoid a crash when trying to detect the declared encoding of a | ||
407 | Unicode document. [bug=1838877] | ||
408 | |||
409 | * Avoid a crash when unpickling certain parse trees generated | ||
410 | using html5lib on Python 3. [bug=1843545] | ||
411 | |||
412 | = 4.8.0 (20190720, "One Small Soup") | ||
413 | |||
414 | This release focuses on making it easier to customize Beautiful Soup's | ||
415 | input mechanism (the TreeBuilder) and output mechanism (the Formatter). | ||
416 | |||
417 | * You can customize the TreeBuilder object by passing keyword | ||
418 | arguments into the BeautifulSoup constructor. Those keyword | ||
419 | arguments will be passed along into the TreeBuilder constructor. | ||
420 | |||
421 | The main reason to do this right now is to change how which | ||
422 | attributes are treated as multi-valued attributes (the way 'class' | ||
423 | is treated by default). You can do this with the | ||
424 | 'multi_valued_attributes' argument. [bug=1832978] | ||
425 | |||
426 | * The role of Formatter objects has been greatly expanded. The Formatter | ||
427 | class now controls the following: | ||
428 | |||
429 | - The function to call to perform entity substitution. (This was | ||
430 | previously Formatter's only job.) | ||
431 | - Which tags should be treated as containing CDATA and have their | ||
432 | contents exempt from entity substitution. | ||
433 | - The order in which a tag's attributes are output. [bug=1812422] | ||
434 | - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' | ||
435 | |||
436 | All preexisting code should work as before. | ||
437 | |||
438 | * Added a new method to the API, Tag.smooth(), which consolidates | ||
439 | multiple adjacent NavigableString elements. [bug=1697296] | ||
440 | |||
441 | * ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always | ||
442 | recognized as a named entity and converted to a single quote. [bug=1818721] | ||
443 | |||
444 | = 4.7.1 (20190106) | ||
445 | |||
446 | * Fixed a significant performance problem introduced in 4.7.0. [bug=1810617] | ||
447 | |||
448 | * Fixed an incorrectly raised exception when inserting a tag before or | ||
449 | after an identical tag. [bug=1810692] | ||
450 | |||
451 | * Beautiful Soup will no longer try to keep track of namespaces that | ||
452 | are not defined with a prefix; this can confuse soupselect. [bug=1810680] | ||
453 | |||
454 | * Tried even harder to avoid the deprecation warning originally fixed in | ||
455 | 4.6.1. [bug=1778909] | ||
456 | |||
457 | = 4.7.0 (20181231) | ||
458 | |||
459 | * Beautiful Soup's CSS Selector implementation has been replaced by a | ||
460 | dependency on Isaac Muse's SoupSieve project (the soupsieve package | ||
461 | on PyPI). The good news is that SoupSieve has a much more robust and | ||
462 | complete implementation of CSS selectors, resolving a large number | ||
463 | of longstanding issues. The bad news is that from this point onward, | ||
464 | SoupSieve must be installed if you want to use the select() method. | ||
465 | |||
466 | You don't have to change anything lf you installed Beautiful Soup | ||
467 | through pip (SoupSieve will be automatically installed when you | ||
468 | upgrade Beautiful Soup) or if you don't use CSS selectors from | ||
469 | within Beautiful Soup. | ||
470 | |||
471 | SoupSieve documentation: https://facelessuser.github.io/soupsieve/ | ||
472 | |||
473 | * Added the PageElement.extend() method, which works like list.append(). | ||
474 | [bug=1514970] | ||
475 | |||
476 | * PageElement.insert_before() and insert_after() now take a variable | ||
477 | number of arguments. [bug=1514970] | ||
478 | |||
479 | * Fix a number of problems with the tree builder that caused | ||
480 | trees that were superficially okay, but which fell apart when bits | ||
481 | were extracted. Patch by Isaac Muse. [bug=1782928,1809910] | ||
482 | |||
483 | * Fixed a problem with the tree builder in which elements that | ||
484 | contained no content (such as empty comments and all-whitespace | ||
485 | elements) were not being treated as part of the tree. Patch by Isaac | ||
486 | Muse. [bug=1798699] | ||
487 | |||
488 | * Fixed a problem with multi-valued attributes where the value | ||
489 | contained whitespace. Thanks to Jens Svalgaard for the | ||
490 | fix. [bug=1787453] | ||
491 | |||
492 | * Clarified ambiguous license statements in the source code. Beautiful | ||
493 | Soup is released under the MIT license, and has been since 4.4.0. | ||
494 | |||
495 | * This file has been renamed from NEWS.txt to CHANGELOG. | ||
496 | |||
497 | = 4.6.3 (20180812) | ||
498 | |||
499 | * Exactly the same as 4.6.2. Re-released to make the README file | ||
500 | render properly on PyPI. | ||
501 | |||
502 | = 4.6.2 (20180812) | ||
503 | |||
504 | * Fix an exception when a custom formatter was asked to format a void | ||
505 | element. [bug=1784408] | ||
506 | |||
507 | = 4.6.1 (20180728) | ||
508 | |||
509 | * Stop data loss when encountering an empty numeric entity, and | ||
510 | possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503] | ||
511 | |||
512 | * Preserve XML namespaces introduced inside an XML document, not just | ||
513 | the ones introduced at the top level. [bug=1718787] | ||
514 | |||
515 | * Added a new formatter, "html5", which represents void elements | ||
516 | as "<element>" rather than "<element/>". [bug=1716272] | ||
517 | |||
518 | * Fixed a problem where the html.parser tree builder interpreted | ||
519 | a string like "&foo " as the character entity "&foo;" [bug=1728706] | ||
520 | |||
521 | * Correctly handle invalid HTML numeric character entities like “ | ||
522 | which reference code points that are not Unicode code points. Note | ||
523 | that this is only fixed when Beautiful Soup is used with the | ||
524 | html.parser parser -- html5lib already worked and I couldn't fix it | ||
525 | with lxml. [bug=1782933] | ||
526 | |||
527 | * Improved the warning given when no parser is specified. [bug=1780571] | ||
528 | |||
529 | * When markup contains duplicate elements, a select() call that | ||
530 | includes multiple match clauses will match all relevant | ||
531 | elements. [bug=1770596] | ||
532 | |||
533 | * Fixed code that was causing deprecation warnings in recent Python 3 | ||
534 | versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] | ||
535 | |||
536 | * Fixed a Windows crash in diagnose() when checking whether a long | ||
537 | markup string is a filename. [bug=1737121] | ||
538 | |||
539 | * Stopped HTMLParser from raising an exception in very rare cases of | ||
540 | bad markup. [bug=1708831] | ||
541 | |||
542 | * Fixed a bug where find_all() was not working when asked to find a | ||
543 | tag with a namespaced name in an XML document that was parsed as | ||
544 | HTML. [bug=1723783] | ||
545 | |||
546 | * You can get finer control over formatting by subclassing | ||
547 | bs4.element.Formatter and passing a Formatter instance into (e.g.) | ||
548 | encode(). [bug=1716272] | ||
549 | |||
550 | * You can pass a dictionary of `attrs` into | ||
551 | BeautifulSoup.new_tag. This makes it possible to create a tag with | ||
552 | an attribute like 'name' that would otherwise be masked by another | ||
553 | argument of new_tag. [bug=1779276] | ||
554 | |||
555 | * Clarified the deprecation warning when accessing tag.fooTag, to cover | ||
556 | the possibility that you might really have been looking for a tag | ||
557 | called 'fooTag'. | ||
558 | |||
559 | = 4.6.0 (20170507) = | ||
560 | |||
561 | * Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for | ||
562 | getting the value of an attribute, but which always returns a list, | ||
563 | whether or not the attribute is a multi-value attribute. [bug=1678589] | ||
564 | |||
565 | * It's now possible to use a tag's namespace prefix when searching, | ||
566 | e.g. soup.find('namespace:tag') [bug=1655332] | ||
567 | |||
568 | * Improved the handling of empty-element tags like <br> when using the | ||
569 | html.parser parser. [bug=1676935] | ||
570 | |||
571 | * HTML parsers treat all HTML4 and HTML5 empty element tags (aka void | ||
572 | element tags) correctly. [bug=1656909] | ||
573 | |||
574 | * Namespace prefix is preserved when an XML tag is copied. Thanks | ||
575 | to Vikas for a patch and test. [bug=1685172] | ||
576 | |||
577 | = 4.5.3 (20170102) = | ||
578 | |||
579 | * Fixed foster parenting when html5lib is the tree builder. Thanks to | ||
580 | Geoffrey Sneddon for a patch and test. | ||
581 | |||
582 | * Fixed yet another problem that caused the html5lib tree builder to | ||
583 | create a disconnected parse tree. [bug=1629825] | ||
584 | |||
585 | = 4.5.2 (20170102) = | ||
586 | |||
587 | * Apart from the version number, this release is identical to | ||
588 | 4.5.3. Due to user error, it could not be completely uploaded to | ||
589 | PyPI. Use 4.5.3 instead. | ||
590 | |||
591 | = 4.5.1 (20160802) = | ||
592 | |||
593 | * Fixed a crash when passing Unicode markup that contained a | ||
594 | processing instruction into the lxml HTML parser on Python | ||
595 | 3. [bug=1608048] | ||
596 | |||
597 | = 4.5.0 (20160719) = | ||
598 | |||
599 | * Beautiful Soup is no longer compatible with Python 2.6. This | ||
600 | actually happened a few releases ago, but it's now official. | ||
601 | |||
602 | * Beautiful Soup will now work with versions of html5lib greater than | ||
603 | 0.99999999. [bug=1603299] | ||
604 | |||
605 | * If a search against each individual value of a multi-valued | ||
606 | attribute fails, the search will be run one final time against the | ||
607 | complete attribute value considered as a single string. That is, if | ||
608 | a tag has class="foo bar" and neither "foo" nor "bar" matches, but | ||
609 | "foo bar" does, the tag is now considered a match. | ||
610 | |||
611 | This happened in previous versions, but only when the value being | ||
612 | searched for was a string. Now it also works when that value is | ||
613 | a regular expression, a list of strings, etc. [bug=1476868] | ||
614 | |||
615 | * Fixed a bug that deranged the tree when a whitespace element was | ||
616 | reparented into a tag that contained an identical whitespace | ||
617 | element. [bug=1505351] | ||
618 | |||
619 | * Added support for CSS selector values that contain quoted spaces, | ||
620 | such as tag[style="display: foo"]. [bug=1540588] | ||
621 | |||
622 | * Corrected handling of XML processing instructions. [bug=1504393] | ||
623 | |||
624 | * Corrected an encoding error that happened when a BeautifulSoup | ||
625 | object was copied. [bug=1554439] | ||
626 | |||
627 | * The contents of <textarea> tags will no longer be modified when the | ||
628 | tree is prettified. [bug=1555829] | ||
629 | |||
630 | * When a BeautifulSoup object is pickled but its tree builder cannot | ||
631 | be pickled, its .builder attribute is set to None instead of being | ||
632 | destroyed. This avoids a performance problem once the object is | ||
633 | unpickled. [bug=1523629] | ||
634 | |||
635 | * Specify the file and line number when warning about a | ||
636 | BeautifulSoup object being instantiated without a parser being | ||
637 | specified. [bug=1574647] | ||
638 | |||
639 | * The `limit` argument to `select()` now works correctly, though it's | ||
640 | not implemented very efficiently. [bug=1520530] | ||
641 | |||
642 | * Fixed a Python 3 ByteWarning when a URL was passed in as though it | ||
643 | were markup. Thanks to James Salter for a patch and | ||
644 | test. [bug=1533762] | ||
645 | |||
646 | * We don't run the check for a filename passed in as markup if the | ||
647 | 'filename' contains a less-than character; the less-than character | ||
648 | indicates it's most likely a very small document. [bug=1577864] | ||
649 | |||
650 | = 4.4.1 (20150928) = | ||
651 | |||
652 | * Fixed a bug that deranged the tree when part of it was | ||
653 | removed. Thanks to Eric Weiser for the patch and John Wiseman for a | ||
654 | test. [bug=1481520] | ||
655 | |||
656 | * Fixed a parse bug with the html5lib tree-builder. Thanks to Roel | ||
657 | Kramer for the patch. [bug=1483781] | ||
658 | |||
659 | * Improved the implementation of CSS selector grouping. Thanks to | ||
660 | Orangain for the patch. [bug=1484543] | ||
661 | |||
662 | * Fixed the test_detect_utf8 test so that it works when chardet is | ||
663 | installed. [bug=1471359] | ||
664 | |||
665 | * Corrected the output of Declaration objects. [bug=1477847] | ||
666 | |||
667 | |||
668 | = 4.4.0 (20150703) = | ||
669 | |||
670 | Especially important changes: | ||
671 | |||
672 | * Added a warning when you instantiate a BeautifulSoup object without | ||
673 | explicitly naming a parser. [bug=1398866] | ||
674 | |||
675 | * __repr__ now returns an ASCII bytestring in Python 2, and a Unicode | ||
676 | string in Python 3, instead of a UTF8-encoded bytestring in both | ||
677 | versions. In Python 3, __str__ now returns a Unicode string instead | ||
678 | of a bytestring. [bug=1420131] | ||
679 | |||
680 | * The `text` argument to the find_* methods is now called `string`, | ||
681 | which is more accurate. `text` still works, but `string` is the | ||
682 | argument described in the documentation. `text` may eventually | ||
683 | change its meaning, but not for a very long time. [bug=1366856] | ||
684 | |||
685 | * Changed the way soup objects work under copy.copy(). Copying a | ||
686 | NavigableString or a Tag will give you a new NavigableString that's | ||
687 | equal to the old one but not connected to the parse tree. Patch by | ||
688 | Martijn Peters. [bug=1307490] | ||
689 | |||
690 | * Started using a standard MIT license. [bug=1294662] | ||
691 | |||
692 | * Added a Chinese translation of the documentation by Delong .w. | ||
693 | |||
694 | New features: | ||
695 | |||
696 | * Introduced the select_one() method, which uses a CSS selector but | ||
697 | only returns the first match, instead of a list of | ||
698 | matches. [bug=1349367] | ||
699 | |||
700 | * You can now create a Tag object without specifying a | ||
701 | TreeBuilder. Patch by Martijn Pieters. [bug=1307471] | ||
702 | |||
703 | * You can now create a NavigableString or a subclass just by invoking | ||
704 | the constructor. [bug=1294315] | ||
705 | |||
706 | * Added an `exclude_encodings` argument to UnicodeDammit and to the | ||
707 | Beautiful Soup constructor, which lets you prohibit the detection of | ||
708 | an encoding that you know is wrong. [bug=1469408] | ||
709 | |||
710 | * The select() method now supports selector grouping. Patch by | ||
711 | Francisco Canas [bug=1191917] | ||
712 | |||
713 | Bug fixes: | ||
714 | |||
715 | * Fixed yet another problem that caused the html5lib tree builder to | ||
716 | create a disconnected parse tree. [bug=1237763] | ||
717 | |||
718 | * Force object_was_parsed() to keep the tree intact even when an element | ||
719 | from later in the document is moved into place. [bug=1430633] | ||
720 | |||
721 | * Fixed yet another bug that caused a disconnected tree when html5lib | ||
722 | copied an element from one part of the tree to another. [bug=1270611] | ||
723 | |||
724 | * Fixed a bug where Element.extract() could create an infinite loop in | ||
725 | the remaining tree. | ||
726 | |||
727 | * The select() method can now find tags whose names contain | ||
728 | dashes. Patch by Francisco Canas. [bug=1276211] | ||
729 | |||
730 | * The select() method can now find tags with attributes whose names | ||
731 | contain dashes. Patch by Marek Kapolka. [bug=1304007] | ||
732 | |||
733 | * Improved the lxml tree builder's handling of processing | ||
734 | instructions. [bug=1294645] | ||
735 | |||
736 | * Restored the helpful syntax error that happens when you try to | ||
737 | import the Python 2 edition of Beautiful Soup under Python | ||
738 | 3. [bug=1213387] | ||
739 | |||
740 | * In Python 3.4 and above, set the new convert_charrefs argument to | ||
741 | the html.parser constructor to avoid a warning and future | ||
742 | failures. Patch by Stefano Revera. [bug=1375721] | ||
743 | |||
744 | * The warning when you pass in a filename or URL as markup will now be | ||
745 | displayed correctly even if the filename or URL is a Unicode | ||
746 | string. [bug=1268888] | ||
747 | |||
748 | * If the initial <html> tag contains a CDATA list attribute such as | ||
749 | 'class', the html5lib tree builder will now turn its value into a | ||
750 | list, as it would with any other tag. [bug=1296481] | ||
751 | |||
752 | * Fixed an import error in Python 3.5 caused by the removal of the | ||
753 | HTMLParseError class. [bug=1420063] | ||
754 | |||
755 | * Improved docstring for encode_contents() and | ||
756 | decode_contents(). [bug=1441543] | ||
757 | |||
758 | * Fixed a crash in Unicode, Dammit's encoding detector when the name | ||
759 | of the encoding itself contained invalid bytes. [bug=1360913] | ||
760 | |||
761 | * Improved the exception raised when you call .unwrap() or | ||
762 | .replace_with() on an element that's not attached to a tree. | ||
763 | |||
764 | * Raise a NotImplementedError whenever an unsupported CSS pseudoclass | ||
765 | is used in select(). Previously some cases did not result in a | ||
766 | NotImplementedError. | ||
767 | |||
768 | * It's now possible to pickle a BeautifulSoup object no matter which | ||
769 | tree builder was used to create it. However, the only tree builder | ||
770 | that survives the pickling process is the HTMLParserTreeBuilder | ||
771 | ('html.parser'). If you unpickle a BeautifulSoup object created with | ||
772 | some other tree builder, soup.builder will be None. [bug=1231545] | ||
773 | |||
774 | = 4.3.2 (20131002) = | ||
775 | |||
776 | * Fixed a bug in which short Unicode input was improperly encoded to | ||
777 | ASCII when checking whether or not it was the name of a file on | ||
778 | disk. [bug=1227016] | ||
779 | |||
780 | * Fixed a crash when a short input contains data not valid in | ||
781 | filenames. [bug=1232604] | ||
782 | |||
783 | * Fixed a bug that caused Unicode data put into UnicodeDammit to | ||
784 | return None instead of the original data. [bug=1214983] | ||
785 | |||
786 | * Combined two tests to stop a spurious test failure when tests are | ||
787 | run by nosetests. [bug=1212445] | ||
788 | |||
789 | = 4.3.1 (20130815) = | ||
790 | |||
791 | * Fixed yet another problem with the html5lib tree builder, caused by | ||
792 | html5lib's tendency to rearrange the tree during | ||
793 | parsing. [bug=1189267] | ||
794 | |||
795 | * Fixed a bug that caused the optimized version of find_all() to | ||
796 | return nothing. [bug=1212655] | ||
797 | |||
798 | = 4.3.0 (20130812) = | ||
799 | |||
800 | * Instead of converting incoming data to Unicode and feeding it to the | ||
801 | lxml tree builder in chunks, Beautiful Soup now makes successive | ||
802 | guesses at the encoding of the incoming data, and tells lxml to | ||
803 | parse the data as that encoding. Giving lxml more control over the | ||
804 | parsing process improves performance and avoids a number of bugs and | ||
805 | issues with the lxml parser which had previously required elaborate | ||
806 | workarounds: | ||
807 | |||
808 | - An issue in which lxml refuses to parse Unicode strings on some | ||
809 | systems. [bug=1180527] | ||
810 | |||
811 | - A returning bug that truncated documents longer than a (very | ||
812 | small) size. [bug=963880] | ||
813 | |||
814 | - A returning bug in which extra spaces were added to a document if | ||
815 | the document defined a charset other than UTF-8. [bug=972466] | ||
816 | |||
817 | This required a major overhaul of the tree builder architecture. If | ||
818 | you wrote your own tree builder and didn't tell me, you'll need to | ||
819 | modify your prepare_markup() method. | ||
820 | |||
821 | * The UnicodeDammit code that makes guesses at encodings has been | ||
822 | split into its own class, EncodingDetector. A lot of apparently | ||
823 | redundant code has been removed from Unicode, Dammit, and some | ||
824 | undocumented features have also been removed. | ||
825 | |||
826 | * Beautiful Soup will issue a warning if instead of markup you pass it | ||
827 | a URL or the name of a file on disk (a common beginner's mistake). | ||
828 | |||
829 | * A number of optimizations improve the performance of the lxml tree | ||
830 | builder by about 33%, the html.parser tree builder by about 20%, and | ||
831 | the html5lib tree builder by about 15%. | ||
832 | |||
833 | * All find_all calls should now return a ResultSet object. Patch by | ||
834 | Aaron DeVore. [bug=1194034] | ||
835 | |||
836 | = 4.2.1 (20130531) = | ||
837 | |||
838 | * The default XML formatter will now replace ampersands even if they | ||
839 | appear to be part of entities. That is, "<" will become | ||
840 | "&lt;". The old code was left over from Beautiful Soup 3, which | ||
841 | didn't always turn entities into Unicode characters. | ||
842 | |||
843 | If you really want the old behavior (maybe because you add new | ||
844 | strings to the tree, those strings include entities, and you want | ||
845 | the formatter to leave them alone on output), it can be found in | ||
846 | EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] | ||
847 | |||
848 | * Gave new_string() the ability to create subclasses of | ||
849 | NavigableString. [bug=1181986] | ||
850 | |||
851 | * Fixed another bug by which the html5lib tree builder could create a | ||
852 | disconnected tree. [bug=1182089] | ||
853 | |||
854 | * The .previous_element of a BeautifulSoup object is now always None, | ||
855 | not the last element to be parsed. [bug=1182089] | ||
856 | |||
857 | * Fixed test failures when lxml is not installed. [bug=1181589] | ||
858 | |||
859 | * html5lib now supports Python 3. Fixed some Python 2-specific | ||
860 | code in the html5lib test suite. [bug=1181624] | ||
861 | |||
862 | * The html.parser treebuilder can now handle numeric attributes in | ||
863 | text when the hexidecimal name of the attribute starts with a | ||
864 | capital X. Patch by Tim Shirley. [bug=1186242] | ||
865 | |||
866 | = 4.2.0 (20130514) = | ||
867 | |||
868 | * The Tag.select() method now supports a much wider variety of CSS | ||
869 | selectors. | ||
870 | |||
871 | - Added support for the adjacent sibling combinator (+) and the | ||
872 | general sibling combinator (~). Tests by "liquider". [bug=1082144] | ||
873 | |||
874 | - The combinators (>, +, and ~) can now combine with any supported | ||
875 | selector, not just one that selects based on tag name. | ||
876 | |||
877 | - Added limited support for the "nth-of-type" pseudo-class. Code | ||
878 | by Sven Slootweg. [bug=1109952] | ||
879 | |||
880 | * The BeautifulSoup class is now aliased to "_s" and "_soup", making | ||
881 | it quicker to type the import statement in an interactive session: | ||
882 | |||
883 | from bs4 import _s | ||
884 | or | ||
885 | from bs4 import _soup | ||
886 | |||
887 | The alias may change in the future, so don't use this in code you're | ||
888 | going to run more than once. | ||
889 | |||
890 | * Added the 'diagnose' submodule, which includes several useful | ||
891 | functions for reporting problems and doing tech support. | ||
892 | |||
893 | - diagnose(data) tries the given markup on every installed parser, | ||
894 | reporting exceptions and displaying successes. If a parser is not | ||
895 | installed, diagnose() mentions this fact. | ||
896 | |||
897 | - lxml_trace(data, html=True) runs the given markup through lxml's | ||
898 | XML parser or HTML parser, and prints out the parser events as | ||
899 | they happen. This helps you quickly determine whether a given | ||
900 | problem occurs in lxml code or Beautiful Soup code. | ||
901 | |||
902 | - htmlparser_trace(data) is the same thing, but for Python's | ||
903 | built-in HTMLParser class. | ||
904 | |||
905 | * In an HTML document, the contents of a <script> or <style> tag will | ||
906 | no longer undergo entity substitution by default. XML documents work | ||
907 | the same way they did before. [bug=1085953] | ||
908 | |||
909 | * Methods like get_text() and properties like .strings now only give | ||
910 | you strings that are visible in the document--no comments or | ||
911 | processing commands. [bug=1050164] | ||
912 | |||
913 | * The prettify() method now leaves the contents of <pre> tags | ||
914 | alone. [bug=1095654] | ||
915 | |||
916 | * Fix a bug in the html5lib treebuilder which sometimes created | ||
917 | disconnected trees. [bug=1039527] | ||
918 | |||
919 | * Fix a bug in the lxml treebuilder which crashed when a tag included | ||
920 | an attribute from the predefined "xml:" namespace. [bug=1065617] | ||
921 | |||
922 | * Fix a bug by which keyword arguments to find_parent() were not | ||
923 | being passed on. [bug=1126734] | ||
924 | |||
925 | * Stop a crash when unwisely messing with a tag that's been | ||
926 | decomposed. [bug=1097699] | ||
927 | |||
928 | * Now that lxml's segfault on invalid doctype has been fixed, fixed a | ||
929 | corresponding problem on the Beautiful Soup end that was previously | ||
930 | invisible. [bug=984936] | ||
931 | |||
932 | * Fixed an exception when an overspecified CSS selector didn't match | ||
933 | anything. Code by Stefaan Lippens. [bug=1168167] | ||
934 | |||
935 | = 4.1.3 (20120820) = | ||
936 | |||
937 | * Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious | ||
938 | test failure caused by the lousy HTMLParser in those | ||
939 | versions. [bug=1038503] | ||
940 | |||
941 | * Raise a more specific error (FeatureNotFound) when a requested | ||
942 | parser or parser feature is not installed. Raise NotImplementedError | ||
943 | instead of ValueError when the user calls insert_before() or | ||
944 | insert_after() on the BeautifulSoup object itself. Patch by Aaron | ||
945 | Devore. [bug=1038301] | ||
946 | |||
947 | = 4.1.2 (20120817) = | ||
948 | |||
949 | * As per PEP-8, allow searching by CSS class using the 'class_' | ||
950 | keyword argument. [bug=1037624] | ||
951 | |||
952 | * Display namespace prefixes for namespaced attribute names, instead of | ||
953 | the fully-qualified names given by the lxml parser. [bug=1037597] | ||
954 | |||
955 | * Fixed a crash on encoding when an attribute name contained | ||
956 | non-ASCII characters. | ||
957 | |||
958 | * When sniffing encodings, if the cchardet library is installed, | ||
959 | Beautiful Soup uses it instead of chardet. cchardet is much | ||
960 | faster. [bug=1020748] | ||
961 | |||
962 | * Use logging.warning() instead of warning.warn() to notify the user | ||
963 | that characters were replaced with REPLACEMENT | ||
964 | CHARACTER. [bug=1013862] | ||
965 | |||
966 | = 4.1.1 (20120703) = | ||
967 | |||
968 | * Fixed an html5lib tree builder crash which happened when html5lib | ||
969 | moved a tag with a multivalued attribute from one part of the tree | ||
970 | to another. [bug=1019603] | ||
971 | |||
972 | * Correctly display closing tags with an XML namespace declared. Patch | ||
973 | by Andreas Kostyrka. [bug=1019635] | ||
974 | |||
975 | * Fixed a typo that made parsing significantly slower than it should | ||
976 | have been, and also waited too long to close tags with XML | ||
977 | namespaces. [bug=1020268] | ||
978 | |||
979 | * get_text() now returns an empty Unicode string if there is no text, | ||
980 | rather than an empty bytestring. [bug=1020387] | ||
981 | |||
982 | = 4.1.0 (20120529) = | ||
983 | |||
984 | * Added experimental support for fixing Windows-1252 characters | ||
985 | embedded in UTF-8 documents. (UnicodeDammit.detwingle()) | ||
986 | |||
987 | * Fixed the handling of " with the built-in parser. [bug=993871] | ||
988 | |||
989 | * Comments, processing instructions, document type declarations, and | ||
990 | markup declarations are now treated as preformatted strings, the way | ||
991 | CData blocks are. [bug=1001025] | ||
992 | |||
993 | * Fixed a bug with the lxml treebuilder that prevented the user from | ||
994 | adding attributes to a tag that didn't originally have | ||
995 | attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. | ||
996 | |||
997 | * Fixed some edge-case bugs having to do with inserting an element | ||
998 | into a tag it's already inside, and replacing one of a tag's | ||
999 | children with another. [bug=997529] | ||
1000 | |||
1001 | * Added the ability to search for attribute values specified in UTF-8. [bug=1003974] | ||
1002 | |||
1003 | This caused a major refactoring of the search code. All the tests | ||
1004 | pass, but it's possible that some searches will behave differently. | ||
1005 | |||
1006 | = 4.0.5 (20120427) = | ||
1007 | |||
1008 | * Added a new method, wrap(), which wraps an element in a tag. | ||
1009 | |||
1010 | * Renamed replace_with_children() to unwrap(), which is easier to | ||
1011 | understand and also the jQuery name of the function. | ||
1012 | |||
1013 | * Made encoding substitution in <meta> tags completely transparent (no | ||
1014 | more %SOUP-ENCODING%). | ||
1015 | |||
1016 | * Fixed a bug in decoding data that contained a byte-order mark, such | ||
1017 | as data encoded in UTF-16LE. [bug=988980] | ||
1018 | |||
1019 | * Fixed a bug that made the HTMLParser treebuilder generate XML | ||
1020 | definitions ending with two question marks instead of | ||
1021 | one. [bug=984258] | ||
1022 | |||
1023 | * Upon document generation, CData objects are no longer run through | ||
1024 | the formatter. [bug=988905] | ||
1025 | |||
1026 | * The test suite now passes when lxml is not installed, whether or not | ||
1027 | html5lib is installed. [bug=987004] | ||
1028 | |||
1029 | * Print a warning on HTMLParseErrors to let people know they should | ||
1030 | install a better parser library. | ||
1031 | |||
1032 | = 4.0.4 (20120416) = | ||
1033 | |||
1034 | * Fixed a bug that sometimes created disconnected trees. | ||
1035 | |||
1036 | * Fixed a bug with the string setter that moved a string around the | ||
1037 | tree instead of copying it. [bug=983050] | ||
1038 | |||
1039 | * Attribute values are now run through the provided output formatter. | ||
1040 | Previously they were always run through the 'minimal' formatter. In | ||
1041 | the future I may make it possible to specify different formatters | ||
1042 | for attribute values and strings, but for now, consistent behavior | ||
1043 | is better than inconsistent behavior. [bug=980237] | ||
1044 | |||
1045 | * Added the missing renderContents method from Beautiful Soup 3. Also | ||
1046 | added an encode_contents() method to go along with decode_contents(). | ||
1047 | |||
1048 | * Give a more useful error when the user tries to run the Python 2 | ||
1049 | version of BS under Python 3. | ||
1050 | |||
1051 | * UnicodeDammit can now convert Microsoft smart quotes to ASCII with | ||
1052 | UnicodeDammit(markup, smart_quotes_to="ascii"). | ||
1053 | |||
1054 | = 4.0.3 (20120403) = | ||
1055 | |||
1056 | * Fixed a typo that caused some versions of Python 3 to convert the | ||
1057 | Beautiful Soup codebase incorrectly. | ||
1058 | |||
1059 | * Got rid of the 4.0.2 workaround for HTML documents--it was | ||
1060 | unnecessary and the workaround was triggering a (possibly different, | ||
1061 | but related) bug in lxml. [bug=972466] | ||
1062 | |||
1063 | = 4.0.2 (20120326) = | ||
1064 | |||
1065 | * Worked around a possible bug in lxml that prevents non-tiny XML | ||
1066 | documents from being parsed. [bug=963880, bug=963936] | ||
1067 | |||
1068 | * Fixed a bug where specifying `text` while also searching for a tag | ||
1069 | only worked if `text` wanted an exact string match. [bug=955942] | ||
1070 | |||
1071 | = 4.0.1 (20120314) = | ||
1072 | |||
1073 | * This is the first official release of Beautiful Soup 4. There is no | ||
1074 | 4.0.0 release, to eliminate any possibility that packaging software | ||
1075 | might treat "4.0.0" as being an earlier version than "4.0.0b10". | ||
1076 | |||
1077 | * Brought BS up to date with the latest release of soupselect, adding | ||
1078 | CSS selector support for direct descendant matches and multiple CSS | ||
1079 | class matches. | ||
1080 | |||
1081 | = 4.0.0b10 (20120302) = | ||
1082 | |||
1083 | * Added support for simple CSS selectors, taken from the soupselect project. | ||
1084 | |||
1085 | * Fixed a crash when using html5lib. [bug=943246] | ||
1086 | |||
1087 | * In HTML5-style <meta charset="foo"> tags, the value of the "charset" | ||
1088 | attribute is now replaced with the appropriate encoding on | ||
1089 | output. [bug=942714] | ||
1090 | |||
1091 | * Fixed a bug that caused calling a tag to sometimes call find_all() | ||
1092 | with the wrong arguments. [bug=944426] | ||
1093 | |||
1094 | * For backwards compatibility, brought back the BeautifulStoneSoup | ||
1095 | class as a deprecated wrapper around BeautifulSoup. | ||
1096 | |||
1097 | = 4.0.0b9 (20120228) = | ||
1098 | |||
1099 | * Fixed the string representation of DOCTYPEs that have both a public | ||
1100 | ID and a system ID. | ||
1101 | |||
1102 | * Fixed the generated XML declaration. | ||
1103 | |||
1104 | * Renamed Tag.nsprefix to Tag.prefix, for consistency with | ||
1105 | NamespacedAttribute. | ||
1106 | |||
1107 | * Fixed a test failure that occurred on Python 3.x when chardet was | ||
1108 | installed. | ||
1109 | |||
1110 | * Made prettify() return Unicode by default, so it will look nice on | ||
1111 | Python 3 when passed into print(). | ||
1112 | |||
1113 | = 4.0.0b8 (20120224) = | ||
1114 | |||
1115 | * All tree builders now preserve namespace information in the | ||
1116 | documents they parse. If you use the html5lib parser or lxml's XML | ||
1117 | parser, you can access the namespace URL for a tag as tag.namespace. | ||
1118 | |||
1119 | However, there is no special support for namespace-oriented | ||
1120 | searching or tree manipulation. When you search the tree, you need | ||
1121 | to use namespace prefixes exactly as they're used in the original | ||
1122 | document. | ||
1123 | |||
1124 | * The string representation of a DOCTYPE always ends in a newline. | ||
1125 | |||
1126 | * Issue a warning if the user tries to use a SoupStrainer in | ||
1127 | conjunction with the html5lib tree builder, which doesn't support | ||
1128 | them. | ||
1129 | |||
1130 | = 4.0.0b7 (20120223) = | ||
1131 | |||
1132 | * Upon decoding to string, any characters that can't be represented in | ||
1133 | your chosen encoding will be converted into numeric XML entity | ||
1134 | references. | ||
1135 | |||
1136 | * Issue a warning if characters were replaced with REPLACEMENT | ||
1137 | CHARACTER during Unicode conversion. | ||
1138 | |||
1139 | * Restored compatibility with Python 2.6. | ||
1140 | |||
1141 | * The install process no longer installs docs or auxiliary text files. | ||
1142 | |||
1143 | * It's now possible to deepcopy a BeautifulSoup object created with | ||
1144 | Python's built-in HTML parser. | ||
1145 | |||
1146 | * About 100 unit tests that "test" the behavior of various parsers on | ||
1147 | invalid markup have been removed. Legitimate changes to those | ||
1148 | parsers caused these tests to fail, indicating that perhaps | ||
1149 | Beautiful Soup should not test the behavior of foreign | ||
1150 | libraries. | ||
1151 | |||
1152 | The problematic unit tests have been reformulated as informational | ||
1153 | comparisons generated by the script | ||
1154 | scripts/demonstrate_parser_differences.py. | ||
1155 | |||
1156 | This makes Beautiful Soup compatible with html5lib version 0.95 and | ||
1157 | future versions of HTMLParser. | ||
1158 | |||
1159 | = 4.0.0b6 (20120216) = | ||
1160 | |||
1161 | * Multi-valued attributes like "class" always have a list of values, | ||
1162 | even if there's only one value in the list. | ||
1163 | |||
1164 | * Added a number of multi-valued attributes defined in HTML5. | ||
1165 | |||
1166 | * Stopped generating a space before the slash that closes an | ||
1167 | empty-element tag. This may come back if I add a special XHTML mode | ||
1168 | (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty | ||
1169 | useless. | ||
1170 | |||
1171 | * Passing text along with tag-specific arguments to a find* method: | ||
1172 | |||
1173 | find("a", text="Click here") | ||
1174 | |||
1175 | will find tags that contain the given text as their | ||
1176 | .string. Previously, the tag-specific arguments were ignored and | ||
1177 | only strings were searched. | ||
1178 | |||
1179 | * Fixed a bug that caused the html5lib tree builder to build a | ||
1180 | partially disconnected tree. Generally cleaned up the html5lib tree | ||
1181 | builder. | ||
1182 | |||
1183 | * If you restrict a multi-valued attribute like "class" to a string | ||
1184 | that contains spaces, Beautiful Soup will only consider it a match | ||
1185 | if the values correspond to that specific string. | ||
1186 | |||
1187 | = 4.0.0b5 (20120209) = | ||
1188 | |||
1189 | * Rationalized Beautiful Soup's treatment of CSS class. A tag | ||
1190 | belonging to multiple CSS classes is treated as having a list of | ||
1191 | values for the 'class' attribute. Searching for a CSS class will | ||
1192 | match *any* of the CSS classes. | ||
1193 | |||
1194 | This actually affects all attributes that the HTML standard defines | ||
1195 | as taking multiple values (class, rel, rev, archive, accept-charset, | ||
1196 | and headers), but 'class' is by far the most common. [bug=41034] | ||
1197 | |||
1198 | * If you pass anything other than a dictionary as the second argument | ||
1199 | to one of the find* methods, it'll assume you want to use that | ||
1200 | object to search against a tag's CSS classes. Previously this only | ||
1201 | worked if you passed in a string. | ||
1202 | |||
1203 | * Fixed a bug that caused a crash when you passed a dictionary as an | ||
1204 | attribute value (possibly because you mistyped "attrs"). [bug=842419] | ||
1205 | |||
1206 | * Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags | ||
1207 | like <meta charset="utf-8" />. [bug=837268] | ||
1208 | |||
1209 | * If Unicode, Dammit can't figure out a consistent encoding for a | ||
1210 | page, it will try each of its guesses again, with errors="replace" | ||
1211 | instead of errors="strict". This may mean that some data gets | ||
1212 | replaced with REPLACEMENT CHARACTER, but at least most of it will | ||
1213 | get turned into Unicode. [bug=754903] | ||
1214 | |||
1215 | * Patched over a bug in html5lib (?) that was crashing Beautiful Soup | ||
1216 | on certain kinds of markup. [bug=838800] | ||
1217 | |||
1218 | * Fixed a bug that wrecked the tree if you replaced an element with an | ||
1219 | empty string. [bug=728697] | ||
1220 | |||
1221 | * Improved Unicode, Dammit's behavior when you give it Unicode to | ||
1222 | begin with. | ||
1223 | |||
1224 | = 4.0.0b4 (20120208) = | ||
1225 | |||
1226 | * Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() | ||
1227 | |||
1228 | * BeautifulSoup.new_tag() will follow the rules of whatever | ||
1229 | tree-builder was used to create the original BeautifulSoup object. A | ||
1230 | new <p> tag will look like "<p />" if the soup object was created to | ||
1231 | parse XML, but it will look like "<p></p>" if the soup object was | ||
1232 | created to parse HTML. | ||
1233 | |||
1234 | * We pass in strict=False to html.parser on Python 3, greatly | ||
1235 | improving html.parser's ability to handle bad HTML. | ||
1236 | |||
1237 | * We also monkeypatch a serious bug in html.parser that made | ||
1238 | strict=False disastrous on Python 3.2.2. | ||
1239 | |||
1240 | * Replaced the "substitute_html_entities" argument with the | ||
1241 | more general "formatter" argument. | ||
1242 | |||
1243 | * Bare ampersands and angle brackets are always converted to XML | ||
1244 | entities unless the user prevents it. | ||
1245 | |||
1246 | * Added PageElement.insert_before() and PageElement.insert_after(), | ||
1247 | which let you put an element into the parse tree with respect to | ||
1248 | some other element. | ||
1249 | |||
1250 | * Raise an exception when the user tries to do something nonsensical | ||
1251 | like insert a tag into itself. | ||
1252 | |||
1253 | |||
1254 | = 4.0.0b3 (20120203) = | ||
1255 | |||
1256 | Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful | ||
1257 | Soup's custom HTML parser in favor of a system that lets you write a | ||
1258 | little glue code and plug in any HTML or XML parser you want. | ||
1259 | |||
1260 | Beautiful Soup 4.0 comes with glue code for four parsers: | ||
1261 | |||
1262 | * Python's standard HTMLParser (html.parser in Python 3) | ||
1263 | * lxml's HTML and XML parsers | ||
1264 | * html5lib's HTML parser | ||
1265 | |||
1266 | HTMLParser is the default, but I recommend you install lxml if you | ||
1267 | can. | ||
1268 | |||
1269 | For complete documentation, see the Sphinx documentation in | ||
1270 | bs4/doc/source/. What follows is a summary of the changes from | ||
1271 | Beautiful Soup 3. | ||
1272 | |||
1273 | === The module name has changed === | ||
1274 | |||
1275 | Previously you imported the BeautifulSoup class from a module also | ||
1276 | called BeautifulSoup. To save keystrokes and make it clear which | ||
1277 | version of the API is in use, the module is now called 'bs4': | ||
1278 | |||
1279 | >>> from bs4 import BeautifulSoup | ||
1280 | |||
1281 | === It works with Python 3 === | ||
1282 | |||
1283 | Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was | ||
1284 | so bad that it barely worked at all. Beautiful Soup 4 works with | ||
1285 | Python 3, and since its parser is pluggable, you don't sacrifice | ||
1286 | quality. | ||
1287 | |||
1288 | Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 | ||
1289 | support to the finish line. Ezio Melotti is also to thank for greatly | ||
1290 | improving the HTML parser that comes with Python 3.2. | ||
1291 | |||
1292 | === CDATA sections are normal text, if they're understood at all. === | ||
1293 | |||
1294 | Currently, the lxml and html5lib HTML parsers ignore CDATA sections in | ||
1295 | markup: | ||
1296 | |||
1297 | <p><![CDATA[foo]]></p> => <p></p> | ||
1298 | |||
1299 | A future version of html5lib will turn CDATA sections into text nodes, | ||
1300 | but only within tags like <svg> and <math>: | ||
1301 | |||
1302 | <svg><![CDATA[foo]]></svg> => <p>foo</p> | ||
1303 | |||
1304 | The default XML parser (which uses lxml behind the scenes) turns CDATA | ||
1305 | sections into ordinary text elements: | ||
1306 | |||
1307 | <p><![CDATA[foo]]></p> => <p>foo</p> | ||
1308 | |||
1309 | In theory it's possible to preserve the CDATA sections when using the | ||
1310 | XML parser, but I don't see how to get it to work in practice. | ||
1311 | |||
1312 | === Miscellaneous other stuff === | ||
1313 | |||
1314 | If the BeautifulSoup instance has .is_xml set to True, an appropriate | ||
1315 | XML declaration will be emitted when the tree is transformed into a | ||
1316 | string: | ||
1317 | |||
1318 | <?xml version="1.0" encoding="utf-8"> | ||
1319 | <markup> | ||
1320 | ... | ||
1321 | </markup> | ||
1322 | |||
1323 | The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree | ||
1324 | builders set it to False. If you want to parse XHTML with an HTML | ||
1325 | parser, you can set it manually. | ||
1326 | |||
1327 | |||
1328 | = 3.2.0 = | ||
1329 | |||
1330 | The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 | ||
1331 | to make it obvious which one you should use. | ||
1332 | |||
1333 | = 3.1.0 = | ||
1334 | |||
1335 | A hybrid version that supports 2.4 and can be automatically converted | ||
1336 | to run under Python 3.0. There are three backwards-incompatible | ||
1337 | changes you should be aware of, but no new features or deliberate | ||
1338 | behavior changes. | ||
1339 | |||
1340 | 1. str() may no longer do what you want. This is because the meaning | ||
1341 | of str() inverts between Python 2 and 3; in Python 2 it gives you a | ||
1342 | byte string, in Python 3 it gives you a Unicode string. | ||
1343 | |||
1344 | The effect of this is that you can't pass an encoding to .__str__ | ||
1345 | anymore. Use encode() to get a string and decode() to get Unicode, and | ||
1346 | you'll be ready (well, readier) for Python 3. | ||
1347 | |||
1348 | 2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, | ||
1349 | which is gone in Python 3. There's some bad HTML that SGMLParser | ||
1350 | handled but HTMLParser doesn't, usually to do with attribute values | ||
1351 | that aren't closed or have brackets inside them: | ||
1352 | |||
1353 | <a href="foo</a>, </a><a href="bar">baz</a> | ||
1354 | <a b="<a>">', '<a b="<a>"></a><a>"></a> | ||
1355 | |||
1356 | A later version of Beautiful Soup will allow you to plug in different | ||
1357 | parsers to make tradeoffs between speed and the ability to handle bad | ||
1358 | HTML. | ||
1359 | |||
1360 | 3. In Python 3 (but not Python 2), HTMLParser converts entities within | ||
1361 | attributes to the corresponding Unicode characters. In Python 2 it's | ||
1362 | possible to parse this string and leave the é intact. | ||
1363 | |||
1364 | <a href="http://crummy.com?sacré&bleu"> | ||
1365 | |||
1366 | In Python 3, the é is always converted to \xe9 during | ||
1367 | parsing. | ||
1368 | |||
1369 | |||
1370 | = 3.0.7a = | ||
1371 | |||
1372 | Added an import that makes BS work in Python 2.3. | ||
1373 | |||
1374 | |||
1375 | = 3.0.7 = | ||
1376 | |||
1377 | Fixed a UnicodeDecodeError when unpickling documents that contain | ||
1378 | non-ASCII characters. | ||
1379 | |||
1380 | Fixed a TypeError that occurred in some circumstances when a tag | ||
1381 | contained no text. | ||
1382 | |||
1383 | Jump through hoops to avoid the use of chardet, which can be extremely | ||
1384 | slow in some circumstances. UTF-8 documents should never trigger the | ||
1385 | use of chardet. | ||
1386 | |||
1387 | Whitespace is preserved inside <pre> and <textarea> tags that contain | ||
1388 | nothing but whitespace. | ||
1389 | |||
1390 | Beautiful Soup can now parse a doctype that's scoped to an XML namespace. | ||
1391 | |||
1392 | |||
1393 | = 3.0.6 = | ||
1394 | |||
1395 | Got rid of a very old debug line that prevented chardet from working. | ||
1396 | |||
1397 | Added a Tag.decompose() method that completely disconnects a tree or a | ||
1398 | subset of a tree, breaking it up into bite-sized pieces that are | ||
1399 | easy for the garbage collecter to collect. | ||
1400 | |||
1401 | Tag.extract() now returns the tag that was extracted. | ||
1402 | |||
1403 | Tag.findNext() now does something with the keyword arguments you pass | ||
1404 | it instead of dropping them on the floor. | ||
1405 | |||
1406 | Fixed a Unicode conversion bug. | ||
1407 | |||
1408 | Fixed a bug that garbled some <meta> tags when rewriting them. | ||
1409 | |||
1410 | |||
1411 | = 3.0.5 = | ||
1412 | |||
1413 | Soup objects can now be pickled, and copied with copy.deepcopy. | ||
1414 | |||
1415 | Tag.append now works properly on existing BS objects. (It wasn't | ||
1416 | originally intended for outside use, but it can be now.) (Giles | ||
1417 | Radford) | ||
1418 | |||
1419 | Passing in a nonexistent encoding will no longer crash the parser on | ||
1420 | Python 2.4 (John Nagle). | ||
1421 | |||
1422 | Fixed an underlying bug in SGMLParser that thinks ASCII has 255 | ||
1423 | characters instead of 127 (John Nagle). | ||
1424 | |||
1425 | Entities are converted more consistently to Unicode characters. | ||
1426 | |||
1427 | Entity references in attribute values are now converted to Unicode | ||
1428 | characters when appropriate. Numeric entities are always converted, | ||
1429 | because SGMLParser always converts them outside of attribute values. | ||
1430 | |||
1431 | ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to | ||
1432 | XHTML_ENTITIES. | ||
1433 | |||
1434 | The regular expression for bare ampersands was too loose. In some | ||
1435 | cases ampersands were not being escaped. (Sam Ruby?) | ||
1436 | |||
1437 | Non-breaking spaces and other special Unicode space characters are no | ||
1438 | longer folded to ASCII spaces. (Robert Leftwich) | ||
1439 | |||
1440 | Information inside a TEXTAREA tag is now parsed literally, not as HTML | ||
1441 | tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) | ||
1442 | |||
1443 | = 3.0.4 = | ||
1444 | |||
1445 | Fixed a bug that crashed Unicode conversion in some cases. | ||
1446 | |||
1447 | Fixed a bug that prevented UnicodeDammit from being used as a | ||
1448 | general-purpose data scrubber. | ||
1449 | |||
1450 | Fixed some unit test failures when running against Python 2.5. | ||
1451 | |||
1452 | When considering whether to convert smart quotes, UnicodeDammit now | ||
1453 | looks at the original encoding in a case-insensitive way. | ||
1454 | |||
1455 | = 3.0.3 (20060606) = | ||
1456 | |||
1457 | Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be | ||
1458 | sure to pass in an appropriate value for convertEntities, or XML/HTML | ||
1459 | entities might stick around that aren't valid in HTML/XML). The result | ||
1460 | may not validate, but it should be good enough to not choke a | ||
1461 | real-world XML parser. Specifically, the output of a properly | ||
1462 | constructed soup object should always be valid as part of an XML | ||
1463 | document, but parts may be missing if they were missing in the | ||
1464 | original. As always, if the input is valid XML, the output will also | ||
1465 | be valid. | ||
1466 | |||
1467 | = 3.0.2 (20060602) = | ||
1468 | |||
1469 | Previously, Beautiful Soup correctly handled attribute values that | ||
1470 | contained embedded quotes (sometimes by escaping), but not other kinds | ||
1471 | of XML character. Now, it correctly handles or escapes all special XML | ||
1472 | characters in attribute values. | ||
1473 | |||
1474 | I aliased methods to the 2.x names (fetch, find, findText, etc.) for | ||
1475 | backwards compatibility purposes. Those names are deprecated and if I | ||
1476 | ever do a 4.0 I will remove them. I will, I tell you! | ||
1477 | |||
1478 | Fixed a bug where the findAll method wasn't passing along any keyword | ||
1479 | arguments. | ||
1480 | |||
1481 | When run from the command line, Beautiful Soup now acts as an HTML | ||
1482 | pretty-printer, not an XML pretty-printer. | ||
1483 | |||
1484 | = 3.0.1 (20060530) = | ||
1485 | |||
1486 | Reintroduced the "fetch by CSS class" shortcut. I thought keyword | ||
1487 | arguments would replace it, but they don't. You can't call soup('a', | ||
1488 | class='foo') because class is a Python keyword. | ||
1489 | |||
1490 | If Beautiful Soup encounters a meta tag that declares the encoding, | ||
1491 | but a SoupStrainer tells it not to parse that tag, Beautiful Soup will | ||
1492 | no longer try to rewrite the meta tag to mention the new | ||
1493 | encoding. Basically, this makes SoupStrainers work in real-world | ||
1494 | applications instead of crashing the parser. | ||
1495 | |||
1496 | = 3.0.0 "Who would not give all else for two p" (20060528) = | ||
1497 | |||
1498 | This release is not backward-compatible with previous releases. If | ||
1499 | you've got code written with a previous version of the library, go | ||
1500 | ahead and keep using it, unless one of the features mentioned here | ||
1501 | really makes your life easier. Since the library is self-contained, | ||
1502 | you can include an old copy of the library in your old applications, | ||
1503 | and use the new version for everything else. | ||
1504 | |||
1505 | The documentation has been rewritten and greatly expanded with many | ||
1506 | more examples. | ||
1507 | |||
1508 | Beautiful Soup autodetects the encoding of a document (or uses the one | ||
1509 | you specify), and converts it from its native encoding to | ||
1510 | Unicode. Internally, it only deals with Unicode strings. When you | ||
1511 | print out the document, it converts to UTF-8 (or another encoding you | ||
1512 | specify). [Doc reference] | ||
1513 | |||
1514 | It's now easy to make large-scale changes to the parse tree without | ||
1515 | screwing up the navigation members. The methods are extract, | ||
1516 | replaceWith, and insert. [Doc reference. See also Improving Memory | ||
1517 | Usage with extract] | ||
1518 | |||
1519 | Passing True in as an attribute value gives you tags that have any | ||
1520 | value for that attribute. You don't have to create a regular | ||
1521 | expression. Passing None for an attribute value gives you tags that | ||
1522 | don't have that attribute at all. | ||
1523 | |||
1524 | Tag objects now know whether or not they're self-closing. This avoids | ||
1525 | the problem where Beautiful Soup thought that tags like <BR /> were | ||
1526 | self-closing even in XML documents. You can customize the self-closing | ||
1527 | tags for a parser object by passing them in as a list of | ||
1528 | selfClosingTags: you don't have to subclass anymore. | ||
1529 | |||
1530 | There's a new built-in parser, MinimalSoup, which has most of | ||
1531 | BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc | ||
1532 | reference] | ||
1533 | |||
1534 | You can use a SoupStrainer to tell Beautiful Soup to parse only part | ||
1535 | of a document. This saves time and memory, often making Beautiful Soup | ||
1536 | about as fast as a custom-built SGMLParser subclass. [Doc reference, | ||
1537 | SoupStrainer reference] | ||
1538 | |||
1539 | You can (usually) use keyword arguments instead of passing a | ||
1540 | dictionary of attributes to a search method. That is, you can replace | ||
1541 | soup(args={"id" : "5"}) with soup(id="5"). You can still use args if | ||
1542 | (for instance) you need to find an attribute whose name clashes with | ||
1543 | the name of an argument to findAll. [Doc reference: **kwargs attrs] | ||
1544 | |||
1545 | The method names have changed to the better method names used in | ||
1546 | Rubyful Soup. Instead of find methods and fetch methods, there are | ||
1547 | only find methods. Instead of a scheme where you can't remember which | ||
1548 | method finds one element and which one finds them all, we have find | ||
1549 | and findAll. In general, if the method name mentions All or a plural | ||
1550 | noun (eg. findNextSiblings), then it finds many elements | ||
1551 | method. Otherwise, it only finds one element. [Doc reference] | ||
1552 | |||
1553 | Some of the argument names have been renamed for clarity. For instance | ||
1554 | avoidParserProblems is now parserMassage. | ||
1555 | |||
1556 | Beautiful Soup no longer implements a feed method. You need to pass a | ||
1557 | string or a filehandle into the soup constructor, not with feed after | ||
1558 | the soup has been created. There is still a feed method, but it's the | ||
1559 | feed method implemented by SGMLParser and calling it will bypass | ||
1560 | Beautiful Soup and cause problems. | ||
1561 | |||
1562 | The NavigableText class has been renamed to NavigableString. There is | ||
1563 | no NavigableUnicodeString anymore, because every string inside a | ||
1564 | Beautiful Soup parse tree is a Unicode string. | ||
1565 | |||
1566 | findText and fetchText are gone. Just pass a text argument into find | ||
1567 | or findAll. | ||
1568 | |||
1569 | Null was more trouble than it was worth, so I got rid of it. Anything | ||
1570 | that used to return Null now returns None. | ||
1571 | |||
1572 | Special XML constructs like comments and CDATA now have their own | ||
1573 | NavigableString subclasses, instead of being treated as oddly-formed | ||
1574 | data. If you parse a document that contains CDATA and write it back | ||
1575 | out, the CDATA will still be there. | ||
1576 | |||
1577 | When you're parsing a document, you can get Beautiful Soup to convert | ||
1578 | XML or HTML entities into the corresponding Unicode characters. [Doc | ||
1579 | reference] | ||
1580 | |||
1581 | = 2.1.1 (20050918) = | ||
1582 | |||
1583 | Fixed a serious performance bug in BeautifulStoneSoup which was | ||
1584 | causing parsing to be incredibly slow. | ||
1585 | |||
1586 | Corrected several entities that were previously being incorrectly | ||
1587 | translated from Microsoft smart-quote-like characters. | ||
1588 | |||
1589 | Fixed a bug that was breaking text fetch. | ||
1590 | |||
1591 | Fixed a bug that crashed the parser when text chunks that look like | ||
1592 | HTML tag names showed up within a SCRIPT tag. | ||
1593 | |||
1594 | THEAD, TBODY, and TFOOT tags are now nestable within TABLE | ||
1595 | tags. Nested tables should parse more sensibly now. | ||
1596 | |||
1597 | BASE is now considered a self-closing tag. | ||
1598 | |||
1599 | = 2.1.0 "Game, or any other dish?" (20050504) = | ||
1600 | |||
1601 | Added a wide variety of new search methods which, given a starting | ||
1602 | point inside the tree, follow a particular navigation member (like | ||
1603 | nextSibling) over and over again, looking for Tag and NavigableText | ||
1604 | objects that match certain criteria. The new methods are findNext, | ||
1605 | fetchNext, findPrevious, fetchPrevious, findNextSibling, | ||
1606 | fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, | ||
1607 | findParent, and fetchParents. All of these use the same basic code | ||
1608 | used by first and fetch, so you can pass your weird ways of matching | ||
1609 | things into these methods. | ||
1610 | |||
1611 | The fetch method and its derivatives now accept a limit argument. | ||
1612 | |||
1613 | You can now pass keyword arguments when calling a Tag object as though | ||
1614 | it were a method. | ||
1615 | |||
1616 | Fixed a bug that caused all hand-created tags to share a single set of | ||
1617 | attributes. | ||
1618 | |||
1619 | = 2.0.3 (20050501) = | ||
1620 | |||
1621 | Fixed Python 2.2 support for iterators. | ||
1622 | |||
1623 | Fixed a bug that gave the wrong representation to tags within quote | ||
1624 | tags like <script>. | ||
1625 | |||
1626 | Took some code from Mark Pilgrim that treats CDATA declarations as | ||
1627 | data instead of ignoring them. | ||
1628 | |||
1629 | Beautiful Soup's setup.py will now do an install even if the unit | ||
1630 | tests fail. It won't build a source distribution if the unit tests | ||
1631 | fail, so I can't release a new version unless they pass. | ||
1632 | |||
1633 | = 2.0.2 (20050416) = | ||
1634 | |||
1635 | Added the unit tests in a separate module, and packaged it with | ||
1636 | distutils. | ||
1637 | |||
1638 | Fixed a bug that sometimes caused renderContents() to return a Unicode | ||
1639 | string even if there was no Unicode in the original string. | ||
1640 | |||
1641 | Added the done() method, which closes all of the parser's open | ||
1642 | tags. It gets called automatically when you pass in some text to the | ||
1643 | constructor of a parser class; otherwise you must call it yourself. | ||
1644 | |||
1645 | Reinstated some backwards compatibility with 1.x versions: referencing | ||
1646 | the string member of a NavigableText object returns the NavigableText | ||
1647 | object instead of throwing an error. | ||
1648 | |||
1649 | = 2.0.1 (20050412) = | ||
1650 | |||
1651 | Fixed a bug that caused bad results when you tried to reference a tag | ||
1652 | name shorter than 3 characters as a member of a Tag, eg. tag.table.td. | ||
1653 | |||
1654 | Made sure all Tags have the 'hidden' attribute so that an attempt to | ||
1655 | access tag.hidden doesn't spawn an attempt to find a tag named | ||
1656 | 'hidden'. | ||
1657 | |||
1658 | Fixed a bug in the comparison operator. | ||
1659 | |||
1660 | = 2.0.0 "Who cares for fish?" (20050410) | ||
1661 | |||
1662 | Beautiful Soup version 1 was very useful but also pretty stupid. I | ||
1663 | originally wrote it without noticing any of the problems inherent in | ||
1664 | trying to build a parse tree out of ambiguous HTML tags. This version | ||
1665 | solves all of those problems to my satisfaction. It also adds many new | ||
1666 | clever things to make up for the removal of the stupid things. | ||
1667 | |||
1668 | == Parsing == | ||
1669 | |||
1670 | The parser logic has been greatly improved, and the BeautifulSoup | ||
1671 | class should much more reliably yield a parse tree that looks like | ||
1672 | what the page author intended. For a particular class of odd edge | ||
1673 | cases that now causes problems, there is a new class, | ||
1674 | ICantBelieveItsBeautifulSoup. | ||
1675 | |||
1676 | By default, Beautiful Soup now performs some cleanup operations on | ||
1677 | text before parsing it. This is to avoid common problems with bad | ||
1678 | definitions and self-closing tags that crash SGMLParser. You can | ||
1679 | provide your own set of cleanup operations, or turn it off | ||
1680 | altogether. The cleanup operations include fixing self-closing tags | ||
1681 | that don't close, and replacing Microsoft smart quotes and similar | ||
1682 | characters with their HTML entity equivalents. | ||
1683 | |||
1684 | You can now get a pretty-print version of parsed HTML to get a visual | ||
1685 | picture of how Beautiful Soup parses it, with the Tag.prettify() | ||
1686 | method. | ||
1687 | |||
1688 | == Strings and Unicode == | ||
1689 | |||
1690 | There are separate NavigableText subclasses for ASCII and Unicode | ||
1691 | strings. These classes directly subclass the corresponding base data | ||
1692 | types. This means you can treat NavigableText objects as strings | ||
1693 | instead of having to call methods on them to get the strings. | ||
1694 | |||
1695 | str() on a Tag always returns a string, and unicode() always returns | ||
1696 | Unicode. Previously it was inconsistent. | ||
1697 | |||
1698 | == Tree traversal == | ||
1699 | |||
1700 | In a first() or fetch() call, the tag name or the desired value of an | ||
1701 | attribute can now be any of the following: | ||
1702 | |||
1703 | * A string (matches that specific tag or that specific attribute value) | ||
1704 | * A list of strings (matches any tag or attribute value in the list) | ||
1705 | * A compiled regular expression object (matches any tag or attribute | ||
1706 | value that matches the regular expression) | ||
1707 | * A callable object that takes the Tag object or attribute value as a | ||
1708 | string. It returns None/false/empty string if the given string | ||
1709 | doesn't match, and any other value if it does. | ||
1710 | |||
1711 | This is much easier to use than SQL-style wildcards (see, regular | ||
1712 | expressions are good for something). Because of this, I took out | ||
1713 | SQL-style wildcards. I'll put them back if someone complains, but | ||
1714 | their removal simplifies the code a lot. | ||
1715 | |||
1716 | You can use fetch() and first() to search for text in the parse tree, | ||
1717 | not just tags. There are new alias methods fetchText() and firstText() | ||
1718 | designed for this purpose. As with searching for tags, you can pass in | ||
1719 | a string, a regular expression object, or a method to match your text. | ||
1720 | |||
1721 | If you pass in something besides a map to the attrs argument of | ||
1722 | fetch() or first(), Beautiful Soup will assume you want to match that | ||
1723 | thing against the "class" attribute. When you're scraping | ||
1724 | well-structured HTML, this makes your code a lot cleaner. | ||
1725 | |||
1726 | 1.x and 2.x both let you call a Tag object as a shorthand for | ||
1727 | fetch(). For instance, foo("bar") is a shorthand for | ||
1728 | foo.fetch("bar"). In 2.x, you can also access a specially-named member | ||
1729 | of a Tag object as a shorthand for first(). For instance, foo.barTag | ||
1730 | is a shorthand for foo.first("bar"). By chaining these shortcuts you | ||
1731 | traverse a tree in very little code: for header in | ||
1732 | soup.bodyTag.pTag.tableTag('th'): | ||
1733 | |||
1734 | If an element relationship (like parent or next) doesn't apply to a | ||
1735 | tag, it'll now show up Null instead of None. first() will also return | ||
1736 | Null if you ask it for a nonexistent tag. Null is an object that's | ||
1737 | just like None, except you can do whatever you want to it and it'll | ||
1738 | give you Null instead of throwing an error. | ||
1739 | |||
1740 | This lets you do tree traversals like soup.htmlTag.headTag.titleTag | ||
1741 | without having to worry if the intermediate stages are actually | ||
1742 | there. Previously, if there was no 'head' tag in the document, headTag | ||
1743 | in that instance would have been None, and accessing its 'titleTag' | ||
1744 | member would have thrown an AttributeError. Now, you can get what you | ||
1745 | want when it exists, and get Null when it doesn't, without having to | ||
1746 | do a lot of conditionals checking to see if every stage is None. | ||
1747 | |||
1748 | There are two new relations between page elements: previousSibling and | ||
1749 | nextSibling. They reference the previous and next element at the same | ||
1750 | level of the parse tree. For instance, if you have HTML like this: | ||
1751 | |||
1752 | <p><ul><li>Foo<br /><li>Bar</ul> | ||
1753 | |||
1754 | The first 'li' tag has a previousSibling of Null and its nextSibling | ||
1755 | is the second 'li' tag. The second 'li' tag has a nextSibling of Null | ||
1756 | and its previousSibling is the first 'li' tag. The previousSibling of | ||
1757 | the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the | ||
1758 | 'br' tag. | ||
1759 | |||
1760 | I took out the ability to use fetch() to find tags that have a | ||
1761 | specific list of contents. See, I can't even explain it well. It was | ||
1762 | really difficult to use, I never used it, and I don't think anyone | ||
1763 | else ever used it. To the extent anyone did, they can probably use | ||
1764 | fetchText() instead. If it turns out someone needs it I'll think of | ||
1765 | another solution. | ||
1766 | |||
1767 | == Tree manipulation == | ||
1768 | |||
1769 | You can add new attributes to a tag, and delete attributes from a | ||
1770 | tag. In 1.x you could only change a tag's existing attributes. | ||
1771 | |||
1772 | == Porting Considerations == | ||
1773 | |||
1774 | There are three changes in 2.0 that break old code: | ||
1775 | |||
1776 | In the post-1.2 release you could pass in a function into fetch(). The | ||
1777 | function took a string, the tag name. In 2.0, the function takes the | ||
1778 | actual Tag object. | ||
1779 | |||
1780 | It's no longer to pass in SQL-style wildcards to fetch(). Use a | ||
1781 | regular expression instead. | ||
1782 | |||
1783 | The different parsing algorithm means the parse tree may not be shaped | ||
1784 | like you expect. This will only actually affect you if your code uses | ||
1785 | one of the affected parts. I haven't run into this problem yet while | ||
1786 | porting my code. | ||
1787 | |||
1788 | = Between 1.2 and 2.0 = | ||
1789 | |||
1790 | This is the release to get if you want Python 1.5 compatibility. | ||
1791 | |||
1792 | The desired value of an attribute can now be any of the following: | ||
1793 | |||
1794 | * A string | ||
1795 | * A string with SQL-style wildcards | ||
1796 | * A compiled RE object | ||
1797 | * A callable that returns None/false/empty string if the given value | ||
1798 | doesn't match, and any other value otherwise. | ||
1799 | |||
1800 | This is much easier to use than SQL-style wildcards (see, regular | ||
1801 | expressions are good for something). Because of this, I no longer | ||
1802 | recommend you use SQL-style wildcards. They may go away in a future | ||
1803 | release to clean up the code. | ||
1804 | |||
1805 | Made Beautiful Soup handle processing instructions as text instead of | ||
1806 | ignoring them. | ||
1807 | |||
1808 | Applied patch from Richie Hindle (richie at entrian dot com) that | ||
1809 | makes tag.string a shorthand for tag.contents[0].string when the tag | ||
1810 | has only one string-owning child. | ||
1811 | |||
1812 | Added still more nestable tags. The nestable tags thing won't work in | ||
1813 | a lot of cases and needs to be rethought. | ||
1814 | |||
1815 | Fixed an edge case where searching for "%foo" would match any string | ||
1816 | shorter than "foo". | ||
1817 | |||
1818 | = 1.2 "Who for such dainties would not stoop?" (20040708) = | ||
1819 | |||
1820 | Applied patch from Ben Last (ben at benlast dot com) that made | ||
1821 | Tag.renderContents() correctly handle Unicode. | ||
1822 | |||
1823 | Made BeautifulStoneSoup even dumber by making it not implicitly close | ||
1824 | a tag when another tag of the same type is encountered; only when an | ||
1825 | actual closing tag is encountered. This change courtesy of Fuzzy (mike | ||
1826 | at pcblokes dot com). BeautifulSoup still works as before. | ||
1827 | |||
1828 | = 1.1 "Swimming in a hot tureen" = | ||
1829 | |||
1830 | Added more 'nestable' tags. Changed popping semantics so that when a | ||
1831 | nestable tag is encountered, tags are popped up to the previously | ||
1832 | encountered nestable tag (of whatever kind). I will revert this if | ||
1833 | enough people complain, but it should make more people's lives easier | ||
1834 | than harder. This enhancement was suggested by Anthony Baxter (anthony | ||
1835 | at interlink dot com dot au). | ||
1836 | |||
1837 | = 1.0 "So rich and green" (20040420) = | ||
1838 | |||
1839 | Initial release. | ||