summaryrefslogtreecommitdiffstats
path: root/bitbake/lib/bs4
diff options
context:
space:
mode:
authorRichard Purdie <richard.purdie@linuxfoundation.org>2025-11-07 13:31:53 +0000
committerRichard Purdie <richard.purdie@linuxfoundation.org>2025-11-07 13:31:53 +0000
commit8c22ff0d8b70d9b12f0487ef696a7e915b9e3173 (patch)
treeefdc32587159d0050a69009bdf2330a531727d95 /bitbake/lib/bs4
parentd412d2747595c1cc4a5e3ca975e3adc31b2f7891 (diff)
downloadpoky-8c22ff0d8b70d9b12f0487ef696a7e915b9e3173.tar.gz
The poky repository master branch is no longer being updated.
You can either: a) switch to individual clones of bitbake, openembedded-core, meta-yocto and yocto-docs b) use the new bitbake-setup You can find information about either approach in our documentation: https://docs.yoctoproject.org/ Note that "poky" the distro setting is still available in meta-yocto as before and we continue to use and maintain that. Long live Poky! Some further information on the background of this change can be found in: https://lists.openembedded.org/g/openembedded-architecture/message/2179 Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Diffstat (limited to 'bitbake/lib/bs4')
-rw-r--r--bitbake/lib/bs4/AUTHORS49
-rw-r--r--bitbake/lib/bs4/CHANGELOG1839
-rw-r--r--bitbake/lib/bs4/LICENSE31
-rw-r--r--bitbake/lib/bs4/__init__.py839
-rw-r--r--bitbake/lib/bs4/builder/__init__.py636
-rw-r--r--bitbake/lib/bs4/builder/_html5lib.py481
-rw-r--r--bitbake/lib/bs4/builder/_htmlparser.py387
-rw-r--r--bitbake/lib/bs4/builder/_lxml.py388
-rw-r--r--bitbake/lib/bs4/css.py274
-rw-r--r--bitbake/lib/bs4/dammit.py1095
-rw-r--r--bitbake/lib/bs4/diagnose.py232
-rw-r--r--bitbake/lib/bs4/element.py2435
-rw-r--r--bitbake/lib/bs4/formatter.py185
13 files changed, 0 insertions, 8871 deletions
diff --git a/bitbake/lib/bs4/AUTHORS b/bitbake/lib/bs4/AUTHORS
deleted file mode 100644
index 1f14fe07de..0000000000
--- a/bitbake/lib/bs4/AUTHORS
+++ /dev/null
@@ -1,49 +0,0 @@
1Behold, mortal, the origins of Beautiful Soup...
2================================================
3
4Leonard Richardson is the primary maintainer.
5
6Aaron DeVore and Isaac Muse have made significant contributions to the
7code base.
8
9Mark Pilgrim provided the encoding detection code that forms the base
10of UnicodeDammit.
11
12Thomas Kluyver and Ezio Melotti finished the work of getting Beautiful
13Soup 4 working under Python 3.
14
15Simon Willison wrote soupselect, which was used to make Beautiful Soup
16support CSS selectors. Isaac Muse wrote SoupSieve, which made it
17possible to _remove_ the CSS selector code from Beautiful Soup.
18
19Sam Ruby helped with a lot of edge cases.
20
21Jonathan Ellis was awarded the prestigious Beau Potage D'Or for his
22work in solving the nestable tags conundrum.
23
24An incomplete list of people have contributed patches to Beautiful
25Soup:
26
27 Istvan Albert, Andrew Lin, Anthony Baxter, Oliver Beattie, Andrew
28Boyko, Tony Chang, Francisco Canas, "Delong", Zephyr Fang, Fuzzy,
29Roman Gaufman, Yoni Gilad, Richie Hindle, Toshihiro Kamiya, Peteris
30Krumins, Kent Johnson, Marek Kapolka, Andreas Kostyrka, Roel Kramer,
31Ben Last, Robert Leftwich, Stefaan Lippens, "liquider", Staffan
32Malmgren, Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, "Jon",
33Ed Oskiewicz, Martijn Peters, Greg Phillips, Giles Radford, Stefano
34Revera, Arthur Rudolph, Marko Samastur, James Salter, Jouni Seppänen,
35Alexander Schmolck, Tim Shirley, Geoffrey Sneddon, Ville Skyttä,
36"Vikas", Jens Svalgaard, Andy Theyers, Eric Weiser, Glyn Webster, John
37Wiseman, Paul Wright, Danny Yoo
38
39An incomplete list of people who made suggestions or found bugs or
40found ways to break Beautiful Soup:
41
42 Hanno Böck, Matteo Bertini, Chris Curvey, Simon Cusack, Bruce Eckel,
43 Matt Ernst, Michael Foord, Tom Harris, Bill de hOra, Donald Howes,
44 Matt Patterson, Scott Roberts, Steve Strassmann, Mike Williams,
45 warchild at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison,
46 Joren Mc, Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed
47 Summers, Dennis Sutch, Chris Smith, Aaron Swartz, Stuart
48 Turner, Greg Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de
49 Sousa Rocha, Yichun Wei, Per Vognsen
diff --git a/bitbake/lib/bs4/CHANGELOG b/bitbake/lib/bs4/CHANGELOG
deleted file mode 100644
index 2701446a6d..0000000000
--- a/bitbake/lib/bs4/CHANGELOG
+++ /dev/null
@@ -1,1839 +0,0 @@
1= 4.12.3 (20240117)
2
3* The Beautiful Soup documentation now has a Spanish translation, thanks
4 to Carlos Romero. Delong Wang's Chinese translation has been updated
5 to cover Beautiful Soup 4.12.0.
6
7* Fixed a regression such that if you set .hidden on a tag, the tag
8 becomes invisible but its contents are still visible. User manipulation
9 of .hidden is not a documented or supported feature, so don't do this,
10 but it wasn't too difficult to keep the old behavior working.
11
12* Fixed a case found by Mengyuhan where html.parser giving up on
13 markup would result in an AssertionError instead of a
14 ParserRejectedMarkup exception.
15
16* Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning.
17 [bug=2034451]
18
19* Corrected the syntax of the license definition in pyproject.toml. Patch
20 by Louis Maddox. [bug=2032848]
21
22* Corrected a typo in a test that was causing test failures when run against
23 libxml2 2.12.1. [bug=2045481]
24
25= 4.12.2 (20230407)
26
27* Fixed an unhandled exception in BeautifulSoup.decode_contents
28 and methods that call it. [bug=2015545]
29
30= 4.12.1 (20230405)
31
32NOTE: the following things are likely to be dropped in the next
33feature release of Beautiful Soup:
34
35 Official support for Python 3.6.
36 Inclusion of unit tests and test data in the wheel file.
37 Two scripts: demonstrate_parser_differences.py and test-all-versions.
38
39Changes:
40
41* This version of Beautiful Soup replaces setup.py and setup.cfg
42 with pyproject.toml. Beautiful Soup now uses tox as its test backend
43 and hatch to do builds.
44
45* The main functional improvement in this version is a nonrecursive technique
46 for regenerating a tree. This technique is used to avoid situations where,
47 in previous versions, doing something to a very deeply nested tree
48 would overflow the Python interpreter stack:
49
50 1. Outputting a tree as a string, e.g. with
51 BeautifulSoup.encode() [bug=1471755]
52
53 2. Making copies of trees (copy.copy() and
54 copy.deepcopy() from the Python standard library). [bug=1709837]
55
56 3. Pickling a BeautifulSoup object. (Note that pickling a Tag
57 object can still cause an overflow.)
58
59* Making a copy of a BeautifulSoup object no longer parses the
60 document again, which should improve performance significantly.
61
62* When a BeautifulSoup object is unpickled, Beautiful Soup now
63 tries to associate an appropriate TreeBuilder object with it.
64
65* Tag.prettify() will now consistently end prettified markup with
66 a newline.
67
68* Added unit tests for fuzz test cases created by third
69 parties. Some of these tests are skipped since they point
70 to problems outside of Beautiful Soup, but this change
71 puts them all in one convenient place.
72
73* PageElement now implements the known_xml attribute. (This was technically
74 a bug, but it shouldn't be an issue in normal use.) [bug=2007895]
75
76* The demonstrate_parser_differences.py script was still written in
77 Python 2. I've converted it to Python 3, but since no one has
78 mentioned this over the years, it's a sign that no one uses this
79 script and it's not serving its purpose.
80
81= 4.12.0 (20230320)
82
83* Introduced the .css property, which centralizes all access to
84 the Soup Sieve API. This allows Beautiful Soup to give direct
85 access to as much of Soup Sieve that makes sense, without cluttering
86 the BeautifulSoup and Tag classes with a lot of new methods.
87
88 This does mean one addition to the BeautifulSoup and Tag classes
89 (the .css property itself), so this might be a breaking change if you
90 happen to use Beautiful Soup to parse XML that includes a tag called
91 <css>. In particular, code like this will stop working in 4.12.0:
92
93 soup.css['id']
94
95 Code like this will work just as before:
96
97 soup.find_one('css')['id']
98
99 The Soup Sieve methods supported through the .css property are
100 select(), select_one(), iselect(), closest(), match(), filter(),
101 escape(), and compile(). The BeautifulSoup and Tag classes still
102 support the select() and select_one() methods; they have not been
103 deprecated, but they have been demoted to convenience methods.
104
105 [bug=2003677]
106
107* When the html.parser parser decides it can't parse a document, Beautiful
108 Soup now consistently propagates this fact by raising a
109 ParserRejectedMarkup error. [bug=2007343]
110
111* Removed some error checking code from diagnose(), which is redundant with
112 similar (but more Pythonic) code in the BeautifulSoup constructor.
113 [bug=2007344]
114
115* Added intersphinx references to the documentation so that other
116 projects have a target to point to when they reference Beautiful
117 Soup classes. [bug=1453370]
118
119= 4.11.2 (20230131)
120
121* Fixed test failures caused by nondeterministic behavior of
122 UnicodeDammit's character detection, depending on the platform setup.
123 [bug=1973072]
124
125* Fixed another crash when overriding multi_valued_attributes and using the
126 html5lib parser. [bug=1948488]
127
128* The HTMLFormatter and XMLFormatter constructors no longer return a
129 value. [bug=1992693]
130
131* Tag.interesting_string_types is now propagated when a tag is
132 copied. [bug=1990400]
133
134* Warnings now do their best to provide an appropriate stacklevel,
135 improving the usefulness of the message. [bug=1978744]
136
137* Passing a Tag's .contents into PageElement.extend() now works the
138 same way as passing the Tag itself.
139
140* Soup Sieve tests will be skipped if the library is not installed.
141
142= 4.11.1 (20220408)
143
144This release was done to ensure that the unit tests are packaged along
145with the released source. There are no functionality changes in this
146release, but there are a few other packaging changes:
147
148* The Japanese and Korean translations of the documentation are included.
149* The changelog is now packaged as CHANGELOG, and the license file is
150 packaged as LICENSE. NEWS.txt and COPYING.txt are still present,
151 but may be removed in the future.
152* TODO.txt is no longer packaged, since a TODO is not relevant for released
153 code.
154
155= 4.11.0 (20220407)
156
157* Ported unit tests to use pytest.
158
159* Added special string classes, RubyParenthesisString and RubyTextString,
160 to make it possible to treat ruby text specially in get_text() calls.
161 [bug=1941980]
162
163* It's now possible to customize the way output is indented by
164 providing a value for the 'indent' argument to the Formatter
165 constructor. The 'indent' argument works very similarly to the
166 argument of the same name in the Python standard library's
167 json.dump() function. [bug=1955497]
168
169* If the charset-normalizer Python module
170 (https://pypi.org/project/charset-normalizer/) is installed, Beautiful
171 Soup will use it to detect the character sets of incoming documents.
172 This is also the module used by newer versions of the Requests library.
173 For the sake of backwards compatibility, chardet and cchardet both take
174 precedence if installed. [bug=1955346]
175
176* Added a workaround for an lxml bug
177 (https://bugs.launchpad.net/lxml/+bug/1948551) that causes
178 problems when parsing a Unicode string beginning with BYTE ORDER MARK.
179 [bug=1947768]
180
181* Issue a warning when an HTML parser is used to parse a document that
182 looks like XML but not XHTML. [bug=1939121]
183
184* Do a better job of keeping track of namespaces as an XML document is
185 parsed, so that CSS selectors that use namespaces will do the right
186 thing more often. [bug=1946243]
187
188* Some time ago, the misleadingly named "text" argument to find-type
189 methods was renamed to the more accurate "string." But this supposed
190 "renaming" didn't make it into important places like the method
191 signatures or the docstrings. That's corrected in this
192 version. "text" still works, but will give a DeprecationWarning.
193 [bug=1947038]
194
195* Fixed a crash when pickling a BeautifulSoup object that has no
196 tree builder. [bug=1934003]
197
198* Fixed a crash when overriding multi_valued_attributes and using the
199 html5lib parser. [bug=1948488]
200
201* Standardized the wording of the MarkupResemblesLocatorWarning
202 warnings to omit untrusted input and make the warnings less
203 judgmental about what you ought to be doing. [bug=1955450]
204
205* Removed support for the iconv_codec library, which doesn't seem
206 to exist anymore and was never put up on PyPI. (The closest
207 replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use
208 it--it's also quite old.)
209
210= 4.10.0 (20210907)
211
212* This is the first release of Beautiful Soup to only support Python
213 3. I dropped Python 2 support to maintain support for newer versions
214 (58 and up) of setuptools. See:
215 https://github.com/pypa/setuptools/issues/2769 [bug=1942919]
216
217* The behavior of methods like .get_text() and .strings now differs
218 depending on the type of tag. The change is visible with HTML tags
219 like <script>, <style>, and <template>. Starting in 4.9.0, methods
220 like get_text() returned no results on such tags, because the
221 contents of those tags are not considered 'text' within the document
222 as a whole.
223
224 But a user who calls script.get_text() is working from a different
225 definition of 'text' than a user who calls div.get_text()--otherwise
226 there would be no need to call script.get_text() at all. In 4.10.0,
227 the contents of (e.g.) a <script> tag are considered 'text' during a
228 get_text() call on the tag itself, but not considered 'text' during
229 a get_text() call on the tag's parent.
230
231 Because of this change, calling get_text() on each child of a tag
232 may now return a different result than calling get_text() on the tag
233 itself. That's because different tags now have different
234 understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
235
236* NavigableString and its subclasses now implement the get_text()
237 method, as well as the properties .strings and
238 .stripped_strings. These methods will either return the string
239 itself, or nothing, so the only reason to use this is when iterating
240 over a list of mixed Tag and NavigableString objects. [bug=1904309]
241
242* The 'html5' formatter now treats attributes whose values are the
243 empty string as HTML boolean attributes. Previously (and in other
244 formatters), an attribute value must be set as None to be treated as
245 a boolean attribute. In a future release, I plan to also give this
246 behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
247
248* The 'replace_with()' method now takes a variable number of arguments,
249 and can be used to replace a single element with a sequence of elements.
250 Patch by Bill Chandos. [rev=605]
251
252* Corrected output when the namespace prefix associated with a
253 namespaced attribute is the empty string, as opposed to
254 None. [bug=1915583]
255
256* Performance improvement when processing tags that speeds up overall
257 tree construction by 2%. Patch by Morotti. [bug=1899358]
258
259* Corrected the use of special string container classes in cases when a
260 single tag may contain strings with different containers; such as
261 the <template> tag, which may contain both TemplateString objects
262 and Comment objects. [bug=1913406]
263
264* The html.parser tree builder can now handle named entities
265 found in the HTML5 spec in much the same way that the html5lib
266 tree builder does. Note that the lxml HTML tree builder doesn't handle
267 named entities this way. [bug=1924908]
268
269* Added a second way to pass specify encodings to UnicodeDammit and
270 EncodingDetector, based on the order of precedence defined in the
271 HTML5 spec, starting at:
272 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
273
274 Encodings in 'known_definite_encodings' are tried first, then
275 byte-order-mark sniffing is run, then encodings in 'user_encodings'
276 are tried. The old argument, 'override_encodings', is now a
277 deprecated alias for 'known_definite_encodings'.
278
279 This changes the default behavior of the html.parser and lxml tree
280 builders, in a way that may slightly improve encoding
281 detection but will probably have no effect. [bug=1889014]
282
283* Improve the warning issued when a directory name (as opposed to
284 the name of a regular file) is passed as markup into the BeautifulSoup
285 constructor. [bug=1913628]
286
287= 4.9.3 (20201003)
288
289This is the final release of Beautiful Soup to support Python
2902. Beautiful Soup's official support for Python 2 ended on 01 January,
2912021. In the Launchpad Git repository, the final revision to support
292Python 2 was revision 70f546b1e689a70e2f103795efce6d261a3dadf7; it is
293tagged as "python2".
294
295* Implemented a significant performance optimization to the process of
296 searching the parse tree. Patch by Morotti. [bug=1898212]
297
298= 4.9.2 (20200926)
299
300* Fixed a bug that caused too many tags to be popped from the tag
301 stack during tree building, when encountering a closing tag that had
302 no matching opening tag. [bug=1880420]
303
304* Fixed a bug that inconsistently moved elements over when passing
305 a Tag, rather than a list, into Tag.extend(). [bug=1885710]
306
307* Specify the soupsieve dependency in a way that complies with
308 PEP 508. Patch by Mike Nerone. [bug=1893696]
309
310* Change the signatures for BeautifulSoup.insert_before and insert_after
311 (which are not implemented) to match PageElement.insert_before and
312 insert_after, quieting warnings in some IDEs. [bug=1897120]
313
314= 4.9.1 (20200517)
315
316* Added a keyword argument 'on_duplicate_attribute' to the
317 BeautifulSoupHTMLParser constructor (used by the html.parser tree
318 builder) which lets you customize the handling of markup that
319 contains the same attribute more than once, as in:
320 <a href="url1" href="url2"> [bug=1878209]
321
322* Added a distinct subclass, GuessedAtParserWarning, for the warning
323 issued when BeautifulSoup is instantiated without a parser being
324 specified. [bug=1873787]
325
326* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
327 warning issued when BeautifulSoup is instantiated with 'markup' that
328 actually seems to be a URL or the path to a file on
329 disk. [bug=1873787]
330
331* The new NavigableString subclasses (Stylesheet, Script, and
332 TemplateString) can now be imported directly from the bs4 package.
333
334* If you encode a document with a Python-specific encoding like
335 'unicode_escape', that encoding is no longer mentioned in the final
336 XML or HTML document. Instead, encoding information is omitted or
337 left blank. [bug=1874955]
338
339* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
340 Chvátal. [bug=1872279]
341
342= 4.9.0 (20200405)
343
344* Added PageElement.decomposed, a new property which lets you
345 check whether you've already called decompose() on a Tag or
346 NavigableString.
347
348* Embedded CSS and Javascript is now stored in distinct Stylesheet and
349 Script tags, which are ignored by methods like get_text() since most
350 people don't consider this sort of content to be 'text'. This
351 feature is not supported by the html5lib treebuilder. [bug=1868861]
352
353* Added a Russian translation by 'authoress' to the repository.
354
355* Fixed an unhandled exception when formatting a Tag that had been
356 decomposed.[bug=1857767]
357
358* Fixed a bug that happened when passing a Unicode filename containing
359 non-ASCII characters as markup into Beautiful Soup, on a system that
360 allows Unicode filenames. [bug=1866717]
361
362* Added a performance optimization to PageElement.extract(). Patch by
363 Arthur Darcet.
364
365= 4.8.2 (20191224)
366
367* Added Python docstrings to all public methods of the most commonly
368 used classes.
369
370* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
371 translation by Cezar Peixeiro to the repository.
372
373* Fixed two deprecation warnings. Patches by Colin
374 Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]
375
376* The html.parser tree builder now correctly handles DOCTYPEs that are
377 not uppercase. [bug=1848401]
378
379* PageElement.select() now returns a ResultSet rather than a regular
380 list, making it consistent with methods like find_all().
381
382= 4.8.1 (20191006)
383
384* When the html.parser or html5lib parsers are in use, Beautiful Soup
385 will, by default, record the position in the original document where
386 each tag was encountered. This includes line number (Tag.sourceline)
387 and position within a line (Tag.sourcepos). Based on code by Chris
388 Mayo. [bug=1742921]
389
390* When instantiating a BeautifulSoup object, it's now possible to
391 provide a dictionary ('element_classes') of the classes you'd like to be
392 instantiated instead of Tag, NavigableString, etc.
393
394* Fixed the definition of the default XML namespace when using
395 lxml 4.4. Patch by Isaac Muse. [bug=1840141]
396
397* Fixed a crash when pretty-printing tags that were not created
398 during initial parsing. [bug=1838903]
399
400* Copying a Tag preserves information that was originally obtained from
401 the TreeBuilder used to build the original Tag. [bug=1838903]
402
403* Raise an explanatory exception when the underlying parser
404 completely rejects the incoming markup. [bug=1838877]
405
406* Avoid a crash when trying to detect the declared encoding of a
407 Unicode document. [bug=1838877]
408
409* Avoid a crash when unpickling certain parse trees generated
410 using html5lib on Python 3. [bug=1843545]
411
412= 4.8.0 (20190720, "One Small Soup")
413
414This release focuses on making it easier to customize Beautiful Soup's
415input mechanism (the TreeBuilder) and output mechanism (the Formatter).
416
417* You can customize the TreeBuilder object by passing keyword
418 arguments into the BeautifulSoup constructor. Those keyword
419 arguments will be passed along into the TreeBuilder constructor.
420
421 The main reason to do this right now is to change how which
422 attributes are treated as multi-valued attributes (the way 'class'
423 is treated by default). You can do this with the
424 'multi_valued_attributes' argument. [bug=1832978]
425
426* The role of Formatter objects has been greatly expanded. The Formatter
427 class now controls the following:
428
429 - The function to call to perform entity substitution. (This was
430 previously Formatter's only job.)
431 - Which tags should be treated as containing CDATA and have their
432 contents exempt from entity substitution.
433 - The order in which a tag's attributes are output. [bug=1812422]
434 - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
435
436 All preexisting code should work as before.
437
438* Added a new method to the API, Tag.smooth(), which consolidates
439 multiple adjacent NavigableString elements. [bug=1697296]
440
441* &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
442 recognized as a named entity and converted to a single quote. [bug=1818721]
443
444= 4.7.1 (20190106)
445
446* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
447
448* Fixed an incorrectly raised exception when inserting a tag before or
449 after an identical tag. [bug=1810692]
450
451* Beautiful Soup will no longer try to keep track of namespaces that
452 are not defined with a prefix; this can confuse soupselect. [bug=1810680]
453
454* Tried even harder to avoid the deprecation warning originally fixed in
455 4.6.1. [bug=1778909]
456
457= 4.7.0 (20181231)
458
459* Beautiful Soup's CSS Selector implementation has been replaced by a
460 dependency on Isaac Muse's SoupSieve project (the soupsieve package
461 on PyPI). The good news is that SoupSieve has a much more robust and
462 complete implementation of CSS selectors, resolving a large number
463 of longstanding issues. The bad news is that from this point onward,
464 SoupSieve must be installed if you want to use the select() method.
465
466 You don't have to change anything lf you installed Beautiful Soup
467 through pip (SoupSieve will be automatically installed when you
468 upgrade Beautiful Soup) or if you don't use CSS selectors from
469 within Beautiful Soup.
470
471 SoupSieve documentation: https://facelessuser.github.io/soupsieve/
472
473* Added the PageElement.extend() method, which works like list.append().
474 [bug=1514970]
475
476* PageElement.insert_before() and insert_after() now take a variable
477 number of arguments. [bug=1514970]
478
479* Fix a number of problems with the tree builder that caused
480 trees that were superficially okay, but which fell apart when bits
481 were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
482
483* Fixed a problem with the tree builder in which elements that
484 contained no content (such as empty comments and all-whitespace
485 elements) were not being treated as part of the tree. Patch by Isaac
486 Muse. [bug=1798699]
487
488* Fixed a problem with multi-valued attributes where the value
489 contained whitespace. Thanks to Jens Svalgaard for the
490 fix. [bug=1787453]
491
492* Clarified ambiguous license statements in the source code. Beautiful
493 Soup is released under the MIT license, and has been since 4.4.0.
494
495* This file has been renamed from NEWS.txt to CHANGELOG.
496
497= 4.6.3 (20180812)
498
499* Exactly the same as 4.6.2. Re-released to make the README file
500 render properly on PyPI.
501
502= 4.6.2 (20180812)
503
504* Fix an exception when a custom formatter was asked to format a void
505 element. [bug=1784408]
506
507= 4.6.1 (20180728)
508
509* Stop data loss when encountering an empty numeric entity, and
510 possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
511
512* Preserve XML namespaces introduced inside an XML document, not just
513 the ones introduced at the top level. [bug=1718787]
514
515* Added a new formatter, "html5", which represents void elements
516 as "<element>" rather than "<element/>". [bug=1716272]
517
518* Fixed a problem where the html.parser tree builder interpreted
519 a string like "&foo " as the character entity "&foo;" [bug=1728706]
520
521* Correctly handle invalid HTML numeric character entities like &#147;
522 which reference code points that are not Unicode code points. Note
523 that this is only fixed when Beautiful Soup is used with the
524 html.parser parser -- html5lib already worked and I couldn't fix it
525 with lxml. [bug=1782933]
526
527* Improved the warning given when no parser is specified. [bug=1780571]
528
529* When markup contains duplicate elements, a select() call that
530 includes multiple match clauses will match all relevant
531 elements. [bug=1770596]
532
533* Fixed code that was causing deprecation warnings in recent Python 3
534 versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
535
536* Fixed a Windows crash in diagnose() when checking whether a long
537 markup string is a filename. [bug=1737121]
538
539* Stopped HTMLParser from raising an exception in very rare cases of
540 bad markup. [bug=1708831]
541
542* Fixed a bug where find_all() was not working when asked to find a
543 tag with a namespaced name in an XML document that was parsed as
544 HTML. [bug=1723783]
545
546* You can get finer control over formatting by subclassing
547 bs4.element.Formatter and passing a Formatter instance into (e.g.)
548 encode(). [bug=1716272]
549
550* You can pass a dictionary of `attrs` into
551 BeautifulSoup.new_tag. This makes it possible to create a tag with
552 an attribute like 'name' that would otherwise be masked by another
553 argument of new_tag. [bug=1779276]
554
555* Clarified the deprecation warning when accessing tag.fooTag, to cover
556 the possibility that you might really have been looking for a tag
557 called 'fooTag'.
558
559= 4.6.0 (20170507) =
560
561* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for
562 getting the value of an attribute, but which always returns a list,
563 whether or not the attribute is a multi-value attribute. [bug=1678589]
564
565* It's now possible to use a tag's namespace prefix when searching,
566 e.g. soup.find('namespace:tag') [bug=1655332]
567
568* Improved the handling of empty-element tags like <br> when using the
569 html.parser parser. [bug=1676935]
570
571* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void
572 element tags) correctly. [bug=1656909]
573
574* Namespace prefix is preserved when an XML tag is copied. Thanks
575 to Vikas for a patch and test. [bug=1685172]
576
577= 4.5.3 (20170102) =
578
579* Fixed foster parenting when html5lib is the tree builder. Thanks to
580 Geoffrey Sneddon for a patch and test.
581
582* Fixed yet another problem that caused the html5lib tree builder to
583 create a disconnected parse tree. [bug=1629825]
584
585= 4.5.2 (20170102) =
586
587* Apart from the version number, this release is identical to
588 4.5.3. Due to user error, it could not be completely uploaded to
589 PyPI. Use 4.5.3 instead.
590
591= 4.5.1 (20160802) =
592
593* Fixed a crash when passing Unicode markup that contained a
594 processing instruction into the lxml HTML parser on Python
595 3. [bug=1608048]
596
597= 4.5.0 (20160719) =
598
599* Beautiful Soup is no longer compatible with Python 2.6. This
600 actually happened a few releases ago, but it's now official.
601
602* Beautiful Soup will now work with versions of html5lib greater than
603 0.99999999. [bug=1603299]
604
605* If a search against each individual value of a multi-valued
606 attribute fails, the search will be run one final time against the
607 complete attribute value considered as a single string. That is, if
608 a tag has class="foo bar" and neither "foo" nor "bar" matches, but
609 "foo bar" does, the tag is now considered a match.
610
611 This happened in previous versions, but only when the value being
612 searched for was a string. Now it also works when that value is
613 a regular expression, a list of strings, etc. [bug=1476868]
614
615* Fixed a bug that deranged the tree when a whitespace element was
616 reparented into a tag that contained an identical whitespace
617 element. [bug=1505351]
618
619* Added support for CSS selector values that contain quoted spaces,
620 such as tag[style="display: foo"]. [bug=1540588]
621
622* Corrected handling of XML processing instructions. [bug=1504393]
623
624* Corrected an encoding error that happened when a BeautifulSoup
625 object was copied. [bug=1554439]
626
627* The contents of <textarea> tags will no longer be modified when the
628 tree is prettified. [bug=1555829]
629
630* When a BeautifulSoup object is pickled but its tree builder cannot
631 be pickled, its .builder attribute is set to None instead of being
632 destroyed. This avoids a performance problem once the object is
633 unpickled. [bug=1523629]
634
635* Specify the file and line number when warning about a
636 BeautifulSoup object being instantiated without a parser being
637 specified. [bug=1574647]
638
639* The `limit` argument to `select()` now works correctly, though it's
640 not implemented very efficiently. [bug=1520530]
641
642* Fixed a Python 3 ByteWarning when a URL was passed in as though it
643 were markup. Thanks to James Salter for a patch and
644 test. [bug=1533762]
645
646* We don't run the check for a filename passed in as markup if the
647 'filename' contains a less-than character; the less-than character
648 indicates it's most likely a very small document. [bug=1577864]
649
650= 4.4.1 (20150928) =
651
652* Fixed a bug that deranged the tree when part of it was
653 removed. Thanks to Eric Weiser for the patch and John Wiseman for a
654 test. [bug=1481520]
655
656* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel
657 Kramer for the patch. [bug=1483781]
658
659* Improved the implementation of CSS selector grouping. Thanks to
660 Orangain for the patch. [bug=1484543]
661
662* Fixed the test_detect_utf8 test so that it works when chardet is
663 installed. [bug=1471359]
664
665* Corrected the output of Declaration objects. [bug=1477847]
666
667
668= 4.4.0 (20150703) =
669
670Especially important changes:
671
672* Added a warning when you instantiate a BeautifulSoup object without
673 explicitly naming a parser. [bug=1398866]
674
675* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode
676 string in Python 3, instead of a UTF8-encoded bytestring in both
677 versions. In Python 3, __str__ now returns a Unicode string instead
678 of a bytestring. [bug=1420131]
679
680* The `text` argument to the find_* methods is now called `string`,
681 which is more accurate. `text` still works, but `string` is the
682 argument described in the documentation. `text` may eventually
683 change its meaning, but not for a very long time. [bug=1366856]
684
685* Changed the way soup objects work under copy.copy(). Copying a
686 NavigableString or a Tag will give you a new NavigableString that's
687 equal to the old one but not connected to the parse tree. Patch by
688 Martijn Peters. [bug=1307490]
689
690* Started using a standard MIT license. [bug=1294662]
691
692* Added a Chinese translation of the documentation by Delong .w.
693
694New features:
695
696* Introduced the select_one() method, which uses a CSS selector but
697 only returns the first match, instead of a list of
698 matches. [bug=1349367]
699
700* You can now create a Tag object without specifying a
701 TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
702
703* You can now create a NavigableString or a subclass just by invoking
704 the constructor. [bug=1294315]
705
706* Added an `exclude_encodings` argument to UnicodeDammit and to the
707 Beautiful Soup constructor, which lets you prohibit the detection of
708 an encoding that you know is wrong. [bug=1469408]
709
710* The select() method now supports selector grouping. Patch by
711 Francisco Canas [bug=1191917]
712
713Bug fixes:
714
715* Fixed yet another problem that caused the html5lib tree builder to
716 create a disconnected parse tree. [bug=1237763]
717
718* Force object_was_parsed() to keep the tree intact even when an element
719 from later in the document is moved into place. [bug=1430633]
720
721* Fixed yet another bug that caused a disconnected tree when html5lib
722 copied an element from one part of the tree to another. [bug=1270611]
723
724* Fixed a bug where Element.extract() could create an infinite loop in
725 the remaining tree.
726
727* The select() method can now find tags whose names contain
728 dashes. Patch by Francisco Canas. [bug=1276211]
729
730* The select() method can now find tags with attributes whose names
731 contain dashes. Patch by Marek Kapolka. [bug=1304007]
732
733* Improved the lxml tree builder's handling of processing
734 instructions. [bug=1294645]
735
736* Restored the helpful syntax error that happens when you try to
737 import the Python 2 edition of Beautiful Soup under Python
738 3. [bug=1213387]
739
740* In Python 3.4 and above, set the new convert_charrefs argument to
741 the html.parser constructor to avoid a warning and future
742 failures. Patch by Stefano Revera. [bug=1375721]
743
744* The warning when you pass in a filename or URL as markup will now be
745 displayed correctly even if the filename or URL is a Unicode
746 string. [bug=1268888]
747
748* If the initial <html> tag contains a CDATA list attribute such as
749 'class', the html5lib tree builder will now turn its value into a
750 list, as it would with any other tag. [bug=1296481]
751
752* Fixed an import error in Python 3.5 caused by the removal of the
753 HTMLParseError class. [bug=1420063]
754
755* Improved docstring for encode_contents() and
756 decode_contents(). [bug=1441543]
757
758* Fixed a crash in Unicode, Dammit's encoding detector when the name
759 of the encoding itself contained invalid bytes. [bug=1360913]
760
761* Improved the exception raised when you call .unwrap() or
762 .replace_with() on an element that's not attached to a tree.
763
764* Raise a NotImplementedError whenever an unsupported CSS pseudoclass
765 is used in select(). Previously some cases did not result in a
766 NotImplementedError.
767
768* It's now possible to pickle a BeautifulSoup object no matter which
769 tree builder was used to create it. However, the only tree builder
770 that survives the pickling process is the HTMLParserTreeBuilder
771 ('html.parser'). If you unpickle a BeautifulSoup object created with
772 some other tree builder, soup.builder will be None. [bug=1231545]
773
774= 4.3.2 (20131002) =
775
776* Fixed a bug in which short Unicode input was improperly encoded to
777 ASCII when checking whether or not it was the name of a file on
778 disk. [bug=1227016]
779
780* Fixed a crash when a short input contains data not valid in
781 filenames. [bug=1232604]
782
783* Fixed a bug that caused Unicode data put into UnicodeDammit to
784 return None instead of the original data. [bug=1214983]
785
786* Combined two tests to stop a spurious test failure when tests are
787 run by nosetests. [bug=1212445]
788
789= 4.3.1 (20130815) =
790
791* Fixed yet another problem with the html5lib tree builder, caused by
792 html5lib's tendency to rearrange the tree during
793 parsing. [bug=1189267]
794
795* Fixed a bug that caused the optimized version of find_all() to
796 return nothing. [bug=1212655]
797
798= 4.3.0 (20130812) =
799
800* Instead of converting incoming data to Unicode and feeding it to the
801 lxml tree builder in chunks, Beautiful Soup now makes successive
802 guesses at the encoding of the incoming data, and tells lxml to
803 parse the data as that encoding. Giving lxml more control over the
804 parsing process improves performance and avoids a number of bugs and
805 issues with the lxml parser which had previously required elaborate
806 workarounds:
807
808 - An issue in which lxml refuses to parse Unicode strings on some
809 systems. [bug=1180527]
810
811 - A returning bug that truncated documents longer than a (very
812 small) size. [bug=963880]
813
814 - A returning bug in which extra spaces were added to a document if
815 the document defined a charset other than UTF-8. [bug=972466]
816
817 This required a major overhaul of the tree builder architecture. If
818 you wrote your own tree builder and didn't tell me, you'll need to
819 modify your prepare_markup() method.
820
821* The UnicodeDammit code that makes guesses at encodings has been
822 split into its own class, EncodingDetector. A lot of apparently
823 redundant code has been removed from Unicode, Dammit, and some
824 undocumented features have also been removed.
825
826* Beautiful Soup will issue a warning if instead of markup you pass it
827 a URL or the name of a file on disk (a common beginner's mistake).
828
829* A number of optimizations improve the performance of the lxml tree
830 builder by about 33%, the html.parser tree builder by about 20%, and
831 the html5lib tree builder by about 15%.
832
833* All find_all calls should now return a ResultSet object. Patch by
834 Aaron DeVore. [bug=1194034]
835
836= 4.2.1 (20130531) =
837
838* The default XML formatter will now replace ampersands even if they
839 appear to be part of entities. That is, "&lt;" will become
840 "&amp;lt;". The old code was left over from Beautiful Soup 3, which
841 didn't always turn entities into Unicode characters.
842
843 If you really want the old behavior (maybe because you add new
844 strings to the tree, those strings include entities, and you want
845 the formatter to leave them alone on output), it can be found in
846 EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
847
848* Gave new_string() the ability to create subclasses of
849 NavigableString. [bug=1181986]
850
851* Fixed another bug by which the html5lib tree builder could create a
852 disconnected tree. [bug=1182089]
853
854* The .previous_element of a BeautifulSoup object is now always None,
855 not the last element to be parsed. [bug=1182089]
856
857* Fixed test failures when lxml is not installed. [bug=1181589]
858
859* html5lib now supports Python 3. Fixed some Python 2-specific
860 code in the html5lib test suite. [bug=1181624]
861
862* The html.parser treebuilder can now handle numeric attributes in
863 text when the hexidecimal name of the attribute starts with a
864 capital X. Patch by Tim Shirley. [bug=1186242]
865
866= 4.2.0 (20130514) =
867
868* The Tag.select() method now supports a much wider variety of CSS
869 selectors.
870
871 - Added support for the adjacent sibling combinator (+) and the
872 general sibling combinator (~). Tests by "liquider". [bug=1082144]
873
874 - The combinators (>, +, and ~) can now combine with any supported
875 selector, not just one that selects based on tag name.
876
877 - Added limited support for the "nth-of-type" pseudo-class. Code
878 by Sven Slootweg. [bug=1109952]
879
880* The BeautifulSoup class is now aliased to "_s" and "_soup", making
881 it quicker to type the import statement in an interactive session:
882
883 from bs4 import _s
884 or
885 from bs4 import _soup
886
887 The alias may change in the future, so don't use this in code you're
888 going to run more than once.
889
890* Added the 'diagnose' submodule, which includes several useful
891 functions for reporting problems and doing tech support.
892
893 - diagnose(data) tries the given markup on every installed parser,
894 reporting exceptions and displaying successes. If a parser is not
895 installed, diagnose() mentions this fact.
896
897 - lxml_trace(data, html=True) runs the given markup through lxml's
898 XML parser or HTML parser, and prints out the parser events as
899 they happen. This helps you quickly determine whether a given
900 problem occurs in lxml code or Beautiful Soup code.
901
902 - htmlparser_trace(data) is the same thing, but for Python's
903 built-in HTMLParser class.
904
905* In an HTML document, the contents of a <script> or <style> tag will
906 no longer undergo entity substitution by default. XML documents work
907 the same way they did before. [bug=1085953]
908
909* Methods like get_text() and properties like .strings now only give
910 you strings that are visible in the document--no comments or
911 processing commands. [bug=1050164]
912
913* The prettify() method now leaves the contents of <pre> tags
914 alone. [bug=1095654]
915
916* Fix a bug in the html5lib treebuilder which sometimes created
917 disconnected trees. [bug=1039527]
918
919* Fix a bug in the lxml treebuilder which crashed when a tag included
920 an attribute from the predefined "xml:" namespace. [bug=1065617]
921
922* Fix a bug by which keyword arguments to find_parent() were not
923 being passed on. [bug=1126734]
924
925* Stop a crash when unwisely messing with a tag that's been
926 decomposed. [bug=1097699]
927
928* Now that lxml's segfault on invalid doctype has been fixed, fixed a
929 corresponding problem on the Beautiful Soup end that was previously
930 invisible. [bug=984936]
931
932* Fixed an exception when an overspecified CSS selector didn't match
933 anything. Code by Stefaan Lippens. [bug=1168167]
934
935= 4.1.3 (20120820) =
936
937* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
938 test failure caused by the lousy HTMLParser in those
939 versions. [bug=1038503]
940
941* Raise a more specific error (FeatureNotFound) when a requested
942 parser or parser feature is not installed. Raise NotImplementedError
943 instead of ValueError when the user calls insert_before() or
944 insert_after() on the BeautifulSoup object itself. Patch by Aaron
945 Devore. [bug=1038301]
946
947= 4.1.2 (20120817) =
948
949* As per PEP-8, allow searching by CSS class using the 'class_'
950 keyword argument. [bug=1037624]
951
952* Display namespace prefixes for namespaced attribute names, instead of
953 the fully-qualified names given by the lxml parser. [bug=1037597]
954
955* Fixed a crash on encoding when an attribute name contained
956 non-ASCII characters.
957
958* When sniffing encodings, if the cchardet library is installed,
959 Beautiful Soup uses it instead of chardet. cchardet is much
960 faster. [bug=1020748]
961
962* Use logging.warning() instead of warning.warn() to notify the user
963 that characters were replaced with REPLACEMENT
964 CHARACTER. [bug=1013862]
965
966= 4.1.1 (20120703) =
967
968* Fixed an html5lib tree builder crash which happened when html5lib
969 moved a tag with a multivalued attribute from one part of the tree
970 to another. [bug=1019603]
971
972* Correctly display closing tags with an XML namespace declared. Patch
973 by Andreas Kostyrka. [bug=1019635]
974
975* Fixed a typo that made parsing significantly slower than it should
976 have been, and also waited too long to close tags with XML
977 namespaces. [bug=1020268]
978
979* get_text() now returns an empty Unicode string if there is no text,
980 rather than an empty bytestring. [bug=1020387]
981
982= 4.1.0 (20120529) =
983
984* Added experimental support for fixing Windows-1252 characters
985 embedded in UTF-8 documents. (UnicodeDammit.detwingle())
986
987* Fixed the handling of &quot; with the built-in parser. [bug=993871]
988
989* Comments, processing instructions, document type declarations, and
990 markup declarations are now treated as preformatted strings, the way
991 CData blocks are. [bug=1001025]
992
993* Fixed a bug with the lxml treebuilder that prevented the user from
994 adding attributes to a tag that didn't originally have
995 attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
996
997* Fixed some edge-case bugs having to do with inserting an element
998 into a tag it's already inside, and replacing one of a tag's
999 children with another. [bug=997529]
1000
1001* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
1002
1003 This caused a major refactoring of the search code. All the tests
1004 pass, but it's possible that some searches will behave differently.
1005
1006= 4.0.5 (20120427) =
1007
1008* Added a new method, wrap(), which wraps an element in a tag.
1009
1010* Renamed replace_with_children() to unwrap(), which is easier to
1011 understand and also the jQuery name of the function.
1012
1013* Made encoding substitution in <meta> tags completely transparent (no
1014 more %SOUP-ENCODING%).
1015
1016* Fixed a bug in decoding data that contained a byte-order mark, such
1017 as data encoded in UTF-16LE. [bug=988980]
1018
1019* Fixed a bug that made the HTMLParser treebuilder generate XML
1020 definitions ending with two question marks instead of
1021 one. [bug=984258]
1022
1023* Upon document generation, CData objects are no longer run through
1024 the formatter. [bug=988905]
1025
1026* The test suite now passes when lxml is not installed, whether or not
1027 html5lib is installed. [bug=987004]
1028
1029* Print a warning on HTMLParseErrors to let people know they should
1030 install a better parser library.
1031
1032= 4.0.4 (20120416) =
1033
1034* Fixed a bug that sometimes created disconnected trees.
1035
1036* Fixed a bug with the string setter that moved a string around the
1037 tree instead of copying it. [bug=983050]
1038
1039* Attribute values are now run through the provided output formatter.
1040 Previously they were always run through the 'minimal' formatter. In
1041 the future I may make it possible to specify different formatters
1042 for attribute values and strings, but for now, consistent behavior
1043 is better than inconsistent behavior. [bug=980237]
1044
1045* Added the missing renderContents method from Beautiful Soup 3. Also
1046 added an encode_contents() method to go along with decode_contents().
1047
1048* Give a more useful error when the user tries to run the Python 2
1049 version of BS under Python 3.
1050
1051* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
1052 UnicodeDammit(markup, smart_quotes_to="ascii").
1053
1054= 4.0.3 (20120403) =
1055
1056* Fixed a typo that caused some versions of Python 3 to convert the
1057 Beautiful Soup codebase incorrectly.
1058
1059* Got rid of the 4.0.2 workaround for HTML documents--it was
1060 unnecessary and the workaround was triggering a (possibly different,
1061 but related) bug in lxml. [bug=972466]
1062
1063= 4.0.2 (20120326) =
1064
1065* Worked around a possible bug in lxml that prevents non-tiny XML
1066 documents from being parsed. [bug=963880, bug=963936]
1067
1068* Fixed a bug where specifying `text` while also searching for a tag
1069 only worked if `text` wanted an exact string match. [bug=955942]
1070
1071= 4.0.1 (20120314) =
1072
1073* This is the first official release of Beautiful Soup 4. There is no
1074 4.0.0 release, to eliminate any possibility that packaging software
1075 might treat "4.0.0" as being an earlier version than "4.0.0b10".
1076
1077* Brought BS up to date with the latest release of soupselect, adding
1078 CSS selector support for direct descendant matches and multiple CSS
1079 class matches.
1080
1081= 4.0.0b10 (20120302) =
1082
1083* Added support for simple CSS selectors, taken from the soupselect project.
1084
1085* Fixed a crash when using html5lib. [bug=943246]
1086
1087* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
1088 attribute is now replaced with the appropriate encoding on
1089 output. [bug=942714]
1090
1091* Fixed a bug that caused calling a tag to sometimes call find_all()
1092 with the wrong arguments. [bug=944426]
1093
1094* For backwards compatibility, brought back the BeautifulStoneSoup
1095 class as a deprecated wrapper around BeautifulSoup.
1096
1097= 4.0.0b9 (20120228) =
1098
1099* Fixed the string representation of DOCTYPEs that have both a public
1100 ID and a system ID.
1101
1102* Fixed the generated XML declaration.
1103
1104* Renamed Tag.nsprefix to Tag.prefix, for consistency with
1105 NamespacedAttribute.
1106
1107* Fixed a test failure that occurred on Python 3.x when chardet was
1108 installed.
1109
1110* Made prettify() return Unicode by default, so it will look nice on
1111 Python 3 when passed into print().
1112
1113= 4.0.0b8 (20120224) =
1114
1115* All tree builders now preserve namespace information in the
1116 documents they parse. If you use the html5lib parser or lxml's XML
1117 parser, you can access the namespace URL for a tag as tag.namespace.
1118
1119 However, there is no special support for namespace-oriented
1120 searching or tree manipulation. When you search the tree, you need
1121 to use namespace prefixes exactly as they're used in the original
1122 document.
1123
1124* The string representation of a DOCTYPE always ends in a newline.
1125
1126* Issue a warning if the user tries to use a SoupStrainer in
1127 conjunction with the html5lib tree builder, which doesn't support
1128 them.
1129
1130= 4.0.0b7 (20120223) =
1131
1132* Upon decoding to string, any characters that can't be represented in
1133 your chosen encoding will be converted into numeric XML entity
1134 references.
1135
1136* Issue a warning if characters were replaced with REPLACEMENT
1137 CHARACTER during Unicode conversion.
1138
1139* Restored compatibility with Python 2.6.
1140
1141* The install process no longer installs docs or auxiliary text files.
1142
1143* It's now possible to deepcopy a BeautifulSoup object created with
1144 Python's built-in HTML parser.
1145
1146* About 100 unit tests that "test" the behavior of various parsers on
1147 invalid markup have been removed. Legitimate changes to those
1148 parsers caused these tests to fail, indicating that perhaps
1149 Beautiful Soup should not test the behavior of foreign
1150 libraries.
1151
1152 The problematic unit tests have been reformulated as informational
1153 comparisons generated by the script
1154 scripts/demonstrate_parser_differences.py.
1155
1156 This makes Beautiful Soup compatible with html5lib version 0.95 and
1157 future versions of HTMLParser.
1158
1159= 4.0.0b6 (20120216) =
1160
1161* Multi-valued attributes like "class" always have a list of values,
1162 even if there's only one value in the list.
1163
1164* Added a number of multi-valued attributes defined in HTML5.
1165
1166* Stopped generating a space before the slash that closes an
1167 empty-element tag. This may come back if I add a special XHTML mode
1168 (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
1169 useless.
1170
1171* Passing text along with tag-specific arguments to a find* method:
1172
1173 find("a", text="Click here")
1174
1175 will find tags that contain the given text as their
1176 .string. Previously, the tag-specific arguments were ignored and
1177 only strings were searched.
1178
1179* Fixed a bug that caused the html5lib tree builder to build a
1180 partially disconnected tree. Generally cleaned up the html5lib tree
1181 builder.
1182
1183* If you restrict a multi-valued attribute like "class" to a string
1184 that contains spaces, Beautiful Soup will only consider it a match
1185 if the values correspond to that specific string.
1186
1187= 4.0.0b5 (20120209) =
1188
1189* Rationalized Beautiful Soup's treatment of CSS class. A tag
1190 belonging to multiple CSS classes is treated as having a list of
1191 values for the 'class' attribute. Searching for a CSS class will
1192 match *any* of the CSS classes.
1193
1194 This actually affects all attributes that the HTML standard defines
1195 as taking multiple values (class, rel, rev, archive, accept-charset,
1196 and headers), but 'class' is by far the most common. [bug=41034]
1197
1198* If you pass anything other than a dictionary as the second argument
1199 to one of the find* methods, it'll assume you want to use that
1200 object to search against a tag's CSS classes. Previously this only
1201 worked if you passed in a string.
1202
1203* Fixed a bug that caused a crash when you passed a dictionary as an
1204 attribute value (possibly because you mistyped "attrs"). [bug=842419]
1205
1206* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
1207 like <meta charset="utf-8" />. [bug=837268]
1208
1209* If Unicode, Dammit can't figure out a consistent encoding for a
1210 page, it will try each of its guesses again, with errors="replace"
1211 instead of errors="strict". This may mean that some data gets
1212 replaced with REPLACEMENT CHARACTER, but at least most of it will
1213 get turned into Unicode. [bug=754903]
1214
1215* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
1216 on certain kinds of markup. [bug=838800]
1217
1218* Fixed a bug that wrecked the tree if you replaced an element with an
1219 empty string. [bug=728697]
1220
1221* Improved Unicode, Dammit's behavior when you give it Unicode to
1222 begin with.
1223
1224= 4.0.0b4 (20120208) =
1225
1226* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
1227
1228* BeautifulSoup.new_tag() will follow the rules of whatever
1229 tree-builder was used to create the original BeautifulSoup object. A
1230 new <p> tag will look like "<p />" if the soup object was created to
1231 parse XML, but it will look like "<p></p>" if the soup object was
1232 created to parse HTML.
1233
1234* We pass in strict=False to html.parser on Python 3, greatly
1235 improving html.parser's ability to handle bad HTML.
1236
1237* We also monkeypatch a serious bug in html.parser that made
1238 strict=False disastrous on Python 3.2.2.
1239
1240* Replaced the "substitute_html_entities" argument with the
1241 more general "formatter" argument.
1242
1243* Bare ampersands and angle brackets are always converted to XML
1244 entities unless the user prevents it.
1245
1246* Added PageElement.insert_before() and PageElement.insert_after(),
1247 which let you put an element into the parse tree with respect to
1248 some other element.
1249
1250* Raise an exception when the user tries to do something nonsensical
1251 like insert a tag into itself.
1252
1253
1254= 4.0.0b3 (20120203) =
1255
1256Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
1257Soup's custom HTML parser in favor of a system that lets you write a
1258little glue code and plug in any HTML or XML parser you want.
1259
1260Beautiful Soup 4.0 comes with glue code for four parsers:
1261
1262 * Python's standard HTMLParser (html.parser in Python 3)
1263 * lxml's HTML and XML parsers
1264 * html5lib's HTML parser
1265
1266HTMLParser is the default, but I recommend you install lxml if you
1267can.
1268
1269For complete documentation, see the Sphinx documentation in
1270bs4/doc/source/. What follows is a summary of the changes from
1271Beautiful Soup 3.
1272
1273=== The module name has changed ===
1274
1275Previously you imported the BeautifulSoup class from a module also
1276called BeautifulSoup. To save keystrokes and make it clear which
1277version of the API is in use, the module is now called 'bs4':
1278
1279 >>> from bs4 import BeautifulSoup
1280
1281=== It works with Python 3 ===
1282
1283Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
1284so bad that it barely worked at all. Beautiful Soup 4 works with
1285Python 3, and since its parser is pluggable, you don't sacrifice
1286quality.
1287
1288Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
1289support to the finish line. Ezio Melotti is also to thank for greatly
1290improving the HTML parser that comes with Python 3.2.
1291
1292=== CDATA sections are normal text, if they're understood at all. ===
1293
1294Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
1295markup:
1296
1297 <p><![CDATA[foo]]></p> => <p></p>
1298
1299A future version of html5lib will turn CDATA sections into text nodes,
1300but only within tags like <svg> and <math>:
1301
1302 <svg><![CDATA[foo]]></svg> => <p>foo</p>
1303
1304The default XML parser (which uses lxml behind the scenes) turns CDATA
1305sections into ordinary text elements:
1306
1307 <p><![CDATA[foo]]></p> => <p>foo</p>
1308
1309In theory it's possible to preserve the CDATA sections when using the
1310XML parser, but I don't see how to get it to work in practice.
1311
1312=== Miscellaneous other stuff ===
1313
1314If the BeautifulSoup instance has .is_xml set to True, an appropriate
1315XML declaration will be emitted when the tree is transformed into a
1316string:
1317
1318 <?xml version="1.0" encoding="utf-8">
1319 <markup>
1320 ...
1321 </markup>
1322
1323The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
1324builders set it to False. If you want to parse XHTML with an HTML
1325parser, you can set it manually.
1326
1327
1328= 3.2.0 =
1329
1330The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
1331to make it obvious which one you should use.
1332
1333= 3.1.0 =
1334
1335A hybrid version that supports 2.4 and can be automatically converted
1336to run under Python 3.0. There are three backwards-incompatible
1337changes you should be aware of, but no new features or deliberate
1338behavior changes.
1339
13401. str() may no longer do what you want. This is because the meaning
1341of str() inverts between Python 2 and 3; in Python 2 it gives you a
1342byte string, in Python 3 it gives you a Unicode string.
1343
1344The effect of this is that you can't pass an encoding to .__str__
1345anymore. Use encode() to get a string and decode() to get Unicode, and
1346you'll be ready (well, readier) for Python 3.
1347
13482. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
1349which is gone in Python 3. There's some bad HTML that SGMLParser
1350handled but HTMLParser doesn't, usually to do with attribute values
1351that aren't closed or have brackets inside them:
1352
1353 <a href="foo</a>, </a><a href="bar">baz</a>
1354 <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>
1355
1356A later version of Beautiful Soup will allow you to plug in different
1357parsers to make tradeoffs between speed and the ability to handle bad
1358HTML.
1359
13603. In Python 3 (but not Python 2), HTMLParser converts entities within
1361attributes to the corresponding Unicode characters. In Python 2 it's
1362possible to parse this string and leave the &eacute; intact.
1363
1364 <a href="http://crummy.com?sacr&eacute;&bleu">
1365
1366In Python 3, the &eacute; is always converted to \xe9 during
1367parsing.
1368
1369
1370= 3.0.7a =
1371
1372Added an import that makes BS work in Python 2.3.
1373
1374
1375= 3.0.7 =
1376
1377Fixed a UnicodeDecodeError when unpickling documents that contain
1378non-ASCII characters.
1379
1380Fixed a TypeError that occurred in some circumstances when a tag
1381contained no text.
1382
1383Jump through hoops to avoid the use of chardet, which can be extremely
1384slow in some circumstances. UTF-8 documents should never trigger the
1385use of chardet.
1386
1387Whitespace is preserved inside <pre> and <textarea> tags that contain
1388nothing but whitespace.
1389
1390Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
1391
1392
1393= 3.0.6 =
1394
1395Got rid of a very old debug line that prevented chardet from working.
1396
1397Added a Tag.decompose() method that completely disconnects a tree or a
1398subset of a tree, breaking it up into bite-sized pieces that are
1399easy for the garbage collecter to collect.
1400
1401Tag.extract() now returns the tag that was extracted.
1402
1403Tag.findNext() now does something with the keyword arguments you pass
1404it instead of dropping them on the floor.
1405
1406Fixed a Unicode conversion bug.
1407
1408Fixed a bug that garbled some <meta> tags when rewriting them.
1409
1410
1411= 3.0.5 =
1412
1413Soup objects can now be pickled, and copied with copy.deepcopy.
1414
1415Tag.append now works properly on existing BS objects. (It wasn't
1416originally intended for outside use, but it can be now.) (Giles
1417Radford)
1418
1419Passing in a nonexistent encoding will no longer crash the parser on
1420Python 2.4 (John Nagle).
1421
1422Fixed an underlying bug in SGMLParser that thinks ASCII has 255
1423characters instead of 127 (John Nagle).
1424
1425Entities are converted more consistently to Unicode characters.
1426
1427Entity references in attribute values are now converted to Unicode
1428characters when appropriate. Numeric entities are always converted,
1429because SGMLParser always converts them outside of attribute values.
1430
1431ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
1432XHTML_ENTITIES.
1433
1434The regular expression for bare ampersands was too loose. In some
1435cases ampersands were not being escaped. (Sam Ruby?)
1436
1437Non-breaking spaces and other special Unicode space characters are no
1438longer folded to ASCII spaces. (Robert Leftwich)
1439
1440Information inside a TEXTAREA tag is now parsed literally, not as HTML
1441tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
1442
1443= 3.0.4 =
1444
1445Fixed a bug that crashed Unicode conversion in some cases.
1446
1447Fixed a bug that prevented UnicodeDammit from being used as a
1448general-purpose data scrubber.
1449
1450Fixed some unit test failures when running against Python 2.5.
1451
1452When considering whether to convert smart quotes, UnicodeDammit now
1453looks at the original encoding in a case-insensitive way.
1454
1455= 3.0.3 (20060606) =
1456
1457Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be
1458sure to pass in an appropriate value for convertEntities, or XML/HTML
1459entities might stick around that aren't valid in HTML/XML). The result
1460may not validate, but it should be good enough to not choke a
1461real-world XML parser. Specifically, the output of a properly
1462constructed soup object should always be valid as part of an XML
1463document, but parts may be missing if they were missing in the
1464original. As always, if the input is valid XML, the output will also
1465be valid.
1466
1467= 3.0.2 (20060602) =
1468
1469Previously, Beautiful Soup correctly handled attribute values that
1470contained embedded quotes (sometimes by escaping), but not other kinds
1471of XML character. Now, it correctly handles or escapes all special XML
1472characters in attribute values.
1473
1474I aliased methods to the 2.x names (fetch, find, findText, etc.) for
1475backwards compatibility purposes. Those names are deprecated and if I
1476ever do a 4.0 I will remove them. I will, I tell you!
1477
1478Fixed a bug where the findAll method wasn't passing along any keyword
1479arguments.
1480
1481When run from the command line, Beautiful Soup now acts as an HTML
1482pretty-printer, not an XML pretty-printer.
1483
1484= 3.0.1 (20060530) =
1485
1486Reintroduced the "fetch by CSS class" shortcut. I thought keyword
1487arguments would replace it, but they don't. You can't call soup('a',
1488class='foo') because class is a Python keyword.
1489
1490If Beautiful Soup encounters a meta tag that declares the encoding,
1491but a SoupStrainer tells it not to parse that tag, Beautiful Soup will
1492no longer try to rewrite the meta tag to mention the new
1493encoding. Basically, this makes SoupStrainers work in real-world
1494applications instead of crashing the parser.
1495
1496= 3.0.0 "Who would not give all else for two p" (20060528) =
1497
1498This release is not backward-compatible with previous releases. If
1499you've got code written with a previous version of the library, go
1500ahead and keep using it, unless one of the features mentioned here
1501really makes your life easier. Since the library is self-contained,
1502you can include an old copy of the library in your old applications,
1503and use the new version for everything else.
1504
1505The documentation has been rewritten and greatly expanded with many
1506more examples.
1507
1508Beautiful Soup autodetects the encoding of a document (or uses the one
1509you specify), and converts it from its native encoding to
1510Unicode. Internally, it only deals with Unicode strings. When you
1511print out the document, it converts to UTF-8 (or another encoding you
1512specify). [Doc reference]
1513
1514It's now easy to make large-scale changes to the parse tree without
1515screwing up the navigation members. The methods are extract,
1516replaceWith, and insert. [Doc reference. See also Improving Memory
1517Usage with extract]
1518
1519Passing True in as an attribute value gives you tags that have any
1520value for that attribute. You don't have to create a regular
1521expression. Passing None for an attribute value gives you tags that
1522don't have that attribute at all.
1523
1524Tag objects now know whether or not they're self-closing. This avoids
1525the problem where Beautiful Soup thought that tags like <BR /> were
1526self-closing even in XML documents. You can customize the self-closing
1527tags for a parser object by passing them in as a list of
1528selfClosingTags: you don't have to subclass anymore.
1529
1530There's a new built-in parser, MinimalSoup, which has most of
1531BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
1532reference]
1533
1534You can use a SoupStrainer to tell Beautiful Soup to parse only part
1535of a document. This saves time and memory, often making Beautiful Soup
1536about as fast as a custom-built SGMLParser subclass. [Doc reference,
1537SoupStrainer reference]
1538
1539You can (usually) use keyword arguments instead of passing a
1540dictionary of attributes to a search method. That is, you can replace
1541soup(args={"id" : "5"}) with soup(id="5"). You can still use args if
1542(for instance) you need to find an attribute whose name clashes with
1543the name of an argument to findAll. [Doc reference: **kwargs attrs]
1544
1545The method names have changed to the better method names used in
1546Rubyful Soup. Instead of find methods and fetch methods, there are
1547only find methods. Instead of a scheme where you can't remember which
1548method finds one element and which one finds them all, we have find
1549and findAll. In general, if the method name mentions All or a plural
1550noun (eg. findNextSiblings), then it finds many elements
1551method. Otherwise, it only finds one element. [Doc reference]
1552
1553Some of the argument names have been renamed for clarity. For instance
1554avoidParserProblems is now parserMassage.
1555
1556Beautiful Soup no longer implements a feed method. You need to pass a
1557string or a filehandle into the soup constructor, not with feed after
1558the soup has been created. There is still a feed method, but it's the
1559feed method implemented by SGMLParser and calling it will bypass
1560Beautiful Soup and cause problems.
1561
1562The NavigableText class has been renamed to NavigableString. There is
1563no NavigableUnicodeString anymore, because every string inside a
1564Beautiful Soup parse tree is a Unicode string.
1565
1566findText and fetchText are gone. Just pass a text argument into find
1567or findAll.
1568
1569Null was more trouble than it was worth, so I got rid of it. Anything
1570that used to return Null now returns None.
1571
1572Special XML constructs like comments and CDATA now have their own
1573NavigableString subclasses, instead of being treated as oddly-formed
1574data. If you parse a document that contains CDATA and write it back
1575out, the CDATA will still be there.
1576
1577When you're parsing a document, you can get Beautiful Soup to convert
1578XML or HTML entities into the corresponding Unicode characters. [Doc
1579reference]
1580
1581= 2.1.1 (20050918) =
1582
1583Fixed a serious performance bug in BeautifulStoneSoup which was
1584causing parsing to be incredibly slow.
1585
1586Corrected several entities that were previously being incorrectly
1587translated from Microsoft smart-quote-like characters.
1588
1589Fixed a bug that was breaking text fetch.
1590
1591Fixed a bug that crashed the parser when text chunks that look like
1592HTML tag names showed up within a SCRIPT tag.
1593
1594THEAD, TBODY, and TFOOT tags are now nestable within TABLE
1595tags. Nested tables should parse more sensibly now.
1596
1597BASE is now considered a self-closing tag.
1598
1599= 2.1.0 "Game, or any other dish?" (20050504) =
1600
1601Added a wide variety of new search methods which, given a starting
1602point inside the tree, follow a particular navigation member (like
1603nextSibling) over and over again, looking for Tag and NavigableText
1604objects that match certain criteria. The new methods are findNext,
1605fetchNext, findPrevious, fetchPrevious, findNextSibling,
1606fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
1607findParent, and fetchParents. All of these use the same basic code
1608used by first and fetch, so you can pass your weird ways of matching
1609things into these methods.
1610
1611The fetch method and its derivatives now accept a limit argument.
1612
1613You can now pass keyword arguments when calling a Tag object as though
1614it were a method.
1615
1616Fixed a bug that caused all hand-created tags to share a single set of
1617attributes.
1618
1619= 2.0.3 (20050501) =
1620
1621Fixed Python 2.2 support for iterators.
1622
1623Fixed a bug that gave the wrong representation to tags within quote
1624tags like <script>.
1625
1626Took some code from Mark Pilgrim that treats CDATA declarations as
1627data instead of ignoring them.
1628
1629Beautiful Soup's setup.py will now do an install even if the unit
1630tests fail. It won't build a source distribution if the unit tests
1631fail, so I can't release a new version unless they pass.
1632
1633= 2.0.2 (20050416) =
1634
1635Added the unit tests in a separate module, and packaged it with
1636distutils.
1637
1638Fixed a bug that sometimes caused renderContents() to return a Unicode
1639string even if there was no Unicode in the original string.
1640
1641Added the done() method, which closes all of the parser's open
1642tags. It gets called automatically when you pass in some text to the
1643constructor of a parser class; otherwise you must call it yourself.
1644
1645Reinstated some backwards compatibility with 1.x versions: referencing
1646the string member of a NavigableText object returns the NavigableText
1647object instead of throwing an error.
1648
1649= 2.0.1 (20050412) =
1650
1651Fixed a bug that caused bad results when you tried to reference a tag
1652name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
1653
1654Made sure all Tags have the 'hidden' attribute so that an attempt to
1655access tag.hidden doesn't spawn an attempt to find a tag named
1656'hidden'.
1657
1658Fixed a bug in the comparison operator.
1659
1660= 2.0.0 "Who cares for fish?" (20050410)
1661
1662Beautiful Soup version 1 was very useful but also pretty stupid. I
1663originally wrote it without noticing any of the problems inherent in
1664trying to build a parse tree out of ambiguous HTML tags. This version
1665solves all of those problems to my satisfaction. It also adds many new
1666clever things to make up for the removal of the stupid things.
1667
1668== Parsing ==
1669
1670The parser logic has been greatly improved, and the BeautifulSoup
1671class should much more reliably yield a parse tree that looks like
1672what the page author intended. For a particular class of odd edge
1673cases that now causes problems, there is a new class,
1674ICantBelieveItsBeautifulSoup.
1675
1676By default, Beautiful Soup now performs some cleanup operations on
1677text before parsing it. This is to avoid common problems with bad
1678definitions and self-closing tags that crash SGMLParser. You can
1679provide your own set of cleanup operations, or turn it off
1680altogether. The cleanup operations include fixing self-closing tags
1681that don't close, and replacing Microsoft smart quotes and similar
1682characters with their HTML entity equivalents.
1683
1684You can now get a pretty-print version of parsed HTML to get a visual
1685picture of how Beautiful Soup parses it, with the Tag.prettify()
1686method.
1687
1688== Strings and Unicode ==
1689
1690There are separate NavigableText subclasses for ASCII and Unicode
1691strings. These classes directly subclass the corresponding base data
1692types. This means you can treat NavigableText objects as strings
1693instead of having to call methods on them to get the strings.
1694
1695str() on a Tag always returns a string, and unicode() always returns
1696Unicode. Previously it was inconsistent.
1697
1698== Tree traversal ==
1699
1700In a first() or fetch() call, the tag name or the desired value of an
1701attribute can now be any of the following:
1702
1703 * A string (matches that specific tag or that specific attribute value)
1704 * A list of strings (matches any tag or attribute value in the list)
1705 * A compiled regular expression object (matches any tag or attribute
1706 value that matches the regular expression)
1707 * A callable object that takes the Tag object or attribute value as a
1708 string. It returns None/false/empty string if the given string
1709 doesn't match, and any other value if it does.
1710
1711This is much easier to use than SQL-style wildcards (see, regular
1712expressions are good for something). Because of this, I took out
1713SQL-style wildcards. I'll put them back if someone complains, but
1714their removal simplifies the code a lot.
1715
1716You can use fetch() and first() to search for text in the parse tree,
1717not just tags. There are new alias methods fetchText() and firstText()
1718designed for this purpose. As with searching for tags, you can pass in
1719a string, a regular expression object, or a method to match your text.
1720
1721If you pass in something besides a map to the attrs argument of
1722fetch() or first(), Beautiful Soup will assume you want to match that
1723thing against the "class" attribute. When you're scraping
1724well-structured HTML, this makes your code a lot cleaner.
1725
17261.x and 2.x both let you call a Tag object as a shorthand for
1727fetch(). For instance, foo("bar") is a shorthand for
1728foo.fetch("bar"). In 2.x, you can also access a specially-named member
1729of a Tag object as a shorthand for first(). For instance, foo.barTag
1730is a shorthand for foo.first("bar"). By chaining these shortcuts you
1731traverse a tree in very little code: for header in
1732soup.bodyTag.pTag.tableTag('th'):
1733
1734If an element relationship (like parent or next) doesn't apply to a
1735tag, it'll now show up Null instead of None. first() will also return
1736Null if you ask it for a nonexistent tag. Null is an object that's
1737just like None, except you can do whatever you want to it and it'll
1738give you Null instead of throwing an error.
1739
1740This lets you do tree traversals like soup.htmlTag.headTag.titleTag
1741without having to worry if the intermediate stages are actually
1742there. Previously, if there was no 'head' tag in the document, headTag
1743in that instance would have been None, and accessing its 'titleTag'
1744member would have thrown an AttributeError. Now, you can get what you
1745want when it exists, and get Null when it doesn't, without having to
1746do a lot of conditionals checking to see if every stage is None.
1747
1748There are two new relations between page elements: previousSibling and
1749nextSibling. They reference the previous and next element at the same
1750level of the parse tree. For instance, if you have HTML like this:
1751
1752 <p><ul><li>Foo<br /><li>Bar</ul>
1753
1754The first 'li' tag has a previousSibling of Null and its nextSibling
1755is the second 'li' tag. The second 'li' tag has a nextSibling of Null
1756and its previousSibling is the first 'li' tag. The previousSibling of
1757the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
1758'br' tag.
1759
1760I took out the ability to use fetch() to find tags that have a
1761specific list of contents. See, I can't even explain it well. It was
1762really difficult to use, I never used it, and I don't think anyone
1763else ever used it. To the extent anyone did, they can probably use
1764fetchText() instead. If it turns out someone needs it I'll think of
1765another solution.
1766
1767== Tree manipulation ==
1768
1769You can add new attributes to a tag, and delete attributes from a
1770tag. In 1.x you could only change a tag's existing attributes.
1771
1772== Porting Considerations ==
1773
1774There are three changes in 2.0 that break old code:
1775
1776In the post-1.2 release you could pass in a function into fetch(). The
1777function took a string, the tag name. In 2.0, the function takes the
1778actual Tag object.
1779
1780It's no longer to pass in SQL-style wildcards to fetch(). Use a
1781regular expression instead.
1782
1783The different parsing algorithm means the parse tree may not be shaped
1784like you expect. This will only actually affect you if your code uses
1785one of the affected parts. I haven't run into this problem yet while
1786porting my code.
1787
1788= Between 1.2 and 2.0 =
1789
1790This is the release to get if you want Python 1.5 compatibility.
1791
1792The desired value of an attribute can now be any of the following:
1793
1794 * A string
1795 * A string with SQL-style wildcards
1796 * A compiled RE object
1797 * A callable that returns None/false/empty string if the given value
1798 doesn't match, and any other value otherwise.
1799
1800This is much easier to use than SQL-style wildcards (see, regular
1801expressions are good for something). Because of this, I no longer
1802recommend you use SQL-style wildcards. They may go away in a future
1803release to clean up the code.
1804
1805Made Beautiful Soup handle processing instructions as text instead of
1806ignoring them.
1807
1808Applied patch from Richie Hindle (richie at entrian dot com) that
1809makes tag.string a shorthand for tag.contents[0].string when the tag
1810has only one string-owning child.
1811
1812Added still more nestable tags. The nestable tags thing won't work in
1813a lot of cases and needs to be rethought.
1814
1815Fixed an edge case where searching for "%foo" would match any string
1816shorter than "foo".
1817
1818= 1.2 "Who for such dainties would not stoop?" (20040708) =
1819
1820Applied patch from Ben Last (ben at benlast dot com) that made
1821Tag.renderContents() correctly handle Unicode.
1822
1823Made BeautifulStoneSoup even dumber by making it not implicitly close
1824a tag when another tag of the same type is encountered; only when an
1825actual closing tag is encountered. This change courtesy of Fuzzy (mike
1826at pcblokes dot com). BeautifulSoup still works as before.
1827
1828= 1.1 "Swimming in a hot tureen" =
1829
1830Added more 'nestable' tags. Changed popping semantics so that when a
1831nestable tag is encountered, tags are popped up to the previously
1832encountered nestable tag (of whatever kind). I will revert this if
1833enough people complain, but it should make more people's lives easier
1834than harder. This enhancement was suggested by Anthony Baxter (anthony
1835at interlink dot com dot au).
1836
1837= 1.0 "So rich and green" (20040420) =
1838
1839Initial release.
diff --git a/bitbake/lib/bs4/LICENSE b/bitbake/lib/bs4/LICENSE
deleted file mode 100644
index 08e3a9cf8c..0000000000
--- a/bitbake/lib/bs4/LICENSE
+++ /dev/null
@@ -1,31 +0,0 @@
1Beautiful Soup is made available under the MIT license:
2
3 Copyright (c) Leonard Richardson
4
5 Permission is hereby granted, free of charge, to any person obtaining
6 a copy of this software and associated documentation files (the
7 "Software"), to deal in the Software without restriction, including
8 without limitation the rights to use, copy, modify, merge, publish,
9 distribute, sublicense, and/or sell copies of the Software, and to
10 permit persons to whom the Software is furnished to do so, subject to
11 the following conditions:
12
13 The above copyright notice and this permission notice shall be
14 included in all copies or substantial portions of the Software.
15
16 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19 NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
20 BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
21 ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
22 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23 SOFTWARE.
24
25Beautiful Soup incorporates code from the html5lib library, which is
26also made available under the MIT license. Copyright (c) James Graham
27and other contributors
28
29Beautiful Soup has an optional dependency on the soupsieve library,
30which is also made available under the MIT license. Copyright (c)
31Isaac Muse
diff --git a/bitbake/lib/bs4/__init__.py b/bitbake/lib/bs4/__init__.py
deleted file mode 100644
index 725203d94a..0000000000
--- a/bitbake/lib/bs4/__init__.py
+++ /dev/null
@@ -1,839 +0,0 @@
1"""Beautiful Soup Elixir and Tonic - "The Screen-Scraper's Friend".
2
3http://www.crummy.com/software/BeautifulSoup/
4
5Beautiful Soup uses a pluggable XML or HTML parser to parse a
6(possibly invalid) document into a tree representation. Beautiful Soup
7provides methods and Pythonic idioms that make it easy to navigate,
8search, and modify the parse tree.
9
10Beautiful Soup works with Python 3.6 and up. It works better if lxml
11and/or html5lib is installed.
12
13For more than you ever wanted to know about Beautiful Soup, see the
14documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
15"""
16
17__author__ = "Leonard Richardson (leonardr@segfault.org)"
18__version__ = "4.12.3"
19__copyright__ = "Copyright (c) 2004-2024 Leonard Richardson"
20# Use of this source code is governed by the MIT license.
21__license__ = "MIT"
22
23__all__ = ['BeautifulSoup']
24
25from collections import Counter
26import os
27import re
28import sys
29import traceback
30import warnings
31
32# The very first thing we do is give a useful error if someone is
33# running this code under Python 2.
34if sys.version_info.major < 3:
35 raise ImportError('You are trying to use a Python 3-specific version of Beautiful Soup under Python 2. This will not work. The final version of Beautiful Soup to support Python 2 was 4.9.3.')
36
37from .builder import (
38 builder_registry,
39 ParserRejectedMarkup,
40 XMLParsedAsHTMLWarning,
41 HTMLParserTreeBuilder
42)
43from .dammit import UnicodeDammit
44from .element import (
45 CData,
46 Comment,
47 CSS,
48 DEFAULT_OUTPUT_ENCODING,
49 Declaration,
50 Doctype,
51 NavigableString,
52 PageElement,
53 ProcessingInstruction,
54 PYTHON_SPECIFIC_ENCODINGS,
55 ResultSet,
56 Script,
57 Stylesheet,
58 SoupStrainer,
59 Tag,
60 TemplateString,
61 )
62
63# Define some custom warnings.
64class GuessedAtParserWarning(UserWarning):
65 """The warning issued when BeautifulSoup has to guess what parser to
66 use -- probably because no parser was specified in the constructor.
67 """
68
69class MarkupResemblesLocatorWarning(UserWarning):
70 """The warning issued when BeautifulSoup is given 'markup' that
71 actually looks like a resource locator -- a URL or a path to a file
72 on disk.
73 """
74
75
76class BeautifulSoup(Tag):
77 """A data structure representing a parsed HTML or XML document.
78
79 Most of the methods you'll call on a BeautifulSoup object are inherited from
80 PageElement or Tag.
81
82 Internally, this class defines the basic interface called by the
83 tree builders when converting an HTML/XML document into a data
84 structure. The interface abstracts away the differences between
85 parsers. To write a new tree builder, you'll need to understand
86 these methods as a whole.
87
88 These methods will be called by the BeautifulSoup constructor:
89 * reset()
90 * feed(markup)
91
92 The tree builder may call these methods from its feed() implementation:
93 * handle_starttag(name, attrs) # See note about return value
94 * handle_endtag(name)
95 * handle_data(data) # Appends to the current data node
96 * endData(containerClass) # Ends the current data node
97
98 No matter how complicated the underlying parser is, you should be
99 able to build a tree using 'start tag' events, 'end tag' events,
100 'data' events, and "done with data" events.
101
102 If you encounter an empty-element tag (aka a self-closing tag,
103 like HTML's <br> tag), call handle_starttag and then
104 handle_endtag.
105 """
106
107 # Since BeautifulSoup subclasses Tag, it's possible to treat it as
108 # a Tag with a .name. This name makes it clear the BeautifulSoup
109 # object isn't a real markup tag.
110 ROOT_TAG_NAME = '[document]'
111
112 # If the end-user gives no indication which tree builder they
113 # want, look for one with these features.
114 DEFAULT_BUILDER_FEATURES = ['html', 'fast']
115
116 # A string containing all ASCII whitespace characters, used in
117 # endData() to detect data chunks that seem 'empty'.
118 ASCII_SPACES = '\x20\x0a\x09\x0c\x0d'
119
120 NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features=\"%(parser)s\"' to the BeautifulSoup constructor.\n"
121
122 def __init__(self, markup="", features=None, builder=None,
123 parse_only=None, from_encoding=None, exclude_encodings=None,
124 element_classes=None, **kwargs):
125 """Constructor.
126
127 :param markup: A string or a file-like object representing
128 markup to be parsed.
129
130 :param features: Desirable features of the parser to be
131 used. This may be the name of a specific parser ("lxml",
132 "lxml-xml", "html.parser", or "html5lib") or it may be the
133 type of markup to be used ("html", "html5", "xml"). It's
134 recommended that you name a specific parser, so that
135 Beautiful Soup gives you the same results across platforms
136 and virtual environments.
137
138 :param builder: A TreeBuilder subclass to instantiate (or
139 instance to use) instead of looking one up based on
140 `features`. You only need to use this if you've implemented a
141 custom TreeBuilder.
142
143 :param parse_only: A SoupStrainer. Only parts of the document
144 matching the SoupStrainer will be considered. This is useful
145 when parsing part of a document that would otherwise be too
146 large to fit into memory.
147
148 :param from_encoding: A string indicating the encoding of the
149 document to be parsed. Pass this in if Beautiful Soup is
150 guessing wrongly about the document's encoding.
151
152 :param exclude_encodings: A list of strings indicating
153 encodings known to be wrong. Pass this in if you don't know
154 the document's encoding but you know Beautiful Soup's guess is
155 wrong.
156
157 :param element_classes: A dictionary mapping BeautifulSoup
158 classes like Tag and NavigableString, to other classes you'd
159 like to be instantiated instead as the parse tree is
160 built. This is useful for subclassing Tag or NavigableString
161 to modify default behavior.
162
163 :param kwargs: For backwards compatibility purposes, the
164 constructor accepts certain keyword arguments used in
165 Beautiful Soup 3. None of these arguments do anything in
166 Beautiful Soup 4; they will result in a warning and then be
167 ignored.
168
169 Apart from this, any keyword arguments passed into the
170 BeautifulSoup constructor are propagated to the TreeBuilder
171 constructor. This makes it possible to configure a
172 TreeBuilder by passing in arguments, not just by saying which
173 one to use.
174 """
175 if 'convertEntities' in kwargs:
176 del kwargs['convertEntities']
177 warnings.warn(
178 "BS4 does not respect the convertEntities argument to the "
179 "BeautifulSoup constructor. Entities are always converted "
180 "to Unicode characters.")
181
182 if 'markupMassage' in kwargs:
183 del kwargs['markupMassage']
184 warnings.warn(
185 "BS4 does not respect the markupMassage argument to the "
186 "BeautifulSoup constructor. The tree builder is responsible "
187 "for any necessary markup massage.")
188
189 if 'smartQuotesTo' in kwargs:
190 del kwargs['smartQuotesTo']
191 warnings.warn(
192 "BS4 does not respect the smartQuotesTo argument to the "
193 "BeautifulSoup constructor. Smart quotes are always converted "
194 "to Unicode characters.")
195
196 if 'selfClosingTags' in kwargs:
197 del kwargs['selfClosingTags']
198 warnings.warn(
199 "BS4 does not respect the selfClosingTags argument to the "
200 "BeautifulSoup constructor. The tree builder is responsible "
201 "for understanding self-closing tags.")
202
203 if 'isHTML' in kwargs:
204 del kwargs['isHTML']
205 warnings.warn(
206 "BS4 does not respect the isHTML argument to the "
207 "BeautifulSoup constructor. Suggest you use "
208 "features='lxml' for HTML and features='lxml-xml' for "
209 "XML.")
210
211 def deprecated_argument(old_name, new_name):
212 if old_name in kwargs:
213 warnings.warn(
214 'The "%s" argument to the BeautifulSoup constructor '
215 'has been renamed to "%s."' % (old_name, new_name),
216 DeprecationWarning, stacklevel=3
217 )
218 return kwargs.pop(old_name)
219 return None
220
221 parse_only = parse_only or deprecated_argument(
222 "parseOnlyThese", "parse_only")
223
224 from_encoding = from_encoding or deprecated_argument(
225 "fromEncoding", "from_encoding")
226
227 if from_encoding and isinstance(markup, str):
228 warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
229 from_encoding = None
230
231 self.element_classes = element_classes or dict()
232
233 # We need this information to track whether or not the builder
234 # was specified well enough that we can omit the 'you need to
235 # specify a parser' warning.
236 original_builder = builder
237 original_features = features
238
239 if isinstance(builder, type):
240 # A builder class was passed in; it needs to be instantiated.
241 builder_class = builder
242 builder = None
243 elif builder is None:
244 if isinstance(features, str):
245 features = [features]
246 if features is None or len(features) == 0:
247 features = self.DEFAULT_BUILDER_FEATURES
248 builder_class = builder_registry.lookup(*features)
249 if builder_class is None:
250 raise FeatureNotFound(
251 "Couldn't find a tree builder with the features you "
252 "requested: %s. Do you need to install a parser library?"
253 % ",".join(features))
254
255 # At this point either we have a TreeBuilder instance in
256 # builder, or we have a builder_class that we can instantiate
257 # with the remaining **kwargs.
258 if builder is None:
259 builder = builder_class(**kwargs)
260 if not original_builder and not (
261 original_features == builder.NAME or
262 original_features in builder.ALTERNATE_NAMES
263 ) and markup:
264 # The user did not tell us which TreeBuilder to use,
265 # and we had to guess. Issue a warning.
266 if builder.is_xml:
267 markup_type = "XML"
268 else:
269 markup_type = "HTML"
270
271 # This code adapted from warnings.py so that we get the same line
272 # of code as our warnings.warn() call gets, even if the answer is wrong
273 # (as it may be in a multithreading situation).
274 caller = None
275 try:
276 caller = sys._getframe(1)
277 except ValueError:
278 pass
279 if caller:
280 globals = caller.f_globals
281 line_number = caller.f_lineno
282 else:
283 globals = sys.__dict__
284 line_number= 1
285 filename = globals.get('__file__')
286 if filename:
287 fnl = filename.lower()
288 if fnl.endswith((".pyc", ".pyo")):
289 filename = filename[:-1]
290 if filename:
291 # If there is no filename at all, the user is most likely in a REPL,
292 # and the warning is not necessary.
293 values = dict(
294 filename=filename,
295 line_number=line_number,
296 parser=builder.NAME,
297 markup_type=markup_type
298 )
299 warnings.warn(
300 self.NO_PARSER_SPECIFIED_WARNING % values,
301 GuessedAtParserWarning, stacklevel=2
302 )
303 else:
304 if kwargs:
305 warnings.warn("Keyword arguments to the BeautifulSoup constructor will be ignored. These would normally be passed into the TreeBuilder constructor, but a TreeBuilder instance was passed in as `builder`.")
306
307 self.builder = builder
308 self.is_xml = builder.is_xml
309 self.known_xml = self.is_xml
310 self._namespaces = dict()
311 self.parse_only = parse_only
312
313 if hasattr(markup, 'read'): # It's a file-type object.
314 markup = markup.read()
315 elif len(markup) <= 256 and (
316 (isinstance(markup, bytes) and not b'<' in markup)
317 or (isinstance(markup, str) and not '<' in markup)
318 ):
319 # Issue warnings for a couple beginner problems
320 # involving passing non-markup to Beautiful Soup.
321 # Beautiful Soup will still parse the input as markup,
322 # since that is sometimes the intended behavior.
323 if not self._markup_is_url(markup):
324 self._markup_resembles_filename(markup)
325
326 rejections = []
327 success = False
328 for (self.markup, self.original_encoding, self.declared_html_encoding,
329 self.contains_replacement_characters) in (
330 self.builder.prepare_markup(
331 markup, from_encoding, exclude_encodings=exclude_encodings)):
332 self.reset()
333 self.builder.initialize_soup(self)
334 try:
335 self._feed()
336 success = True
337 break
338 except ParserRejectedMarkup as e:
339 rejections.append(e)
340 pass
341
342 if not success:
343 other_exceptions = [str(e) for e in rejections]
344 raise ParserRejectedMarkup(
345 "The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.\n\nOriginal exception(s) from parser:\n " + "\n ".join(other_exceptions)
346 )
347
348 # Clear out the markup and remove the builder's circular
349 # reference to this object.
350 self.markup = None
351 self.builder.soup = None
352
353 def _clone(self):
354 """Create a new BeautifulSoup object with the same TreeBuilder,
355 but not associated with any markup.
356
357 This is the first step of the deepcopy process.
358 """
359 clone = type(self)("", None, self.builder)
360
361 # Keep track of the encoding of the original document,
362 # since we won't be parsing it again.
363 clone.original_encoding = self.original_encoding
364 return clone
365
366 def __getstate__(self):
367 # Frequently a tree builder can't be pickled.
368 d = dict(self.__dict__)
369 if 'builder' in d and d['builder'] is not None and not self.builder.picklable:
370 d['builder'] = type(self.builder)
371 # Store the contents as a Unicode string.
372 d['contents'] = []
373 d['markup'] = self.decode()
374
375 # If _most_recent_element is present, it's a Tag object left
376 # over from initial parse. It might not be picklable and we
377 # don't need it.
378 if '_most_recent_element' in d:
379 del d['_most_recent_element']
380 return d
381
382 def __setstate__(self, state):
383 # If necessary, restore the TreeBuilder by looking it up.
384 self.__dict__ = state
385 if isinstance(self.builder, type):
386 self.builder = self.builder()
387 elif not self.builder:
388 # We don't know which builder was used to build this
389 # parse tree, so use a default we know is always available.
390 self.builder = HTMLParserTreeBuilder()
391 self.builder.soup = self
392 self.reset()
393 self._feed()
394 return state
395
396
397 @classmethod
398 def _decode_markup(cls, markup):
399 """Ensure `markup` is bytes so it's safe to send into warnings.warn.
400
401 TODO: warnings.warn had this problem back in 2010 but it might not
402 anymore.
403 """
404 if isinstance(markup, bytes):
405 decoded = markup.decode('utf-8', 'replace')
406 else:
407 decoded = markup
408 return decoded
409
410 @classmethod
411 def _markup_is_url(cls, markup):
412 """Error-handling method to raise a warning if incoming markup looks
413 like a URL.
414
415 :param markup: A string.
416 :return: Whether or not the markup resembles a URL
417 closely enough to justify a warning.
418 """
419 if isinstance(markup, bytes):
420 space = b' '
421 cant_start_with = (b"http:", b"https:")
422 elif isinstance(markup, str):
423 space = ' '
424 cant_start_with = ("http:", "https:")
425 else:
426 return False
427
428 if any(markup.startswith(prefix) for prefix in cant_start_with):
429 if not space in markup:
430 warnings.warn(
431 'The input looks more like a URL than markup. You may want to use'
432 ' an HTTP client like requests to get the document behind'
433 ' the URL, and feed that document to Beautiful Soup.',
434 MarkupResemblesLocatorWarning,
435 stacklevel=3
436 )
437 return True
438 return False
439
440 @classmethod
441 def _markup_resembles_filename(cls, markup):
442 """Error-handling method to raise a warning if incoming markup
443 resembles a filename.
444
445 :param markup: A bytestring or string.
446 :return: Whether or not the markup resembles a filename
447 closely enough to justify a warning.
448 """
449 path_characters = '/\\'
450 extensions = ['.html', '.htm', '.xml', '.xhtml', '.txt']
451 if isinstance(markup, bytes):
452 path_characters = path_characters.encode("utf8")
453 extensions = [x.encode('utf8') for x in extensions]
454 filelike = False
455 if any(x in markup for x in path_characters):
456 filelike = True
457 else:
458 lower = markup.lower()
459 if any(lower.endswith(ext) for ext in extensions):
460 filelike = True
461 if filelike:
462 warnings.warn(
463 'The input looks more like a filename than markup. You may'
464 ' want to open this file and pass the filehandle into'
465 ' Beautiful Soup.',
466 MarkupResemblesLocatorWarning, stacklevel=3
467 )
468 return True
469 return False
470
471 def _feed(self):
472 """Internal method that parses previously set markup, creating a large
473 number of Tag and NavigableString objects.
474 """
475 # Convert the document to Unicode.
476 self.builder.reset()
477
478 self.builder.feed(self.markup)
479 # Close out any unfinished strings and close all the open tags.
480 self.endData()
481 while self.currentTag.name != self.ROOT_TAG_NAME:
482 self.popTag()
483
484 def reset(self):
485 """Reset this object to a state as though it had never parsed any
486 markup.
487 """
488 Tag.__init__(self, self, self.builder, self.ROOT_TAG_NAME)
489 self.hidden = 1
490 self.builder.reset()
491 self.current_data = []
492 self.currentTag = None
493 self.tagStack = []
494 self.open_tag_counter = Counter()
495 self.preserve_whitespace_tag_stack = []
496 self.string_container_stack = []
497 self._most_recent_element = None
498 self.pushTag(self)
499
500 def new_tag(self, name, namespace=None, nsprefix=None, attrs={},
501 sourceline=None, sourcepos=None, **kwattrs):
502 """Create a new Tag associated with this BeautifulSoup object.
503
504 :param name: The name of the new Tag.
505 :param namespace: The URI of the new Tag's XML namespace, if any.
506 :param prefix: The prefix for the new Tag's XML namespace, if any.
507 :param attrs: A dictionary of this Tag's attribute values; can
508 be used instead of `kwattrs` for attributes like 'class'
509 that are reserved words in Python.
510 :param sourceline: The line number where this tag was
511 (purportedly) found in its source document.
512 :param sourcepos: The character position within `sourceline` where this
513 tag was (purportedly) found.
514 :param kwattrs: Keyword arguments for the new Tag's attribute values.
515
516 """
517 kwattrs.update(attrs)
518 return self.element_classes.get(Tag, Tag)(
519 None, self.builder, name, namespace, nsprefix, kwattrs,
520 sourceline=sourceline, sourcepos=sourcepos
521 )
522
523 def string_container(self, base_class=None):
524 container = base_class or NavigableString
525
526 # There may be a general override of NavigableString.
527 container = self.element_classes.get(
528 container, container
529 )
530
531 # On top of that, we may be inside a tag that needs a special
532 # container class.
533 if self.string_container_stack and container is NavigableString:
534 container = self.builder.string_containers.get(
535 self.string_container_stack[-1].name, container
536 )
537 return container
538
539 def new_string(self, s, subclass=None):
540 """Create a new NavigableString associated with this BeautifulSoup
541 object.
542 """
543 container = self.string_container(subclass)
544 return container(s)
545
546 def insert_before(self, *args):
547 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
548 it because there is nothing before or after it in the parse tree.
549 """
550 raise NotImplementedError("BeautifulSoup objects don't support insert_before().")
551
552 def insert_after(self, *args):
553 """This method is part of the PageElement API, but `BeautifulSoup` doesn't implement
554 it because there is nothing before or after it in the parse tree.
555 """
556 raise NotImplementedError("BeautifulSoup objects don't support insert_after().")
557
558 def popTag(self):
559 """Internal method called by _popToTag when a tag is closed."""
560 tag = self.tagStack.pop()
561 if tag.name in self.open_tag_counter:
562 self.open_tag_counter[tag.name] -= 1
563 if self.preserve_whitespace_tag_stack and tag == self.preserve_whitespace_tag_stack[-1]:
564 self.preserve_whitespace_tag_stack.pop()
565 if self.string_container_stack and tag == self.string_container_stack[-1]:
566 self.string_container_stack.pop()
567 #print("Pop", tag.name)
568 if self.tagStack:
569 self.currentTag = self.tagStack[-1]
570 return self.currentTag
571
572 def pushTag(self, tag):
573 """Internal method called by handle_starttag when a tag is opened."""
574 #print("Push", tag.name)
575 if self.currentTag is not None:
576 self.currentTag.contents.append(tag)
577 self.tagStack.append(tag)
578 self.currentTag = self.tagStack[-1]
579 if tag.name != self.ROOT_TAG_NAME:
580 self.open_tag_counter[tag.name] += 1
581 if tag.name in self.builder.preserve_whitespace_tags:
582 self.preserve_whitespace_tag_stack.append(tag)
583 if tag.name in self.builder.string_containers:
584 self.string_container_stack.append(tag)
585
586 def endData(self, containerClass=None):
587 """Method called by the TreeBuilder when the end of a data segment
588 occurs.
589 """
590 if self.current_data:
591 current_data = ''.join(self.current_data)
592 # If whitespace is not preserved, and this string contains
593 # nothing but ASCII spaces, replace it with a single space
594 # or newline.
595 if not self.preserve_whitespace_tag_stack:
596 strippable = True
597 for i in current_data:
598 if i not in self.ASCII_SPACES:
599 strippable = False
600 break
601 if strippable:
602 if '\n' in current_data:
603 current_data = '\n'
604 else:
605 current_data = ' '
606
607 # Reset the data collector.
608 self.current_data = []
609
610 # Should we add this string to the tree at all?
611 if self.parse_only and len(self.tagStack) <= 1 and \
612 (not self.parse_only.text or \
613 not self.parse_only.search(current_data)):
614 return
615
616 containerClass = self.string_container(containerClass)
617 o = containerClass(current_data)
618 self.object_was_parsed(o)
619
620 def object_was_parsed(self, o, parent=None, most_recent_element=None):
621 """Method called by the TreeBuilder to integrate an object into the parse tree."""
622 if parent is None:
623 parent = self.currentTag
624 if most_recent_element is not None:
625 previous_element = most_recent_element
626 else:
627 previous_element = self._most_recent_element
628
629 next_element = previous_sibling = next_sibling = None
630 if isinstance(o, Tag):
631 next_element = o.next_element
632 next_sibling = o.next_sibling
633 previous_sibling = o.previous_sibling
634 if previous_element is None:
635 previous_element = o.previous_element
636
637 fix = parent.next_element is not None
638
639 o.setup(parent, previous_element, next_element, previous_sibling, next_sibling)
640
641 self._most_recent_element = o
642 parent.contents.append(o)
643
644 # Check if we are inserting into an already parsed node.
645 if fix:
646 self._linkage_fixer(parent)
647
648 def _linkage_fixer(self, el):
649 """Make sure linkage of this fragment is sound."""
650
651 first = el.contents[0]
652 child = el.contents[-1]
653 descendant = child
654
655 if child is first and el.parent is not None:
656 # Parent should be linked to first child
657 el.next_element = child
658 # We are no longer linked to whatever this element is
659 prev_el = child.previous_element
660 if prev_el is not None and prev_el is not el:
661 prev_el.next_element = None
662 # First child should be linked to the parent, and no previous siblings.
663 child.previous_element = el
664 child.previous_sibling = None
665
666 # We have no sibling as we've been appended as the last.
667 child.next_sibling = None
668
669 # This index is a tag, dig deeper for a "last descendant"
670 if isinstance(child, Tag) and child.contents:
671 descendant = child._last_descendant(False)
672
673 # As the final step, link last descendant. It should be linked
674 # to the parent's next sibling (if found), else walk up the chain
675 # and find a parent with a sibling. It should have no next sibling.
676 descendant.next_element = None
677 descendant.next_sibling = None
678 target = el
679 while True:
680 if target is None:
681 break
682 elif target.next_sibling is not None:
683 descendant.next_element = target.next_sibling
684 target.next_sibling.previous_element = child
685 break
686 target = target.parent
687
688 def _popToTag(self, name, nsprefix=None, inclusivePop=True):
689 """Pops the tag stack up to and including the most recent
690 instance of the given tag.
691
692 If there are no open tags with the given name, nothing will be
693 popped.
694
695 :param name: Pop up to the most recent tag with this name.
696 :param nsprefix: The namespace prefix that goes with `name`.
697 :param inclusivePop: It this is false, pops the tag stack up
698 to but *not* including the most recent instqance of the
699 given tag.
700
701 """
702 #print("Popping to %s" % name)
703 if name == self.ROOT_TAG_NAME:
704 # The BeautifulSoup object itself can never be popped.
705 return
706
707 most_recently_popped = None
708
709 stack_size = len(self.tagStack)
710 for i in range(stack_size - 1, 0, -1):
711 if not self.open_tag_counter.get(name):
712 break
713 t = self.tagStack[i]
714 if (name == t.name and nsprefix == t.prefix):
715 if inclusivePop:
716 most_recently_popped = self.popTag()
717 break
718 most_recently_popped = self.popTag()
719
720 return most_recently_popped
721
722 def handle_starttag(self, name, namespace, nsprefix, attrs, sourceline=None,
723 sourcepos=None, namespaces=None):
724 """Called by the tree builder when a new tag is encountered.
725
726 :param name: Name of the tag.
727 :param nsprefix: Namespace prefix for the tag.
728 :param attrs: A dictionary of attribute values.
729 :param sourceline: The line number where this tag was found in its
730 source document.
731 :param sourcepos: The character position within `sourceline` where this
732 tag was found.
733 :param namespaces: A dictionary of all namespace prefix mappings
734 currently in scope in the document.
735
736 If this method returns None, the tag was rejected by an active
737 SoupStrainer. You should proceed as if the tag had not occurred
738 in the document. For instance, if this was a self-closing tag,
739 don't call handle_endtag.
740 """
741 # print("Start tag %s: %s" % (name, attrs))
742 self.endData()
743
744 if (self.parse_only and len(self.tagStack) <= 1
745 and (self.parse_only.text
746 or not self.parse_only.search_tag(name, attrs))):
747 return None
748
749 tag = self.element_classes.get(Tag, Tag)(
750 self, self.builder, name, namespace, nsprefix, attrs,
751 self.currentTag, self._most_recent_element,
752 sourceline=sourceline, sourcepos=sourcepos,
753 namespaces=namespaces
754 )
755 if tag is None:
756 return tag
757 if self._most_recent_element is not None:
758 self._most_recent_element.next_element = tag
759 self._most_recent_element = tag
760 self.pushTag(tag)
761 return tag
762
763 def handle_endtag(self, name, nsprefix=None):
764 """Called by the tree builder when an ending tag is encountered.
765
766 :param name: Name of the tag.
767 :param nsprefix: Namespace prefix for the tag.
768 """
769 #print("End tag: " + name)
770 self.endData()
771 self._popToTag(name, nsprefix)
772
773 def handle_data(self, data):
774 """Called by the tree builder when a chunk of textual data is encountered."""
775 self.current_data.append(data)
776
777 def decode(self, pretty_print=False,
778 eventual_encoding=DEFAULT_OUTPUT_ENCODING,
779 formatter="minimal", iterator=None):
780 """Returns a string or Unicode representation of the parse tree
781 as an HTML or XML document.
782
783 :param pretty_print: If this is True, indentation will be used to
784 make the document more readable.
785 :param eventual_encoding: The encoding of the final document.
786 If this is None, the document will be a Unicode string.
787 """
788 if self.is_xml:
789 # Print the XML declaration
790 encoding_part = ''
791 if eventual_encoding in PYTHON_SPECIFIC_ENCODINGS:
792 # This is a special Python encoding; it can't actually
793 # go into an XML document because it means nothing
794 # outside of Python.
795 eventual_encoding = None
796 if eventual_encoding != None:
797 encoding_part = ' encoding="%s"' % eventual_encoding
798 prefix = '<?xml version="1.0"%s?>\n' % encoding_part
799 else:
800 prefix = ''
801 if not pretty_print:
802 indent_level = None
803 else:
804 indent_level = 0
805 return prefix + super(BeautifulSoup, self).decode(
806 indent_level, eventual_encoding, formatter, iterator)
807
808# Aliases to make it easier to get started quickly, e.g. 'from bs4 import _soup'
809_s = BeautifulSoup
810_soup = BeautifulSoup
811
812class BeautifulStoneSoup(BeautifulSoup):
813 """Deprecated interface to an XML parser."""
814
815 def __init__(self, *args, **kwargs):
816 kwargs['features'] = 'xml'
817 warnings.warn(
818 'The BeautifulStoneSoup class is deprecated. Instead of using '
819 'it, pass features="xml" into the BeautifulSoup constructor.',
820 DeprecationWarning, stacklevel=2
821 )
822 super(BeautifulStoneSoup, self).__init__(*args, **kwargs)
823
824
825class StopParsing(Exception):
826 """Exception raised by a TreeBuilder if it's unable to continue parsing."""
827 pass
828
829class FeatureNotFound(ValueError):
830 """Exception raised by the BeautifulSoup constructor if no parser with the
831 requested features is found.
832 """
833 pass
834
835
836#If this file is run as a script, act as an HTML pretty-printer.
837if __name__ == '__main__':
838 soup = BeautifulSoup(sys.stdin)
839 print((soup.prettify()))
diff --git a/bitbake/lib/bs4/builder/__init__.py b/bitbake/lib/bs4/builder/__init__.py
deleted file mode 100644
index ffb31fc25e..0000000000
--- a/bitbake/lib/bs4/builder/__init__.py
+++ /dev/null
@@ -1,636 +0,0 @@
1# Use of this source code is governed by the MIT license.
2__license__ = "MIT"
3
4from collections import defaultdict
5import itertools
6import re
7import warnings
8import sys
9from bs4.element import (
10 CharsetMetaAttributeValue,
11 ContentMetaAttributeValue,
12 RubyParenthesisString,
13 RubyTextString,
14 Stylesheet,
15 Script,
16 TemplateString,
17 nonwhitespace_re
18)
19
20__all__ = [
21 'HTMLTreeBuilder',
22 'SAXTreeBuilder',
23 'TreeBuilder',
24 'TreeBuilderRegistry',
25 ]
26
27# Some useful features for a TreeBuilder to have.
28FAST = 'fast'
29PERMISSIVE = 'permissive'
30STRICT = 'strict'
31XML = 'xml'
32HTML = 'html'
33HTML_5 = 'html5'
34
35class XMLParsedAsHTMLWarning(UserWarning):
36 """The warning issued when an HTML parser is used to parse
37 XML that is not XHTML.
38 """
39 MESSAGE = """It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor."""
40
41
42class TreeBuilderRegistry(object):
43 """A way of looking up TreeBuilder subclasses by their name or by desired
44 features.
45 """
46
47 def __init__(self):
48 self.builders_for_feature = defaultdict(list)
49 self.builders = []
50
51 def register(self, treebuilder_class):
52 """Register a treebuilder based on its advertised features.
53
54 :param treebuilder_class: A subclass of Treebuilder. its .features
55 attribute should list its features.
56 """
57 for feature in treebuilder_class.features:
58 self.builders_for_feature[feature].insert(0, treebuilder_class)
59 self.builders.insert(0, treebuilder_class)
60
61 def lookup(self, *features):
62 """Look up a TreeBuilder subclass with the desired features.
63
64 :param features: A list of features to look for. If none are
65 provided, the most recently registered TreeBuilder subclass
66 will be used.
67 :return: A TreeBuilder subclass, or None if there's no
68 registered subclass with all the requested features.
69 """
70 if len(self.builders) == 0:
71 # There are no builders at all.
72 return None
73
74 if len(features) == 0:
75 # They didn't ask for any features. Give them the most
76 # recently registered builder.
77 return self.builders[0]
78
79 # Go down the list of features in order, and eliminate any builders
80 # that don't match every feature.
81 features = list(features)
82 features.reverse()
83 candidates = None
84 candidate_set = None
85 while len(features) > 0:
86 feature = features.pop()
87 we_have_the_feature = self.builders_for_feature.get(feature, [])
88 if len(we_have_the_feature) > 0:
89 if candidates is None:
90 candidates = we_have_the_feature
91 candidate_set = set(candidates)
92 else:
93 # Eliminate any candidates that don't have this feature.
94 candidate_set = candidate_set.intersection(
95 set(we_have_the_feature))
96
97 # The only valid candidates are the ones in candidate_set.
98 # Go through the original list of candidates and pick the first one
99 # that's in candidate_set.
100 if candidate_set is None:
101 return None
102 for candidate in candidates:
103 if candidate in candidate_set:
104 return candidate
105 return None
106
107# The BeautifulSoup class will take feature lists from developers and use them
108# to look up builders in this registry.
109builder_registry = TreeBuilderRegistry()
110
111class TreeBuilder(object):
112 """Turn a textual document into a Beautiful Soup object tree."""
113
114 NAME = "[Unknown tree builder]"
115 ALTERNATE_NAMES = []
116 features = []
117
118 is_xml = False
119 picklable = False
120 empty_element_tags = None # A tag will be considered an empty-element
121 # tag when and only when it has no contents.
122
123 # A value for these tag/attribute combinations is a space- or
124 # comma-separated list of CDATA, rather than a single CDATA.
125 DEFAULT_CDATA_LIST_ATTRIBUTES = defaultdict(list)
126
127 # Whitespace should be preserved inside these tags.
128 DEFAULT_PRESERVE_WHITESPACE_TAGS = set()
129
130 # The textual contents of tags with these names should be
131 # instantiated with some class other than NavigableString.
132 DEFAULT_STRING_CONTAINERS = {}
133
134 USE_DEFAULT = object()
135
136 # Most parsers don't keep track of line numbers.
137 TRACKS_LINE_NUMBERS = False
138
139 def __init__(self, multi_valued_attributes=USE_DEFAULT,
140 preserve_whitespace_tags=USE_DEFAULT,
141 store_line_numbers=USE_DEFAULT,
142 string_containers=USE_DEFAULT,
143 ):
144 """Constructor.
145
146 :param multi_valued_attributes: If this is set to None, the
147 TreeBuilder will not turn any values for attributes like
148 'class' into lists. Setting this to a dictionary will
149 customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES
150 for an example.
151
152 Internally, these are called "CDATA list attributes", but that
153 probably doesn't make sense to an end-user, so the argument name
154 is `multi_valued_attributes`.
155
156 :param preserve_whitespace_tags: A list of tags to treat
157 the way <pre> tags are treated in HTML. Tags in this list
158 are immune from pretty-printing; their contents will always be
159 output as-is.
160
161 :param string_containers: A dictionary mapping tag names to
162 the classes that should be instantiated to contain the textual
163 contents of those tags. The default is to use NavigableString
164 for every tag, no matter what the name. You can override the
165 default by changing DEFAULT_STRING_CONTAINERS.
166
167 :param store_line_numbers: If the parser keeps track of the
168 line numbers and positions of the original markup, that
169 information will, by default, be stored in each corresponding
170 `Tag` object. You can turn this off by passing
171 store_line_numbers=False. If the parser you're using doesn't
172 keep track of this information, then setting store_line_numbers=True
173 will do nothing.
174 """
175 self.soup = None
176 if multi_valued_attributes is self.USE_DEFAULT:
177 multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES
178 self.cdata_list_attributes = multi_valued_attributes
179 if preserve_whitespace_tags is self.USE_DEFAULT:
180 preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS
181 self.preserve_whitespace_tags = preserve_whitespace_tags
182 if store_line_numbers == self.USE_DEFAULT:
183 store_line_numbers = self.TRACKS_LINE_NUMBERS
184 self.store_line_numbers = store_line_numbers
185 if string_containers == self.USE_DEFAULT:
186 string_containers = self.DEFAULT_STRING_CONTAINERS
187 self.string_containers = string_containers
188
189 def initialize_soup(self, soup):
190 """The BeautifulSoup object has been initialized and is now
191 being associated with the TreeBuilder.
192
193 :param soup: A BeautifulSoup object.
194 """
195 self.soup = soup
196
197 def reset(self):
198 """Do any work necessary to reset the underlying parser
199 for a new document.
200
201 By default, this does nothing.
202 """
203 pass
204
205 def can_be_empty_element(self, tag_name):
206 """Might a tag with this name be an empty-element tag?
207
208 The final markup may or may not actually present this tag as
209 self-closing.
210
211 For instance: an HTMLBuilder does not consider a <p> tag to be
212 an empty-element tag (it's not in
213 HTMLBuilder.empty_element_tags). This means an empty <p> tag
214 will be presented as "<p></p>", not "<p/>" or "<p>".
215
216 The default implementation has no opinion about which tags are
217 empty-element tags, so a tag will be presented as an
218 empty-element tag if and only if it has no children.
219 "<foo></foo>" will become "<foo/>", and "<foo>bar</foo>" will
220 be left alone.
221
222 :param tag_name: The name of a markup tag.
223 """
224 if self.empty_element_tags is None:
225 return True
226 return tag_name in self.empty_element_tags
227
228 def feed(self, markup):
229 """Run some incoming markup through some parsing process,
230 populating the `BeautifulSoup` object in self.soup.
231
232 This method is not implemented in TreeBuilder; it must be
233 implemented in subclasses.
234
235 :return: None.
236 """
237 raise NotImplementedError()
238
239 def prepare_markup(self, markup, user_specified_encoding=None,
240 document_declared_encoding=None, exclude_encodings=None):
241 """Run any preliminary steps necessary to make incoming markup
242 acceptable to the parser.
243
244 :param markup: Some markup -- probably a bytestring.
245 :param user_specified_encoding: The user asked to try this encoding.
246 :param document_declared_encoding: The markup itself claims to be
247 in this encoding. NOTE: This argument is not used by the
248 calling code and can probably be removed.
249 :param exclude_encodings: The user asked _not_ to try any of
250 these encodings.
251
252 :yield: A series of 4-tuples:
253 (markup, encoding, declared encoding,
254 has undergone character replacement)
255
256 Each 4-tuple represents a strategy for converting the
257 document to Unicode and parsing it. Each strategy will be tried
258 in turn.
259
260 By default, the only strategy is to parse the markup
261 as-is. See `LXMLTreeBuilderForXML` and
262 `HTMLParserTreeBuilder` for implementations that take into
263 account the quirks of particular parsers.
264 """
265 yield markup, None, None, False
266
267 def test_fragment_to_document(self, fragment):
268 """Wrap an HTML fragment to make it look like a document.
269
270 Different parsers do this differently. For instance, lxml
271 introduces an empty <head> tag, and html5lib
272 doesn't. Abstracting this away lets us write simple tests
273 which run HTML fragments through the parser and compare the
274 results against other HTML fragments.
275
276 This method should not be used outside of tests.
277
278 :param fragment: A string -- fragment of HTML.
279 :return: A string -- a full HTML document.
280 """
281 return fragment
282
283 def set_up_substitutions(self, tag):
284 """Set up any substitutions that will need to be performed on
285 a `Tag` when it's output as a string.
286
287 By default, this does nothing. See `HTMLTreeBuilder` for a
288 case where this is used.
289
290 :param tag: A `Tag`
291 :return: Whether or not a substitution was performed.
292 """
293 return False
294
295 def _replace_cdata_list_attribute_values(self, tag_name, attrs):
296 """When an attribute value is associated with a tag that can
297 have multiple values for that attribute, convert the string
298 value to a list of strings.
299
300 Basically, replaces class="foo bar" with class=["foo", "bar"]
301
302 NOTE: This method modifies its input in place.
303
304 :param tag_name: The name of a tag.
305 :param attrs: A dictionary containing the tag's attributes.
306 Any appropriate attribute values will be modified in place.
307 """
308 if not attrs:
309 return attrs
310 if self.cdata_list_attributes:
311 universal = self.cdata_list_attributes.get('*', [])
312 tag_specific = self.cdata_list_attributes.get(
313 tag_name.lower(), None)
314 for attr in list(attrs.keys()):
315 if attr in universal or (tag_specific and attr in tag_specific):
316 # We have a "class"-type attribute whose string
317 # value is a whitespace-separated list of
318 # values. Split it into a list.
319 value = attrs[attr]
320 if isinstance(value, str):
321 values = nonwhitespace_re.findall(value)
322 else:
323 # html5lib sometimes calls setAttributes twice
324 # for the same tag when rearranging the parse
325 # tree. On the second call the attribute value
326 # here is already a list. If this happens,
327 # leave the value alone rather than trying to
328 # split it again.
329 values = value
330 attrs[attr] = values
331 return attrs
332
333class SAXTreeBuilder(TreeBuilder):
334 """A Beautiful Soup treebuilder that listens for SAX events.
335
336 This is not currently used for anything, but it demonstrates
337 how a simple TreeBuilder would work.
338 """
339
340 def feed(self, markup):
341 raise NotImplementedError()
342
343 def close(self):
344 pass
345
346 def startElement(self, name, attrs):
347 attrs = dict((key[1], value) for key, value in list(attrs.items()))
348 #print("Start %s, %r" % (name, attrs))
349 self.soup.handle_starttag(name, attrs)
350
351 def endElement(self, name):
352 #print("End %s" % name)
353 self.soup.handle_endtag(name)
354
355 def startElementNS(self, nsTuple, nodeName, attrs):
356 # Throw away (ns, nodeName) for now.
357 self.startElement(nodeName, attrs)
358
359 def endElementNS(self, nsTuple, nodeName):
360 # Throw away (ns, nodeName) for now.
361 self.endElement(nodeName)
362 #handler.endElementNS((ns, node.nodeName), node.nodeName)
363
364 def startPrefixMapping(self, prefix, nodeValue):
365 # Ignore the prefix for now.
366 pass
367
368 def endPrefixMapping(self, prefix):
369 # Ignore the prefix for now.
370 # handler.endPrefixMapping(prefix)
371 pass
372
373 def characters(self, content):
374 self.soup.handle_data(content)
375
376 def startDocument(self):
377 pass
378
379 def endDocument(self):
380 pass
381
382
383class HTMLTreeBuilder(TreeBuilder):
384 """This TreeBuilder knows facts about HTML.
385
386 Such as which tags are empty-element tags.
387 """
388
389 empty_element_tags = set([
390 # These are from HTML5.
391 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr',
392
393 # These are from earlier versions of HTML and are removed in HTML5.
394 'basefont', 'bgsound', 'command', 'frame', 'image', 'isindex', 'nextid', 'spacer'
395 ])
396
397 # The HTML standard defines these as block-level elements. Beautiful
398 # Soup does not treat these elements differently from other elements,
399 # but it may do so eventually, and this information is available if
400 # you need to use it.
401 block_elements = set(["address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tfoot", "ul", "video"])
402
403 # These HTML tags need special treatment so they can be
404 # represented by a string class other than NavigableString.
405 #
406 # For some of these tags, it's because the HTML standard defines
407 # an unusual content model for them. I made this list by going
408 # through the HTML spec
409 # (https://html.spec.whatwg.org/#metadata-content) and looking for
410 # "metadata content" elements that can contain strings.
411 #
412 # The Ruby tags (<rt> and <rp>) are here despite being normal
413 # "phrasing content" tags, because the content they contain is
414 # qualitatively different from other text in the document, and it
415 # can be useful to be able to distinguish it.
416 #
417 # TODO: Arguably <noscript> could go here but it seems
418 # qualitatively different from the other tags.
419 DEFAULT_STRING_CONTAINERS = {
420 'rt' : RubyTextString,
421 'rp' : RubyParenthesisString,
422 'style': Stylesheet,
423 'script': Script,
424 'template': TemplateString,
425 }
426
427 # The HTML standard defines these attributes as containing a
428 # space-separated list of values, not a single value. That is,
429 # class="foo bar" means that the 'class' attribute has two values,
430 # 'foo' and 'bar', not the single value 'foo bar'. When we
431 # encounter one of these attributes, we will parse its value into
432 # a list of values if possible. Upon output, the list will be
433 # converted back into a string.
434 DEFAULT_CDATA_LIST_ATTRIBUTES = {
435 "*" : ['class', 'accesskey', 'dropzone'],
436 "a" : ['rel', 'rev'],
437 "link" : ['rel', 'rev'],
438 "td" : ["headers"],
439 "th" : ["headers"],
440 "td" : ["headers"],
441 "form" : ["accept-charset"],
442 "object" : ["archive"],
443
444 # These are HTML5 specific, as are *.accesskey and *.dropzone above.
445 "area" : ["rel"],
446 "icon" : ["sizes"],
447 "iframe" : ["sandbox"],
448 "output" : ["for"],
449 }
450
451 DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
452
453 def set_up_substitutions(self, tag):
454 """Replace the declared encoding in a <meta> tag with a placeholder,
455 to be substituted when the tag is output to a string.
456
457 An HTML document may come in to Beautiful Soup as one
458 encoding, but exit in a different encoding, and the <meta> tag
459 needs to be changed to reflect this.
460
461 :param tag: A `Tag`
462 :return: Whether or not a substitution was performed.
463 """
464 # We are only interested in <meta> tags
465 if tag.name != 'meta':
466 return False
467
468 http_equiv = tag.get('http-equiv')
469 content = tag.get('content')
470 charset = tag.get('charset')
471
472 # We are interested in <meta> tags that say what encoding the
473 # document was originally in. This means HTML 5-style <meta>
474 # tags that provide the "charset" attribute. It also means
475 # HTML 4-style <meta> tags that provide the "content"
476 # attribute and have "http-equiv" set to "content-type".
477 #
478 # In both cases we will replace the value of the appropriate
479 # attribute with a standin object that can take on any
480 # encoding.
481 meta_encoding = None
482 if charset is not None:
483 # HTML 5 style:
484 # <meta charset="utf8">
485 meta_encoding = charset
486 tag['charset'] = CharsetMetaAttributeValue(charset)
487
488 elif (content is not None and http_equiv is not None
489 and http_equiv.lower() == 'content-type'):
490 # HTML 4 style:
491 # <meta http-equiv="content-type" content="text/html; charset=utf8">
492 tag['content'] = ContentMetaAttributeValue(content)
493
494 return (meta_encoding is not None)
495
496class DetectsXMLParsedAsHTML(object):
497 """A mixin class for any class (a TreeBuilder, or some class used by a
498 TreeBuilder) that's in a position to detect whether an XML
499 document is being incorrectly parsed as HTML, and issue an
500 appropriate warning.
501
502 This requires being able to observe an incoming processing
503 instruction that might be an XML declaration, and also able to
504 observe tags as they're opened. If you can't do that for a given
505 TreeBuilder, there's a less reliable implementation based on
506 examining the raw markup.
507 """
508
509 # Regular expression for seeing if markup has an <html> tag.
510 LOOKS_LIKE_HTML = re.compile("<[^ +]html", re.I)
511 LOOKS_LIKE_HTML_B = re.compile(b"<[^ +]html", re.I)
512
513 XML_PREFIX = '<?xml'
514 XML_PREFIX_B = b'<?xml'
515
516 @classmethod
517 def warn_if_markup_looks_like_xml(cls, markup, stacklevel=3):
518 """Perform a check on some markup to see if it looks like XML
519 that's not XHTML. If so, issue a warning.
520
521 This is much less reliable than doing the check while parsing,
522 but some of the tree builders can't do that.
523
524 :param stacklevel: The stacklevel of the code calling this
525 function.
526
527 :return: True if the markup looks like non-XHTML XML, False
528 otherwise.
529
530 """
531 if isinstance(markup, bytes):
532 prefix = cls.XML_PREFIX_B
533 looks_like_html = cls.LOOKS_LIKE_HTML_B
534 else:
535 prefix = cls.XML_PREFIX
536 looks_like_html = cls.LOOKS_LIKE_HTML
537
538 if (markup is not None
539 and markup.startswith(prefix)
540 and not looks_like_html.search(markup[:500])
541 ):
542 cls._warn(stacklevel=stacklevel+2)
543 return True
544 return False
545
546 @classmethod
547 def _warn(cls, stacklevel=5):
548 """Issue a warning about XML being parsed as HTML."""
549 warnings.warn(
550 XMLParsedAsHTMLWarning.MESSAGE, XMLParsedAsHTMLWarning,
551 stacklevel=stacklevel
552 )
553
554 def _initialize_xml_detector(self):
555 """Call this method before parsing a document."""
556 self._first_processing_instruction = None
557 self._root_tag = None
558
559 def _document_might_be_xml(self, processing_instruction):
560 """Call this method when encountering an XML declaration, or a
561 "processing instruction" that might be an XML declaration.
562 """
563 if (self._first_processing_instruction is not None
564 or self._root_tag is not None):
565 # The document has already started. Don't bother checking
566 # anymore.
567 return
568
569 self._first_processing_instruction = processing_instruction
570
571 # We won't know until we encounter the first tag whether or
572 # not this is actually a problem.
573
574 def _root_tag_encountered(self, name):
575 """Call this when you encounter the document's root tag.
576
577 This is where we actually check whether an XML document is
578 being incorrectly parsed as HTML, and issue the warning.
579 """
580 if self._root_tag is not None:
581 # This method was incorrectly called multiple times. Do
582 # nothing.
583 return
584
585 self._root_tag = name
586 if (name != 'html' and self._first_processing_instruction is not None
587 and self._first_processing_instruction.lower().startswith('xml ')):
588 # We encountered an XML declaration and then a tag other
589 # than 'html'. This is a reliable indicator that a
590 # non-XHTML document is being parsed as XML.
591 self._warn()
592
593
594def register_treebuilders_from(module):
595 """Copy TreeBuilders from the given module into this module."""
596 this_module = sys.modules[__name__]
597 for name in module.__all__:
598 obj = getattr(module, name)
599
600 if issubclass(obj, TreeBuilder):
601 setattr(this_module, name, obj)
602 this_module.__all__.append(name)
603 # Register the builder while we're at it.
604 this_module.builder_registry.register(obj)
605
606class ParserRejectedMarkup(Exception):
607 """An Exception to be raised when the underlying parser simply
608 refuses to parse the given markup.
609 """
610 def __init__(self, message_or_exception):
611 """Explain why the parser rejected the given markup, either
612 with a textual explanation or another exception.
613 """
614 if isinstance(message_or_exception, Exception):
615 e = message_or_exception
616 message_or_exception = "%s: %s" % (e.__class__.__name__, str(e))
617 super(ParserRejectedMarkup, self).__init__(message_or_exception)
618
619# Builders are registered in reverse order of priority, so that custom
620# builder registrations will take precedence. In general, we want lxml
621# to take precedence over html5lib, because it's faster. And we only
622# want to use HTMLParser as a last resort.
623from . import _htmlparser
624register_treebuilders_from(_htmlparser)
625try:
626 from . import _html5lib
627 register_treebuilders_from(_html5lib)
628except ImportError:
629 # They don't have html5lib installed.
630 pass
631try:
632 from . import _lxml
633 register_treebuilders_from(_lxml)
634except ImportError:
635 # They don't have lxml installed.
636 pass
diff --git a/bitbake/lib/bs4/builder/_html5lib.py b/bitbake/lib/bs4/builder/_html5lib.py
deleted file mode 100644
index 7c46a85118..0000000000
--- a/bitbake/lib/bs4/builder/_html5lib.py
+++ /dev/null
@@ -1,481 +0,0 @@
1# Use of this source code is governed by the MIT license.
2__license__ = "MIT"
3
4__all__ = [
5 'HTML5TreeBuilder',
6 ]
7
8import warnings
9import re
10from bs4.builder import (
11 DetectsXMLParsedAsHTML,
12 PERMISSIVE,
13 HTML,
14 HTML_5,
15 HTMLTreeBuilder,
16 )
17from bs4.element import (
18 NamespacedAttribute,
19 nonwhitespace_re,
20)
21import html5lib
22from html5lib.constants import (
23 namespaces,
24 prefixes,
25 )
26from bs4.element import (
27 Comment,
28 Doctype,
29 NavigableString,
30 Tag,
31 )
32
33try:
34 # Pre-0.99999999
35 from html5lib.treebuilders import _base as treebuilder_base
36 new_html5lib = False
37except ImportError as e:
38 # 0.99999999 and up
39 from html5lib.treebuilders import base as treebuilder_base
40 new_html5lib = True
41
42class HTML5TreeBuilder(HTMLTreeBuilder):
43 """Use html5lib to build a tree.
44
45 Note that this TreeBuilder does not support some features common
46 to HTML TreeBuilders. Some of these features could theoretically
47 be implemented, but at the very least it's quite difficult,
48 because html5lib moves the parse tree around as it's being built.
49
50 * This TreeBuilder doesn't use different subclasses of NavigableString
51 based on the name of the tag in which the string was found.
52
53 * You can't use a SoupStrainer to parse only part of a document.
54 """
55
56 NAME = "html5lib"
57
58 features = [NAME, PERMISSIVE, HTML_5, HTML]
59
60 # html5lib can tell us which line number and position in the
61 # original file is the source of an element.
62 TRACKS_LINE_NUMBERS = True
63
64 def prepare_markup(self, markup, user_specified_encoding,
65 document_declared_encoding=None, exclude_encodings=None):
66 # Store the user-specified encoding for use later on.
67 self.user_specified_encoding = user_specified_encoding
68
69 # document_declared_encoding and exclude_encodings aren't used
70 # ATM because the html5lib TreeBuilder doesn't use
71 # UnicodeDammit.
72 if exclude_encodings:
73 warnings.warn(
74 "You provided a value for exclude_encoding, but the html5lib tree builder doesn't support exclude_encoding.",
75 stacklevel=3
76 )
77
78 # html5lib only parses HTML, so if it's given XML that's worth
79 # noting.
80 DetectsXMLParsedAsHTML.warn_if_markup_looks_like_xml(
81 markup, stacklevel=3
82 )
83
84 yield (markup, None, None, False)
85
86 # These methods are defined by Beautiful Soup.
87 def feed(self, markup):
88 if self.soup.parse_only is not None:
89 warnings.warn(
90 "You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.",
91 stacklevel=4
92 )
93 parser = html5lib.HTMLParser(tree=self.create_treebuilder)
94 self.underlying_builder.parser = parser
95 extra_kwargs = dict()
96 if not isinstance(markup, str):
97 if new_html5lib:
98 extra_kwargs['override_encoding'] = self.user_specified_encoding
99 else:
100 extra_kwargs['encoding'] = self.user_specified_encoding
101 doc = parser.parse(markup, **extra_kwargs)
102
103 # Set the character encoding detected by the tokenizer.
104 if isinstance(markup, str):
105 # We need to special-case this because html5lib sets
106 # charEncoding to UTF-8 if it gets Unicode input.
107 doc.original_encoding = None
108 else:
109 original_encoding = parser.tokenizer.stream.charEncoding[0]
110 if not isinstance(original_encoding, str):
111 # In 0.99999999 and up, the encoding is an html5lib
112 # Encoding object. We want to use a string for compatibility
113 # with other tree builders.
114 original_encoding = original_encoding.name
115 doc.original_encoding = original_encoding
116 self.underlying_builder.parser = None
117
118 def create_treebuilder(self, namespaceHTMLElements):
119 self.underlying_builder = TreeBuilderForHtml5lib(
120 namespaceHTMLElements, self.soup,
121 store_line_numbers=self.store_line_numbers
122 )
123 return self.underlying_builder
124
125 def test_fragment_to_document(self, fragment):
126 """See `TreeBuilder`."""
127 return '<html><head></head><body>%s</body></html>' % fragment
128
129
130class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder):
131
132 def __init__(self, namespaceHTMLElements, soup=None,
133 store_line_numbers=True, **kwargs):
134 if soup:
135 self.soup = soup
136 else:
137 from bs4 import BeautifulSoup
138 # TODO: Why is the parser 'html.parser' here? To avoid an
139 # infinite loop?
140 self.soup = BeautifulSoup(
141 "", "html.parser", store_line_numbers=store_line_numbers,
142 **kwargs
143 )
144 # TODO: What are **kwargs exactly? Should they be passed in
145 # here in addition to/instead of being passed to the BeautifulSoup
146 # constructor?
147 super(TreeBuilderForHtml5lib, self).__init__(namespaceHTMLElements)
148
149 # This will be set later to an html5lib.html5parser.HTMLParser
150 # object, which we can use to track the current line number.
151 self.parser = None
152 self.store_line_numbers = store_line_numbers
153
154 def documentClass(self):
155 self.soup.reset()
156 return Element(self.soup, self.soup, None)
157
158 def insertDoctype(self, token):
159 name = token["name"]
160 publicId = token["publicId"]
161 systemId = token["systemId"]
162
163 doctype = Doctype.for_name_and_ids(name, publicId, systemId)
164 self.soup.object_was_parsed(doctype)
165
166 def elementClass(self, name, namespace):
167 kwargs = {}
168 if self.parser and self.store_line_numbers:
169 # This represents the point immediately after the end of the
170 # tag. We don't know when the tag started, but we do know
171 # where it ended -- the character just before this one.
172 sourceline, sourcepos = self.parser.tokenizer.stream.position()
173 kwargs['sourceline'] = sourceline
174 kwargs['sourcepos'] = sourcepos-1
175 tag = self.soup.new_tag(name, namespace, **kwargs)
176
177 return Element(tag, self.soup, namespace)
178
179 def commentClass(self, data):
180 return TextNode(Comment(data), self.soup)
181
182 def fragmentClass(self):
183 from bs4 import BeautifulSoup
184 # TODO: Why is the parser 'html.parser' here? To avoid an
185 # infinite loop?
186 self.soup = BeautifulSoup("", "html.parser")
187 self.soup.name = "[document_fragment]"
188 return Element(self.soup, self.soup, None)
189
190 def appendChild(self, node):
191 # XXX This code is not covered by the BS4 tests.
192 self.soup.append(node.element)
193
194 def getDocument(self):
195 return self.soup
196
197 def getFragment(self):
198 return treebuilder_base.TreeBuilder.getFragment(self).element
199
200 def testSerializer(self, element):
201 from bs4 import BeautifulSoup
202 rv = []
203 doctype_re = re.compile(r'^(.*?)(?: PUBLIC "(.*?)"(?: "(.*?)")?| SYSTEM "(.*?)")?$')
204
205 def serializeElement(element, indent=0):
206 if isinstance(element, BeautifulSoup):
207 pass
208 if isinstance(element, Doctype):
209 m = doctype_re.match(element)
210 if m:
211 name = m.group(1)
212 if m.lastindex > 1:
213 publicId = m.group(2) or ""
214 systemId = m.group(3) or m.group(4) or ""
215 rv.append("""|%s<!DOCTYPE %s "%s" "%s">""" %
216 (' ' * indent, name, publicId, systemId))
217 else:
218 rv.append("|%s<!DOCTYPE %s>" % (' ' * indent, name))
219 else:
220 rv.append("|%s<!DOCTYPE >" % (' ' * indent,))
221 elif isinstance(element, Comment):
222 rv.append("|%s<!-- %s -->" % (' ' * indent, element))
223 elif isinstance(element, NavigableString):
224 rv.append("|%s\"%s\"" % (' ' * indent, element))
225 else:
226 if element.namespace:
227 name = "%s %s" % (prefixes[element.namespace],
228 element.name)
229 else:
230 name = element.name
231 rv.append("|%s<%s>" % (' ' * indent, name))
232 if element.attrs:
233 attributes = []
234 for name, value in list(element.attrs.items()):
235 if isinstance(name, NamespacedAttribute):
236 name = "%s %s" % (prefixes[name.namespace], name.name)
237 if isinstance(value, list):
238 value = " ".join(value)
239 attributes.append((name, value))
240
241 for name, value in sorted(attributes):
242 rv.append('|%s%s="%s"' % (' ' * (indent + 2), name, value))
243 indent += 2
244 for child in element.children:
245 serializeElement(child, indent)
246 serializeElement(element, 0)
247
248 return "\n".join(rv)
249
250class AttrList(object):
251 def __init__(self, element):
252 self.element = element
253 self.attrs = dict(self.element.attrs)
254 def __iter__(self):
255 return list(self.attrs.items()).__iter__()
256 def __setitem__(self, name, value):
257 # If this attribute is a multi-valued attribute for this element,
258 # turn its value into a list.
259 list_attr = self.element.cdata_list_attributes or {}
260 if (name in list_attr.get('*', [])
261 or (self.element.name in list_attr
262 and name in list_attr.get(self.element.name, []))):
263 # A node that is being cloned may have already undergone
264 # this procedure.
265 if not isinstance(value, list):
266 value = nonwhitespace_re.findall(value)
267 self.element[name] = value
268 def items(self):
269 return list(self.attrs.items())
270 def keys(self):
271 return list(self.attrs.keys())
272 def __len__(self):
273 return len(self.attrs)
274 def __getitem__(self, name):
275 return self.attrs[name]
276 def __contains__(self, name):
277 return name in list(self.attrs.keys())
278
279
280class Element(treebuilder_base.Node):
281 def __init__(self, element, soup, namespace):
282 treebuilder_base.Node.__init__(self, element.name)
283 self.element = element
284 self.soup = soup
285 self.namespace = namespace
286
287 def appendChild(self, node):
288 string_child = child = None
289 if isinstance(node, str):
290 # Some other piece of code decided to pass in a string
291 # instead of creating a TextElement object to contain the
292 # string.
293 string_child = child = node
294 elif isinstance(node, Tag):
295 # Some other piece of code decided to pass in a Tag
296 # instead of creating an Element object to contain the
297 # Tag.
298 child = node
299 elif node.element.__class__ == NavigableString:
300 string_child = child = node.element
301 node.parent = self
302 else:
303 child = node.element
304 node.parent = self
305
306 if not isinstance(child, str) and child.parent is not None:
307 node.element.extract()
308
309 if (string_child is not None and self.element.contents
310 and self.element.contents[-1].__class__ == NavigableString):
311 # We are appending a string onto another string.
312 # TODO This has O(n^2) performance, for input like
313 # "a</a>a</a>a</a>..."
314 old_element = self.element.contents[-1]
315 new_element = self.soup.new_string(old_element + string_child)
316 old_element.replace_with(new_element)
317 self.soup._most_recent_element = new_element
318 else:
319 if isinstance(node, str):
320 # Create a brand new NavigableString from this string.
321 child = self.soup.new_string(node)
322
323 # Tell Beautiful Soup to act as if it parsed this element
324 # immediately after the parent's last descendant. (Or
325 # immediately after the parent, if it has no children.)
326 if self.element.contents:
327 most_recent_element = self.element._last_descendant(False)
328 elif self.element.next_element is not None:
329 # Something from further ahead in the parse tree is
330 # being inserted into this earlier element. This is
331 # very annoying because it means an expensive search
332 # for the last element in the tree.
333 most_recent_element = self.soup._last_descendant()
334 else:
335 most_recent_element = self.element
336
337 self.soup.object_was_parsed(
338 child, parent=self.element,
339 most_recent_element=most_recent_element)
340
341 def getAttributes(self):
342 if isinstance(self.element, Comment):
343 return {}
344 return AttrList(self.element)
345
346 def setAttributes(self, attributes):
347 if attributes is not None and len(attributes) > 0:
348 converted_attributes = []
349 for name, value in list(attributes.items()):
350 if isinstance(name, tuple):
351 new_name = NamespacedAttribute(*name)
352 del attributes[name]
353 attributes[new_name] = value
354
355 self.soup.builder._replace_cdata_list_attribute_values(
356 self.name, attributes)
357 for name, value in list(attributes.items()):
358 self.element[name] = value
359
360 # The attributes may contain variables that need substitution.
361 # Call set_up_substitutions manually.
362 #
363 # The Tag constructor called this method when the Tag was created,
364 # but we just set/changed the attributes, so call it again.
365 self.soup.builder.set_up_substitutions(self.element)
366 attributes = property(getAttributes, setAttributes)
367
368 def insertText(self, data, insertBefore=None):
369 text = TextNode(self.soup.new_string(data), self.soup)
370 if insertBefore:
371 self.insertBefore(text, insertBefore)
372 else:
373 self.appendChild(text)
374
375 def insertBefore(self, node, refNode):
376 index = self.element.index(refNode.element)
377 if (node.element.__class__ == NavigableString and self.element.contents
378 and self.element.contents[index-1].__class__ == NavigableString):
379 # (See comments in appendChild)
380 old_node = self.element.contents[index-1]
381 new_str = self.soup.new_string(old_node + node.element)
382 old_node.replace_with(new_str)
383 else:
384 self.element.insert(index, node.element)
385 node.parent = self
386
387 def removeChild(self, node):
388 node.element.extract()
389
390 def reparentChildren(self, new_parent):
391 """Move all of this tag's children into another tag."""
392 # print("MOVE", self.element.contents)
393 # print("FROM", self.element)
394 # print("TO", new_parent.element)
395
396 element = self.element
397 new_parent_element = new_parent.element
398 # Determine what this tag's next_element will be once all the children
399 # are removed.
400 final_next_element = element.next_sibling
401
402 new_parents_last_descendant = new_parent_element._last_descendant(False, False)
403 if len(new_parent_element.contents) > 0:
404 # The new parent already contains children. We will be
405 # appending this tag's children to the end.
406 new_parents_last_child = new_parent_element.contents[-1]
407 new_parents_last_descendant_next_element = new_parents_last_descendant.next_element
408 else:
409 # The new parent contains no children.
410 new_parents_last_child = None
411 new_parents_last_descendant_next_element = new_parent_element.next_element
412
413 to_append = element.contents
414 if len(to_append) > 0:
415 # Set the first child's previous_element and previous_sibling
416 # to elements within the new parent
417 first_child = to_append[0]
418 if new_parents_last_descendant is not None:
419 first_child.previous_element = new_parents_last_descendant
420 else:
421 first_child.previous_element = new_parent_element
422 first_child.previous_sibling = new_parents_last_child
423 if new_parents_last_descendant is not None:
424 new_parents_last_descendant.next_element = first_child
425 else:
426 new_parent_element.next_element = first_child
427 if new_parents_last_child is not None:
428 new_parents_last_child.next_sibling = first_child
429
430 # Find the very last element being moved. It is now the
431 # parent's last descendant. It has no .next_sibling and
432 # its .next_element is whatever the previous last
433 # descendant had.
434 last_childs_last_descendant = to_append[-1]._last_descendant(False, True)
435
436 last_childs_last_descendant.next_element = new_parents_last_descendant_next_element
437 if new_parents_last_descendant_next_element is not None:
438 # TODO: This code has no test coverage and I'm not sure
439 # how to get html5lib to go through this path, but it's
440 # just the other side of the previous line.
441 new_parents_last_descendant_next_element.previous_element = last_childs_last_descendant
442 last_childs_last_descendant.next_sibling = None
443
444 for child in to_append:
445 child.parent = new_parent_element
446 new_parent_element.contents.append(child)
447
448 # Now that this element has no children, change its .next_element.
449 element.contents = []
450 element.next_element = final_next_element
451
452 # print("DONE WITH MOVE")
453 # print("FROM", self.element)
454 # print("TO", new_parent_element)
455
456 def cloneNode(self):
457 tag = self.soup.new_tag(self.element.name, self.namespace)
458 node = Element(tag, self.soup, self.namespace)
459 for key,value in self.attributes:
460 node.attributes[key] = value
461 return node
462
463 def hasContent(self):
464 return self.element.contents
465
466 def getNameTuple(self):
467 if self.namespace == None:
468 return namespaces["html"], self.name
469 else:
470 return self.namespace, self.name
471
472 nameTuple = property(getNameTuple)
473
474class TextNode(Element):
475 def __init__(self, element, soup):
476 treebuilder_base.Node.__init__(self, None)
477 self.element = element
478 self.soup = soup
479
480 def cloneNode(self):
481 raise NotImplementedError
diff --git a/bitbake/lib/bs4/builder/_htmlparser.py b/bitbake/lib/bs4/builder/_htmlparser.py
deleted file mode 100644
index 3cc187f892..0000000000
--- a/bitbake/lib/bs4/builder/_htmlparser.py
+++ /dev/null
@@ -1,387 +0,0 @@
1# encoding: utf-8
2"""Use the HTMLParser library to parse HTML files that aren't too bad."""
3
4# Use of this source code is governed by the MIT license.
5__license__ = "MIT"
6
7__all__ = [
8 'HTMLParserTreeBuilder',
9 ]
10
11from html.parser import HTMLParser
12
13import sys
14import warnings
15
16from bs4.element import (
17 CData,
18 Comment,
19 Declaration,
20 Doctype,
21 ProcessingInstruction,
22 )
23from bs4.dammit import EntitySubstitution, UnicodeDammit
24
25from bs4.builder import (
26 DetectsXMLParsedAsHTML,
27 ParserRejectedMarkup,
28 HTML,
29 HTMLTreeBuilder,
30 STRICT,
31 )
32
33
34HTMLPARSER = 'html.parser'
35
36class BeautifulSoupHTMLParser(HTMLParser, DetectsXMLParsedAsHTML):
37 """A subclass of the Python standard library's HTMLParser class, which
38 listens for HTMLParser events and translates them into calls
39 to Beautiful Soup's tree construction API.
40 """
41
42 # Strategies for handling duplicate attributes
43 IGNORE = 'ignore'
44 REPLACE = 'replace'
45
46 def __init__(self, *args, **kwargs):
47 """Constructor.
48
49 :param on_duplicate_attribute: A strategy for what to do if a
50 tag includes the same attribute more than once. Accepted
51 values are: REPLACE (replace earlier values with later
52 ones, the default), IGNORE (keep the earliest value
53 encountered), or a callable. A callable must take three
54 arguments: the dictionary of attributes already processed,
55 the name of the duplicate attribute, and the most recent value
56 encountered.
57 """
58 self.on_duplicate_attribute = kwargs.pop(
59 'on_duplicate_attribute', self.REPLACE
60 )
61 HTMLParser.__init__(self, *args, **kwargs)
62
63 # Keep a list of empty-element tags that were encountered
64 # without an explicit closing tag. If we encounter a closing tag
65 # of this type, we'll associate it with one of those entries.
66 #
67 # This isn't a stack because we don't care about the
68 # order. It's a list of closing tags we've already handled and
69 # will ignore, assuming they ever show up.
70 self.already_closed_empty_element = []
71
72 self._initialize_xml_detector()
73
74 def error(self, message):
75 # NOTE: This method is required so long as Python 3.9 is
76 # supported. The corresponding code is removed from HTMLParser
77 # in 3.5, but not removed from ParserBase until 3.10.
78 # https://github.com/python/cpython/issues/76025
79 #
80 # The original implementation turned the error into a warning,
81 # but in every case I discovered, this made HTMLParser
82 # immediately crash with an error message that was less
83 # helpful than the warning. The new implementation makes it
84 # more clear that html.parser just can't parse this
85 # markup. The 3.10 implementation does the same, though it
86 # raises AssertionError rather than calling a method. (We
87 # catch this error and wrap it in a ParserRejectedMarkup.)
88 raise ParserRejectedMarkup(message)
89
90 def handle_startendtag(self, name, attrs):
91 """Handle an incoming empty-element tag.
92
93 This is only called when the markup looks like <tag/>.
94
95 :param name: Name of the tag.
96 :param attrs: Dictionary of the tag's attributes.
97 """
98 # is_startend() tells handle_starttag not to close the tag
99 # just because its name matches a known empty-element tag. We
100 # know that this is an empty-element tag and we want to call
101 # handle_endtag ourselves.
102 tag = self.handle_starttag(name, attrs, handle_empty_element=False)
103 self.handle_endtag(name)
104
105 def handle_starttag(self, name, attrs, handle_empty_element=True):
106 """Handle an opening tag, e.g. '<tag>'
107
108 :param name: Name of the tag.
109 :param attrs: Dictionary of the tag's attributes.
110 :param handle_empty_element: True if this tag is known to be
111 an empty-element tag (i.e. there is not expected to be any
112 closing tag).
113 """
114 # XXX namespace
115 attr_dict = {}
116 for key, value in attrs:
117 # Change None attribute values to the empty string
118 # for consistency with the other tree builders.
119 if value is None:
120 value = ''
121 if key in attr_dict:
122 # A single attribute shows up multiple times in this
123 # tag. How to handle it depends on the
124 # on_duplicate_attribute setting.
125 on_dupe = self.on_duplicate_attribute
126 if on_dupe == self.IGNORE:
127 pass
128 elif on_dupe in (None, self.REPLACE):
129 attr_dict[key] = value
130 else:
131 on_dupe(attr_dict, key, value)
132 else:
133 attr_dict[key] = value
134 attrvalue = '""'
135 #print("START", name)
136 sourceline, sourcepos = self.getpos()
137 tag = self.soup.handle_starttag(
138 name, None, None, attr_dict, sourceline=sourceline,
139 sourcepos=sourcepos
140 )
141 if tag and tag.is_empty_element and handle_empty_element:
142 # Unlike other parsers, html.parser doesn't send separate end tag
143 # events for empty-element tags. (It's handled in
144 # handle_startendtag, but only if the original markup looked like
145 # <tag/>.)
146 #
147 # So we need to call handle_endtag() ourselves. Since we
148 # know the start event is identical to the end event, we
149 # don't want handle_endtag() to cross off any previous end
150 # events for tags of this name.
151 self.handle_endtag(name, check_already_closed=False)
152
153 # But we might encounter an explicit closing tag for this tag
154 # later on. If so, we want to ignore it.
155 self.already_closed_empty_element.append(name)
156
157 if self._root_tag is None:
158 self._root_tag_encountered(name)
159
160 def handle_endtag(self, name, check_already_closed=True):
161 """Handle a closing tag, e.g. '</tag>'
162
163 :param name: A tag name.
164 :param check_already_closed: True if this tag is expected to
165 be the closing portion of an empty-element tag,
166 e.g. '<tag></tag>'.
167 """
168 #print("END", name)
169 if check_already_closed and name in self.already_closed_empty_element:
170 # This is a redundant end tag for an empty-element tag.
171 # We've already called handle_endtag() for it, so just
172 # check it off the list.
173 #print("ALREADY CLOSED", name)
174 self.already_closed_empty_element.remove(name)
175 else:
176 self.soup.handle_endtag(name)
177
178 def handle_data(self, data):
179 """Handle some textual data that shows up between tags."""
180 self.soup.handle_data(data)
181
182 def handle_charref(self, name):
183 """Handle a numeric character reference by converting it to the
184 corresponding Unicode character and treating it as textual
185 data.
186
187 :param name: Character number, possibly in hexadecimal.
188 """
189 # TODO: This was originally a workaround for a bug in
190 # HTMLParser. (http://bugs.python.org/issue13633) The bug has
191 # been fixed, but removing this code still makes some
192 # Beautiful Soup tests fail. This needs investigation.
193 if name.startswith('x'):
194 real_name = int(name.lstrip('x'), 16)
195 elif name.startswith('X'):
196 real_name = int(name.lstrip('X'), 16)
197 else:
198 real_name = int(name)
199
200 data = None
201 if real_name < 256:
202 # HTML numeric entities are supposed to reference Unicode
203 # code points, but sometimes they reference code points in
204 # some other encoding (ahem, Windows-1252). E.g. &#147;
205 # instead of &#201; for LEFT DOUBLE QUOTATION MARK. This
206 # code tries to detect this situation and compensate.
207 for encoding in (self.soup.original_encoding, 'windows-1252'):
208 if not encoding:
209 continue
210 try:
211 data = bytearray([real_name]).decode(encoding)
212 except UnicodeDecodeError as e:
213 pass
214 if not data:
215 try:
216 data = chr(real_name)
217 except (ValueError, OverflowError) as e:
218 pass
219 data = data or "\N{REPLACEMENT CHARACTER}"
220 self.handle_data(data)
221
222 def handle_entityref(self, name):
223 """Handle a named entity reference by converting it to the
224 corresponding Unicode character(s) and treating it as textual
225 data.
226
227 :param name: Name of the entity reference.
228 """
229 character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
230 if character is not None:
231 data = character
232 else:
233 # If this were XML, it would be ambiguous whether "&foo"
234 # was an character entity reference with a missing
235 # semicolon or the literal string "&foo". Since this is
236 # HTML, we have a complete list of all character entity references,
237 # and this one wasn't found, so assume it's the literal string "&foo".
238 data = "&%s" % name
239 self.handle_data(data)
240
241 def handle_comment(self, data):
242 """Handle an HTML comment.
243
244 :param data: The text of the comment.
245 """
246 self.soup.endData()
247 self.soup.handle_data(data)
248 self.soup.endData(Comment)
249
250 def handle_decl(self, data):
251 """Handle a DOCTYPE declaration.
252
253 :param data: The text of the declaration.
254 """
255 self.soup.endData()
256 data = data[len("DOCTYPE "):]
257 self.soup.handle_data(data)
258 self.soup.endData(Doctype)
259
260 def unknown_decl(self, data):
261 """Handle a declaration of unknown type -- probably a CDATA block.
262
263 :param data: The text of the declaration.
264 """
265 if data.upper().startswith('CDATA['):
266 cls = CData
267 data = data[len('CDATA['):]
268 else:
269 cls = Declaration
270 self.soup.endData()
271 self.soup.handle_data(data)
272 self.soup.endData(cls)
273
274 def handle_pi(self, data):
275 """Handle a processing instruction.
276
277 :param data: The text of the instruction.
278 """
279 self.soup.endData()
280 self.soup.handle_data(data)
281 self._document_might_be_xml(data)
282 self.soup.endData(ProcessingInstruction)
283
284
285class HTMLParserTreeBuilder(HTMLTreeBuilder):
286 """A Beautiful soup `TreeBuilder` that uses the `HTMLParser` parser,
287 found in the Python standard library.
288 """
289 is_xml = False
290 picklable = True
291 NAME = HTMLPARSER
292 features = [NAME, HTML, STRICT]
293
294 # The html.parser knows which line number and position in the
295 # original file is the source of an element.
296 TRACKS_LINE_NUMBERS = True
297
298 def __init__(self, parser_args=None, parser_kwargs=None, **kwargs):
299 """Constructor.
300
301 :param parser_args: Positional arguments to pass into
302 the BeautifulSoupHTMLParser constructor, once it's
303 invoked.
304 :param parser_kwargs: Keyword arguments to pass into
305 the BeautifulSoupHTMLParser constructor, once it's
306 invoked.
307 :param kwargs: Keyword arguments for the superclass constructor.
308 """
309 # Some keyword arguments will be pulled out of kwargs and placed
310 # into parser_kwargs.
311 extra_parser_kwargs = dict()
312 for arg in ('on_duplicate_attribute',):
313 if arg in kwargs:
314 value = kwargs.pop(arg)
315 extra_parser_kwargs[arg] = value
316 super(HTMLParserTreeBuilder, self).__init__(**kwargs)
317 parser_args = parser_args or []
318 parser_kwargs = parser_kwargs or {}
319 parser_kwargs.update(extra_parser_kwargs)
320 parser_kwargs['convert_charrefs'] = False
321 self.parser_args = (parser_args, parser_kwargs)
322
323 def prepare_markup(self, markup, user_specified_encoding=None,
324 document_declared_encoding=None, exclude_encodings=None):
325
326 """Run any preliminary steps necessary to make incoming markup
327 acceptable to the parser.
328
329 :param markup: Some markup -- probably a bytestring.
330 :param user_specified_encoding: The user asked to try this encoding.
331 :param document_declared_encoding: The markup itself claims to be
332 in this encoding.
333 :param exclude_encodings: The user asked _not_ to try any of
334 these encodings.
335
336 :yield: A series of 4-tuples:
337 (markup, encoding, declared encoding,
338 has undergone character replacement)
339
340 Each 4-tuple represents a strategy for converting the
341 document to Unicode and parsing it. Each strategy will be tried
342 in turn.
343 """
344 if isinstance(markup, str):
345 # Parse Unicode as-is.
346 yield (markup, None, None, False)
347 return
348
349 # Ask UnicodeDammit to sniff the most likely encoding.
350
351 # This was provided by the end-user; treat it as a known
352 # definite encoding per the algorithm laid out in the HTML5
353 # spec. (See the EncodingDetector class for details.)
354 known_definite_encodings = [user_specified_encoding]
355
356 # This was found in the document; treat it as a slightly lower-priority
357 # user encoding.
358 user_encodings = [document_declared_encoding]
359
360 try_encodings = [user_specified_encoding, document_declared_encoding]
361 dammit = UnicodeDammit(
362 markup,
363 known_definite_encodings=known_definite_encodings,
364 user_encodings=user_encodings,
365 is_html=True,
366 exclude_encodings=exclude_encodings
367 )
368 yield (dammit.markup, dammit.original_encoding,
369 dammit.declared_html_encoding,
370 dammit.contains_replacement_characters)
371
372 def feed(self, markup):
373 """Run some incoming markup through some parsing process,
374 populating the `BeautifulSoup` object in self.soup.
375 """
376 args, kwargs = self.parser_args
377 parser = BeautifulSoupHTMLParser(*args, **kwargs)
378 parser.soup = self.soup
379 try:
380 parser.feed(markup)
381 parser.close()
382 except AssertionError as e:
383 # html.parser raises AssertionError in rare cases to
384 # indicate a fatal problem with the markup, especially
385 # when there's an error in the doctype declaration.
386 raise ParserRejectedMarkup(e)
387 parser.already_closed_empty_element = []
diff --git a/bitbake/lib/bs4/builder/_lxml.py b/bitbake/lib/bs4/builder/_lxml.py
deleted file mode 100644
index 4f7cf74681..0000000000
--- a/bitbake/lib/bs4/builder/_lxml.py
+++ /dev/null
@@ -1,388 +0,0 @@
1# Use of this source code is governed by the MIT license.
2__license__ = "MIT"
3
4__all__ = [
5 'LXMLTreeBuilderForXML',
6 'LXMLTreeBuilder',
7 ]
8
9try:
10 from collections.abc import Callable # Python 3.6
11except ImportError as e:
12 from collections import Callable
13
14from io import BytesIO
15from io import StringIO
16from lxml import etree
17from bs4.element import (
18 Comment,
19 Doctype,
20 NamespacedAttribute,
21 ProcessingInstruction,
22 XMLProcessingInstruction,
23)
24from bs4.builder import (
25 DetectsXMLParsedAsHTML,
26 FAST,
27 HTML,
28 HTMLTreeBuilder,
29 PERMISSIVE,
30 ParserRejectedMarkup,
31 TreeBuilder,
32 XML)
33from bs4.dammit import EncodingDetector
34
35LXML = 'lxml'
36
37def _invert(d):
38 "Invert a dictionary."
39 return dict((v,k) for k, v in list(d.items()))
40
41class LXMLTreeBuilderForXML(TreeBuilder):
42 DEFAULT_PARSER_CLASS = etree.XMLParser
43
44 is_xml = True
45 processing_instruction_class = XMLProcessingInstruction
46
47 NAME = "lxml-xml"
48 ALTERNATE_NAMES = ["xml"]
49
50 # Well, it's permissive by XML parser standards.
51 features = [NAME, LXML, XML, FAST, PERMISSIVE]
52
53 CHUNK_SIZE = 512
54
55 # This namespace mapping is specified in the XML Namespace
56 # standard.
57 DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace')
58
59 DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS)
60
61 # NOTE: If we parsed Element objects and looked at .sourceline,
62 # we'd be able to see the line numbers from the original document.
63 # But instead we build an XMLParser or HTMLParser object to serve
64 # as the target of parse messages, and those messages don't include
65 # line numbers.
66 # See: https://bugs.launchpad.net/lxml/+bug/1846906
67
68 def initialize_soup(self, soup):
69 """Let the BeautifulSoup object know about the standard namespace
70 mapping.
71
72 :param soup: A `BeautifulSoup`.
73 """
74 super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
75 self._register_namespaces(self.DEFAULT_NSMAPS)
76
77 def _register_namespaces(self, mapping):
78 """Let the BeautifulSoup object know about namespaces encountered
79 while parsing the document.
80
81 This might be useful later on when creating CSS selectors.
82
83 This will track (almost) all namespaces, even ones that were
84 only in scope for part of the document. If two namespaces have
85 the same prefix, only the first one encountered will be
86 tracked. Un-prefixed namespaces are not tracked.
87
88 :param mapping: A dictionary mapping namespace prefixes to URIs.
89 """
90 for key, value in list(mapping.items()):
91 # This is 'if key' and not 'if key is not None' because we
92 # don't track un-prefixed namespaces. Soupselect will
93 # treat an un-prefixed namespace as the default, which
94 # causes confusion in some cases.
95 if key and key not in self.soup._namespaces:
96 # Let the BeautifulSoup object know about a new namespace.
97 # If there are multiple namespaces defined with the same
98 # prefix, the first one in the document takes precedence.
99 self.soup._namespaces[key] = value
100
101 def default_parser(self, encoding):
102 """Find the default parser for the given encoding.
103
104 :param encoding: A string.
105 :return: Either a parser object or a class, which
106 will be instantiated with default arguments.
107 """
108 if self._default_parser is not None:
109 return self._default_parser
110 return etree.XMLParser(
111 target=self, strip_cdata=False, recover=True, encoding=encoding)
112
113 def parser_for(self, encoding):
114 """Instantiate an appropriate parser for the given encoding.
115
116 :param encoding: A string.
117 :return: A parser object such as an `etree.XMLParser`.
118 """
119 # Use the default parser.
120 parser = self.default_parser(encoding)
121
122 if isinstance(parser, Callable):
123 # Instantiate the parser with default arguments
124 parser = parser(
125 target=self, strip_cdata=False, recover=True, encoding=encoding
126 )
127 return parser
128
129 def __init__(self, parser=None, empty_element_tags=None, **kwargs):
130 # TODO: Issue a warning if parser is present but not a
131 # callable, since that means there's no way to create new
132 # parsers for different encodings.
133 self._default_parser = parser
134 if empty_element_tags is not None:
135 self.empty_element_tags = set(empty_element_tags)
136 self.soup = None
137 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
138 self.active_namespace_prefixes = [dict(self.DEFAULT_NSMAPS)]
139 super(LXMLTreeBuilderForXML, self).__init__(**kwargs)
140
141 def _getNsTag(self, tag):
142 # Split the namespace URL out of a fully-qualified lxml tag
143 # name. Copied from lxml's src/lxml/sax.py.
144 if tag[0] == '{':
145 return tuple(tag[1:].split('}', 1))
146 else:
147 return (None, tag)
148
149 def prepare_markup(self, markup, user_specified_encoding=None,
150 exclude_encodings=None,
151 document_declared_encoding=None):
152 """Run any preliminary steps necessary to make incoming markup
153 acceptable to the parser.
154
155 lxml really wants to get a bytestring and convert it to
156 Unicode itself. So instead of using UnicodeDammit to convert
157 the bytestring to Unicode using different encodings, this
158 implementation uses EncodingDetector to iterate over the
159 encodings, and tell lxml to try to parse the document as each
160 one in turn.
161
162 :param markup: Some markup -- hopefully a bytestring.
163 :param user_specified_encoding: The user asked to try this encoding.
164 :param document_declared_encoding: The markup itself claims to be
165 in this encoding.
166 :param exclude_encodings: The user asked _not_ to try any of
167 these encodings.
168
169 :yield: A series of 4-tuples:
170 (markup, encoding, declared encoding,
171 has undergone character replacement)
172
173 Each 4-tuple represents a strategy for converting the
174 document to Unicode and parsing it. Each strategy will be tried
175 in turn.
176 """
177 is_html = not self.is_xml
178 if is_html:
179 self.processing_instruction_class = ProcessingInstruction
180 # We're in HTML mode, so if we're given XML, that's worth
181 # noting.
182 DetectsXMLParsedAsHTML.warn_if_markup_looks_like_xml(
183 markup, stacklevel=3
184 )
185 else:
186 self.processing_instruction_class = XMLProcessingInstruction
187
188 if isinstance(markup, str):
189 # We were given Unicode. Maybe lxml can parse Unicode on
190 # this system?
191
192 # TODO: This is a workaround for
193 # https://bugs.launchpad.net/lxml/+bug/1948551.
194 # We can remove it once the upstream issue is fixed.
195 if len(markup) > 0 and markup[0] == u'\N{BYTE ORDER MARK}':
196 markup = markup[1:]
197 yield markup, None, document_declared_encoding, False
198
199 if isinstance(markup, str):
200 # No, apparently not. Convert the Unicode to UTF-8 and
201 # tell lxml to parse it as UTF-8.
202 yield (markup.encode("utf8"), "utf8",
203 document_declared_encoding, False)
204
205 # This was provided by the end-user; treat it as a known
206 # definite encoding per the algorithm laid out in the HTML5
207 # spec. (See the EncodingDetector class for details.)
208 known_definite_encodings = [user_specified_encoding]
209
210 # This was found in the document; treat it as a slightly lower-priority
211 # user encoding.
212 user_encodings = [document_declared_encoding]
213 detector = EncodingDetector(
214 markup, known_definite_encodings=known_definite_encodings,
215 user_encodings=user_encodings, is_html=is_html,
216 exclude_encodings=exclude_encodings
217 )
218 for encoding in detector.encodings:
219 yield (detector.markup, encoding, document_declared_encoding, False)
220
221 def feed(self, markup):
222 if isinstance(markup, bytes):
223 markup = BytesIO(markup)
224 elif isinstance(markup, str):
225 markup = StringIO(markup)
226
227 # Call feed() at least once, even if the markup is empty,
228 # or the parser won't be initialized.
229 data = markup.read(self.CHUNK_SIZE)
230 try:
231 self.parser = self.parser_for(self.soup.original_encoding)
232 self.parser.feed(data)
233 while len(data) != 0:
234 # Now call feed() on the rest of the data, chunk by chunk.
235 data = markup.read(self.CHUNK_SIZE)
236 if len(data) != 0:
237 self.parser.feed(data)
238 self.parser.close()
239 except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
240 raise ParserRejectedMarkup(e)
241
242 def close(self):
243 self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
244
245 def start(self, name, attrs, nsmap={}):
246 # Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
247 attrs = dict(attrs)
248 nsprefix = None
249 # Invert each namespace map as it comes in.
250 if len(nsmap) == 0 and len(self.nsmaps) > 1:
251 # There are no new namespaces for this tag, but
252 # non-default namespaces are in play, so we need a
253 # separate tag stack to know when they end.
254 self.nsmaps.append(None)
255 elif len(nsmap) > 0:
256 # A new namespace mapping has come into play.
257
258 # First, Let the BeautifulSoup object know about it.
259 self._register_namespaces(nsmap)
260
261 # Then, add it to our running list of inverted namespace
262 # mappings.
263 self.nsmaps.append(_invert(nsmap))
264
265 # The currently active namespace prefixes have
266 # changed. Calculate the new mapping so it can be stored
267 # with all Tag objects created while these prefixes are in
268 # scope.
269 current_mapping = dict(self.active_namespace_prefixes[-1])
270 current_mapping.update(nsmap)
271
272 # We should not track un-prefixed namespaces as we can only hold one
273 # and it will be recognized as the default namespace by soupsieve,
274 # which may be confusing in some situations.
275 if '' in current_mapping:
276 del current_mapping['']
277 self.active_namespace_prefixes.append(current_mapping)
278
279 # Also treat the namespace mapping as a set of attributes on the
280 # tag, so we can recreate it later.
281 attrs = attrs.copy()
282 for prefix, namespace in list(nsmap.items()):
283 attribute = NamespacedAttribute(
284 "xmlns", prefix, "http://www.w3.org/2000/xmlns/")
285 attrs[attribute] = namespace
286
287 # Namespaces are in play. Find any attributes that came in
288 # from lxml with namespaces attached to their names, and
289 # turn then into NamespacedAttribute objects.
290 new_attrs = {}
291 for attr, value in list(attrs.items()):
292 namespace, attr = self._getNsTag(attr)
293 if namespace is None:
294 new_attrs[attr] = value
295 else:
296 nsprefix = self._prefix_for_namespace(namespace)
297 attr = NamespacedAttribute(nsprefix, attr, namespace)
298 new_attrs[attr] = value
299 attrs = new_attrs
300
301 namespace, name = self._getNsTag(name)
302 nsprefix = self._prefix_for_namespace(namespace)
303 self.soup.handle_starttag(
304 name, namespace, nsprefix, attrs,
305 namespaces=self.active_namespace_prefixes[-1]
306 )
307
308 def _prefix_for_namespace(self, namespace):
309 """Find the currently active prefix for the given namespace."""
310 if namespace is None:
311 return None
312 for inverted_nsmap in reversed(self.nsmaps):
313 if inverted_nsmap is not None and namespace in inverted_nsmap:
314 return inverted_nsmap[namespace]
315 return None
316
317 def end(self, name):
318 self.soup.endData()
319 completed_tag = self.soup.tagStack[-1]
320 namespace, name = self._getNsTag(name)
321 nsprefix = None
322 if namespace is not None:
323 for inverted_nsmap in reversed(self.nsmaps):
324 if inverted_nsmap is not None and namespace in inverted_nsmap:
325 nsprefix = inverted_nsmap[namespace]
326 break
327 self.soup.handle_endtag(name, nsprefix)
328 if len(self.nsmaps) > 1:
329 # This tag, or one of its parents, introduced a namespace
330 # mapping, so pop it off the stack.
331 out_of_scope_nsmap = self.nsmaps.pop()
332
333 if out_of_scope_nsmap is not None:
334 # This tag introduced a namespace mapping which is no
335 # longer in scope. Recalculate the currently active
336 # namespace prefixes.
337 self.active_namespace_prefixes.pop()
338
339 def pi(self, target, data):
340 self.soup.endData()
341 data = target + ' ' + data
342 self.soup.handle_data(data)
343 self.soup.endData(self.processing_instruction_class)
344
345 def data(self, content):
346 self.soup.handle_data(content)
347
348 def doctype(self, name, pubid, system):
349 self.soup.endData()
350 doctype = Doctype.for_name_and_ids(name, pubid, system)
351 self.soup.object_was_parsed(doctype)
352
353 def comment(self, content):
354 "Handle comments as Comment objects."
355 self.soup.endData()
356 self.soup.handle_data(content)
357 self.soup.endData(Comment)
358
359 def test_fragment_to_document(self, fragment):
360 """See `TreeBuilder`."""
361 return '<?xml version="1.0" encoding="utf-8"?>\n%s' % fragment
362
363
364class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
365
366 NAME = LXML
367 ALTERNATE_NAMES = ["lxml-html"]
368
369 features = ALTERNATE_NAMES + [NAME, HTML, FAST, PERMISSIVE]
370 is_xml = False
371 processing_instruction_class = ProcessingInstruction
372
373 def default_parser(self, encoding):
374 return etree.HTMLParser
375
376 def feed(self, markup):
377 encoding = self.soup.original_encoding
378 try:
379 self.parser = self.parser_for(encoding)
380 self.parser.feed(markup)
381 self.parser.close()
382 except (UnicodeDecodeError, LookupError, etree.ParserError) as e:
383 raise ParserRejectedMarkup(e)
384
385
386 def test_fragment_to_document(self, fragment):
387 """See `TreeBuilder`."""
388 return '<html><body>%s</body></html>' % fragment
diff --git a/bitbake/lib/bs4/css.py b/bitbake/lib/bs4/css.py
deleted file mode 100644
index cd1fd2df88..0000000000
--- a/bitbake/lib/bs4/css.py
+++ /dev/null
@@ -1,274 +0,0 @@
1"""Integration code for CSS selectors using Soup Sieve (pypi: soupsieve)."""
2
3# We don't use soupsieve
4soupsieve = None
5
6
7class CSS(object):
8 """A proxy object against the soupsieve library, to simplify its
9 CSS selector API.
10
11 Acquire this object through the .css attribute on the
12 BeautifulSoup object, or on the Tag you want to use as the
13 starting point for a CSS selector.
14
15 The main advantage of doing this is that the tag to be selected
16 against doesn't need to be explicitly specified in the function
17 calls, since it's already scoped to a tag.
18 """
19
20 def __init__(self, tag, api=soupsieve):
21 """Constructor.
22
23 You don't need to instantiate this class yourself; instead,
24 access the .css attribute on the BeautifulSoup object, or on
25 the Tag you want to use as the starting point for your CSS
26 selector.
27
28 :param tag: All CSS selectors will use this as their starting
29 point.
30
31 :param api: A plug-in replacement for the soupsieve module,
32 designed mainly for use in tests.
33 """
34 if api is None:
35 raise NotImplementedError(
36 "Cannot execute CSS selectors because the soupsieve package is not installed."
37 )
38 self.api = api
39 self.tag = tag
40
41 def escape(self, ident):
42 """Escape a CSS identifier.
43
44 This is a simple wrapper around soupselect.escape(). See the
45 documentation for that function for more information.
46 """
47 if soupsieve is None:
48 raise NotImplementedError(
49 "Cannot escape CSS identifiers because the soupsieve package is not installed."
50 )
51 return self.api.escape(ident)
52
53 def _ns(self, ns, select):
54 """Normalize a dictionary of namespaces."""
55 if not isinstance(select, self.api.SoupSieve) and ns is None:
56 # If the selector is a precompiled pattern, it already has
57 # a namespace context compiled in, which cannot be
58 # replaced.
59 ns = self.tag._namespaces
60 return ns
61
62 def _rs(self, results):
63 """Normalize a list of results to a Resultset.
64
65 A ResultSet is more consistent with the rest of Beautiful
66 Soup's API, and ResultSet.__getattr__ has a helpful error
67 message if you try to treat a list of results as a single
68 result (a common mistake).
69 """
70 # Import here to avoid circular import
71 from bs4.element import ResultSet
72 return ResultSet(None, results)
73
74 def compile(self, select, namespaces=None, flags=0, **kwargs):
75 """Pre-compile a selector and return the compiled object.
76
77 :param selector: A CSS selector.
78
79 :param namespaces: A dictionary mapping namespace prefixes
80 used in the CSS selector to namespace URIs. By default,
81 Beautiful Soup will use the prefixes it encountered while
82 parsing the document.
83
84 :param flags: Flags to be passed into Soup Sieve's
85 soupsieve.compile() method.
86
87 :param kwargs: Keyword arguments to be passed into SoupSieve's
88 soupsieve.compile() method.
89
90 :return: A precompiled selector object.
91 :rtype: soupsieve.SoupSieve
92 """
93 return self.api.compile(
94 select, self._ns(namespaces, select), flags, **kwargs
95 )
96
97 def select_one(self, select, namespaces=None, flags=0, **kwargs):
98 """Perform a CSS selection operation on the current Tag and return the
99 first result.
100
101 This uses the Soup Sieve library. For more information, see
102 that library's documentation for the soupsieve.select_one()
103 method.
104
105 :param selector: A CSS selector.
106
107 :param namespaces: A dictionary mapping namespace prefixes
108 used in the CSS selector to namespace URIs. By default,
109 Beautiful Soup will use the prefixes it encountered while
110 parsing the document.
111
112 :param flags: Flags to be passed into Soup Sieve's
113 soupsieve.select_one() method.
114
115 :param kwargs: Keyword arguments to be passed into SoupSieve's
116 soupsieve.select_one() method.
117
118 :return: A Tag, or None if the selector has no match.
119 :rtype: bs4.element.Tag
120
121 """
122 return self.api.select_one(
123 select, self.tag, self._ns(namespaces, select), flags, **kwargs
124 )
125
126 def select(self, select, namespaces=None, limit=0, flags=0, **kwargs):
127 """Perform a CSS selection operation on the current Tag.
128
129 This uses the Soup Sieve library. For more information, see
130 that library's documentation for the soupsieve.select()
131 method.
132
133 :param selector: A string containing a CSS selector.
134
135 :param namespaces: A dictionary mapping namespace prefixes
136 used in the CSS selector to namespace URIs. By default,
137 Beautiful Soup will pass in the prefixes it encountered while
138 parsing the document.
139
140 :param limit: After finding this number of results, stop looking.
141
142 :param flags: Flags to be passed into Soup Sieve's
143 soupsieve.select() method.
144
145 :param kwargs: Keyword arguments to be passed into SoupSieve's
146 soupsieve.select() method.
147
148 :return: A ResultSet of Tag objects.
149 :rtype: bs4.element.ResultSet
150
151 """
152 if limit is None:
153 limit = 0
154
155 return self._rs(
156 self.api.select(
157 select, self.tag, self._ns(namespaces, select), limit, flags,
158 **kwargs
159 )
160 )
161
162 def iselect(self, select, namespaces=None, limit=0, flags=0, **kwargs):
163 """Perform a CSS selection operation on the current Tag.
164
165 This uses the Soup Sieve library. For more information, see
166 that library's documentation for the soupsieve.iselect()
167 method. It is the same as select(), but it returns a generator
168 instead of a list.
169
170 :param selector: A string containing a CSS selector.
171
172 :param namespaces: A dictionary mapping namespace prefixes
173 used in the CSS selector to namespace URIs. By default,
174 Beautiful Soup will pass in the prefixes it encountered while
175 parsing the document.
176
177 :param limit: After finding this number of results, stop looking.
178
179 :param flags: Flags to be passed into Soup Sieve's
180 soupsieve.iselect() method.
181
182 :param kwargs: Keyword arguments to be passed into SoupSieve's
183 soupsieve.iselect() method.
184
185 :return: A generator
186 :rtype: types.GeneratorType
187 """
188 return self.api.iselect(
189 select, self.tag, self._ns(namespaces, select), limit, flags, **kwargs
190 )
191
192 def closest(self, select, namespaces=None, flags=0, **kwargs):
193 """Find the Tag closest to this one that matches the given selector.
194
195 This uses the Soup Sieve library. For more information, see
196 that library's documentation for the soupsieve.closest()
197 method.
198
199 :param selector: A string containing a CSS selector.
200
201 :param namespaces: A dictionary mapping namespace prefixes
202 used in the CSS selector to namespace URIs. By default,
203 Beautiful Soup will pass in the prefixes it encountered while
204 parsing the document.
205
206 :param flags: Flags to be passed into Soup Sieve's
207 soupsieve.closest() method.
208
209 :param kwargs: Keyword arguments to be passed into SoupSieve's
210 soupsieve.closest() method.
211
212 :return: A Tag, or None if there is no match.
213 :rtype: bs4.Tag
214
215 """
216 return self.api.closest(
217 select, self.tag, self._ns(namespaces, select), flags, **kwargs
218 )
219
220 def match(self, select, namespaces=None, flags=0, **kwargs):
221 """Check whether this Tag matches the given CSS selector.
222
223 This uses the Soup Sieve library. For more information, see
224 that library's documentation for the soupsieve.match()
225 method.
226
227 :param: a CSS selector.
228
229 :param namespaces: A dictionary mapping namespace prefixes
230 used in the CSS selector to namespace URIs. By default,
231 Beautiful Soup will pass in the prefixes it encountered while
232 parsing the document.
233
234 :param flags: Flags to be passed into Soup Sieve's
235 soupsieve.match() method.
236
237 :param kwargs: Keyword arguments to be passed into SoupSieve's
238 soupsieve.match() method.
239
240 :return: True if this Tag matches the selector; False otherwise.
241 :rtype: bool
242 """
243 return self.api.match(
244 select, self.tag, self._ns(namespaces, select), flags, **kwargs
245 )
246
247 def filter(self, select, namespaces=None, flags=0, **kwargs):
248 """Filter this Tag's direct children based on the given CSS selector.
249
250 This uses the Soup Sieve library. It works the same way as
251 passing this Tag into that library's soupsieve.filter()
252 method. More information, for more information see the
253 documentation for soupsieve.filter().
254
255 :param namespaces: A dictionary mapping namespace prefixes
256 used in the CSS selector to namespace URIs. By default,
257 Beautiful Soup will pass in the prefixes it encountered while
258 parsing the document.
259
260 :param flags: Flags to be passed into Soup Sieve's
261 soupsieve.filter() method.
262
263 :param kwargs: Keyword arguments to be passed into SoupSieve's
264 soupsieve.filter() method.
265
266 :return: A ResultSet of Tag objects.
267 :rtype: bs4.element.ResultSet
268
269 """
270 return self._rs(
271 self.api.filter(
272 select, self.tag, self._ns(namespaces, select), flags, **kwargs
273 )
274 )
diff --git a/bitbake/lib/bs4/dammit.py b/bitbake/lib/bs4/dammit.py
deleted file mode 100644
index 692433c57a..0000000000
--- a/bitbake/lib/bs4/dammit.py
+++ /dev/null
@@ -1,1095 +0,0 @@
1# -*- coding: utf-8 -*-
2"""Beautiful Soup bonus library: Unicode, Dammit
3
4This library converts a bytestream to Unicode through any means
5necessary. It is heavily based on code from Mark Pilgrim's Universal
6Feed Parser. It works best on XML and HTML, but it does not rewrite the
7XML or HTML to reflect a new encoding; that's the tree builder's job.
8"""
9# Use of this source code is governed by the MIT license.
10__license__ = "MIT"
11
12from html.entities import codepoint2name
13from collections import defaultdict
14import codecs
15import re
16import logging
17import string
18
19# Import a library to autodetect character encodings. We'll support
20# any of a number of libraries that all support the same API:
21#
22# * cchardet
23# * chardet
24# * charset-normalizer
25chardet_module = None
26try:
27 # PyPI package: cchardet
28 import cchardet as chardet_module
29except ImportError:
30 try:
31 # Debian package: python-chardet
32 # PyPI package: chardet
33 import chardet as chardet_module
34 except ImportError:
35 try:
36 # PyPI package: charset-normalizer
37 import charset_normalizer as chardet_module
38 except ImportError:
39 # No chardet available.
40 chardet_module = None
41
42if chardet_module:
43 def chardet_dammit(s):
44 if isinstance(s, str):
45 return None
46 return chardet_module.detect(s)['encoding']
47else:
48 def chardet_dammit(s):
49 return None
50
51# Build bytestring and Unicode versions of regular expressions for finding
52# a declared encoding inside an XML or HTML document.
53xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
54html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'
55encoding_res = dict()
56encoding_res[bytes] = {
57 'html' : re.compile(html_meta.encode("ascii"), re.I),
58 'xml' : re.compile(xml_encoding.encode("ascii"), re.I),
59}
60encoding_res[str] = {
61 'html' : re.compile(html_meta, re.I),
62 'xml' : re.compile(xml_encoding, re.I)
63}
64
65from html.entities import html5
66
67class EntitySubstitution(object):
68 """The ability to substitute XML or HTML entities for certain characters."""
69
70 def _populate_class_variables():
71 """Initialize variables used by this class to manage the plethora of
72 HTML5 named entities.
73
74 This function returns a 3-tuple containing two dictionaries
75 and a regular expression:
76
77 unicode_to_name - A mapping of Unicode strings like "⦨" to
78 entity names like "angmsdaa". When a single Unicode string has
79 multiple entity names, we try to choose the most commonly-used
80 name.
81
82 name_to_unicode: A mapping of entity names like "angmsdaa" to
83 Unicode strings like "⦨".
84
85 named_entity_re: A regular expression matching (almost) any
86 Unicode string that corresponds to an HTML5 named entity.
87 """
88 unicode_to_name = {}
89 name_to_unicode = {}
90
91 short_entities = set()
92 long_entities_by_first_character = defaultdict(set)
93
94 for name_with_semicolon, character in sorted(html5.items()):
95 # "It is intentional, for legacy compatibility, that many
96 # code points have multiple character reference names. For
97 # example, some appear both with and without the trailing
98 # semicolon, or with different capitalizations."
99 # - https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
100 #
101 # The parsers are in charge of handling (or not) character
102 # references with no trailing semicolon, so we remove the
103 # semicolon whenever it appears.
104 if name_with_semicolon.endswith(';'):
105 name = name_with_semicolon[:-1]
106 else:
107 name = name_with_semicolon
108
109 # When parsing HTML, we want to recognize any known named
110 # entity and convert it to a sequence of Unicode
111 # characters.
112 if name not in name_to_unicode:
113 name_to_unicode[name] = character
114
115 # When _generating_ HTML, we want to recognize special
116 # character sequences that _could_ be converted to named
117 # entities.
118 unicode_to_name[character] = name
119
120 # We also need to build a regular expression that lets us
121 # _find_ those characters in output strings so we can
122 # replace them.
123 #
124 # This is tricky, for two reasons.
125
126 if (len(character) == 1 and ord(character) < 128
127 and character not in '<>&'):
128 # First, it would be annoying to turn single ASCII
129 # characters like | into named entities like
130 # &verbar;. The exceptions are <>&, which we _must_
131 # turn into named entities to produce valid HTML.
132 continue
133
134 if len(character) > 1 and all(ord(x) < 128 for x in character):
135 # We also do not want to turn _combinations_ of ASCII
136 # characters like 'fj' into named entities like '&fjlig;',
137 # though that's more debateable.
138 continue
139
140 # Second, some named entities have a Unicode value that's
141 # a subset of the Unicode value for some _other_ named
142 # entity. As an example, \u2267' is &GreaterFullEqual;,
143 # but '\u2267\u0338' is &NotGreaterFullEqual;. Our regular
144 # expression needs to match the first two characters of
145 # "\u2267\u0338foo", but only the first character of
146 # "\u2267foo".
147 #
148 # In this step, we build two sets of characters that
149 # _eventually_ need to go into the regular expression. But
150 # we won't know exactly what the regular expression needs
151 # to look like until we've gone through the entire list of
152 # named entities.
153 if len(character) == 1:
154 short_entities.add(character)
155 else:
156 long_entities_by_first_character[character[0]].add(character)
157
158 # Now that we've been through the entire list of entities, we
159 # can create a regular expression that matches any of them.
160 particles = set()
161 for short in short_entities:
162 long_versions = long_entities_by_first_character[short]
163 if not long_versions:
164 particles.add(short)
165 else:
166 ignore = "".join([x[1] for x in long_versions])
167 # This finds, e.g. \u2267 but only if it is _not_
168 # followed by \u0338.
169 particles.add("%s(?![%s])" % (short, ignore))
170
171 for long_entities in list(long_entities_by_first_character.values()):
172 for long_entity in long_entities:
173 particles.add(long_entity)
174
175 re_definition = "(%s)" % "|".join(particles)
176
177 # If an entity shows up in both html5 and codepoint2name, it's
178 # likely that HTML5 gives it several different names, such as
179 # 'rsquo' and 'rsquor'. When converting Unicode characters to
180 # named entities, the codepoint2name name should take
181 # precedence where possible, since that's the more easily
182 # recognizable one.
183 for codepoint, name in list(codepoint2name.items()):
184 character = chr(codepoint)
185 unicode_to_name[character] = name
186
187 return unicode_to_name, name_to_unicode, re.compile(re_definition)
188 (CHARACTER_TO_HTML_ENTITY, HTML_ENTITY_TO_CHARACTER,
189 CHARACTER_TO_HTML_ENTITY_RE) = _populate_class_variables()
190
191 CHARACTER_TO_XML_ENTITY = {
192 "'": "apos",
193 '"': "quot",
194 "&": "amp",
195 "<": "lt",
196 ">": "gt",
197 }
198
199 BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
200 "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)"
201 ")")
202
203 AMPERSAND_OR_BRACKET = re.compile("([<>&])")
204
205 @classmethod
206 def _substitute_html_entity(cls, matchobj):
207 """Used with a regular expression to substitute the
208 appropriate HTML entity for a special character string."""
209 entity = cls.CHARACTER_TO_HTML_ENTITY.get(matchobj.group(0))
210 return "&%s;" % entity
211
212 @classmethod
213 def _substitute_xml_entity(cls, matchobj):
214 """Used with a regular expression to substitute the
215 appropriate XML entity for a special character string."""
216 entity = cls.CHARACTER_TO_XML_ENTITY[matchobj.group(0)]
217 return "&%s;" % entity
218
219 @classmethod
220 def quoted_attribute_value(self, value):
221 """Make a value into a quoted XML attribute, possibly escaping it.
222
223 Most strings will be quoted using double quotes.
224
225 Bob's Bar -> "Bob's Bar"
226
227 If a string contains double quotes, it will be quoted using
228 single quotes.
229
230 Welcome to "my bar" -> 'Welcome to "my bar"'
231
232 If a string contains both single and double quotes, the
233 double quotes will be escaped, and the string will be quoted
234 using double quotes.
235
236 Welcome to "Bob's Bar" -> "Welcome to &quot;Bob's bar&quot;
237 """
238 quote_with = '"'
239 if '"' in value:
240 if "'" in value:
241 # The string contains both single and double
242 # quotes. Turn the double quotes into
243 # entities. We quote the double quotes rather than
244 # the single quotes because the entity name is
245 # "&quot;" whether this is HTML or XML. If we
246 # quoted the single quotes, we'd have to decide
247 # between &apos; and &squot;.
248 replace_with = "&quot;"
249 value = value.replace('"', replace_with)
250 else:
251 # There are double quotes but no single quotes.
252 # We can use single quotes to quote the attribute.
253 quote_with = "'"
254 return quote_with + value + quote_with
255
256 @classmethod
257 def substitute_xml(cls, value, make_quoted_attribute=False):
258 """Substitute XML entities for special XML characters.
259
260 :param value: A string to be substituted. The less-than sign
261 will become &lt;, the greater-than sign will become &gt;,
262 and any ampersands will become &amp;. If you want ampersands
263 that appear to be part of an entity definition to be left
264 alone, use substitute_xml_containing_entities() instead.
265
266 :param make_quoted_attribute: If True, then the string will be
267 quoted, as befits an attribute value.
268 """
269 # Escape angle brackets and ampersands.
270 value = cls.AMPERSAND_OR_BRACKET.sub(
271 cls._substitute_xml_entity, value)
272
273 if make_quoted_attribute:
274 value = cls.quoted_attribute_value(value)
275 return value
276
277 @classmethod
278 def substitute_xml_containing_entities(
279 cls, value, make_quoted_attribute=False):
280 """Substitute XML entities for special XML characters.
281
282 :param value: A string to be substituted. The less-than sign will
283 become &lt;, the greater-than sign will become &gt;, and any
284 ampersands that are not part of an entity defition will
285 become &amp;.
286
287 :param make_quoted_attribute: If True, then the string will be
288 quoted, as befits an attribute value.
289 """
290 # Escape angle brackets, and ampersands that aren't part of
291 # entities.
292 value = cls.BARE_AMPERSAND_OR_BRACKET.sub(
293 cls._substitute_xml_entity, value)
294
295 if make_quoted_attribute:
296 value = cls.quoted_attribute_value(value)
297 return value
298
299 @classmethod
300 def substitute_html(cls, s):
301 """Replace certain Unicode characters with named HTML entities.
302
303 This differs from data.encode(encoding, 'xmlcharrefreplace')
304 in that the goal is to make the result more readable (to those
305 with ASCII displays) rather than to recover from
306 errors. There's absolutely nothing wrong with a UTF-8 string
307 containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that
308 character with "&eacute;" will make it more readable to some
309 people.
310
311 :param s: A Unicode string.
312 """
313 return cls.CHARACTER_TO_HTML_ENTITY_RE.sub(
314 cls._substitute_html_entity, s)
315
316
317class EncodingDetector:
318 """Suggests a number of possible encodings for a bytestring.
319
320 Order of precedence:
321
322 1. Encodings you specifically tell EncodingDetector to try first
323 (the known_definite_encodings argument to the constructor).
324
325 2. An encoding determined by sniffing the document's byte-order mark.
326
327 3. Encodings you specifically tell EncodingDetector to try if
328 byte-order mark sniffing fails (the user_encodings argument to the
329 constructor).
330
331 4. An encoding declared within the bytestring itself, either in an
332 XML declaration (if the bytestring is to be interpreted as an XML
333 document), or in a <meta> tag (if the bytestring is to be
334 interpreted as an HTML document.)
335
336 5. An encoding detected through textual analysis by chardet,
337 cchardet, or a similar external library.
338
339 4. UTF-8.
340
341 5. Windows-1252.
342
343 """
344 def __init__(self, markup, known_definite_encodings=None,
345 is_html=False, exclude_encodings=None,
346 user_encodings=None, override_encodings=None):
347 """Constructor.
348
349 :param markup: Some markup in an unknown encoding.
350
351 :param known_definite_encodings: When determining the encoding
352 of `markup`, these encodings will be tried first, in
353 order. In HTML terms, this corresponds to the "known
354 definite encoding" step defined here:
355 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
356
357 :param user_encodings: These encodings will be tried after the
358 `known_definite_encodings` have been tried and failed, and
359 after an attempt to sniff the encoding by looking at a
360 byte order mark has failed. In HTML terms, this
361 corresponds to the step "user has explicitly instructed
362 the user agent to override the document's character
363 encoding", defined here:
364 https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
365
366 :param override_encodings: A deprecated alias for
367 known_definite_encodings. Any encodings here will be tried
368 immediately after the encodings in
369 known_definite_encodings.
370
371 :param is_html: If True, this markup is considered to be
372 HTML. Otherwise it's assumed to be XML.
373
374 :param exclude_encodings: These encodings will not be tried,
375 even if they otherwise would be.
376
377 """
378 self.known_definite_encodings = list(known_definite_encodings or [])
379 if override_encodings:
380 self.known_definite_encodings += override_encodings
381 self.user_encodings = user_encodings or []
382 exclude_encodings = exclude_encodings or []
383 self.exclude_encodings = set([x.lower() for x in exclude_encodings])
384 self.chardet_encoding = None
385 self.is_html = is_html
386 self.declared_encoding = None
387
388 # First order of business: strip a byte-order mark.
389 self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
390
391 def _usable(self, encoding, tried):
392 """Should we even bother to try this encoding?
393
394 :param encoding: Name of an encoding.
395 :param tried: Encodings that have already been tried. This will be modified
396 as a side effect.
397 """
398 if encoding is not None:
399 encoding = encoding.lower()
400 if encoding in self.exclude_encodings:
401 return False
402 if encoding not in tried:
403 tried.add(encoding)
404 return True
405 return False
406
407 @property
408 def encodings(self):
409 """Yield a number of encodings that might work for this markup.
410
411 :yield: A sequence of strings.
412 """
413 tried = set()
414
415 # First, try the known definite encodings
416 for e in self.known_definite_encodings:
417 if self._usable(e, tried):
418 yield e
419
420 # Did the document originally start with a byte-order mark
421 # that indicated its encoding?
422 if self._usable(self.sniffed_encoding, tried):
423 yield self.sniffed_encoding
424
425 # Sniffing the byte-order mark did nothing; try the user
426 # encodings.
427 for e in self.user_encodings:
428 if self._usable(e, tried):
429 yield e
430
431 # Look within the document for an XML or HTML encoding
432 # declaration.
433 if self.declared_encoding is None:
434 self.declared_encoding = self.find_declared_encoding(
435 self.markup, self.is_html)
436 if self._usable(self.declared_encoding, tried):
437 yield self.declared_encoding
438
439 # Use third-party character set detection to guess at the
440 # encoding.
441 if self.chardet_encoding is None:
442 self.chardet_encoding = chardet_dammit(self.markup)
443 if self._usable(self.chardet_encoding, tried):
444 yield self.chardet_encoding
445
446 # As a last-ditch effort, try utf-8 and windows-1252.
447 for e in ('utf-8', 'windows-1252'):
448 if self._usable(e, tried):
449 yield e
450
451 @classmethod
452 def strip_byte_order_mark(cls, data):
453 """If a byte-order mark is present, strip it and return the encoding it implies.
454
455 :param data: Some markup.
456 :return: A 2-tuple (modified data, implied encoding)
457 """
458 encoding = None
459 if isinstance(data, str):
460 # Unicode data cannot have a byte-order mark.
461 return data, encoding
462 if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
463 and (data[2:4] != '\x00\x00'):
464 encoding = 'utf-16be'
465 data = data[2:]
466 elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
467 and (data[2:4] != '\x00\x00'):
468 encoding = 'utf-16le'
469 data = data[2:]
470 elif data[:3] == b'\xef\xbb\xbf':
471 encoding = 'utf-8'
472 data = data[3:]
473 elif data[:4] == b'\x00\x00\xfe\xff':
474 encoding = 'utf-32be'
475 data = data[4:]
476 elif data[:4] == b'\xff\xfe\x00\x00':
477 encoding = 'utf-32le'
478 data = data[4:]
479 return data, encoding
480
481 @classmethod
482 def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
483 """Given a document, tries to find its declared encoding.
484
485 An XML encoding is declared at the beginning of the document.
486
487 An HTML encoding is declared in a <meta> tag, hopefully near the
488 beginning of the document.
489
490 :param markup: Some markup.
491 :param is_html: If True, this markup is considered to be HTML. Otherwise
492 it's assumed to be XML.
493 :param search_entire_document: Since an encoding is supposed to declared near the beginning
494 of the document, most of the time it's only necessary to search a few kilobytes of data.
495 Set this to True to force this method to search the entire document.
496 """
497 if search_entire_document:
498 xml_endpos = html_endpos = len(markup)
499 else:
500 xml_endpos = 1024
501 html_endpos = max(2048, int(len(markup) * 0.05))
502
503 if isinstance(markup, bytes):
504 res = encoding_res[bytes]
505 else:
506 res = encoding_res[str]
507
508 xml_re = res['xml']
509 html_re = res['html']
510 declared_encoding = None
511 declared_encoding_match = xml_re.search(markup, endpos=xml_endpos)
512 if not declared_encoding_match and is_html:
513 declared_encoding_match = html_re.search(markup, endpos=html_endpos)
514 if declared_encoding_match is not None:
515 declared_encoding = declared_encoding_match.groups()[0]
516 if declared_encoding:
517 if isinstance(declared_encoding, bytes):
518 declared_encoding = declared_encoding.decode('ascii', 'replace')
519 return declared_encoding.lower()
520 return None
521
522class UnicodeDammit:
523 """A class for detecting the encoding of a *ML document and
524 converting it to a Unicode string. If the source encoding is
525 windows-1252, can replace MS smart quotes with their HTML or XML
526 equivalents."""
527
528 # This dictionary maps commonly seen values for "charset" in HTML
529 # meta tags to the corresponding Python codec names. It only covers
530 # values that aren't in Python's aliases and can't be determined
531 # by the heuristics in find_codec.
532 CHARSET_ALIASES = {"macintosh": "mac-roman",
533 "x-sjis": "shift-jis"}
534
535 ENCODINGS_WITH_SMART_QUOTES = [
536 "windows-1252",
537 "iso-8859-1",
538 "iso-8859-2",
539 ]
540
541 def __init__(self, markup, known_definite_encodings=[],
542 smart_quotes_to=None, is_html=False, exclude_encodings=[],
543 user_encodings=None, override_encodings=None
544 ):
545 """Constructor.
546
547 :param markup: A bytestring representing markup in an unknown encoding.
548
549 :param known_definite_encodings: When determining the encoding
550 of `markup`, these encodings will be tried first, in
551 order. In HTML terms, this corresponds to the "known
552 definite encoding" step defined here:
553 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
554
555 :param user_encodings: These encodings will be tried after the
556 `known_definite_encodings` have been tried and failed, and
557 after an attempt to sniff the encoding by looking at a
558 byte order mark has failed. In HTML terms, this
559 corresponds to the step "user has explicitly instructed
560 the user agent to override the document's character
561 encoding", defined here:
562 https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding
563
564 :param override_encodings: A deprecated alias for
565 known_definite_encodings. Any encodings here will be tried
566 immediately after the encodings in
567 known_definite_encodings.
568
569 :param smart_quotes_to: By default, Microsoft smart quotes will, like all other characters, be converted
570 to Unicode characters. Setting this to 'ascii' will convert them to ASCII quotes instead.
571 Setting it to 'xml' will convert them to XML entity references, and setting it to 'html'
572 will convert them to HTML entity references.
573 :param is_html: If True, this markup is considered to be HTML. Otherwise
574 it's assumed to be XML.
575 :param exclude_encodings: These encodings will not be considered, even
576 if the sniffing code thinks they might make sense.
577
578 """
579 self.smart_quotes_to = smart_quotes_to
580 self.tried_encodings = []
581 self.contains_replacement_characters = False
582 self.is_html = is_html
583 self.log = logging.getLogger(__name__)
584 self.detector = EncodingDetector(
585 markup, known_definite_encodings, is_html, exclude_encodings,
586 user_encodings, override_encodings
587 )
588
589 # Short-circuit if the data is in Unicode to begin with.
590 if isinstance(markup, str) or markup == '':
591 self.markup = markup
592 self.unicode_markup = str(markup)
593 self.original_encoding = None
594 return
595
596 # The encoding detector may have stripped a byte-order mark.
597 # Use the stripped markup from this point on.
598 self.markup = self.detector.markup
599
600 u = None
601 for encoding in self.detector.encodings:
602 markup = self.detector.markup
603 u = self._convert_from(encoding)
604 if u is not None:
605 break
606
607 if not u:
608 # None of the encodings worked. As an absolute last resort,
609 # try them again with character replacement.
610
611 for encoding in self.detector.encodings:
612 if encoding != "ascii":
613 u = self._convert_from(encoding, "replace")
614 if u is not None:
615 self.log.warning(
616 "Some characters could not be decoded, and were "
617 "replaced with REPLACEMENT CHARACTER."
618 )
619 self.contains_replacement_characters = True
620 break
621
622 # If none of that worked, we could at this point force it to
623 # ASCII, but that would destroy so much data that I think
624 # giving up is better.
625 self.unicode_markup = u
626 if not u:
627 self.original_encoding = None
628
629 def _sub_ms_char(self, match):
630 """Changes a MS smart quote character to an XML or HTML
631 entity, or an ASCII character."""
632 orig = match.group(1)
633 if self.smart_quotes_to == 'ascii':
634 sub = self.MS_CHARS_TO_ASCII.get(orig).encode()
635 else:
636 sub = self.MS_CHARS.get(orig)
637 if type(sub) == tuple:
638 if self.smart_quotes_to == 'xml':
639 sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
640 else:
641 sub = '&'.encode() + sub[0].encode() + ';'.encode()
642 else:
643 sub = sub.encode()
644 return sub
645
646 def _convert_from(self, proposed, errors="strict"):
647 """Attempt to convert the markup to the proposed encoding.
648
649 :param proposed: The name of a character encoding.
650 """
651 proposed = self.find_codec(proposed)
652 if not proposed or (proposed, errors) in self.tried_encodings:
653 return None
654 self.tried_encodings.append((proposed, errors))
655 markup = self.markup
656 # Convert smart quotes to HTML if coming from an encoding
657 # that might have them.
658 if (self.smart_quotes_to is not None
659 and proposed in self.ENCODINGS_WITH_SMART_QUOTES):
660 smart_quotes_re = b"([\x80-\x9f])"
661 smart_quotes_compiled = re.compile(smart_quotes_re)
662 markup = smart_quotes_compiled.sub(self._sub_ms_char, markup)
663
664 try:
665 #print("Trying to convert document to %s (errors=%s)" % (
666 # proposed, errors))
667 u = self._to_unicode(markup, proposed, errors)
668 self.markup = u
669 self.original_encoding = proposed
670 except Exception as e:
671 #print("That didn't work!")
672 #print(e)
673 return None
674 #print("Correct encoding: %s" % proposed)
675 return self.markup
676
677 def _to_unicode(self, data, encoding, errors="strict"):
678 """Given a string and its encoding, decodes the string into Unicode.
679
680 :param encoding: The name of an encoding.
681 """
682 return str(data, encoding, errors)
683
684 @property
685 def declared_html_encoding(self):
686 """If the markup is an HTML document, returns the encoding declared _within_
687 the document.
688 """
689 if not self.is_html:
690 return None
691 return self.detector.declared_encoding
692
693 def find_codec(self, charset):
694 """Convert the name of a character set to a codec name.
695
696 :param charset: The name of a character set.
697 :return: The name of a codec.
698 """
699 value = (self._codec(self.CHARSET_ALIASES.get(charset, charset))
700 or (charset and self._codec(charset.replace("-", "")))
701 or (charset and self._codec(charset.replace("-", "_")))
702 or (charset and charset.lower())
703 or charset
704 )
705 if value:
706 return value.lower()
707 return None
708
709 def _codec(self, charset):
710 if not charset:
711 return charset
712 codec = None
713 try:
714 codecs.lookup(charset)
715 codec = charset
716 except (LookupError, ValueError):
717 pass
718 return codec
719
720
721 # A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
722 MS_CHARS = {b'\x80': ('euro', '20AC'),
723 b'\x81': ' ',
724 b'\x82': ('sbquo', '201A'),
725 b'\x83': ('fnof', '192'),
726 b'\x84': ('bdquo', '201E'),
727 b'\x85': ('hellip', '2026'),
728 b'\x86': ('dagger', '2020'),
729 b'\x87': ('Dagger', '2021'),
730 b'\x88': ('circ', '2C6'),
731 b'\x89': ('permil', '2030'),
732 b'\x8A': ('Scaron', '160'),
733 b'\x8B': ('lsaquo', '2039'),
734 b'\x8C': ('OElig', '152'),
735 b'\x8D': '?',
736 b'\x8E': ('#x17D', '17D'),
737 b'\x8F': '?',
738 b'\x90': '?',
739 b'\x91': ('lsquo', '2018'),
740 b'\x92': ('rsquo', '2019'),
741 b'\x93': ('ldquo', '201C'),
742 b'\x94': ('rdquo', '201D'),
743 b'\x95': ('bull', '2022'),
744 b'\x96': ('ndash', '2013'),
745 b'\x97': ('mdash', '2014'),
746 b'\x98': ('tilde', '2DC'),
747 b'\x99': ('trade', '2122'),
748 b'\x9a': ('scaron', '161'),
749 b'\x9b': ('rsaquo', '203A'),
750 b'\x9c': ('oelig', '153'),
751 b'\x9d': '?',
752 b'\x9e': ('#x17E', '17E'),
753 b'\x9f': ('Yuml', ''),}
754
755 # A parochial partial mapping of ISO-Latin-1 to ASCII. Contains
756 # horrors like stripping diacritical marks to turn á into a, but also
757 # contains non-horrors like turning “ into ".
758 MS_CHARS_TO_ASCII = {
759 b'\x80' : 'EUR',
760 b'\x81' : ' ',
761 b'\x82' : ',',
762 b'\x83' : 'f',
763 b'\x84' : ',,',
764 b'\x85' : '...',
765 b'\x86' : '+',
766 b'\x87' : '++',
767 b'\x88' : '^',
768 b'\x89' : '%',
769 b'\x8a' : 'S',
770 b'\x8b' : '<',
771 b'\x8c' : 'OE',
772 b'\x8d' : '?',
773 b'\x8e' : 'Z',
774 b'\x8f' : '?',
775 b'\x90' : '?',
776 b'\x91' : "'",
777 b'\x92' : "'",
778 b'\x93' : '"',
779 b'\x94' : '"',
780 b'\x95' : '*',
781 b'\x96' : '-',
782 b'\x97' : '--',
783 b'\x98' : '~',
784 b'\x99' : '(TM)',
785 b'\x9a' : 's',
786 b'\x9b' : '>',
787 b'\x9c' : 'oe',
788 b'\x9d' : '?',
789 b'\x9e' : 'z',
790 b'\x9f' : 'Y',
791 b'\xa0' : ' ',
792 b'\xa1' : '!',
793 b'\xa2' : 'c',
794 b'\xa3' : 'GBP',
795 b'\xa4' : '$', #This approximation is especially parochial--this is the
796 #generic currency symbol.
797 b'\xa5' : 'YEN',
798 b'\xa6' : '|',
799 b'\xa7' : 'S',
800 b'\xa8' : '..',
801 b'\xa9' : '',
802 b'\xaa' : '(th)',
803 b'\xab' : '<<',
804 b'\xac' : '!',
805 b'\xad' : ' ',
806 b'\xae' : '(R)',
807 b'\xaf' : '-',
808 b'\xb0' : 'o',
809 b'\xb1' : '+-',
810 b'\xb2' : '2',
811 b'\xb3' : '3',
812 b'\xb4' : ("'", 'acute'),
813 b'\xb5' : 'u',
814 b'\xb6' : 'P',
815 b'\xb7' : '*',
816 b'\xb8' : ',',
817 b'\xb9' : '1',
818 b'\xba' : '(th)',
819 b'\xbb' : '>>',
820 b'\xbc' : '1/4',
821 b'\xbd' : '1/2',
822 b'\xbe' : '3/4',
823 b'\xbf' : '?',
824 b'\xc0' : 'A',
825 b'\xc1' : 'A',
826 b'\xc2' : 'A',
827 b'\xc3' : 'A',
828 b'\xc4' : 'A',
829 b'\xc5' : 'A',
830 b'\xc6' : 'AE',
831 b'\xc7' : 'C',
832 b'\xc8' : 'E',
833 b'\xc9' : 'E',
834 b'\xca' : 'E',
835 b'\xcb' : 'E',
836 b'\xcc' : 'I',
837 b'\xcd' : 'I',
838 b'\xce' : 'I',
839 b'\xcf' : 'I',
840 b'\xd0' : 'D',
841 b'\xd1' : 'N',
842 b'\xd2' : 'O',
843 b'\xd3' : 'O',
844 b'\xd4' : 'O',
845 b'\xd5' : 'O',
846 b'\xd6' : 'O',
847 b'\xd7' : '*',
848 b'\xd8' : 'O',
849 b'\xd9' : 'U',
850 b'\xda' : 'U',
851 b'\xdb' : 'U',
852 b'\xdc' : 'U',
853 b'\xdd' : 'Y',
854 b'\xde' : 'b',
855 b'\xdf' : 'B',
856 b'\xe0' : 'a',
857 b'\xe1' : 'a',
858 b'\xe2' : 'a',
859 b'\xe3' : 'a',
860 b'\xe4' : 'a',
861 b'\xe5' : 'a',
862 b'\xe6' : 'ae',
863 b'\xe7' : 'c',
864 b'\xe8' : 'e',
865 b'\xe9' : 'e',
866 b'\xea' : 'e',
867 b'\xeb' : 'e',
868 b'\xec' : 'i',
869 b'\xed' : 'i',
870 b'\xee' : 'i',
871 b'\xef' : 'i',
872 b'\xf0' : 'o',
873 b'\xf1' : 'n',
874 b'\xf2' : 'o',
875 b'\xf3' : 'o',
876 b'\xf4' : 'o',
877 b'\xf5' : 'o',
878 b'\xf6' : 'o',
879 b'\xf7' : '/',
880 b'\xf8' : 'o',
881 b'\xf9' : 'u',
882 b'\xfa' : 'u',
883 b'\xfb' : 'u',
884 b'\xfc' : 'u',
885 b'\xfd' : 'y',
886 b'\xfe' : 'b',
887 b'\xff' : 'y',
888 }
889
890 # A map used when removing rogue Windows-1252/ISO-8859-1
891 # characters in otherwise UTF-8 documents.
892 #
893 # Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in
894 # Windows-1252.
895 WINDOWS_1252_TO_UTF8 = {
896 0x80 : b'\xe2\x82\xac', # €
897 0x82 : b'\xe2\x80\x9a', # ‚
898 0x83 : b'\xc6\x92', # Æ’
899 0x84 : b'\xe2\x80\x9e', # „
900 0x85 : b'\xe2\x80\xa6', # …
901 0x86 : b'\xe2\x80\xa0', # †
902 0x87 : b'\xe2\x80\xa1', # ‡
903 0x88 : b'\xcb\x86', # ˆ
904 0x89 : b'\xe2\x80\xb0', # ‰
905 0x8a : b'\xc5\xa0', # Å 
906 0x8b : b'\xe2\x80\xb9', # ‹
907 0x8c : b'\xc5\x92', # Å’
908 0x8e : b'\xc5\xbd', # Ž
909 0x91 : b'\xe2\x80\x98', # ‘
910 0x92 : b'\xe2\x80\x99', # ’
911 0x93 : b'\xe2\x80\x9c', # “
912 0x94 : b'\xe2\x80\x9d', # â€
913 0x95 : b'\xe2\x80\xa2', # •
914 0x96 : b'\xe2\x80\x93', # –
915 0x97 : b'\xe2\x80\x94', # —
916 0x98 : b'\xcb\x9c', # ˜
917 0x99 : b'\xe2\x84\xa2', # â„¢
918 0x9a : b'\xc5\xa1', # Å¡
919 0x9b : b'\xe2\x80\xba', # ›
920 0x9c : b'\xc5\x93', # Å“
921 0x9e : b'\xc5\xbe', # ž
922 0x9f : b'\xc5\xb8', # Ÿ
923 0xa0 : b'\xc2\xa0', #  
924 0xa1 : b'\xc2\xa1', # ¡
925 0xa2 : b'\xc2\xa2', # ¢
926 0xa3 : b'\xc2\xa3', # £
927 0xa4 : b'\xc2\xa4', # ¤
928 0xa5 : b'\xc2\xa5', # ¥
929 0xa6 : b'\xc2\xa6', # ¦
930 0xa7 : b'\xc2\xa7', # §
931 0xa8 : b'\xc2\xa8', # ¨
932 0xa9 : b'\xc2\xa9', # ©
933 0xaa : b'\xc2\xaa', # ª
934 0xab : b'\xc2\xab', # «
935 0xac : b'\xc2\xac', # ¬
936 0xad : b'\xc2\xad', # ­
937 0xae : b'\xc2\xae', # ®
938 0xaf : b'\xc2\xaf', # ¯
939 0xb0 : b'\xc2\xb0', # °
940 0xb1 : b'\xc2\xb1', # ±
941 0xb2 : b'\xc2\xb2', # ²
942 0xb3 : b'\xc2\xb3', # ³
943 0xb4 : b'\xc2\xb4', # ´
944 0xb5 : b'\xc2\xb5', # µ
945 0xb6 : b'\xc2\xb6', # ¶
946 0xb7 : b'\xc2\xb7', # ·
947 0xb8 : b'\xc2\xb8', # ¸
948 0xb9 : b'\xc2\xb9', # ¹
949 0xba : b'\xc2\xba', # º
950 0xbb : b'\xc2\xbb', # »
951 0xbc : b'\xc2\xbc', # ¼
952 0xbd : b'\xc2\xbd', # ½
953 0xbe : b'\xc2\xbe', # ¾
954 0xbf : b'\xc2\xbf', # ¿
955 0xc0 : b'\xc3\x80', # À
956 0xc1 : b'\xc3\x81', # Ã
957 0xc2 : b'\xc3\x82', # Â
958 0xc3 : b'\xc3\x83', # Ã
959 0xc4 : b'\xc3\x84', # Ä
960 0xc5 : b'\xc3\x85', # Ã…
961 0xc6 : b'\xc3\x86', # Æ
962 0xc7 : b'\xc3\x87', # Ç
963 0xc8 : b'\xc3\x88', # È
964 0xc9 : b'\xc3\x89', # É
965 0xca : b'\xc3\x8a', # Ê
966 0xcb : b'\xc3\x8b', # Ë
967 0xcc : b'\xc3\x8c', # Ì
968 0xcd : b'\xc3\x8d', # Ã
969 0xce : b'\xc3\x8e', # ÃŽ
970 0xcf : b'\xc3\x8f', # Ã
971 0xd0 : b'\xc3\x90', # Ã
972 0xd1 : b'\xc3\x91', # Ñ
973 0xd2 : b'\xc3\x92', # Ã’
974 0xd3 : b'\xc3\x93', # Ó
975 0xd4 : b'\xc3\x94', # Ô
976 0xd5 : b'\xc3\x95', # Õ
977 0xd6 : b'\xc3\x96', # Ö
978 0xd7 : b'\xc3\x97', # ×
979 0xd8 : b'\xc3\x98', # Ø
980 0xd9 : b'\xc3\x99', # Ù
981 0xda : b'\xc3\x9a', # Ú
982 0xdb : b'\xc3\x9b', # Û
983 0xdc : b'\xc3\x9c', # Ü
984 0xdd : b'\xc3\x9d', # Ã
985 0xde : b'\xc3\x9e', # Þ
986 0xdf : b'\xc3\x9f', # ß
987 0xe0 : b'\xc3\xa0', # à
988 0xe1 : b'\xa1', # á
989 0xe2 : b'\xc3\xa2', # â
990 0xe3 : b'\xc3\xa3', # ã
991 0xe4 : b'\xc3\xa4', # ä
992 0xe5 : b'\xc3\xa5', # å
993 0xe6 : b'\xc3\xa6', # æ
994 0xe7 : b'\xc3\xa7', # ç
995 0xe8 : b'\xc3\xa8', # è
996 0xe9 : b'\xc3\xa9', # é
997 0xea : b'\xc3\xaa', # ê
998 0xeb : b'\xc3\xab', # ë
999 0xec : b'\xc3\xac', # ì
1000 0xed : b'\xc3\xad', # í
1001 0xee : b'\xc3\xae', # î
1002 0xef : b'\xc3\xaf', # ï
1003 0xf0 : b'\xc3\xb0', # ð
1004 0xf1 : b'\xc3\xb1', # ñ
1005 0xf2 : b'\xc3\xb2', # ò
1006 0xf3 : b'\xc3\xb3', # ó
1007 0xf4 : b'\xc3\xb4', # ô
1008 0xf5 : b'\xc3\xb5', # õ
1009 0xf6 : b'\xc3\xb6', # ö
1010 0xf7 : b'\xc3\xb7', # ÷
1011 0xf8 : b'\xc3\xb8', # ø
1012 0xf9 : b'\xc3\xb9', # ù
1013 0xfa : b'\xc3\xba', # ú
1014 0xfb : b'\xc3\xbb', # û
1015 0xfc : b'\xc3\xbc', # ü
1016 0xfd : b'\xc3\xbd', # ý
1017 0xfe : b'\xc3\xbe', # þ
1018 }
1019
1020 MULTIBYTE_MARKERS_AND_SIZES = [
1021 (0xc2, 0xdf, 2), # 2-byte characters start with a byte C2-DF
1022 (0xe0, 0xef, 3), # 3-byte characters start with E0-EF
1023 (0xf0, 0xf4, 4), # 4-byte characters start with F0-F4
1024 ]
1025
1026 FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
1027 LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
1028
1029 @classmethod
1030 def detwingle(cls, in_bytes, main_encoding="utf8",
1031 embedded_encoding="windows-1252"):
1032 """Fix characters from one encoding embedded in some other encoding.
1033
1034 Currently the only situation supported is Windows-1252 (or its
1035 subset ISO-8859-1), embedded in UTF-8.
1036
1037 :param in_bytes: A bytestring that you suspect contains
1038 characters from multiple encodings. Note that this _must_
1039 be a bytestring. If you've already converted the document
1040 to Unicode, you're too late.
1041 :param main_encoding: The primary encoding of `in_bytes`.
1042 :param embedded_encoding: The encoding that was used to embed characters
1043 in the main document.
1044 :return: A bytestring in which `embedded_encoding`
1045 characters have been converted to their `main_encoding`
1046 equivalents.
1047 """
1048 if embedded_encoding.replace('_', '-').lower() not in (
1049 'windows-1252', 'windows_1252'):
1050 raise NotImplementedError(
1051 "Windows-1252 and ISO-8859-1 are the only currently supported "
1052 "embedded encodings.")
1053
1054 if main_encoding.lower() not in ('utf8', 'utf-8'):
1055 raise NotImplementedError(
1056 "UTF-8 is the only currently supported main encoding.")
1057
1058 byte_chunks = []
1059
1060 chunk_start = 0
1061 pos = 0
1062 while pos < len(in_bytes):
1063 byte = in_bytes[pos]
1064 if not isinstance(byte, int):
1065 # Python 2.x
1066 byte = ord(byte)
1067 if (byte >= cls.FIRST_MULTIBYTE_MARKER
1068 and byte <= cls.LAST_MULTIBYTE_MARKER):
1069 # This is the start of a UTF-8 multibyte character. Skip
1070 # to the end.
1071 for start, end, size in cls.MULTIBYTE_MARKERS_AND_SIZES:
1072 if byte >= start and byte <= end:
1073 pos += size
1074 break
1075 elif byte >= 0x80 and byte in cls.WINDOWS_1252_TO_UTF8:
1076 # We found a Windows-1252 character!
1077 # Save the string up to this point as a chunk.
1078 byte_chunks.append(in_bytes[chunk_start:pos])
1079
1080 # Now translate the Windows-1252 character into UTF-8
1081 # and add it as another, one-byte chunk.
1082 byte_chunks.append(cls.WINDOWS_1252_TO_UTF8[byte])
1083 pos += 1
1084 chunk_start = pos
1085 else:
1086 # Go on to the next character.
1087 pos += 1
1088 if chunk_start == 0:
1089 # The string is unchanged.
1090 return in_bytes
1091 else:
1092 # Store the final chunk.
1093 byte_chunks.append(in_bytes[chunk_start:])
1094 return b''.join(byte_chunks)
1095
diff --git a/bitbake/lib/bs4/diagnose.py b/bitbake/lib/bs4/diagnose.py
deleted file mode 100644
index 4692795340..0000000000
--- a/bitbake/lib/bs4/diagnose.py
+++ /dev/null
@@ -1,232 +0,0 @@
1"""Diagnostic functions, mainly for use when doing tech support."""
2
3# Use of this source code is governed by the MIT license.
4__license__ = "MIT"
5
6import cProfile
7from io import BytesIO
8from html.parser import HTMLParser
9import bs4
10from bs4 import BeautifulSoup, __version__
11from bs4.builder import builder_registry
12
13import os
14import pstats
15import random
16import tempfile
17import time
18import traceback
19import sys
20
21def diagnose(data):
22 """Diagnostic suite for isolating common problems.
23
24 :param data: A string containing markup that needs to be explained.
25 :return: None; diagnostics are printed to standard output.
26 """
27 print(("Diagnostic running on Beautiful Soup %s" % __version__))
28 print(("Python version %s" % sys.version))
29
30 basic_parsers = ["html.parser", "html5lib", "lxml"]
31 for name in basic_parsers:
32 for builder in builder_registry.builders:
33 if name in builder.features:
34 break
35 else:
36 basic_parsers.remove(name)
37 print((
38 "I noticed that %s is not installed. Installing it may help." %
39 name))
40
41 if 'lxml' in basic_parsers:
42 basic_parsers.append("lxml-xml")
43 try:
44 from lxml import etree
45 print(("Found lxml version %s" % ".".join(map(str,etree.LXML_VERSION))))
46 except ImportError as e:
47 print(
48 "lxml is not installed or couldn't be imported.")
49
50
51 if 'html5lib' in basic_parsers:
52 try:
53 import html5lib
54 print(("Found html5lib version %s" % html5lib.__version__))
55 except ImportError as e:
56 print(
57 "html5lib is not installed or couldn't be imported.")
58
59 if hasattr(data, 'read'):
60 data = data.read()
61
62 for parser in basic_parsers:
63 print(("Trying to parse your markup with %s" % parser))
64 success = False
65 try:
66 soup = BeautifulSoup(data, features=parser)
67 success = True
68 except Exception as e:
69 print(("%s could not parse the markup." % parser))
70 traceback.print_exc()
71 if success:
72 print(("Here's what %s did with the markup:" % parser))
73 print((soup.prettify()))
74
75 print(("-" * 80))
76
77def lxml_trace(data, html=True, **kwargs):
78 """Print out the lxml events that occur during parsing.
79
80 This lets you see how lxml parses a document when no Beautiful
81 Soup code is running. You can use this to determine whether
82 an lxml-specific problem is in Beautiful Soup's lxml tree builders
83 or in lxml itself.
84
85 :param data: Some markup.
86 :param html: If True, markup will be parsed with lxml's HTML parser.
87 if False, lxml's XML parser will be used.
88 """
89 from lxml import etree
90 recover = kwargs.pop('recover', True)
91 if isinstance(data, str):
92 data = data.encode("utf8")
93 reader = BytesIO(data)
94 for event, element in etree.iterparse(
95 reader, html=html, recover=recover, **kwargs
96 ):
97 print(("%s, %4s, %s" % (event, element.tag, element.text)))
98
99class AnnouncingParser(HTMLParser):
100 """Subclass of HTMLParser that announces parse events, without doing
101 anything else.
102
103 You can use this to get a picture of how html.parser sees a given
104 document. The easiest way to do this is to call `htmlparser_trace`.
105 """
106
107 def _p(self, s):
108 print(s)
109
110 def handle_starttag(self, name, attrs):
111 self._p("%s START" % name)
112
113 def handle_endtag(self, name):
114 self._p("%s END" % name)
115
116 def handle_data(self, data):
117 self._p("%s DATA" % data)
118
119 def handle_charref(self, name):
120 self._p("%s CHARREF" % name)
121
122 def handle_entityref(self, name):
123 self._p("%s ENTITYREF" % name)
124
125 def handle_comment(self, data):
126 self._p("%s COMMENT" % data)
127
128 def handle_decl(self, data):
129 self._p("%s DECL" % data)
130
131 def unknown_decl(self, data):
132 self._p("%s UNKNOWN-DECL" % data)
133
134 def handle_pi(self, data):
135 self._p("%s PI" % data)
136
137def htmlparser_trace(data):
138 """Print out the HTMLParser events that occur during parsing.
139
140 This lets you see how HTMLParser parses a document when no
141 Beautiful Soup code is running.
142
143 :param data: Some markup.
144 """
145 parser = AnnouncingParser()
146 parser.feed(data)
147
148_vowels = "aeiou"
149_consonants = "bcdfghjklmnpqrstvwxyz"
150
151def rword(length=5):
152 "Generate a random word-like string."
153 s = ''
154 for i in range(length):
155 if i % 2 == 0:
156 t = _consonants
157 else:
158 t = _vowels
159 s += random.choice(t)
160 return s
161
162def rsentence(length=4):
163 "Generate a random sentence-like string."
164 return " ".join(rword(random.randint(4,9)) for i in range(length))
165
166def rdoc(num_elements=1000):
167 """Randomly generate an invalid HTML document."""
168 tag_names = ['p', 'div', 'span', 'i', 'b', 'script', 'table']
169 elements = []
170 for i in range(num_elements):
171 choice = random.randint(0,3)
172 if choice == 0:
173 # New tag.
174 tag_name = random.choice(tag_names)
175 elements.append("<%s>" % tag_name)
176 elif choice == 1:
177 elements.append(rsentence(random.randint(1,4)))
178 elif choice == 2:
179 # Close a tag.
180 tag_name = random.choice(tag_names)
181 elements.append("</%s>" % tag_name)
182 return "<html>" + "\n".join(elements) + "</html>"
183
184def benchmark_parsers(num_elements=100000):
185 """Very basic head-to-head performance benchmark."""
186 print(("Comparative parser benchmark on Beautiful Soup %s" % __version__))
187 data = rdoc(num_elements)
188 print(("Generated a large invalid HTML document (%d bytes)." % len(data)))
189
190 for parser in ["lxml", ["lxml", "html"], "html5lib", "html.parser"]:
191 success = False
192 try:
193 a = time.time()
194 soup = BeautifulSoup(data, parser)
195 b = time.time()
196 success = True
197 except Exception as e:
198 print(("%s could not parse the markup." % parser))
199 traceback.print_exc()
200 if success:
201 print(("BS4+%s parsed the markup in %.2fs." % (parser, b-a)))
202
203 from lxml import etree
204 a = time.time()
205 etree.HTML(data)
206 b = time.time()
207 print(("Raw lxml parsed the markup in %.2fs." % (b-a)))
208
209 import html5lib
210 parser = html5lib.HTMLParser()
211 a = time.time()
212 parser.parse(data)
213 b = time.time()
214 print(("Raw html5lib parsed the markup in %.2fs." % (b-a)))
215
216def profile(num_elements=100000, parser="lxml"):
217 """Use Python's profiler on a randomly generated document."""
218 filehandle = tempfile.NamedTemporaryFile()
219 filename = filehandle.name
220
221 data = rdoc(num_elements)
222 vars = dict(bs4=bs4, data=data, parser=parser)
223 cProfile.runctx('bs4.BeautifulSoup(data, parser)' , vars, vars, filename)
224
225 stats = pstats.Stats(filename)
226 # stats.strip_dirs()
227 stats.sort_stats("cumulative")
228 stats.print_stats('_html5lib|bs4', 50)
229
230# If this file is run as a script, standard input is diagnosed.
231if __name__ == '__main__':
232 diagnose(sys.stdin.read())
diff --git a/bitbake/lib/bs4/element.py b/bitbake/lib/bs4/element.py
deleted file mode 100644
index 0aefe734b2..0000000000
--- a/bitbake/lib/bs4/element.py
+++ /dev/null
@@ -1,2435 +0,0 @@
1# Use of this source code is governed by the MIT license.
2__license__ = "MIT"
3
4try:
5 from collections.abc import Callable # Python 3.6
6except ImportError as e:
7 from collections import Callable
8import re
9import sys
10import warnings
11
12from bs4.css import CSS
13from bs4.formatter import (
14 Formatter,
15 HTMLFormatter,
16 XMLFormatter,
17)
18
19DEFAULT_OUTPUT_ENCODING = "utf-8"
20
21nonwhitespace_re = re.compile(r"\S+")
22
23# NOTE: This isn't used as of 4.7.0. I'm leaving it for a little bit on
24# the off chance someone imported it for their own use.
25whitespace_re = re.compile(r"\s+")
26
27def _alias(attr):
28 """Alias one attribute name to another for backward compatibility"""
29 @property
30 def alias(self):
31 return getattr(self, attr)
32
33 @alias.setter
34 def alias(self):
35 return setattr(self, attr)
36 return alias
37
38
39# These encodings are recognized by Python (so PageElement.encode
40# could theoretically support them) but XML and HTML don't recognize
41# them (so they should not show up in an XML or HTML document as that
42# document's encoding).
43#
44# If an XML document is encoded in one of these encodings, no encoding
45# will be mentioned in the XML declaration. If an HTML document is
46# encoded in one of these encodings, and the HTML document has a
47# <meta> tag that mentions an encoding, the encoding will be given as
48# the empty string.
49#
50# Source:
51# https://docs.python.org/3/library/codecs.html#python-specific-encodings
52PYTHON_SPECIFIC_ENCODINGS = set([
53 "idna",
54 "mbcs",
55 "oem",
56 "palmos",
57 "punycode",
58 "raw_unicode_escape",
59 "undefined",
60 "unicode_escape",
61 "raw-unicode-escape",
62 "unicode-escape",
63 "string-escape",
64 "string_escape",
65])
66
67
68class NamespacedAttribute(str):
69 """A namespaced string (e.g. 'xml:lang') that remembers the namespace
70 ('xml') and the name ('lang') that were used to create it.
71 """
72
73 def __new__(cls, prefix, name=None, namespace=None):
74 if not name:
75 # This is the default namespace. Its name "has no value"
76 # per https://www.w3.org/TR/xml-names/#defaulting
77 name = None
78
79 if not name:
80 obj = str.__new__(cls, prefix)
81 elif not prefix:
82 # Not really namespaced.
83 obj = str.__new__(cls, name)
84 else:
85 obj = str.__new__(cls, prefix + ":" + name)
86 obj.prefix = prefix
87 obj.name = name
88 obj.namespace = namespace
89 return obj
90
91class AttributeValueWithCharsetSubstitution(str):
92 """A stand-in object for a character encoding specified in HTML."""
93
94class CharsetMetaAttributeValue(AttributeValueWithCharsetSubstitution):
95 """A generic stand-in for the value of a meta tag's 'charset' attribute.
96
97 When Beautiful Soup parses the markup '<meta charset="utf8">', the
98 value of the 'charset' attribute will be one of these objects.
99 """
100
101 def __new__(cls, original_value):
102 obj = str.__new__(cls, original_value)
103 obj.original_value = original_value
104 return obj
105
106 def encode(self, encoding):
107 """When an HTML document is being encoded to a given encoding, the
108 value of a meta tag's 'charset' is the name of the encoding.
109 """
110 if encoding in PYTHON_SPECIFIC_ENCODINGS:
111 return ''
112 return encoding
113
114
115class ContentMetaAttributeValue(AttributeValueWithCharsetSubstitution):
116 """A generic stand-in for the value of a meta tag's 'content' attribute.
117
118 When Beautiful Soup parses the markup:
119 <meta http-equiv="content-type" content="text/html; charset=utf8">
120
121 The value of the 'content' attribute will be one of these objects.
122 """
123
124 CHARSET_RE = re.compile(r"((^|;)\s*charset=)([^;]*)", re.M)
125
126 def __new__(cls, original_value):
127 match = cls.CHARSET_RE.search(original_value)
128 if match is None:
129 # No substitution necessary.
130 return str.__new__(str, original_value)
131
132 obj = str.__new__(cls, original_value)
133 obj.original_value = original_value
134 return obj
135
136 def encode(self, encoding):
137 if encoding in PYTHON_SPECIFIC_ENCODINGS:
138 return ''
139 def rewrite(match):
140 return match.group(1) + encoding
141 return self.CHARSET_RE.sub(rewrite, self.original_value)
142
143
144class PageElement(object):
145 """Contains the navigational information for some part of the page:
146 that is, its current location in the parse tree.
147
148 NavigableString, Tag, etc. are all subclasses of PageElement.
149 """
150
151 # In general, we can't tell just by looking at an element whether
152 # it's contained in an XML document or an HTML document. But for
153 # Tags (q.v.) we can store this information at parse time.
154 known_xml = None
155
156 def setup(self, parent=None, previous_element=None, next_element=None,
157 previous_sibling=None, next_sibling=None):
158 """Sets up the initial relations between this element and
159 other elements.
160
161 :param parent: The parent of this element.
162
163 :param previous_element: The element parsed immediately before
164 this one.
165
166 :param next_element: The element parsed immediately before
167 this one.
168
169 :param previous_sibling: The most recently encountered element
170 on the same level of the parse tree as this one.
171
172 :param previous_sibling: The next element to be encountered
173 on the same level of the parse tree as this one.
174 """
175 self.parent = parent
176
177 self.previous_element = previous_element
178 if previous_element is not None:
179 self.previous_element.next_element = self
180
181 self.next_element = next_element
182 if self.next_element is not None:
183 self.next_element.previous_element = self
184
185 self.next_sibling = next_sibling
186 if self.next_sibling is not None:
187 self.next_sibling.previous_sibling = self
188
189 if (previous_sibling is None
190 and self.parent is not None and self.parent.contents):
191 previous_sibling = self.parent.contents[-1]
192
193 self.previous_sibling = previous_sibling
194 if previous_sibling is not None:
195 self.previous_sibling.next_sibling = self
196
197 def format_string(self, s, formatter):
198 """Format the given string using the given formatter.
199
200 :param s: A string.
201 :param formatter: A Formatter object, or a string naming one of the standard formatters.
202 """
203 if formatter is None:
204 return s
205 if not isinstance(formatter, Formatter):
206 formatter = self.formatter_for_name(formatter)
207 output = formatter.substitute(s)
208 return output
209
210 def formatter_for_name(self, formatter):
211 """Look up or create a Formatter for the given identifier,
212 if necessary.
213
214 :param formatter: Can be a Formatter object (used as-is), a
215 function (used as the entity substitution hook for an
216 XMLFormatter or HTMLFormatter), or a string (used to look
217 up an XMLFormatter or HTMLFormatter in the appropriate
218 registry.
219 """
220 if isinstance(formatter, Formatter):
221 return formatter
222 if self._is_xml:
223 c = XMLFormatter
224 else:
225 c = HTMLFormatter
226 if isinstance(formatter, Callable):
227 return c(entity_substitution=formatter)
228 return c.REGISTRY[formatter]
229
230 @property
231 def _is_xml(self):
232 """Is this element part of an XML tree or an HTML tree?
233
234 This is used in formatter_for_name, when deciding whether an
235 XMLFormatter or HTMLFormatter is more appropriate. It can be
236 inefficient, but it should be called very rarely.
237 """
238 if self.known_xml is not None:
239 # Most of the time we will have determined this when the
240 # document is parsed.
241 return self.known_xml
242
243 # Otherwise, it's likely that this element was created by
244 # direct invocation of the constructor from within the user's
245 # Python code.
246 if self.parent is None:
247 # This is the top-level object. It should have .known_xml set
248 # from tree creation. If not, take a guess--BS is usually
249 # used on HTML markup.
250 return getattr(self, 'is_xml', False)
251 return self.parent._is_xml
252
253 nextSibling = _alias("next_sibling") # BS3
254 previousSibling = _alias("previous_sibling") # BS3
255
256 default = object()
257 def _all_strings(self, strip=False, types=default):
258 """Yield all strings of certain classes, possibly stripping them.
259
260 This is implemented differently in Tag and NavigableString.
261 """
262 raise NotImplementedError()
263
264 @property
265 def stripped_strings(self):
266 """Yield all strings in this PageElement, stripping them first.
267
268 :yield: A sequence of stripped strings.
269 """
270 for string in self._all_strings(True):
271 yield string
272
273 def get_text(self, separator="", strip=False,
274 types=default):
275 """Get all child strings of this PageElement, concatenated using the
276 given separator.
277
278 :param separator: Strings will be concatenated using this separator.
279
280 :param strip: If True, strings will be stripped before being
281 concatenated.
282
283 :param types: A tuple of NavigableString subclasses. Any
284 strings of a subclass not found in this list will be
285 ignored. Although there are exceptions, the default
286 behavior in most cases is to consider only NavigableString
287 and CData objects. That means no comments, processing
288 instructions, etc.
289
290 :return: A string.
291 """
292 return separator.join([s for s in self._all_strings(
293 strip, types=types)])
294 getText = get_text
295 text = property(get_text)
296
297 def replace_with(self, *args):
298 """Replace this PageElement with one or more PageElements, keeping the
299 rest of the tree the same.
300
301 :param args: One or more PageElements.
302 :return: `self`, no longer part of the tree.
303 """
304 if self.parent is None:
305 raise ValueError(
306 "Cannot replace one element with another when the "
307 "element to be replaced is not part of a tree.")
308 if len(args) == 1 and args[0] is self:
309 return
310 if any(x is self.parent for x in args):
311 raise ValueError("Cannot replace a Tag with its parent.")
312 old_parent = self.parent
313 my_index = self.parent.index(self)
314 self.extract(_self_index=my_index)
315 for idx, replace_with in enumerate(args, start=my_index):
316 old_parent.insert(idx, replace_with)
317 return self
318 replaceWith = replace_with # BS3
319
320 def unwrap(self):
321 """Replace this PageElement with its contents.
322
323 :return: `self`, no longer part of the tree.
324 """
325 my_parent = self.parent
326 if self.parent is None:
327 raise ValueError(
328 "Cannot replace an element with its contents when that"
329 "element is not part of a tree.")
330 my_index = self.parent.index(self)
331 self.extract(_self_index=my_index)
332 for child in reversed(self.contents[:]):
333 my_parent.insert(my_index, child)
334 return self
335 replace_with_children = unwrap
336 replaceWithChildren = unwrap # BS3
337
338 def wrap(self, wrap_inside):
339 """Wrap this PageElement inside another one.
340
341 :param wrap_inside: A PageElement.
342 :return: `wrap_inside`, occupying the position in the tree that used
343 to be occupied by `self`, and with `self` inside it.
344 """
345 me = self.replace_with(wrap_inside)
346 wrap_inside.append(me)
347 return wrap_inside
348
349 def extract(self, _self_index=None):
350 """Destructively rips this element out of the tree.
351
352 :param _self_index: The location of this element in its parent's
353 .contents, if known. Passing this in allows for a performance
354 optimization.
355
356 :return: `self`, no longer part of the tree.
357 """
358 if self.parent is not None:
359 if _self_index is None:
360 _self_index = self.parent.index(self)
361 del self.parent.contents[_self_index]
362
363 #Find the two elements that would be next to each other if
364 #this element (and any children) hadn't been parsed. Connect
365 #the two.
366 last_child = self._last_descendant()
367 next_element = last_child.next_element
368
369 if (self.previous_element is not None and
370 self.previous_element is not next_element):
371 self.previous_element.next_element = next_element
372 if next_element is not None and next_element is not self.previous_element:
373 next_element.previous_element = self.previous_element
374 self.previous_element = None
375 last_child.next_element = None
376
377 self.parent = None
378 if (self.previous_sibling is not None
379 and self.previous_sibling is not self.next_sibling):
380 self.previous_sibling.next_sibling = self.next_sibling
381 if (self.next_sibling is not None
382 and self.next_sibling is not self.previous_sibling):
383 self.next_sibling.previous_sibling = self.previous_sibling
384 self.previous_sibling = self.next_sibling = None
385 return self
386
387 def _last_descendant(self, is_initialized=True, accept_self=True):
388 """Finds the last element beneath this object to be parsed.
389
390 :param is_initialized: Has `setup` been called on this PageElement
391 yet?
392 :param accept_self: Is `self` an acceptable answer to the question?
393 """
394 if is_initialized and self.next_sibling is not None:
395 last_child = self.next_sibling.previous_element
396 else:
397 last_child = self
398 while isinstance(last_child, Tag) and last_child.contents:
399 last_child = last_child.contents[-1]
400 if not accept_self and last_child is self:
401 last_child = None
402 return last_child
403 # BS3: Not part of the API!
404 _lastRecursiveChild = _last_descendant
405
406 def insert(self, position, new_child):
407 """Insert a new PageElement in the list of this PageElement's children.
408
409 This works the same way as `list.insert`.
410
411 :param position: The numeric position that should be occupied
412 in `self.children` by the new PageElement.
413 :param new_child: A PageElement.
414 """
415 if new_child is None:
416 raise ValueError("Cannot insert None into a tag.")
417 if new_child is self:
418 raise ValueError("Cannot insert a tag into itself.")
419 if (isinstance(new_child, str)
420 and not isinstance(new_child, NavigableString)):
421 new_child = NavigableString(new_child)
422
423 from bs4 import BeautifulSoup
424 if isinstance(new_child, BeautifulSoup):
425 # We don't want to end up with a situation where one BeautifulSoup
426 # object contains another. Insert the children one at a time.
427 for subchild in list(new_child.contents):
428 self.insert(position, subchild)
429 position += 1
430 return
431 position = min(position, len(self.contents))
432 if hasattr(new_child, 'parent') and new_child.parent is not None:
433 # We're 'inserting' an element that's already one
434 # of this object's children.
435 if new_child.parent is self:
436 current_index = self.index(new_child)
437 if current_index < position:
438 # We're moving this element further down the list
439 # of this object's children. That means that when
440 # we extract this element, our target index will
441 # jump down one.
442 position -= 1
443 new_child.extract()
444
445 new_child.parent = self
446 previous_child = None
447 if position == 0:
448 new_child.previous_sibling = None
449 new_child.previous_element = self
450 else:
451 previous_child = self.contents[position - 1]
452 new_child.previous_sibling = previous_child
453 new_child.previous_sibling.next_sibling = new_child
454 new_child.previous_element = previous_child._last_descendant(False)
455 if new_child.previous_element is not None:
456 new_child.previous_element.next_element = new_child
457
458 new_childs_last_element = new_child._last_descendant(False)
459
460 if position >= len(self.contents):
461 new_child.next_sibling = None
462
463 parent = self
464 parents_next_sibling = None
465 while parents_next_sibling is None and parent is not None:
466 parents_next_sibling = parent.next_sibling
467 parent = parent.parent
468 if parents_next_sibling is not None:
469 # We found the element that comes next in the document.
470 break
471 if parents_next_sibling is not None:
472 new_childs_last_element.next_element = parents_next_sibling
473 else:
474 # The last element of this tag is the last element in
475 # the document.
476 new_childs_last_element.next_element = None
477 else:
478 next_child = self.contents[position]
479 new_child.next_sibling = next_child
480 if new_child.next_sibling is not None:
481 new_child.next_sibling.previous_sibling = new_child
482 new_childs_last_element.next_element = next_child
483
484 if new_childs_last_element.next_element is not None:
485 new_childs_last_element.next_element.previous_element = new_childs_last_element
486 self.contents.insert(position, new_child)
487
488 def append(self, tag):
489 """Appends the given PageElement to the contents of this one.
490
491 :param tag: A PageElement.
492 """
493 self.insert(len(self.contents), tag)
494
495 def extend(self, tags):
496 """Appends the given PageElements to this one's contents.
497
498 :param tags: A list of PageElements. If a single Tag is
499 provided instead, this PageElement's contents will be extended
500 with that Tag's contents.
501 """
502 if isinstance(tags, Tag):
503 tags = tags.contents
504 if isinstance(tags, list):
505 # Moving items around the tree may change their position in
506 # the original list. Make a list that won't change.
507 tags = list(tags)
508 for tag in tags:
509 self.append(tag)
510
511 def insert_before(self, *args):
512 """Makes the given element(s) the immediate predecessor of this one.
513
514 All the elements will have the same parent, and the given elements
515 will be immediately before this one.
516
517 :param args: One or more PageElements.
518 """
519 parent = self.parent
520 if parent is None:
521 raise ValueError(
522 "Element has no parent, so 'before' has no meaning.")
523 if any(x is self for x in args):
524 raise ValueError("Can't insert an element before itself.")
525 for predecessor in args:
526 # Extract first so that the index won't be screwed up if they
527 # are siblings.
528 if isinstance(predecessor, PageElement):
529 predecessor.extract()
530 index = parent.index(self)
531 parent.insert(index, predecessor)
532
533 def insert_after(self, *args):
534 """Makes the given element(s) the immediate successor of this one.
535
536 The elements will have the same parent, and the given elements
537 will be immediately after this one.
538
539 :param args: One or more PageElements.
540 """
541 # Do all error checking before modifying the tree.
542 parent = self.parent
543 if parent is None:
544 raise ValueError(
545 "Element has no parent, so 'after' has no meaning.")
546 if any(x is self for x in args):
547 raise ValueError("Can't insert an element after itself.")
548
549 offset = 0
550 for successor in args:
551 # Extract first so that the index won't be screwed up if they
552 # are siblings.
553 if isinstance(successor, PageElement):
554 successor.extract()
555 index = parent.index(self)
556 parent.insert(index+1+offset, successor)
557 offset += 1
558
559 def find_next(self, name=None, attrs={}, string=None, **kwargs):
560 """Find the first PageElement that matches the given criteria and
561 appears later in the document than this PageElement.
562
563 All find_* methods take a common set of arguments. See the online
564 documentation for detailed explanations.
565
566 :param name: A filter on tag name.
567 :param attrs: A dictionary of filters on attribute values.
568 :param string: A filter for a NavigableString with specific text.
569 :kwargs: A dictionary of filters on attribute values.
570 :return: A PageElement.
571 :rtype: bs4.element.Tag | bs4.element.NavigableString
572 """
573 return self._find_one(self.find_all_next, name, attrs, string, **kwargs)
574 findNext = find_next # BS3
575
576 def find_all_next(self, name=None, attrs={}, string=None, limit=None,
577 **kwargs):
578 """Find all PageElements that match the given criteria and appear
579 later in the document than this PageElement.
580
581 All find_* methods take a common set of arguments. See the online
582 documentation for detailed explanations.
583
584 :param name: A filter on tag name.
585 :param attrs: A dictionary of filters on attribute values.
586 :param string: A filter for a NavigableString with specific text.
587 :param limit: Stop looking after finding this many results.
588 :kwargs: A dictionary of filters on attribute values.
589 :return: A ResultSet containing PageElements.
590 """
591 _stacklevel = kwargs.pop('_stacklevel', 2)
592 return self._find_all(name, attrs, string, limit, self.next_elements,
593 _stacklevel=_stacklevel+1, **kwargs)
594 findAllNext = find_all_next # BS3
595
596 def find_next_sibling(self, name=None, attrs={}, string=None, **kwargs):
597 """Find the closest sibling to this PageElement that matches the
598 given criteria and appears later in the document.
599
600 All find_* methods take a common set of arguments. See the
601 online documentation for detailed explanations.
602
603 :param name: A filter on tag name.
604 :param attrs: A dictionary of filters on attribute values.
605 :param string: A filter for a NavigableString with specific text.
606 :kwargs: A dictionary of filters on attribute values.
607 :return: A PageElement.
608 :rtype: bs4.element.Tag | bs4.element.NavigableString
609 """
610 return self._find_one(self.find_next_siblings, name, attrs, string,
611 **kwargs)
612 findNextSibling = find_next_sibling # BS3
613
614 def find_next_siblings(self, name=None, attrs={}, string=None, limit=None,
615 **kwargs):
616 """Find all siblings of this PageElement that match the given criteria
617 and appear later in the document.
618
619 All find_* methods take a common set of arguments. See the online
620 documentation for detailed explanations.
621
622 :param name: A filter on tag name.
623 :param attrs: A dictionary of filters on attribute values.
624 :param string: A filter for a NavigableString with specific text.
625 :param limit: Stop looking after finding this many results.
626 :kwargs: A dictionary of filters on attribute values.
627 :return: A ResultSet of PageElements.
628 :rtype: bs4.element.ResultSet
629 """
630 _stacklevel = kwargs.pop('_stacklevel', 2)
631 return self._find_all(
632 name, attrs, string, limit,
633 self.next_siblings, _stacklevel=_stacklevel+1, **kwargs
634 )
635 findNextSiblings = find_next_siblings # BS3
636 fetchNextSiblings = find_next_siblings # BS2
637
638 def find_previous(self, name=None, attrs={}, string=None, **kwargs):
639 """Look backwards in the document from this PageElement and find the
640 first PageElement that matches the given criteria.
641
642 All find_* methods take a common set of arguments. See the online
643 documentation for detailed explanations.
644
645 :param name: A filter on tag name.
646 :param attrs: A dictionary of filters on attribute values.
647 :param string: A filter for a NavigableString with specific text.
648 :kwargs: A dictionary of filters on attribute values.
649 :return: A PageElement.
650 :rtype: bs4.element.Tag | bs4.element.NavigableString
651 """
652 return self._find_one(
653 self.find_all_previous, name, attrs, string, **kwargs)
654 findPrevious = find_previous # BS3
655
656 def find_all_previous(self, name=None, attrs={}, string=None, limit=None,
657 **kwargs):
658 """Look backwards in the document from this PageElement and find all
659 PageElements that match the given criteria.
660
661 All find_* methods take a common set of arguments. See the online
662 documentation for detailed explanations.
663
664 :param name: A filter on tag name.
665 :param attrs: A dictionary of filters on attribute values.
666 :param string: A filter for a NavigableString with specific text.
667 :param limit: Stop looking after finding this many results.
668 :kwargs: A dictionary of filters on attribute values.
669 :return: A ResultSet of PageElements.
670 :rtype: bs4.element.ResultSet
671 """
672 _stacklevel = kwargs.pop('_stacklevel', 2)
673 return self._find_all(
674 name, attrs, string, limit, self.previous_elements,
675 _stacklevel=_stacklevel+1, **kwargs
676 )
677 findAllPrevious = find_all_previous # BS3
678 fetchPrevious = find_all_previous # BS2
679
680 def find_previous_sibling(self, name=None, attrs={}, string=None, **kwargs):
681 """Returns the closest sibling to this PageElement that matches the
682 given criteria and appears earlier in the document.
683
684 All find_* methods take a common set of arguments. See the online
685 documentation for detailed explanations.
686
687 :param name: A filter on tag name.
688 :param attrs: A dictionary of filters on attribute values.
689 :param string: A filter for a NavigableString with specific text.
690 :kwargs: A dictionary of filters on attribute values.
691 :return: A PageElement.
692 :rtype: bs4.element.Tag | bs4.element.NavigableString
693 """
694 return self._find_one(self.find_previous_siblings, name, attrs, string,
695 **kwargs)
696 findPreviousSibling = find_previous_sibling # BS3
697
698 def find_previous_siblings(self, name=None, attrs={}, string=None,
699 limit=None, **kwargs):
700 """Returns all siblings to this PageElement that match the
701 given criteria and appear earlier in the document.
702
703 All find_* methods take a common set of arguments. See the online
704 documentation for detailed explanations.
705
706 :param name: A filter on tag name.
707 :param attrs: A dictionary of filters on attribute values.
708 :param string: A filter for a NavigableString with specific text.
709 :param limit: Stop looking after finding this many results.
710 :kwargs: A dictionary of filters on attribute values.
711 :return: A ResultSet of PageElements.
712 :rtype: bs4.element.ResultSet
713 """
714 _stacklevel = kwargs.pop('_stacklevel', 2)
715 return self._find_all(
716 name, attrs, string, limit,
717 self.previous_siblings, _stacklevel=_stacklevel+1, **kwargs
718 )
719 findPreviousSiblings = find_previous_siblings # BS3
720 fetchPreviousSiblings = find_previous_siblings # BS2
721
722 def find_parent(self, name=None, attrs={}, **kwargs):
723 """Find the closest parent of this PageElement that matches the given
724 criteria.
725
726 All find_* methods take a common set of arguments. See the online
727 documentation for detailed explanations.
728
729 :param name: A filter on tag name.
730 :param attrs: A dictionary of filters on attribute values.
731 :kwargs: A dictionary of filters on attribute values.
732
733 :return: A PageElement.
734 :rtype: bs4.element.Tag | bs4.element.NavigableString
735 """
736 # NOTE: We can't use _find_one because findParents takes a different
737 # set of arguments.
738 r = None
739 l = self.find_parents(name, attrs, 1, _stacklevel=3, **kwargs)
740 if l:
741 r = l[0]
742 return r
743 findParent = find_parent # BS3
744
745 def find_parents(self, name=None, attrs={}, limit=None, **kwargs):
746 """Find all parents of this PageElement that match the given criteria.
747
748 All find_* methods take a common set of arguments. See the online
749 documentation for detailed explanations.
750
751 :param name: A filter on tag name.
752 :param attrs: A dictionary of filters on attribute values.
753 :param limit: Stop looking after finding this many results.
754 :kwargs: A dictionary of filters on attribute values.
755
756 :return: A PageElement.
757 :rtype: bs4.element.Tag | bs4.element.NavigableString
758 """
759 _stacklevel = kwargs.pop('_stacklevel', 2)
760 return self._find_all(name, attrs, None, limit, self.parents,
761 _stacklevel=_stacklevel+1, **kwargs)
762 findParents = find_parents # BS3
763 fetchParents = find_parents # BS2
764
765 @property
766 def next(self):
767 """The PageElement, if any, that was parsed just after this one.
768
769 :return: A PageElement.
770 :rtype: bs4.element.Tag | bs4.element.NavigableString
771 """
772 return self.next_element
773
774 @property
775 def previous(self):
776 """The PageElement, if any, that was parsed just before this one.
777
778 :return: A PageElement.
779 :rtype: bs4.element.Tag | bs4.element.NavigableString
780 """
781 return self.previous_element
782
783 #These methods do the real heavy lifting.
784
785 def _find_one(self, method, name, attrs, string, **kwargs):
786 r = None
787 l = method(name, attrs, string, 1, _stacklevel=4, **kwargs)
788 if l:
789 r = l[0]
790 return r
791
792 def _find_all(self, name, attrs, string, limit, generator, **kwargs):
793 "Iterates over a generator looking for things that match."
794 _stacklevel = kwargs.pop('_stacklevel', 3)
795
796 if string is None and 'text' in kwargs:
797 string = kwargs.pop('text')
798 warnings.warn(
799 "The 'text' argument to find()-type methods is deprecated. Use 'string' instead.",
800 DeprecationWarning, stacklevel=_stacklevel
801 )
802
803 if isinstance(name, SoupStrainer):
804 strainer = name
805 else:
806 strainer = SoupStrainer(name, attrs, string, **kwargs)
807
808 if string is None and not limit and not attrs and not kwargs:
809 if name is True or name is None:
810 # Optimization to find all tags.
811 result = (element for element in generator
812 if isinstance(element, Tag))
813 return ResultSet(strainer, result)
814 elif isinstance(name, str):
815 # Optimization to find all tags with a given name.
816 if name.count(':') == 1:
817 # This is a name with a prefix. If this is a namespace-aware document,
818 # we need to match the local name against tag.name. If not,
819 # we need to match the fully-qualified name against tag.name.
820 prefix, local_name = name.split(':', 1)
821 else:
822 prefix = None
823 local_name = name
824 result = (element for element in generator
825 if isinstance(element, Tag)
826 and (
827 element.name == name
828 ) or (
829 element.name == local_name
830 and (prefix is None or element.prefix == prefix)
831 )
832 )
833 return ResultSet(strainer, result)
834 results = ResultSet(strainer)
835 while True:
836 try:
837 i = next(generator)
838 except StopIteration:
839 break
840 if i:
841 found = strainer.search(i)
842 if found:
843 results.append(found)
844 if limit and len(results) >= limit:
845 break
846 return results
847
848 #These generators can be used to navigate starting from both
849 #NavigableStrings and Tags.
850 @property
851 def next_elements(self):
852 """All PageElements that were parsed after this one.
853
854 :yield: A sequence of PageElements.
855 """
856 i = self.next_element
857 while i is not None:
858 yield i
859 i = i.next_element
860
861 @property
862 def next_siblings(self):
863 """All PageElements that are siblings of this one but were parsed
864 later.
865
866 :yield: A sequence of PageElements.
867 """
868 i = self.next_sibling
869 while i is not None:
870 yield i
871 i = i.next_sibling
872
873 @property
874 def previous_elements(self):
875 """All PageElements that were parsed before this one.
876
877 :yield: A sequence of PageElements.
878 """
879 i = self.previous_element
880 while i is not None:
881 yield i
882 i = i.previous_element
883
884 @property
885 def previous_siblings(self):
886 """All PageElements that are siblings of this one but were parsed
887 earlier.
888
889 :yield: A sequence of PageElements.
890 """
891 i = self.previous_sibling
892 while i is not None:
893 yield i
894 i = i.previous_sibling
895
896 @property
897 def parents(self):
898 """All PageElements that are parents of this PageElement.
899
900 :yield: A sequence of PageElements.
901 """
902 i = self.parent
903 while i is not None:
904 yield i
905 i = i.parent
906
907 @property
908 def decomposed(self):
909 """Check whether a PageElement has been decomposed.
910
911 :rtype: bool
912 """
913 return getattr(self, '_decomposed', False) or False
914
915 # Old non-property versions of the generators, for backwards
916 # compatibility with BS3.
917 def nextGenerator(self):
918 return self.next_elements
919
920 def nextSiblingGenerator(self):
921 return self.next_siblings
922
923 def previousGenerator(self):
924 return self.previous_elements
925
926 def previousSiblingGenerator(self):
927 return self.previous_siblings
928
929 def parentGenerator(self):
930 return self.parents
931
932
933class NavigableString(str, PageElement):
934 """A Python Unicode string that is part of a parse tree.
935
936 When Beautiful Soup parses the markup <b>penguin</b>, it will
937 create a NavigableString for the string "penguin".
938 """
939
940 PREFIX = ''
941 SUFFIX = ''
942
943 def __new__(cls, value):
944 """Create a new NavigableString.
945
946 When unpickling a NavigableString, this method is called with
947 the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
948 passed in to the superclass's __new__ or the superclass won't know
949 how to handle non-ASCII characters.
950 """
951 if isinstance(value, str):
952 u = str.__new__(cls, value)
953 else:
954 u = str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
955 u.setup()
956 return u
957
958 def __deepcopy__(self, memo, recursive=False):
959 """A copy of a NavigableString has the same contents and class
960 as the original, but it is not connected to the parse tree.
961
962 :param recursive: This parameter is ignored; it's only defined
963 so that NavigableString.__deepcopy__ implements the same
964 signature as Tag.__deepcopy__.
965 """
966 return type(self)(self)
967
968 def __copy__(self):
969 """A copy of a NavigableString can only be a deep copy, because
970 only one PageElement can occupy a given place in a parse tree.
971 """
972 return self.__deepcopy__({})
973
974 def __getnewargs__(self):
975 return (str(self),)
976
977 def __getattr__(self, attr):
978 """text.string gives you text. This is for backwards
979 compatibility for Navigable*String, but for CData* it lets you
980 get the string without the CData wrapper."""
981 if attr == 'string':
982 return self
983 else:
984 raise AttributeError(
985 "'%s' object has no attribute '%s'" % (
986 self.__class__.__name__, attr))
987
988 def output_ready(self, formatter="minimal"):
989 """Run the string through the provided formatter.
990
991 :param formatter: A Formatter object, or a string naming one of the standard formatters.
992 """
993 output = self.format_string(self, formatter)
994 return self.PREFIX + output + self.SUFFIX
995
996 @property
997 def name(self):
998 """Since a NavigableString is not a Tag, it has no .name.
999
1000 This property is implemented so that code like this doesn't crash
1001 when run on a mixture of Tag and NavigableString objects:
1002 [x.name for x in tag.children]
1003 """
1004 return None
1005
1006 @name.setter
1007 def name(self, name):
1008 """Prevent NavigableString.name from ever being set."""
1009 raise AttributeError("A NavigableString cannot be given a name.")
1010
1011 def _all_strings(self, strip=False, types=PageElement.default):
1012 """Yield all strings of certain classes, possibly stripping them.
1013
1014 This makes it easy for NavigableString to implement methods
1015 like get_text() as conveniences, creating a consistent
1016 text-extraction API across all PageElements.
1017
1018 :param strip: If True, all strings will be stripped before being
1019 yielded.
1020
1021 :param types: A tuple of NavigableString subclasses. If this
1022 NavigableString isn't one of those subclasses, the
1023 sequence will be empty. By default, the subclasses
1024 considered are NavigableString and CData objects. That
1025 means no comments, processing instructions, etc.
1026
1027 :yield: A sequence that either contains this string, or is empty.
1028
1029 """
1030 if types is self.default:
1031 # This is kept in Tag because it's full of subclasses of
1032 # this class, which aren't defined until later in the file.
1033 types = Tag.DEFAULT_INTERESTING_STRING_TYPES
1034
1035 # Do nothing if the caller is looking for specific types of
1036 # string, and we're of a different type.
1037 #
1038 # We check specific types instead of using isinstance(self,
1039 # types) because all of these classes subclass
1040 # NavigableString. Anyone who's using this feature probably
1041 # wants generic NavigableStrings but not other stuff.
1042 my_type = type(self)
1043 if types is not None:
1044 if isinstance(types, type):
1045 # Looking for a single type.
1046 if my_type is not types:
1047 return
1048 elif my_type not in types:
1049 # Looking for one of a list of types.
1050 return
1051
1052 value = self
1053 if strip:
1054 value = value.strip()
1055 if len(value) > 0:
1056 yield value
1057 strings = property(_all_strings)
1058
1059class PreformattedString(NavigableString):
1060 """A NavigableString not subject to the normal formatting rules.
1061
1062 This is an abstract class used for special kinds of strings such
1063 as comments (the Comment class) and CDATA blocks (the CData
1064 class).
1065 """
1066
1067 PREFIX = ''
1068 SUFFIX = ''
1069
1070 def output_ready(self, formatter=None):
1071 """Make this string ready for output by adding any subclass-specific
1072 prefix or suffix.
1073
1074 :param formatter: A Formatter object, or a string naming one
1075 of the standard formatters. The string will be passed into the
1076 Formatter, but only to trigger any side effects: the return
1077 value is ignored.
1078
1079 :return: The string, with any subclass-specific prefix and
1080 suffix added on.
1081 """
1082 if formatter is not None:
1083 ignore = self.format_string(self, formatter)
1084 return self.PREFIX + self + self.SUFFIX
1085
1086class CData(PreformattedString):
1087 """A CDATA block."""
1088 PREFIX = '<![CDATA['
1089 SUFFIX = ']]>'
1090
1091class ProcessingInstruction(PreformattedString):
1092 """A SGML processing instruction."""
1093
1094 PREFIX = '<?'
1095 SUFFIX = '>'
1096
1097class XMLProcessingInstruction(ProcessingInstruction):
1098 """An XML processing instruction."""
1099 PREFIX = '<?'
1100 SUFFIX = '?>'
1101
1102class Comment(PreformattedString):
1103 """An HTML or XML comment."""
1104 PREFIX = '<!--'
1105 SUFFIX = '-->'
1106
1107
1108class Declaration(PreformattedString):
1109 """An XML declaration."""
1110 PREFIX = '<?'
1111 SUFFIX = '?>'
1112
1113
1114class Doctype(PreformattedString):
1115 """A document type declaration."""
1116 @classmethod
1117 def for_name_and_ids(cls, name, pub_id, system_id):
1118 """Generate an appropriate document type declaration for a given
1119 public ID and system ID.
1120
1121 :param name: The name of the document's root element, e.g. 'html'.
1122 :param pub_id: The Formal Public Identifier for this document type,
1123 e.g. '-//W3C//DTD XHTML 1.1//EN'
1124 :param system_id: The system identifier for this document type,
1125 e.g. 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
1126
1127 :return: A Doctype.
1128 """
1129 value = name or ''
1130 if pub_id is not None:
1131 value += ' PUBLIC "%s"' % pub_id
1132 if system_id is not None:
1133 value += ' "%s"' % system_id
1134 elif system_id is not None:
1135 value += ' SYSTEM "%s"' % system_id
1136
1137 return Doctype(value)
1138
1139 PREFIX = '<!DOCTYPE '
1140 SUFFIX = '>\n'
1141
1142
1143class Stylesheet(NavigableString):
1144 """A NavigableString representing an stylesheet (probably
1145 CSS).
1146
1147 Used to distinguish embedded stylesheets from textual content.
1148 """
1149 pass
1150
1151
1152class Script(NavigableString):
1153 """A NavigableString representing an executable script (probably
1154 Javascript).
1155
1156 Used to distinguish executable code from textual content.
1157 """
1158 pass
1159
1160
1161class TemplateString(NavigableString):
1162 """A NavigableString representing a string found inside an HTML
1163 template embedded in a larger document.
1164
1165 Used to distinguish such strings from the main body of the document.
1166 """
1167 pass
1168
1169
1170class RubyTextString(NavigableString):
1171 """A NavigableString representing the contents of the <rt> HTML
1172 element.
1173
1174 https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rt-element
1175
1176 Can be used to distinguish such strings from the strings they're
1177 annotating.
1178 """
1179 pass
1180
1181
1182class RubyParenthesisString(NavigableString):
1183 """A NavigableString representing the contents of the <rp> HTML
1184 element.
1185
1186 https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rp-element
1187 """
1188 pass
1189
1190
1191class Tag(PageElement):
1192 """Represents an HTML or XML tag that is part of a parse tree, along
1193 with its attributes and contents.
1194
1195 When Beautiful Soup parses the markup <b>penguin</b>, it will
1196 create a Tag object representing the <b> tag.
1197 """
1198
1199 def __init__(self, parser=None, builder=None, name=None, namespace=None,
1200 prefix=None, attrs=None, parent=None, previous=None,
1201 is_xml=None, sourceline=None, sourcepos=None,
1202 can_be_empty_element=None, cdata_list_attributes=None,
1203 preserve_whitespace_tags=None,
1204 interesting_string_types=None,
1205 namespaces=None
1206 ):
1207 """Basic constructor.
1208
1209 :param parser: A BeautifulSoup object.
1210 :param builder: A TreeBuilder.
1211 :param name: The name of the tag.
1212 :param namespace: The URI of this Tag's XML namespace, if any.
1213 :param prefix: The prefix for this Tag's XML namespace, if any.
1214 :param attrs: A dictionary of this Tag's attribute values.
1215 :param parent: The PageElement to use as this Tag's parent.
1216 :param previous: The PageElement that was parsed immediately before
1217 this tag.
1218 :param is_xml: If True, this is an XML tag. Otherwise, this is an
1219 HTML tag.
1220 :param sourceline: The line number where this tag was found in its
1221 source document.
1222 :param sourcepos: The character position within `sourceline` where this
1223 tag was found.
1224 :param can_be_empty_element: If True, this tag should be
1225 represented as <tag/>. If False, this tag should be represented
1226 as <tag></tag>.
1227 :param cdata_list_attributes: A list of attributes whose values should
1228 be treated as CDATA if they ever show up on this tag.
1229 :param preserve_whitespace_tags: A list of tag names whose contents
1230 should have their whitespace preserved.
1231 :param interesting_string_types: This is a NavigableString
1232 subclass or a tuple of them. When iterating over this
1233 Tag's strings in methods like Tag.strings or Tag.get_text,
1234 these are the types of strings that are interesting enough
1235 to be considered. The default is to consider
1236 NavigableString and CData the only interesting string
1237 subtypes.
1238 :param namespaces: A dictionary mapping currently active
1239 namespace prefixes to URIs. This can be used later to
1240 construct CSS selectors.
1241 """
1242 if parser is None:
1243 self.parser_class = None
1244 else:
1245 # We don't actually store the parser object: that lets extracted
1246 # chunks be garbage-collected.
1247 self.parser_class = parser.__class__
1248 if name is None:
1249 raise ValueError("No value provided for new tag's name.")
1250 self.name = name
1251 self.namespace = namespace
1252 self._namespaces = namespaces or {}
1253 self.prefix = prefix
1254 if ((not builder or builder.store_line_numbers)
1255 and (sourceline is not None or sourcepos is not None)):
1256 self.sourceline = sourceline
1257 self.sourcepos = sourcepos
1258 if attrs is None:
1259 attrs = {}
1260 elif attrs:
1261 if builder is not None and builder.cdata_list_attributes:
1262 attrs = builder._replace_cdata_list_attribute_values(
1263 self.name, attrs)
1264 else:
1265 attrs = dict(attrs)
1266 else:
1267 attrs = dict(attrs)
1268
1269 # If possible, determine ahead of time whether this tag is an
1270 # XML tag.
1271 if builder:
1272 self.known_xml = builder.is_xml
1273 else:
1274 self.known_xml = is_xml
1275 self.attrs = attrs
1276 self.contents = []
1277 self.setup(parent, previous)
1278 self.hidden = False
1279
1280 if builder is None:
1281 # In the absence of a TreeBuilder, use whatever values were
1282 # passed in here. They're probably None, unless this is a copy of some
1283 # other tag.
1284 self.can_be_empty_element = can_be_empty_element
1285 self.cdata_list_attributes = cdata_list_attributes
1286 self.preserve_whitespace_tags = preserve_whitespace_tags
1287 self.interesting_string_types = interesting_string_types
1288 else:
1289 # Set up any substitutions for this tag, such as the charset in a META tag.
1290 builder.set_up_substitutions(self)
1291
1292 # Ask the TreeBuilder whether this tag might be an empty-element tag.
1293 self.can_be_empty_element = builder.can_be_empty_element(name)
1294
1295 # Keep track of the list of attributes of this tag that
1296 # might need to be treated as a list.
1297 #
1298 # For performance reasons, we store the whole data structure
1299 # rather than asking the question of every tag. Asking would
1300 # require building a new data structure every time, and
1301 # (unlike can_be_empty_element), we almost never need
1302 # to check this.
1303 self.cdata_list_attributes = builder.cdata_list_attributes
1304
1305 # Keep track of the names that might cause this tag to be treated as a
1306 # whitespace-preserved tag.
1307 self.preserve_whitespace_tags = builder.preserve_whitespace_tags
1308
1309 if self.name in builder.string_containers:
1310 # This sort of tag uses a special string container
1311 # subclass for most of its strings. When we ask the
1312 self.interesting_string_types = builder.string_containers[self.name]
1313 else:
1314 self.interesting_string_types = self.DEFAULT_INTERESTING_STRING_TYPES
1315
1316 parserClass = _alias("parser_class") # BS3
1317
1318 def __deepcopy__(self, memo, recursive=True):
1319 """A deepcopy of a Tag is a new Tag, unconnected to the parse tree.
1320 Its contents are a copy of the old Tag's contents.
1321 """
1322 clone = self._clone()
1323
1324 if recursive:
1325 # Clone this tag's descendants recursively, but without
1326 # making any recursive function calls.
1327 tag_stack = [clone]
1328 for event, element in self._event_stream(self.descendants):
1329 if event is Tag.END_ELEMENT_EVENT:
1330 # Stop appending incoming Tags to the Tag that was
1331 # just closed.
1332 tag_stack.pop()
1333 else:
1334 descendant_clone = element.__deepcopy__(
1335 memo, recursive=False
1336 )
1337 # Add to its parent's .contents
1338 tag_stack[-1].append(descendant_clone)
1339
1340 if event is Tag.START_ELEMENT_EVENT:
1341 # Add the Tag itself to the stack so that its
1342 # children will be .appended to it.
1343 tag_stack.append(descendant_clone)
1344 return clone
1345
1346 def __copy__(self):
1347 """A copy of a Tag must always be a deep copy, because a Tag's
1348 children can only have one parent at a time.
1349 """
1350 return self.__deepcopy__({})
1351
1352 def _clone(self):
1353 """Create a new Tag just like this one, but with no
1354 contents and unattached to any parse tree.
1355
1356 This is the first step in the deepcopy process.
1357 """
1358 clone = type(self)(
1359 None, None, self.name, self.namespace,
1360 self.prefix, self.attrs, is_xml=self._is_xml,
1361 sourceline=self.sourceline, sourcepos=self.sourcepos,
1362 can_be_empty_element=self.can_be_empty_element,
1363 cdata_list_attributes=self.cdata_list_attributes,
1364 preserve_whitespace_tags=self.preserve_whitespace_tags,
1365 interesting_string_types=self.interesting_string_types
1366 )
1367 for attr in ('can_be_empty_element', 'hidden'):
1368 setattr(clone, attr, getattr(self, attr))
1369 return clone
1370
1371 @property
1372 def is_empty_element(self):
1373 """Is this tag an empty-element tag? (aka a self-closing tag)
1374
1375 A tag that has contents is never an empty-element tag.
1376
1377 A tag that has no contents may or may not be an empty-element
1378 tag. It depends on the builder used to create the tag. If the
1379 builder has a designated list of empty-element tags, then only
1380 a tag whose name shows up in that list is considered an
1381 empty-element tag.
1382
1383 If the builder has no designated list of empty-element tags,
1384 then any tag with no contents is an empty-element tag.
1385 """
1386 return len(self.contents) == 0 and self.can_be_empty_element
1387 isSelfClosing = is_empty_element # BS3
1388
1389 @property
1390 def string(self):
1391 """Convenience property to get the single string within this
1392 PageElement.
1393
1394 TODO It might make sense to have NavigableString.string return
1395 itself.
1396
1397 :return: If this element has a single string child, return
1398 value is that string. If this element has one child tag,
1399 return value is the 'string' attribute of the child tag,
1400 recursively. If this element is itself a string, has no
1401 children, or has more than one child, return value is None.
1402 """
1403 if len(self.contents) != 1:
1404 return None
1405 child = self.contents[0]
1406 if isinstance(child, NavigableString):
1407 return child
1408 return child.string
1409
1410 @string.setter
1411 def string(self, string):
1412 """Replace this PageElement's contents with `string`."""
1413 self.clear()
1414 self.append(string.__class__(string))
1415
1416 DEFAULT_INTERESTING_STRING_TYPES = (NavigableString, CData)
1417 def _all_strings(self, strip=False, types=PageElement.default):
1418 """Yield all strings of certain classes, possibly stripping them.
1419
1420 :param strip: If True, all strings will be stripped before being
1421 yielded.
1422
1423 :param types: A tuple of NavigableString subclasses. Any strings of
1424 a subclass not found in this list will be ignored. By
1425 default, the subclasses considered are the ones found in
1426 self.interesting_string_types. If that's not specified,
1427 only NavigableString and CData objects will be
1428 considered. That means no comments, processing
1429 instructions, etc.
1430
1431 :yield: A sequence of strings.
1432
1433 """
1434 if types is self.default:
1435 types = self.interesting_string_types
1436
1437 for descendant in self.descendants:
1438 if (types is None and not isinstance(descendant, NavigableString)):
1439 continue
1440 descendant_type = type(descendant)
1441 if isinstance(types, type):
1442 if descendant_type is not types:
1443 # We're not interested in strings of this type.
1444 continue
1445 elif types is not None and descendant_type not in types:
1446 # We're not interested in strings of this type.
1447 continue
1448 if strip:
1449 descendant = descendant.strip()
1450 if len(descendant) == 0:
1451 continue
1452 yield descendant
1453 strings = property(_all_strings)
1454
1455 def decompose(self):
1456 """Recursively destroys this PageElement and its children.
1457
1458 This element will be removed from the tree and wiped out; so
1459 will everything beneath it.
1460
1461 The behavior of a decomposed PageElement is undefined and you
1462 should never use one for anything, but if you need to _check_
1463 whether an element has been decomposed, you can use the
1464 `decomposed` property.
1465 """
1466 self.extract()
1467 i = self
1468 while i is not None:
1469 n = i.next_element
1470 i.__dict__.clear()
1471 i.contents = []
1472 i._decomposed = True
1473 i = n
1474
1475 def clear(self, decompose=False):
1476 """Wipe out all children of this PageElement by calling extract()
1477 on them.
1478
1479 :param decompose: If this is True, decompose() (a more
1480 destructive method) will be called instead of extract().
1481 """
1482 if decompose:
1483 for element in self.contents[:]:
1484 if isinstance(element, Tag):
1485 element.decompose()
1486 else:
1487 element.extract()
1488 else:
1489 for element in self.contents[:]:
1490 element.extract()
1491
1492 def smooth(self):
1493 """Smooth out this element's children by consolidating consecutive
1494 strings.
1495
1496 This makes pretty-printed output look more natural following a
1497 lot of operations that modified the tree.
1498 """
1499 # Mark the first position of every pair of children that need
1500 # to be consolidated. Do this rather than making a copy of
1501 # self.contents, since in most cases very few strings will be
1502 # affected.
1503 marked = []
1504 for i, a in enumerate(self.contents):
1505 if isinstance(a, Tag):
1506 # Recursively smooth children.
1507 a.smooth()
1508 if i == len(self.contents)-1:
1509 # This is the last item in .contents, and it's not a
1510 # tag. There's no chance it needs any work.
1511 continue
1512 b = self.contents[i+1]
1513 if (isinstance(a, NavigableString)
1514 and isinstance(b, NavigableString)
1515 and not isinstance(a, PreformattedString)
1516 and not isinstance(b, PreformattedString)
1517 ):
1518 marked.append(i)
1519
1520 # Go over the marked positions in reverse order, so that
1521 # removing items from .contents won't affect the remaining
1522 # positions.
1523 for i in reversed(marked):
1524 a = self.contents[i]
1525 b = self.contents[i+1]
1526 b.extract()
1527 n = NavigableString(a+b)
1528 a.replace_with(n)
1529
1530 def index(self, element):
1531 """Find the index of a child by identity, not value.
1532
1533 Avoids issues with tag.contents.index(element) getting the
1534 index of equal elements.
1535
1536 :param element: Look for this PageElement in `self.contents`.
1537 """
1538 for i, child in enumerate(self.contents):
1539 if child is element:
1540 return i
1541 raise ValueError("Tag.index: element not in tag")
1542
1543 def get(self, key, default=None):
1544 """Returns the value of the 'key' attribute for the tag, or
1545 the value given for 'default' if it doesn't have that
1546 attribute."""
1547 return self.attrs.get(key, default)
1548
1549 def get_attribute_list(self, key, default=None):
1550 """The same as get(), but always returns a list.
1551
1552 :param key: The attribute to look for.
1553 :param default: Use this value if the attribute is not present
1554 on this PageElement.
1555 :return: A list of values, probably containing only a single
1556 value.
1557 """
1558 value = self.get(key, default)
1559 if not isinstance(value, list):
1560 value = [value]
1561 return value
1562
1563 def has_attr(self, key):
1564 """Does this PageElement have an attribute with the given name?"""
1565 return key in self.attrs
1566
1567 def __hash__(self):
1568 return str(self).__hash__()
1569
1570 def __getitem__(self, key):
1571 """tag[key] returns the value of the 'key' attribute for the Tag,
1572 and throws an exception if it's not there."""
1573 return self.attrs[key]
1574
1575 def __iter__(self):
1576 "Iterating over a Tag iterates over its contents."
1577 return iter(self.contents)
1578
1579 def __len__(self):
1580 "The length of a Tag is the length of its list of contents."
1581 return len(self.contents)
1582
1583 def __contains__(self, x):
1584 return x in self.contents
1585
1586 def __bool__(self):
1587 "A tag is non-None even if it has no contents."
1588 return True
1589
1590 def __setitem__(self, key, value):
1591 """Setting tag[key] sets the value of the 'key' attribute for the
1592 tag."""
1593 self.attrs[key] = value
1594
1595 def __delitem__(self, key):
1596 "Deleting tag[key] deletes all 'key' attributes for the tag."
1597 self.attrs.pop(key, None)
1598
1599 def __call__(self, *args, **kwargs):
1600 """Calling a Tag like a function is the same as calling its
1601 find_all() method. Eg. tag('a') returns a list of all the A tags
1602 found within this tag."""
1603 return self.find_all(*args, **kwargs)
1604
1605 def __getattr__(self, tag):
1606 """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
1607 #print("Getattr %s.%s" % (self.__class__, tag))
1608 if len(tag) > 3 and tag.endswith('Tag'):
1609 # BS3: soup.aTag -> "soup.find("a")
1610 tag_name = tag[:-3]
1611 warnings.warn(
1612 '.%(name)sTag is deprecated, use .find("%(name)s") instead. If you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")' % dict(
1613 name=tag_name
1614 ),
1615 DeprecationWarning, stacklevel=2
1616 )
1617 return self.find(tag_name)
1618 # We special case contents to avoid recursion.
1619 elif not tag.startswith("__") and not tag == "contents":
1620 return self.find(tag)
1621 raise AttributeError(
1622 "'%s' object has no attribute '%s'" % (self.__class__, tag))
1623
1624 def __eq__(self, other):
1625 """Returns true iff this Tag has the same name, the same attributes,
1626 and the same contents (recursively) as `other`."""
1627 if self is other:
1628 return True
1629 if (not hasattr(other, 'name') or
1630 not hasattr(other, 'attrs') or
1631 not hasattr(other, 'contents') or
1632 self.name != other.name or
1633 self.attrs != other.attrs or
1634 len(self) != len(other)):
1635 return False
1636 for i, my_child in enumerate(self.contents):
1637 if my_child != other.contents[i]:
1638 return False
1639 return True
1640
1641 def __ne__(self, other):
1642 """Returns true iff this Tag is not identical to `other`,
1643 as defined in __eq__."""
1644 return not self == other
1645
1646 def __repr__(self, encoding="unicode-escape"):
1647 """Renders this PageElement as a string.
1648
1649 :param encoding: The encoding to use (Python 2 only).
1650 TODO: This is now ignored and a warning should be issued
1651 if a value is provided.
1652 :return: A (Unicode) string.
1653 """
1654 # "The return value must be a string object", i.e. Unicode
1655 return self.decode()
1656
1657 def __unicode__(self):
1658 """Renders this PageElement as a Unicode string."""
1659 return self.decode()
1660
1661 __str__ = __repr__ = __unicode__
1662
1663 def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
1664 indent_level=None, formatter="minimal",
1665 errors="xmlcharrefreplace"):
1666 """Render a bytestring representation of this PageElement and its
1667 contents.
1668
1669 :param encoding: The destination encoding.
1670 :param indent_level: Each line of the rendering will be
1671 indented this many levels. (The formatter decides what a
1672 'level' means in terms of spaces or other characters
1673 output.) Used internally in recursive calls while
1674 pretty-printing.
1675 :param formatter: A Formatter object, or a string naming one of
1676 the standard formatters.
1677 :param errors: An error handling strategy such as
1678 'xmlcharrefreplace'. This value is passed along into
1679 encode() and its value should be one of the constants
1680 defined by Python.
1681 :return: A bytestring.
1682
1683 """
1684 # Turn the data structure into Unicode, then encode the
1685 # Unicode.
1686 u = self.decode(indent_level, encoding, formatter)
1687 return u.encode(encoding, errors)
1688
1689 def decode(self, indent_level=None,
1690 eventual_encoding=DEFAULT_OUTPUT_ENCODING,
1691 formatter="minimal",
1692 iterator=None):
1693 pieces = []
1694 # First off, turn a non-Formatter `formatter` into a Formatter
1695 # object. This will stop the lookup from happening over and
1696 # over again.
1697 if not isinstance(formatter, Formatter):
1698 formatter = self.formatter_for_name(formatter)
1699
1700 if indent_level is True:
1701 indent_level = 0
1702
1703 # The currently active tag that put us into string literal
1704 # mode. Until this element is closed, children will be treated
1705 # as string literals and not pretty-printed. String literal
1706 # mode is turned on immediately after this tag begins, and
1707 # turned off immediately before it's closed. This means there
1708 # will be whitespace before and after the tag itself.
1709 string_literal_tag = None
1710
1711 for event, element in self._event_stream(iterator):
1712 if event in (Tag.START_ELEMENT_EVENT, Tag.EMPTY_ELEMENT_EVENT):
1713 piece = element._format_tag(
1714 eventual_encoding, formatter, opening=True
1715 )
1716 elif event is Tag.END_ELEMENT_EVENT:
1717 piece = element._format_tag(
1718 eventual_encoding, formatter, opening=False
1719 )
1720 if indent_level is not None:
1721 indent_level -= 1
1722 else:
1723 piece = element.output_ready(formatter)
1724
1725 # Now we need to apply the 'prettiness' -- extra
1726 # whitespace before and/or after this tag. This can get
1727 # complicated because certain tags, like <pre> and
1728 # <script>, can't be prettified, since adding whitespace would
1729 # change the meaning of the content.
1730
1731 # The default behavior is to add whitespace before and
1732 # after an element when string literal mode is off, and to
1733 # leave things as they are when string literal mode is on.
1734 if string_literal_tag:
1735 indent_before = indent_after = False
1736 else:
1737 indent_before = indent_after = True
1738
1739 # The only time the behavior is more complex than that is
1740 # when we encounter an opening or closing tag that might
1741 # put us into or out of string literal mode.
1742 if (event is Tag.START_ELEMENT_EVENT
1743 and not string_literal_tag
1744 and not element._should_pretty_print()):
1745 # We are about to enter string literal mode. Add
1746 # whitespace before this tag, but not after. We
1747 # will stay in string literal mode until this tag
1748 # is closed.
1749 indent_before = True
1750 indent_after = False
1751 string_literal_tag = element
1752 elif (event is Tag.END_ELEMENT_EVENT
1753 and element is string_literal_tag):
1754 # We are about to exit string literal mode by closing
1755 # the tag that sent us into that mode. Add whitespace
1756 # after this tag, but not before.
1757 indent_before = False
1758 indent_after = True
1759 string_literal_tag = None
1760
1761 # Now we know whether to add whitespace before and/or
1762 # after this element.
1763 if indent_level is not None:
1764 if (indent_before or indent_after):
1765 if isinstance(element, NavigableString):
1766 piece = piece.strip()
1767 if piece:
1768 piece = self._indent_string(
1769 piece, indent_level, formatter,
1770 indent_before, indent_after
1771 )
1772 if event == Tag.START_ELEMENT_EVENT:
1773 indent_level += 1
1774 pieces.append(piece)
1775 return "".join(pieces)
1776
1777 # Names for the different events yielded by _event_stream
1778 START_ELEMENT_EVENT = object()
1779 END_ELEMENT_EVENT = object()
1780 EMPTY_ELEMENT_EVENT = object()
1781 STRING_ELEMENT_EVENT = object()
1782
1783 def _event_stream(self, iterator=None):
1784 """Yield a sequence of events that can be used to reconstruct the DOM
1785 for this element.
1786
1787 This lets us recreate the nested structure of this element
1788 (e.g. when formatting it as a string) without using recursive
1789 method calls.
1790
1791 This is similar in concept to the SAX API, but it's a simpler
1792 interface designed for internal use. The events are different
1793 from SAX and the arguments associated with the events are Tags
1794 and other Beautiful Soup objects.
1795
1796 :param iterator: An alternate iterator to use when traversing
1797 the tree.
1798 """
1799 tag_stack = []
1800
1801 iterator = iterator or self.self_and_descendants
1802
1803 for c in iterator:
1804 # If the parent of the element we're about to yield is not
1805 # the tag currently on the stack, it means that the tag on
1806 # the stack closed before this element appeared.
1807 while tag_stack and c.parent != tag_stack[-1]:
1808 now_closed_tag = tag_stack.pop()
1809 yield Tag.END_ELEMENT_EVENT, now_closed_tag
1810
1811 if isinstance(c, Tag):
1812 if c.is_empty_element:
1813 yield Tag.EMPTY_ELEMENT_EVENT, c
1814 else:
1815 yield Tag.START_ELEMENT_EVENT, c
1816 tag_stack.append(c)
1817 continue
1818 else:
1819 yield Tag.STRING_ELEMENT_EVENT, c
1820
1821 while tag_stack:
1822 now_closed_tag = tag_stack.pop()
1823 yield Tag.END_ELEMENT_EVENT, now_closed_tag
1824
1825 def _indent_string(self, s, indent_level, formatter,
1826 indent_before, indent_after):
1827 """Add indentation whitespace before and/or after a string.
1828
1829 :param s: The string to amend with whitespace.
1830 :param indent_level: The indentation level; affects how much
1831 whitespace goes before the string.
1832 :param indent_before: Whether or not to add whitespace
1833 before the string.
1834 :param indent_after: Whether or not to add whitespace
1835 (a newline) after the string.
1836 """
1837 space_before = ''
1838 if indent_before and indent_level:
1839 space_before = (formatter.indent * indent_level)
1840
1841 space_after = ''
1842 if indent_after:
1843 space_after = "\n"
1844
1845 return space_before + s + space_after
1846
1847 def _format_tag(self, eventual_encoding, formatter, opening):
1848 if self.hidden:
1849 # A hidden tag is invisible, although its contents
1850 # are visible.
1851 return ''
1852
1853 # A tag starts with the < character (see below).
1854
1855 # Then the / character, if this is a closing tag.
1856 closing_slash = ''
1857 if not opening:
1858 closing_slash = '/'
1859
1860 # Then an optional namespace prefix.
1861 prefix = ''
1862 if self.prefix:
1863 prefix = self.prefix + ":"
1864
1865 # Then a list of attribute values, if this is an opening tag.
1866 attribute_string = ''
1867 if opening:
1868 attributes = formatter.attributes(self)
1869 attrs = []
1870 for key, val in attributes:
1871 if val is None:
1872 decoded = key
1873 else:
1874 if isinstance(val, list) or isinstance(val, tuple):
1875 val = ' '.join(val)
1876 elif not isinstance(val, str):
1877 val = str(val)
1878 elif (
1879 isinstance(val, AttributeValueWithCharsetSubstitution)
1880 and eventual_encoding is not None
1881 ):
1882 val = val.encode(eventual_encoding)
1883
1884 text = formatter.attribute_value(val)
1885 decoded = (
1886 str(key) + '='
1887 + formatter.quoted_attribute_value(text))
1888 attrs.append(decoded)
1889 if attrs:
1890 attribute_string = ' ' + ' '.join(attrs)
1891
1892 # Then an optional closing slash (for a void element in an
1893 # XML document).
1894 void_element_closing_slash = ''
1895 if self.is_empty_element:
1896 void_element_closing_slash = formatter.void_element_close_prefix or ''
1897
1898 # Put it all together.
1899 return '<' + closing_slash + prefix + self.name + attribute_string + void_element_closing_slash + '>'
1900
1901 def _should_pretty_print(self, indent_level=1):
1902 """Should this tag be pretty-printed?
1903
1904 Most of them should, but some (such as <pre> in HTML
1905 documents) should not.
1906 """
1907 return (
1908 indent_level is not None
1909 and (
1910 not self.preserve_whitespace_tags
1911 or self.name not in self.preserve_whitespace_tags
1912 )
1913 )
1914
1915 def prettify(self, encoding=None, formatter="minimal"):
1916 """Pretty-print this PageElement as a string.
1917
1918 :param encoding: The eventual encoding of the string. If this is None,
1919 a Unicode string will be returned.
1920 :param formatter: A Formatter object, or a string naming one of
1921 the standard formatters.
1922 :return: A Unicode string (if encoding==None) or a bytestring
1923 (otherwise).
1924 """
1925 if encoding is None:
1926 return self.decode(True, formatter=formatter)
1927 else:
1928 return self.encode(encoding, True, formatter=formatter)
1929
1930 def decode_contents(self, indent_level=None,
1931 eventual_encoding=DEFAULT_OUTPUT_ENCODING,
1932 formatter="minimal"):
1933 """Renders the contents of this tag as a Unicode string.
1934
1935 :param indent_level: Each line of the rendering will be
1936 indented this many levels. (The formatter decides what a
1937 'level' means in terms of spaces or other characters
1938 output.) Used internally in recursive calls while
1939 pretty-printing.
1940
1941 :param eventual_encoding: The tag is destined to be
1942 encoded into this encoding. decode_contents() is _not_
1943 responsible for performing that encoding. This information
1944 is passed in so that it can be substituted in if the
1945 document contains a <META> tag that mentions the document's
1946 encoding.
1947
1948 :param formatter: A Formatter object, or a string naming one of
1949 the standard Formatters.
1950
1951 """
1952 return self.decode(indent_level, eventual_encoding, formatter,
1953 iterator=self.descendants)
1954
1955 def encode_contents(
1956 self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
1957 formatter="minimal"):
1958 """Renders the contents of this PageElement as a bytestring.
1959
1960 :param indent_level: Each line of the rendering will be
1961 indented this many levels. (The formatter decides what a
1962 'level' means in terms of spaces or other characters
1963 output.) Used internally in recursive calls while
1964 pretty-printing.
1965
1966 :param eventual_encoding: The bytestring will be in this encoding.
1967
1968 :param formatter: A Formatter object, or a string naming one of
1969 the standard Formatters.
1970
1971 :return: A bytestring.
1972 """
1973 contents = self.decode_contents(indent_level, encoding, formatter)
1974 return contents.encode(encoding)
1975
1976 # Old method for BS3 compatibility
1977 def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
1978 prettyPrint=False, indentLevel=0):
1979 """Deprecated method for BS3 compatibility."""
1980 if not prettyPrint:
1981 indentLevel = None
1982 return self.encode_contents(
1983 indent_level=indentLevel, encoding=encoding)
1984
1985 #Soup methods
1986
1987 def find(self, name=None, attrs={}, recursive=True, string=None,
1988 **kwargs):
1989 """Look in the children of this PageElement and find the first
1990 PageElement that matches the given criteria.
1991
1992 All find_* methods take a common set of arguments. See the online
1993 documentation for detailed explanations.
1994
1995 :param name: A filter on tag name.
1996 :param attrs: A dictionary of filters on attribute values.
1997 :param recursive: If this is True, find() will perform a
1998 recursive search of this PageElement's children. Otherwise,
1999 only the direct children will be considered.
2000 :param limit: Stop looking after finding this many results.
2001 :kwargs: A dictionary of filters on attribute values.
2002 :return: A PageElement.
2003 :rtype: bs4.element.Tag | bs4.element.NavigableString
2004 """
2005 r = None
2006 l = self.find_all(name, attrs, recursive, string, 1, _stacklevel=3,
2007 **kwargs)
2008 if l:
2009 r = l[0]
2010 return r
2011 findChild = find #BS2
2012
2013 def find_all(self, name=None, attrs={}, recursive=True, string=None,
2014 limit=None, **kwargs):
2015 """Look in the children of this PageElement and find all
2016 PageElements that match the given criteria.
2017
2018 All find_* methods take a common set of arguments. See the online
2019 documentation for detailed explanations.
2020
2021 :param name: A filter on tag name.
2022 :param attrs: A dictionary of filters on attribute values.
2023 :param recursive: If this is True, find_all() will perform a
2024 recursive search of this PageElement's children. Otherwise,
2025 only the direct children will be considered.
2026 :param limit: Stop looking after finding this many results.
2027 :kwargs: A dictionary of filters on attribute values.
2028 :return: A ResultSet of PageElements.
2029 :rtype: bs4.element.ResultSet
2030 """
2031 generator = self.descendants
2032 if not recursive:
2033 generator = self.children
2034 _stacklevel = kwargs.pop('_stacklevel', 2)
2035 return self._find_all(name, attrs, string, limit, generator,
2036 _stacklevel=_stacklevel+1, **kwargs)
2037 findAll = find_all # BS3
2038 findChildren = find_all # BS2
2039
2040 #Generator methods
2041 @property
2042 def children(self):
2043 """Iterate over all direct children of this PageElement.
2044
2045 :yield: A sequence of PageElements.
2046 """
2047 # return iter() to make the purpose of the method clear
2048 return iter(self.contents) # XXX This seems to be untested.
2049
2050 @property
2051 def self_and_descendants(self):
2052 """Iterate over this PageElement and its children in a
2053 breadth-first sequence.
2054
2055 :yield: A sequence of PageElements.
2056 """
2057 if not self.hidden:
2058 yield self
2059 for i in self.descendants:
2060 yield i
2061
2062 @property
2063 def descendants(self):
2064 """Iterate over all children of this PageElement in a
2065 breadth-first sequence.
2066
2067 :yield: A sequence of PageElements.
2068 """
2069 if not len(self.contents):
2070 return
2071 stopNode = self._last_descendant().next_element
2072 current = self.contents[0]
2073 while current is not stopNode:
2074 yield current
2075 current = current.next_element
2076
2077 # CSS selector code
2078 def select_one(self, selector, namespaces=None, **kwargs):
2079 """Perform a CSS selection operation on the current element.
2080
2081 :param selector: A CSS selector.
2082
2083 :param namespaces: A dictionary mapping namespace prefixes
2084 used in the CSS selector to namespace URIs. By default,
2085 Beautiful Soup will use the prefixes it encountered while
2086 parsing the document.
2087
2088 :param kwargs: Keyword arguments to be passed into Soup Sieve's
2089 soupsieve.select() method.
2090
2091 :return: A Tag.
2092 :rtype: bs4.element.Tag
2093 """
2094 return self.css.select_one(selector, namespaces, **kwargs)
2095
2096 def select(self, selector, namespaces=None, limit=None, **kwargs):
2097 """Perform a CSS selection operation on the current element.
2098
2099 This uses the SoupSieve library.
2100
2101 :param selector: A string containing a CSS selector.
2102
2103 :param namespaces: A dictionary mapping namespace prefixes
2104 used in the CSS selector to namespace URIs. By default,
2105 Beautiful Soup will use the prefixes it encountered while
2106 parsing the document.
2107
2108 :param limit: After finding this number of results, stop looking.
2109
2110 :param kwargs: Keyword arguments to be passed into SoupSieve's
2111 soupsieve.select() method.
2112
2113 :return: A ResultSet of Tags.
2114 :rtype: bs4.element.ResultSet
2115 """
2116 return self.css.select(selector, namespaces, limit, **kwargs)
2117
2118 @property
2119 def css(self):
2120 """Return an interface to the CSS selector API."""
2121 return CSS(self)
2122
2123 # Old names for backwards compatibility
2124 def childGenerator(self):
2125 """Deprecated generator."""
2126 return self.children
2127
2128 def recursiveChildGenerator(self):
2129 """Deprecated generator."""
2130 return self.descendants
2131
2132 def has_key(self, key):
2133 """Deprecated method. This was kind of misleading because has_key()
2134 (attributes) was different from __in__ (contents).
2135
2136 has_key() is gone in Python 3, anyway.
2137 """
2138 warnings.warn(
2139 'has_key is deprecated. Use has_attr(key) instead.',
2140 DeprecationWarning, stacklevel=2
2141 )
2142 return self.has_attr(key)
2143
2144# Next, a couple classes to represent queries and their results.
2145class SoupStrainer(object):
2146 """Encapsulates a number of ways of matching a markup element (tag or
2147 string).
2148
2149 This is primarily used to underpin the find_* methods, but you can
2150 create one yourself and pass it in as `parse_only` to the
2151 `BeautifulSoup` constructor, to parse a subset of a large
2152 document.
2153 """
2154
2155 def __init__(self, name=None, attrs={}, string=None, **kwargs):
2156 """Constructor.
2157
2158 The SoupStrainer constructor takes the same arguments passed
2159 into the find_* methods. See the online documentation for
2160 detailed explanations.
2161
2162 :param name: A filter on tag name.
2163 :param attrs: A dictionary of filters on attribute values.
2164 :param string: A filter for a NavigableString with specific text.
2165 :kwargs: A dictionary of filters on attribute values.
2166 """
2167 if string is None and 'text' in kwargs:
2168 string = kwargs.pop('text')
2169 warnings.warn(
2170 "The 'text' argument to the SoupStrainer constructor is deprecated. Use 'string' instead.",
2171 DeprecationWarning, stacklevel=2
2172 )
2173
2174 self.name = self._normalize_search_value(name)
2175 if not isinstance(attrs, dict):
2176 # Treat a non-dict value for attrs as a search for the 'class'
2177 # attribute.
2178 kwargs['class'] = attrs
2179 attrs = None
2180
2181 if 'class_' in kwargs:
2182 # Treat class_="foo" as a search for the 'class'
2183 # attribute, overriding any non-dict value for attrs.
2184 kwargs['class'] = kwargs['class_']
2185 del kwargs['class_']
2186
2187 if kwargs:
2188 if attrs:
2189 attrs = attrs.copy()
2190 attrs.update(kwargs)
2191 else:
2192 attrs = kwargs
2193 normalized_attrs = {}
2194 for key, value in list(attrs.items()):
2195 normalized_attrs[key] = self._normalize_search_value(value)
2196
2197 self.attrs = normalized_attrs
2198 self.string = self._normalize_search_value(string)
2199
2200 # DEPRECATED but just in case someone is checking this.
2201 self.text = self.string
2202
2203 def _normalize_search_value(self, value):
2204 # Leave it alone if it's a Unicode string, a callable, a
2205 # regular expression, a boolean, or None.
2206 if (isinstance(value, str) or isinstance(value, Callable) or hasattr(value, 'match')
2207 or isinstance(value, bool) or value is None):
2208 return value
2209
2210 # If it's a bytestring, convert it to Unicode, treating it as UTF-8.
2211 if isinstance(value, bytes):
2212 return value.decode("utf8")
2213
2214 # If it's listlike, convert it into a list of strings.
2215 if hasattr(value, '__iter__'):
2216 new_value = []
2217 for v in value:
2218 if (hasattr(v, '__iter__') and not isinstance(v, bytes)
2219 and not isinstance(v, str)):
2220 # This is almost certainly the user's mistake. In the
2221 # interests of avoiding infinite loops, we'll let
2222 # it through as-is rather than doing a recursive call.
2223 new_value.append(v)
2224 else:
2225 new_value.append(self._normalize_search_value(v))
2226 return new_value
2227
2228 # Otherwise, convert it into a Unicode string.
2229 # The unicode(str()) thing is so this will do the same thing on Python 2
2230 # and Python 3.
2231 return str(str(value))
2232
2233 def __str__(self):
2234 """A human-readable representation of this SoupStrainer."""
2235 if self.string:
2236 return self.string
2237 else:
2238 return "%s|%s" % (self.name, self.attrs)
2239
2240 def search_tag(self, markup_name=None, markup_attrs={}):
2241 """Check whether a Tag with the given name and attributes would
2242 match this SoupStrainer.
2243
2244 Used prospectively to decide whether to even bother creating a Tag
2245 object.
2246
2247 :param markup_name: A tag name as found in some markup.
2248 :param markup_attrs: A dictionary of attributes as found in some markup.
2249
2250 :return: True if the prospective tag would match this SoupStrainer;
2251 False otherwise.
2252 """
2253 found = None
2254 markup = None
2255 if isinstance(markup_name, Tag):
2256 markup = markup_name
2257 markup_attrs = markup
2258
2259 if isinstance(self.name, str):
2260 # Optimization for a very common case where the user is
2261 # searching for a tag with one specific name, and we're
2262 # looking at a tag with a different name.
2263 if markup and not markup.prefix and self.name != markup.name:
2264 return False
2265
2266 call_function_with_tag_data = (
2267 isinstance(self.name, Callable)
2268 and not isinstance(markup_name, Tag))
2269
2270 if ((not self.name)
2271 or call_function_with_tag_data
2272 or (markup and self._matches(markup, self.name))
2273 or (not markup and self._matches(markup_name, self.name))):
2274 if call_function_with_tag_data:
2275 match = self.name(markup_name, markup_attrs)
2276 else:
2277 match = True
2278 markup_attr_map = None
2279 for attr, match_against in list(self.attrs.items()):
2280 if not markup_attr_map:
2281 if hasattr(markup_attrs, 'get'):
2282 markup_attr_map = markup_attrs
2283 else:
2284 markup_attr_map = {}
2285 for k, v in markup_attrs:
2286 markup_attr_map[k] = v
2287 attr_value = markup_attr_map.get(attr)
2288 if not self._matches(attr_value, match_against):
2289 match = False
2290 break
2291 if match:
2292 if markup:
2293 found = markup
2294 else:
2295 found = markup_name
2296 if found and self.string and not self._matches(found.string, self.string):
2297 found = None
2298 return found
2299
2300 # For BS3 compatibility.
2301 searchTag = search_tag
2302
2303 def search(self, markup):
2304 """Find all items in `markup` that match this SoupStrainer.
2305
2306 Used by the core _find_all() method, which is ultimately
2307 called by all find_* methods.
2308
2309 :param markup: A PageElement or a list of them.
2310 """
2311 # print('looking for %s in %s' % (self, markup))
2312 found = None
2313 # If given a list of items, scan it for a text element that
2314 # matches.
2315 if hasattr(markup, '__iter__') and not isinstance(markup, (Tag, str)):
2316 for element in markup:
2317 if isinstance(element, NavigableString) \
2318 and self.search(element):
2319 found = element
2320 break
2321 # If it's a Tag, make sure its name or attributes match.
2322 # Don't bother with Tags if we're searching for text.
2323 elif isinstance(markup, Tag):
2324 if not self.string or self.name or self.attrs:
2325 found = self.search_tag(markup)
2326 # If it's text, make sure the text matches.
2327 elif isinstance(markup, NavigableString) or \
2328 isinstance(markup, str):
2329 if not self.name and not self.attrs and self._matches(markup, self.string):
2330 found = markup
2331 else:
2332 raise Exception(
2333 "I don't know how to match against a %s" % markup.__class__)
2334 return found
2335
2336 def _matches(self, markup, match_against, already_tried=None):
2337 # print(u"Matching %s against %s" % (markup, match_against))
2338 result = False
2339 if isinstance(markup, list) or isinstance(markup, tuple):
2340 # This should only happen when searching a multi-valued attribute
2341 # like 'class'.
2342 for item in markup:
2343 if self._matches(item, match_against):
2344 return True
2345 # We didn't match any particular value of the multivalue
2346 # attribute, but maybe we match the attribute value when
2347 # considered as a string.
2348 if self._matches(' '.join(markup), match_against):
2349 return True
2350 return False
2351
2352 if match_against is True:
2353 # True matches any non-None value.
2354 return markup is not None
2355
2356 if isinstance(match_against, Callable):
2357 return match_against(markup)
2358
2359 # Custom callables take the tag as an argument, but all
2360 # other ways of matching match the tag name as a string.
2361 original_markup = markup
2362 if isinstance(markup, Tag):
2363 markup = markup.name
2364
2365 # Ensure that `markup` is either a Unicode string, or None.
2366 markup = self._normalize_search_value(markup)
2367
2368 if markup is None:
2369 # None matches None, False, an empty string, an empty list, and so on.
2370 return not match_against
2371
2372 if (hasattr(match_against, '__iter__')
2373 and not isinstance(match_against, str)):
2374 # We're asked to match against an iterable of items.
2375 # The markup must be match at least one item in the
2376 # iterable. We'll try each one in turn.
2377 #
2378 # To avoid infinite recursion we need to keep track of
2379 # items we've already seen.
2380 if not already_tried:
2381 already_tried = set()
2382 for item in match_against:
2383 if item.__hash__:
2384 key = item
2385 else:
2386 key = id(item)
2387 if key in already_tried:
2388 continue
2389 else:
2390 already_tried.add(key)
2391 if self._matches(original_markup, item, already_tried):
2392 return True
2393 else:
2394 return False
2395
2396 # Beyond this point we might need to run the test twice: once against
2397 # the tag's name and once against its prefixed name.
2398 match = False
2399
2400 if not match and isinstance(match_against, str):
2401 # Exact string match
2402 match = markup == match_against
2403
2404 if not match and hasattr(match_against, 'search'):
2405 # Regexp match
2406 return match_against.search(markup)
2407
2408 if (not match
2409 and isinstance(original_markup, Tag)
2410 and original_markup.prefix):
2411 # Try the whole thing again with the prefixed tag name.
2412 return self._matches(
2413 original_markup.prefix + ':' + original_markup.name, match_against
2414 )
2415
2416 return match
2417
2418
2419class ResultSet(list):
2420 """A ResultSet is just a list that keeps track of the SoupStrainer
2421 that created it."""
2422 def __init__(self, source, result=()):
2423 """Constructor.
2424
2425 :param source: A SoupStrainer.
2426 :param result: A list of PageElements.
2427 """
2428 super(ResultSet, self).__init__(result)
2429 self.source = source
2430
2431 def __getattr__(self, key):
2432 """Raise a helpful exception to explain a common code fix."""
2433 raise AttributeError(
2434 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2435 )
diff --git a/bitbake/lib/bs4/formatter.py b/bitbake/lib/bs4/formatter.py
deleted file mode 100644
index 9fa1b57cb6..0000000000
--- a/bitbake/lib/bs4/formatter.py
+++ /dev/null
@@ -1,185 +0,0 @@
1from bs4.dammit import EntitySubstitution
2
3class Formatter(EntitySubstitution):
4 """Describes a strategy to use when outputting a parse tree to a string.
5
6 Some parts of this strategy come from the distinction between
7 HTML4, HTML5, and XML. Others are configurable by the user.
8
9 Formatters are passed in as the `formatter` argument to methods
10 like `PageElement.encode`. Most people won't need to think about
11 formatters, and most people who need to think about them can pass
12 in one of these predefined strings as `formatter` rather than
13 making a new Formatter object:
14
15 For HTML documents:
16 * 'html' - HTML entity substitution for generic HTML documents. (default)
17 * 'html5' - HTML entity substitution for HTML5 documents, as
18 well as some optimizations in the way tags are rendered.
19 * 'minimal' - Only make the substitutions necessary to guarantee
20 valid HTML.
21 * None - Do not perform any substitution. This will be faster
22 but may result in invalid markup.
23
24 For XML documents:
25 * 'html' - Entity substitution for XHTML documents.
26 * 'minimal' - Only make the substitutions necessary to guarantee
27 valid XML. (default)
28 * None - Do not perform any substitution. This will be faster
29 but may result in invalid markup.
30 """
31 # Registries of XML and HTML formatters.
32 XML_FORMATTERS = {}
33 HTML_FORMATTERS = {}
34
35 HTML = 'html'
36 XML = 'xml'
37
38 HTML_DEFAULTS = dict(
39 cdata_containing_tags=set(["script", "style"]),
40 )
41
42 def _default(self, language, value, kwarg):
43 if value is not None:
44 return value
45 if language == self.XML:
46 return set()
47 return self.HTML_DEFAULTS[kwarg]
48
49 def __init__(
50 self, language=None, entity_substitution=None,
51 void_element_close_prefix='/', cdata_containing_tags=None,
52 empty_attributes_are_booleans=False, indent=1,
53 ):
54 r"""Constructor.
55
56 :param language: This should be Formatter.XML if you are formatting
57 XML markup and Formatter.HTML if you are formatting HTML markup.
58
59 :param entity_substitution: A function to call to replace special
60 characters with XML/HTML entities. For examples, see
61 bs4.dammit.EntitySubstitution.substitute_html and substitute_xml.
62 :param void_element_close_prefix: By default, void elements
63 are represented as <tag/> (XML rules) rather than <tag>
64 (HTML rules). To get <tag>, pass in the empty string.
65 :param cdata_containing_tags: The list of tags that are defined
66 as containing CDATA in this dialect. For example, in HTML,
67 <script> and <style> tags are defined as containing CDATA,
68 and their contents should not be formatted.
69 :param blank_attributes_are_booleans: Render attributes whose value
70 is the empty string as HTML-style boolean attributes.
71 (Attributes whose value is None are always rendered this way.)
72
73 :param indent: If indent is a non-negative integer or string,
74 then the contents of elements will be indented
75 appropriately when pretty-printing. An indent level of 0,
76 negative, or "" will only insert newlines. Using a
77 positive integer indent indents that many spaces per
78 level. If indent is a string (such as "\t"), that string
79 is used to indent each level. The default behavior is to
80 indent one space per level.
81 """
82 self.language = language
83 self.entity_substitution = entity_substitution
84 self.void_element_close_prefix = void_element_close_prefix
85 self.cdata_containing_tags = self._default(
86 language, cdata_containing_tags, 'cdata_containing_tags'
87 )
88 self.empty_attributes_are_booleans=empty_attributes_are_booleans
89 if indent is None:
90 indent = 0
91 if isinstance(indent, int):
92 if indent < 0:
93 indent = 0
94 indent = ' ' * indent
95 elif isinstance(indent, str):
96 indent = indent
97 else:
98 indent = ' '
99 self.indent = indent
100
101 def substitute(self, ns):
102 """Process a string that needs to undergo entity substitution.
103 This may be a string encountered in an attribute value or as
104 text.
105
106 :param ns: A string.
107 :return: A string with certain characters replaced by named
108 or numeric entities.
109 """
110 if not self.entity_substitution:
111 return ns
112 from .element import NavigableString
113 if (isinstance(ns, NavigableString)
114 and ns.parent is not None
115 and ns.parent.name in self.cdata_containing_tags):
116 # Do nothing.
117 return ns
118 # Substitute.
119 return self.entity_substitution(ns)
120
121 def attribute_value(self, value):
122 """Process the value of an attribute.
123
124 :param ns: A string.
125 :return: A string with certain characters replaced by named
126 or numeric entities.
127 """
128 return self.substitute(value)
129
130 def attributes(self, tag):
131 """Reorder a tag's attributes however you want.
132
133 By default, attributes are sorted alphabetically. This makes
134 behavior consistent between Python 2 and Python 3, and preserves
135 backwards compatibility with older versions of Beautiful Soup.
136
137 If `empty_boolean_attributes` is True, then attributes whose
138 values are set to the empty string will be treated as boolean
139 attributes.
140 """
141 if tag.attrs is None:
142 return []
143 return sorted(
144 (k, (None if self.empty_attributes_are_booleans and v == '' else v))
145 for k, v in list(tag.attrs.items())
146 )
147
148class HTMLFormatter(Formatter):
149 """A generic Formatter for HTML."""
150 REGISTRY = {}
151 def __init__(self, *args, **kwargs):
152 super(HTMLFormatter, self).__init__(self.HTML, *args, **kwargs)
153
154
155class XMLFormatter(Formatter):
156 """A generic Formatter for XML."""
157 REGISTRY = {}
158 def __init__(self, *args, **kwargs):
159 super(XMLFormatter, self).__init__(self.XML, *args, **kwargs)
160
161
162# Set up aliases for the default formatters.
163HTMLFormatter.REGISTRY['html'] = HTMLFormatter(
164 entity_substitution=EntitySubstitution.substitute_html
165)
166HTMLFormatter.REGISTRY["html5"] = HTMLFormatter(
167 entity_substitution=EntitySubstitution.substitute_html,
168 void_element_close_prefix=None,
169 empty_attributes_are_booleans=True,
170)
171HTMLFormatter.REGISTRY["minimal"] = HTMLFormatter(
172 entity_substitution=EntitySubstitution.substitute_xml
173)
174HTMLFormatter.REGISTRY[None] = HTMLFormatter(
175 entity_substitution=None
176)
177XMLFormatter.REGISTRY["html"] = XMLFormatter(
178 entity_substitution=EntitySubstitution.substitute_html
179)
180XMLFormatter.REGISTRY["minimal"] = XMLFormatter(
181 entity_substitution=EntitySubstitution.substitute_xml
182)
183XMLFormatter.REGISTRY[None] = Formatter(
184 Formatter(Formatter.XML, entity_substitution=None)
185)