Skip to content

Create LXML from raw_html #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 18, 2018
Merged

Create LXML from raw_html #160

merged 2 commits into from
Sep 18, 2018

Conversation

angaz
Copy link
Contributor

@angaz angaz commented Apr 5, 2018

Create LXML from self.raw_html instead of self.html to allow LXML to process plain XML pages as per @beda42 findings in issue https://github.com/kennethreitz/requests-html/issues/145

I have tested this change with 200 sites and it seems to fix the issue. HTML pages seem to all be working as expected. I haven't run into an issue with any that I've tested.

Create LXML from `self.raw_html` instead of `self.html` to allow LXML to process plain XML pages as per beda42's findings in issue https://github.com/kennethreitz/requests-html/issues/145

I have tested this change with 200 sites and it seems to fix the issue. HTML pages seem to all be working as expected. I haven't run into an issue with any that I've tested.
@sentriz
Copy link

sentriz commented Apr 14, 2018

I tried this fix for this site, but it didn't seem to fix it

@angaz
Copy link
Contributor Author

angaz commented Apr 14, 2018

Hi @sentriz, I am unable to see any problems with this site. This is my attempt to reproduce:

pip install the branch:

sudo -H pip3 install git+https://github.com/SN9NV/requests-html.git@patch-1

Make sure that lxml loads the page by querying an xpath for all the text:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://www.365online.com/online365/spring/authentication?execution=e1s1')
>>> r.html.xpath('//text()')
['\n        \n        ', '\n            ', '\n\n            ', '\n            ', '\n            ', '\n            ', ' \n            ', '\n            ', '\n            \n            ', '\n            ', '\n\n            ', 'window.RICH_FACES_EXTENDED_SKINNING_ON=true;', '\n            ', '\n            ', '\n            ', '\n\n            ', '\n            ', '\n        ', '\n        ', '\n        \n\t\t', 'Bank of Ireland 365 Online | Login - Step 1 of 2', '\n        \n        ', '\n\n            ', '\n            ', '\n            ', '\n            ', '\n            ', '  \n            ', " <!-- REMOVE DIV ONCLICK FUNCTION -->\n                function hide_element(element_name) {\n                    element = document.getElementById(element_name);  \n                    element.style.display = 'none';\n                }\n                var backgroundPositionDefault = '0px 0px';\n            \tvar backgroundPositionUpdated = '0px 80px';\n            ", '\n            ', '\n            \tvar $j = jQuery.noConflict();\n\t\t\t\tfunction closeSmartBanners(element) {\n\t\t\t\t\t$j("#smartBannerSection").css(\'display\',\'none\');\n\t\t\t\t\tupdateBackgroundPosition(backgroundPositionDefault);\n\t\t\t\t}\n\t\t\t\tfunction updateBackgroundPosition(pos) {\n\t\t\t\t\t$j("body").css(\'background-position\',pos);\n\t\t\t\t}\n\t\t\t\t$j(window).load(function(){\n\t\t\t\t\tupdateBackgroundPosition(backgroundPositionDefault);\n\t\t\t\t\tif(device.isAndroid()) {\t\t\t\t\t\n\t\t\t\t\t\t$j("#smartBannerSection").css(\'display\',\'block\');\n\t\t\t\t\t\tupdateBackgroundPosition(backgroundPositionUpdated);\n\t\t\t\t\t\tif(device.isMobile()) {\n\t\t\t\t\t\t\tvar link = \'http://play.google.com/store/apps/details?id=com.bankofireland.mobilebanking\';\n\t\t\t\t\t\t\tvar appName = \'Bank of Ireland Mobile Banking\';\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tvar link = \'https://play.google.com/store/apps/details?id=com.boi.tablet365\';\n\t\t\t\t\t\t\tvar appName = \'Bank of Ireland Tablet Banking\';\n\t\t\t\t\t\t}\n\t\t\t\t\t\t$j("#smartBnrUrl").attr(\'href\',link);\n\t\t\t\t\t\t$j("#smartBnrAppName").html(appName);\n\t\t\t\t\t}\n\t\t\t\t});\n\t\t\t', '\n\t\t', "\n\t\t  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n\t\t  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n\t\t  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n\t\t  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');\n\t\t  ga('create', 'UA-55288034-3', 'auto');\n\t\t", '\n\t\t', "\n\t\t\tga('send', 'pageview');\n\t\t", '\n            \n        ', '\n \t\t', '\n\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t', 'X', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t', 'Bank of Ireland Mobile Banking', '\n\t\t\t\t\t\t', 'Bank of Ireland', '\n\t\t\t\t\t\t', 'GET - On the Play Store', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', 'View', '\n\t\t\t\t', '\n\t\t\t', '\t\n\t\t', '\n\t\t', '\t\n        \n\t\t', '\n\t\t', '\n\t\t', "/*<![CDATA[*/(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\n\t\tnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\n\t\tj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n\t\t'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n\t\t})(window,document,'script','dataLayer','GTM-PWLHXQ');/*]]>*/", '\n\t\t', '\n\t            ', '\n\t                ', 'skip navigation', '\n\t                ', 'Accessibility', '\n\t            ', '\n   \n        ', '\n\t        ', '\n            \t', 'Need help using this site?', 'Get Help', '\n\t\t\t\t', ' \n\t\t\t', '\n        ', '\n\n    ', '  \n\t    ', '\n\t\t\t', ' \n\t\t\t    ', 'Welcome to 365 online\n\t\t\t    ', '\n\t\t\t    ', ' \n\t\t\t    ', ' \n\t\t\t        \n\t\t\t\t\t', '*', ' = mandatory', '\n\t\t\t\t\t\n\t\t\t\t\t', 'Secure Login', '\n\t\t\t\n\t\t\t\t\t', ' \n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t    ', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', 'Please enable javascript on your browser.', '\n\t\t\t\t\t\t\t', 'You do not currently have javascript enabled please enable it.\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', 'To use this application correctly you must have javascript enabled.\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t        \n\t\t\t\t\t', 'Secure Login - Step 1 of 2', ' ', '\n\t\t\t\t\t\t', ' \n                               ', 'Please enter your ', ' User ID', '\n                               ', '*', ' \n                           ', '\n\t\t\t\n\t\t\t                ', '\n\t\t\t                    ', '\n\t\t\t                        ', 'Date of birth', '\n\t\t\t                        ', 'Please enter your ', 'Date of Birth', '\n\t\t\t\n\t\t\t                        ', 'DD', '\n\t\t\t                        ', 'MM', '\n\t\t\t                        ', 'YYYY', '\n\t\t\t                        ', '*', '\xa0/ ', '\xa0/ ', '\n\t\t\t                    ', '\n\t\t\t                ', ' ', ' \n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t', ' ', 'Forgot details', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', ' \n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\tRegister\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t', '\n', '\n', 'Continue', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', ' \n\t\t\t\t', ' \n\t\t\t\t', '\n      ', '\n          ', 'Stay safe and secure', '\n          ', 'We will never email you requesting your online login details', ' - please report any suspicious emails to ', '365security@boi.com', '\n      ', '\n\t\t\t', '\t\n\t    ', '\n    ', '\n\t\n\t', '\n\t      ', '\n\t       ', '\n        \n    ', '\n    \t', '\n\t\t', 'Looking for your IBAN?', '\n\t    ', 'Your IBAN is displayed on 365 online and in the Mobile and Tablet Apps. ', 'Find another IBAN using our calculator. ', '\n\t    ', 'More Info', '\n    ', '\n     \n    ', '\n    \t', '\n\t\t', 'Stay safe online', '\n        ', 'Watch our short video for tips on how to keep yourself safe online.\n', '\n        ', 'More Info', '\n    ', ' \n\t       ', '\n\n    ', '\n\n    ', '\n        ', '365 Online Demo', '\n        ', 'Need help with your online banking?', '\n        ', 'View our helpful demo for a step-by-step guide to the most popular services available on 365 online.', '\n        ', 'See Demo', '\n    ', '\n\n\t', '\n\t', ' \n\t', '\n\n\t', '\n\t', 'J1\n\t', '\n\t', '\n\t\t', ' ', 'About\n\t\tUs', ' |', '\n\t\t', ' ', 'Security', '\n\t\t|', '\n\t\t', ' ', 'Cookie and Privacy Policy', ' |', '\n\t\t', ' ', 'Terms and Conditions', ' |', '\n\t\t', ' ', 'FAQs', '\n\t\t|', '\n\t\t', ' ', 'Accessibility', '\n\n\t', '\n\t\n\t', '\n\t', '\n\t     ', '\n\t            \n\t\t\t\t', '\n\t\t\t\t\t', 'For details of NI/GB products & services, please see ', 'www.bankofireland.co.uk', ' ', '\t\t\t\t\n\t\t\t\t\t', 'Bank of Ireland is regulated by the Central Bank of Ireland. Bank of Ireland (UK) plc is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Bank of Ireland Life is a trading name of New Ireland Assurance Company plc. New Ireland Assurance Company plc trading as Bank of Ireland Life is regulated by the Central Bank of Ireland. Life assurance and pension products are provided by New Ireland Assurance Company PLC trading as Bank of Ireland Life. ', '\n\t\t\t\t', "function clear_form() {\n_clearJSFFormParameters('form','',['form:j_idcl','form:_link_hidden_']);\n}\nfunction clearFormHiddenParams_form(){clear_form();}\nfunction clearFormHiddenParams_form(){clear_form();}\nclear_form();", '\n        \n']

Can you please provide a code snippet that triggers your issue so that I can investigate.

@sentriz
Copy link

sentriz commented Apr 14, 2018

try r.html.find("#form\:phoneNumber")

here is a picture:

the first command shows that your changes are in there

EDIT: It seems to work if I pass self.html.encode("ISO-8859-1") to PyQuery on line 149. but that isn't a fix, I just got it from the top of the page I'm having trouble with.
<?xml version="1.0" encoding="ISO-8859-1"?>

PyQuery with XML sites also has the same issue that LXML does with unicode encoded strings because it uses LXML to parse the page.
The fix has already been applied to LXML, so we can fix the issue with PyQuery by passing the already parsed LXML into PyQuery.
@angaz
Copy link
Contributor Author

angaz commented Apr 14, 2018

Thanks for your discovery!

The issue is with LXML and a Unicode encoded string. So your idea of encoding the string is a solution.

BaseParser.find uses PyQuery to find CSS Selectors. I had not thought of testing the PyQuery part because I only use the xpath query. PyQuery uses LXML to parse the tree, so it will have the same problem with the Unicode string. So initialising PyQuery with raw_html would also fix the problem here. But there is an even better solution.

In the PyQuery documentation, you can see that it says that you can initialise PyQuery with an LXML etree, so just use the LXML tree we already have, BaseParser.lxml, thus solving the issue and cutting down on processing because we create the LXML tree only once. Kill two birds with one stone.

Here is your code with the new patch.

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> session.get('https://www.365online.com/online365/spring/authentication?execution=e1s1')
<Response [200]>
>>> _.html.find('#form\:phoneNumber')
[<Element 'input' id='form:phoneNumber' type='text' name='form:phoneNumber' autocomplete='off' class=('inputbox', 'accountID') maxlength='4' onkeyup="autoTabMaxLength(event, this,'form:continue')" size='18' tabindex='2' title='Please enter the last four digits of your contact number'>]
>>> 

P.S. I didn't know that you could use an underscore to access the return from the previous command in the Python interactive console. Thanks! I learnt something.

@sentriz
Copy link

sentriz commented Apr 14, 2018

very nice.
good work @SN9NV
I hope this gets merged

@rverton
Copy link

rverton commented Jul 4, 2018

Also fixed this unicode bug for me. Thanks!

@kennethreitz
Copy link
Collaborator

skeptical, but i'll give it a shot. will tag appropriately for re-review later.

@phith0n
Copy link

phith0n commented Mar 23, 2019

@SN9NV @kenneth-reitz
Please take a look at this issue https://github.com/kennethreitz/requests-html/issues/279, and I think it was caused by #160 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy