-
Notifications
You must be signed in to change notification settings - Fork 990
Create LXML from raw_html #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Create LXML from `self.raw_html` instead of `self.html` to allow LXML to process plain XML pages as per beda42's findings in issue https://github.com/kennethreitz/requests-html/issues/145 I have tested this change with 200 sites and it seems to fix the issue. HTML pages seem to all be working as expected. I haven't run into an issue with any that I've tested.
I tried this fix for this site, but it didn't seem to fix it |
Hi @sentriz, I am unable to see any problems with this site. This is my attempt to reproduce: pip install the branch:
Make sure that lxml loads the page by querying an xpath for all the text: >>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://www.365online.com/online365/spring/authentication?execution=e1s1')
>>> r.html.xpath('//text()')
['\n \n ', '\n ', '\n\n ', '\n ', '\n ', '\n ', ' \n ', '\n ', '\n \n ', '\n ', '\n\n ', 'window.RICH_FACES_EXTENDED_SKINNING_ON=true;', '\n ', '\n ', '\n ', '\n\n ', '\n ', '\n ', '\n ', '\n \n\t\t', 'Bank of Ireland 365 Online | Login - Step 1 of 2', '\n \n ', '\n\n ', '\n ', '\n ', '\n ', '\n ', ' \n ', " <!-- REMOVE DIV ONCLICK FUNCTION -->\n function hide_element(element_name) {\n element = document.getElementById(element_name); \n element.style.display = 'none';\n }\n var backgroundPositionDefault = '0px 0px';\n \tvar backgroundPositionUpdated = '0px 80px';\n ", '\n ', '\n \tvar $j = jQuery.noConflict();\n\t\t\t\tfunction closeSmartBanners(element) {\n\t\t\t\t\t$j("#smartBannerSection").css(\'display\',\'none\');\n\t\t\t\t\tupdateBackgroundPosition(backgroundPositionDefault);\n\t\t\t\t}\n\t\t\t\tfunction updateBackgroundPosition(pos) {\n\t\t\t\t\t$j("body").css(\'background-position\',pos);\n\t\t\t\t}\n\t\t\t\t$j(window).load(function(){\n\t\t\t\t\tupdateBackgroundPosition(backgroundPositionDefault);\n\t\t\t\t\tif(device.isAndroid()) {\t\t\t\t\t\n\t\t\t\t\t\t$j("#smartBannerSection").css(\'display\',\'block\');\n\t\t\t\t\t\tupdateBackgroundPosition(backgroundPositionUpdated);\n\t\t\t\t\t\tif(device.isMobile()) {\n\t\t\t\t\t\t\tvar link = \'http://play.google.com/store/apps/details?id=com.bankofireland.mobilebanking\';\n\t\t\t\t\t\t\tvar appName = \'Bank of Ireland Mobile Banking\';\n\t\t\t\t\t\t} else {\n\t\t\t\t\t\t\tvar link = \'https://play.google.com/store/apps/details?id=com.boi.tablet365\';\n\t\t\t\t\t\t\tvar appName = \'Bank of Ireland Tablet Banking\';\n\t\t\t\t\t\t}\n\t\t\t\t\t\t$j("#smartBnrUrl").attr(\'href\',link);\n\t\t\t\t\t\t$j("#smartBnrAppName").html(appName);\n\t\t\t\t\t}\n\t\t\t\t});\n\t\t\t', '\n\t\t', "\n\t\t (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n\t\t (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n\t\t m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n\t\t })(window,document,'script','//www.google-analytics.com/analytics.js','ga');\n\t\t ga('create', 'UA-55288034-3', 'auto');\n\t\t", '\n\t\t', "\n\t\t\tga('send', 'pageview');\n\t\t", '\n \n ', '\n \t\t', '\n\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t', 'X', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t', 'Bank of Ireland Mobile Banking', '\n\t\t\t\t\t\t', 'Bank of Ireland', '\n\t\t\t\t\t\t', 'GET - On the Play Store', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t', 'View', '\n\t\t\t\t', '\n\t\t\t', '\t\n\t\t', '\n\t\t', '\t\n \n\t\t', '\n\t\t', '\n\t\t', "/*<![CDATA[*/(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\n\t\tnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\n\t\tj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n\t\t'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n\t\t})(window,document,'script','dataLayer','GTM-PWLHXQ');/*]]>*/", '\n\t\t', '\n\t ', '\n\t ', 'skip navigation', '\n\t ', 'Accessibility', '\n\t ', '\n \n ', '\n\t ', '\n \t', 'Need help using this site?', 'Get Help', '\n\t\t\t\t', ' \n\t\t\t', '\n ', '\n\n ', ' \n\t ', '\n\t\t\t', ' \n\t\t\t ', 'Welcome to 365 online\n\t\t\t ', '\n\t\t\t ', ' \n\t\t\t ', ' \n\t\t\t \n\t\t\t\t\t', '*', ' = mandatory', '\n\t\t\t\t\t\n\t\t\t\t\t', 'Secure Login', '\n\t\t\t\n\t\t\t\t\t', ' \n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t ', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', 'Please enable javascript on your browser.', '\n\t\t\t\t\t\t\t', 'You do not currently have javascript enabled please enable it.\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', 'To use this application correctly you must have javascript enabled.\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t \n\t\t\t\t\t', 'Secure Login - Step 1 of 2', ' ', '\n\t\t\t\t\t\t', ' \n ', 'Please enter your ', ' User ID', '\n ', '*', ' \n ', '\n\t\t\t\n\t\t\t ', '\n\t\t\t ', '\n\t\t\t ', 'Date of birth', '\n\t\t\t ', 'Please enter your ', 'Date of Birth', '\n\t\t\t\n\t\t\t ', 'DD', '\n\t\t\t ', 'MM', '\n\t\t\t ', 'YYYY', '\n\t\t\t ', '*', '\xa0/ ', '\xa0/ ', '\n\t\t\t ', '\n\t\t\t ', ' ', ' \n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t', ' ', 'Forgot details', '\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t', ' \n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\tRegister\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t', '\n', '\n', 'Continue', '\n\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', ' \n\t\t\t\t', ' \n\t\t\t\t', '\n ', '\n ', 'Stay safe and secure', '\n ', 'We will never email you requesting your online login details', ' - please report any suspicious emails to ', '365security@boi.com', '\n ', '\n\t\t\t', '\t\n\t ', '\n ', '\n\t\n\t', '\n\t ', '\n\t ', '\n \n ', '\n \t', '\n\t\t', 'Looking for your IBAN?', '\n\t ', 'Your IBAN is displayed on 365 online and in the Mobile and Tablet Apps. ', 'Find another IBAN using our calculator. ', '\n\t ', 'More Info', '\n ', '\n \n ', '\n \t', '\n\t\t', 'Stay safe online', '\n ', 'Watch our short video for tips on how to keep yourself safe online.\n', '\n ', 'More Info', '\n ', ' \n\t ', '\n\n ', '\n\n ', '\n ', '365 Online Demo', '\n ', 'Need help with your online banking?', '\n ', 'View our helpful demo for a step-by-step guide to the most popular services available on 365 online.', '\n ', 'See Demo', '\n ', '\n\n\t', '\n\t', ' \n\t', '\n\n\t', '\n\t', 'J1\n\t', '\n\t', '\n\t\t', ' ', 'About\n\t\tUs', ' |', '\n\t\t', ' ', 'Security', '\n\t\t|', '\n\t\t', ' ', 'Cookie and Privacy Policy', ' |', '\n\t\t', ' ', 'Terms and Conditions', ' |', '\n\t\t', ' ', 'FAQs', '\n\t\t|', '\n\t\t', ' ', 'Accessibility', '\n\n\t', '\n\t\n\t', '\n\t', '\n\t ', '\n\t \n\t\t\t\t', '\n\t\t\t\t\t', 'For details of NI/GB products & services, please see ', 'www.bankofireland.co.uk', ' ', '\t\t\t\t\n\t\t\t\t\t', 'Bank of Ireland is regulated by the Central Bank of Ireland. Bank of Ireland (UK) plc is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Bank of Ireland Life is a trading name of New Ireland Assurance Company plc. New Ireland Assurance Company plc trading as Bank of Ireland Life is regulated by the Central Bank of Ireland. Life assurance and pension products are provided by New Ireland Assurance Company PLC trading as Bank of Ireland Life. ', '\n\t\t\t\t', "function clear_form() {\n_clearJSFFormParameters('form','',['form:j_idcl','form:_link_hidden_']);\n}\nfunction clearFormHiddenParams_form(){clear_form();}\nfunction clearFormHiddenParams_form(){clear_form();}\nclear_form();", '\n \n'] Can you please provide a code snippet that triggers your issue so that I can investigate. |
try the first command shows that your changes are in there EDIT: It seems to work if I pass |
PyQuery with XML sites also has the same issue that LXML does with unicode encoded strings because it uses LXML to parse the page. The fix has already been applied to LXML, so we can fix the issue with PyQuery by passing the already parsed LXML into PyQuery.
Thanks for your discovery! The issue is with LXML and a Unicode encoded string. So your idea of encoding the string is a solution.
In the PyQuery documentation, you can see that it says that you can initialise PyQuery with an LXML etree, so just use the LXML tree we already have, Here is your code with the new patch. >>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> session.get('https://www.365online.com/online365/spring/authentication?execution=e1s1')
<Response [200]>
>>> _.html.find('#form\:phoneNumber')
[<Element 'input' id='form:phoneNumber' type='text' name='form:phoneNumber' autocomplete='off' class=('inputbox', 'accountID') maxlength='4' onkeyup="autoTabMaxLength(event, this,'form:continue')" size='18' tabindex='2' title='Please enter the last four digits of your contact number'>]
>>> P.S. I didn't know that you could use an underscore to access the return from the previous command in the Python interactive console. Thanks! I learnt something. |
very nice. |
Also fixed this unicode bug for me. Thanks! |
skeptical, but i'll give it a shot. will tag appropriately for re-review later. |
@SN9NV @kenneth-reitz |
Create LXML from
self.raw_html
instead ofself.html
to allow LXML to process plain XML pages as per @beda42 findings in issue https://github.com/kennethreitz/requests-html/issues/145I have tested this change with 200 sites and it seems to fix the issue. HTML pages seem to all be working as expected. I haven't run into an issue with any that I've tested.