Wikipedia:Bots/Requests for approval/BaranBOT 2
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was Approved.
Operator: DreamRimmer (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 14:01, Monday, May 27, 2024 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): Python
Source code available:
Function overview: Fix the URLs for the ECI election database.
Links to relevant discussions (where appropriate):
Edit period(s): Every six months
Estimated number of pages affected: 5050
Exclusion compliant (Yes/No): No
Already has a bot flag (Yes/No): No
Function details: The Election Commission of India has moved all of its data (except for very recent elections) to a subdomain. As a result, URLs in more than 5000 pages are now invalid and are giving a 404 error. This bot will replace URLs like https://eci.gov.in/files/file/11699-maharashtra-legislative-assembly-election-2019
with the new URL https://old.eci.gov.in/files/file/11699-maharashtra-legislative-assembly-election-2019
. Simply replace https://eci.gov.in/
with https://old.eci.gov.in/
.
Discussion
editWhy every six months? Primefac (talk) 18:28, 27 May 2024 (UTC)[reply]
- In India, elections are held in 5-6 states every year. As the elections approach or conclude, the ECI moves data from previous elections to this subdomain. This means that many URLs will become invalid after each year's elections. – DreamRimmer (talk) 22:19, 27 May 2024 (UTC)[reply]
- Apologies if this is coming across as dense, just want to make sure I'm on the same page. Let's arbitrarily say that there's an election in July 2024, and the URL for those pages starts with
https://eci.gov.in/
since it's a "recent election". At what point will that URL get archived to thehttps://old.eci.gov.in/
prefix? If it is archived after the subsequent election, why not just update the URL with the new election information along with the data it represents? Primefac (talk) 15:00, 6 June 2024 (UTC)[reply]- The problem is that I don't know when ECI moves older election results to the old.eci URL. The recent elections, held in November 2023 in six states, were six months ago. So far, the ECI has moved three sets of election data to the old.eci domain. This suggests that they archive election data within six to ten months. For now, we can fix all these broken links, but we might need to do this again for future elections. If the BRFA folks think it's unnecessary to do this regularly (every six months), it's fine to handle it once. I'll try to submit a new BRFA in the future, and we can continue regularly if needed. – DreamRimmer (talk) 14:01, 7 June 2024 (UTC)[reply]
- Apologies if this is coming across as dense, just want to make sure I'm on the same page. Let's arbitrarily say that there's an election in July 2024, and the URL for those pages starts with
- Previous discussion Wikipedia:Link_rot/URL_change_requests#ECI_-_Election_Commission_of_India. Geoblocking is preventing outside-India bots and DreamRimmer has India IP access. DreamRimmer, to caution, there are many non-obvious problems that can arise when operating on URLs. Probably the biggest is archive URLs you don't want to modify. This PCRE regex should capture only non-archive URLs (untested):
(?<!/)(?<!\\?url=)https?://eci[.]gov[.]in/[^\\s\\]|}{<]*[^\\s\\]|}{<]*
- Also verify the new URL is working before switching, do a header check, don't assume, websites always have error rates some higher than others. Other issues might arise, most problems will show up during the first 100 or so edits. Common trouble points are
|url-status=
,{{webarchive}}
and{{dead link}}
. Also links that are square and bare. It might too difficult to get all these exactly right, if you can change the main|url=
and square URLs and verify the new URL works, that will go a long way! -- GreenC 15:51, 8 June 2024 (UTC)[reply]- I would definitely be cautious to avoid any potential mistakes. – DreamRimmer (talk) 16:57, 14 June 2024 (UTC)[reply]
- Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see how things get on. Primefac (talk) 15:25, 27 June 2024 (UTC)[reply]
- Trial complete. Edits. Everything worked as intended, and all the new URLs are working fine. I want to note that initially, the bot did not change the url-status for Mangolpuri Assembly constituency, Chandni Chowk Assembly constituency, and Bawana Assembly constituency, but this issue has since been resolved and now functions correctly. Pinging User:GreenC if they want to take a look. – DreamRimmer (talk) 13:41, 3 July 2024 (UTC)[reply]
- I spot checked, don't see any problems. Can you confirm if it also modifies these types:
- ie. square (1) and bare (2) links. -- GreenC 17:56, 5 July 2024 (UTC)[reply]
Note: these links are georestricted to India IPs and can't be archived, or archived very well. I found an article in The Hindu that talks about it. The article quotes one our most technically knowledgeable editors, User:Nemo_bis, who said: "Nemo has studied 'geofencing' of Indian government websites in the past, and in 2020 created a proxy service for users located abroad to access Indian government websites". This might be our solution. I hope Nemo has a working proxy for the Election Commission website? -- GreenC 17:58, 5 July 2024 (UTC)[reply]
- @GreenC, I am fixing all the links that start with https://eci.gov.in/files/file/, https://eci.gov.in/category, and https://eci.gov.in/ByeElection/. All these links are archived in a subdomain. The links for the 2023 elections of Chhattisgarh, Telangana, Rajasthan, Mizoram, and Madhya Pradesh are still working and have not been moved to the old subdomain, so I will not touch them.
- The working links are formatted as follows: (eg.)
- https://www.eci.gov.in/chhattisgarh-legislative-election-2023-statistical-report
- https://www.eci.gov.in/mp-legislative-election-2023-statistical-report
- https://www.eci.gov.in/mizoram-legislative-election-2023-statistical-report
- The old election links are formatted as follows: (eg.)
- https://eci.gov.in/files/file/9643-statistical-data-of-general-election-to-chhatisgarh-assembly-2018/ (now https://old.eci.gov.in/files/file/9643-statistical-data-of-general-election-to-chhatisgarh-assembly-2018/)
- https://eci.gov.in/files/file/9685-madhya-pradesh-legislative-election-2018-statistical-report/ (now https://old.eci.gov.in/files/file/9685-madhya-pradesh-legislative-election-2018-statistical-report/)
- https://eci.gov.in/files/file/9687-mizoram-legislative-election-2018-statistical-report/ (now https://old.eci.gov.in/files/file/9687-mizoram-legislative-election-2018-statistical-report/)
- Other links that start with https://eci.gov.in/category and https://eci.gov.in/ByeElection/ have all been moved to the subdomain, so I will need to fix them. – DreamRimmer (talk) 14:03, 6 July 2024 (UTC)[reply]
- My read of the above is that this is ready for approval, but I just want to double-check given the discussion above. Primefac (talk) 23:34, 4 August 2024 (UTC)[reply]
- {{Operator assistance needed}} Primefac (talk) 13:12, 10 August 2024 (UTC)[reply]
- Primefac, please let me know if you need me to explain this again. – DreamRimmer (talk) 14:05, 10 August 2024 (UTC)[reply]
- I don't necessarily need any explanation, just wanted to make sure your last reply to GreenC was clarification that the issues can be resolved, and not further complications; the first time I read it about two weeks ago it sounded like the latter, but when I made my comment last week it was after re-reading it and it sounded more like the former. Primefac (talk) 16:51, 10 August 2024 (UTC)[reply]
- @Primefac, I was just noting that in addition to correcting https://eci.gov.in/files/file/* urls, this task will also be correcting https://eci.gov.in/category/*, and https://eci.gov.in/ByeElection/* urls too. This is already done and the bot is ready to start its work once this BRFA gets approval. – DreamRimmer (talk) 11:35, 11 August 2024 (UTC)[reply]
- Savvy. Approved. Primefac (talk) 11:58, 11 August 2024 (UTC)[reply]
- @Primefac, I was just noting that in addition to correcting https://eci.gov.in/files/file/* urls, this task will also be correcting https://eci.gov.in/category/*, and https://eci.gov.in/ByeElection/* urls too. This is already done and the bot is ready to start its work once this BRFA gets approval. – DreamRimmer (talk) 11:35, 11 August 2024 (UTC)[reply]
- I don't necessarily need any explanation, just wanted to make sure your last reply to GreenC was clarification that the issues can be resolved, and not further complications; the first time I read it about two weeks ago it sounded like the latter, but when I made my comment last week it was after re-reading it and it sounded more like the former. Primefac (talk) 16:51, 10 August 2024 (UTC)[reply]
- Primefac, please let me know if you need me to explain this again. – DreamRimmer (talk) 14:05, 10 August 2024 (UTC)[reply]
- {{Operator assistance needed}} Primefac (talk) 13:12, 10 August 2024 (UTC)[reply]
- My read of the above is that this is ready for approval, but I just want to double-check given the discussion above. Primefac (talk) 23:34, 4 August 2024 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.