Wikipedia:Bots/Requests for approval/PockBot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: PocklingtonDan
Automatic or Manually Assisted: Manually assisted
Programming Language(s): Perl
Function Summary: Producing a list of the status of every article in a category.
Edit period(s) (e.g. Continuous, daily, one time run): One-time-run per manual run. Would be run manually approximately daily.
Edit rate requested: Maximum of 1 read request per second during operation. 1 edit request per instance.
Already has a bot flag (Y/N):
Function Details: For a given category to produce a list of all articles within that category and all sub-categories, as well as the status of each, To provide this information in tabular format to enable an automated "hit-list" for editors of the category, showing clearly an overview of all stub, start, B, A, and FA class articles. Preview of output is shown on PockBot user page.
Discussion
editSeems like an interesting idea. This seems to mainly use GET requests, which are much easier on the server than POST. Given its low edit rate, it will not even need a flag.Voice-of-All 17:26, 19 November 2006 (UTC)[reply]
- Well, if it's reading only RFBA isn't commonly used. Only if it needs write access do we need to worry re bot flags and the likes -- Tawker 19:55, 19 November 2006 (UTC)[reply]
- It's effectively a remotely-hosted web spider, yes, that will do a lot of read requests to get data from ages. But I will in version 2 be wanting the bot to log in and edit the status page at "/Status" for a given category provided at runtime with the processed results of its crawling - it was this that I wanted to get approval for. The idea is that it is given a category and then builds its status list f all articles in the category and then writes the data to /Status. Since it will be able to be run on any category potentially it could be reading and writing to lot of category pages. I'm only going to use it on the "Military of ancient Rome" category for testing purposes etc, but it will ultimately be able to be run on any category and any time. Can I skip approval process for this then? Cheers - PocklingtonDan 09:56, 20 November 2006 (UTC)[reply]
- Will you be writing to one status page, or a page per article? — xaosflux Talk 06:05, 21 November 2006 (UTC)[reply]
- The bot would write to a fixed subpage of the category it was run for at "/Status". - PocklingtonDan 08:28, 21 November 2006 (UTC)[reply]
- Will you be writing to one status page, or a page per article? — xaosflux Talk 06:05, 21 November 2006 (UTC)[reply]
- It's effectively a remotely-hosted web spider, yes, that will do a lot of read requests to get data from ages. But I will in version 2 be wanting the bot to log in and edit the status page at "/Status" for a given category provided at runtime with the processed results of its crawling - it was this that I wanted to get approval for. The idea is that it is given a category and then builds its status list f all articles in the category and then writes the data to /Status. Since it will be able to be run on any category potentially it could be reading and writing to lot of category pages. I'm only going to use it on the "Military of ancient Rome" category for testing purposes etc, but it will ultimately be able to be run on any category and any time. Can I skip approval process for this then? Cheers - PocklingtonDan 09:56, 20 November 2006 (UTC)[reply]
- Well, if it's reading only RFBA isn't commonly used. Only if it needs write access do we need to worry re bot flags and the likes -- Tawker 19:55, 19 November 2006 (UTC)[reply]
- Will you be using a unique or bot-like User-Agent string, and obey robots.txt? — xaosflux Talk 06:05, 21 November 2006 (UTC)[reply]
- I was going to have the bot login under its own ID. I am not familiar with robots.txt but if you could point me towards this I don't think obeying its conventions would be a problem - PocklingtonDan 08:28, 21 November 2006 (UTC)[reply]
- Will you be using a unique or bot-like User-Agent string, and obey robots.txt? — xaosflux Talk 06:05, 21 November 2006 (UTC)[reply]
- Approval is not a priority since I have suspended development of PockBot until I can devote some mre time to it. - PocklingtonDan 10:44, 21 November 2006 (UTC)[reply]
Withdrawn by operator. The operator has suspended development on this bot. This request can be restarted at any time if the operator desires. It is not clear whether a bot flag or approval is even required for this bot. It will depend on what the bot intends to do if it is ever reconsidered. I would recommend that if this bot is to be reconsidered that you provide a few example diffs showing exactly what the bot would do. -- RM 15:53, 28 November 2006 (UTC)[reply]
- I have now finished coding version 1.00 of the bot and it is ready for approval request if such is necessary. Could I get this un-archived please? - PocklingtonDan 17:51, 6 December 2006 (UTC)[reply]
- I've reactivated your request. Can you provide us with information on exactly what has been done for version 1.0 and any examples (such as an example /Status page)? -- RM
- Thank you. The full code plus description is now given on the bot's user page. The bot now writes to the category talk page directly rather than creating a separate page. Examples of ones it has done already are Category talk:Wars_of_the_Byzantine_Empire and Category talk:Roman frontiers. It sometimes times out on large categories with 100s of articles in sub-cats and sub-sub-cats due to timeouts of requests to wikipedia but does so without any problems, it just dies and doesn't continue further, with no ill effects for wikipedia. The bot's user page does a better job of describing its operation. It does not exceed 1 read request per second (actually about 2-3 seconds between read requests in practice). It only makes one edit request per instance. Any other questions, let me know PocklingtonDan 18:33, 6 December 2006 (UTC)[reply]
- I've reactivated your request. Can you provide us with information on exactly what has been done for version 1.0 and any examples (such as an example /Status page)? -- RM
Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. This bot has been around for some time as described above and there were no major objections. I've looked over everything including the examples and this looks quite useful. I hope that your bot, since it is available for use by others, has sufficient controls to prevent it from being operated too fast or being abused. As a result I'll approve the trial with the following parameters: No more than 30 page reads and 1 page write per minute. Please run a trial of up to 20 - 30 categories. Ideally some of those articles will be verbally approved and/or generated by other users who want to use this feature. Please post your results here when you are completed. -- RM 19:29, 6 December 2006 (UTC)[reply]
- Thank you, I will start the trial now and post up results shortly. - PocklingtonDan 19:45, 6 December 2006 (UTC)[reply]
- I still have another 15 or so cats to run, but I have had feedback that the bot is a good idea and had extra feature requests, and I notice from the contributions log that the bot has been run by other users too. Will continue running categories through the bot. - PocklingtonDan 08:37, 7 December 2006 (UTC)[reply]
- I've looked over some of the stuff above in a cursory fashion and so far things look good. I was thinking that a read throttle in the final version of the bot would perhaps be unnecessary and only slow things down. There should obviously be a write throttle, but that shouldn't be an issue. In addition, I'd like to see the following features/improvements:
- Have the bot keep a status log of requests. Don't allow the same category (or sub-category) to be updated more than once within, say, a 24 hour period.
- Have a maximum number of articles per table. If the maximum is reached, quit out early and put a note at the top of the generated table. If the table gets too large, it isn't useful anyway.
- It would be nice if on the bot request page you provide a box for a username. Then when the bot is performing its task it can say "PockBot (requested by User:FooBar) - Category...", and the final generated table can say who started it. Obviously this would be subject to potential abuse, but when someone is honest, it can give a good idea who is using the bot. Hopefully vandalism won't be an issue here.
- I'd like to see two status pages: One is a page showing all of the categories in which these tables have been added. The reason for this is that old tables (maybe a month or more old) either need to be updated or removed. Two is a page showing a list of articles that have tables that are managed by the bot automatically. The list of articles should also include the interval. Perhaps the bot could automatically parse the list.
- These tables are cluttered. Could you use transclusion? See this example.
- If this bot becomes popular and the waiting list gets long, we should consider other options, such as hosting a similar bot on the toolserver or running multiple bots simultaneously. But we can address those issues when the time comes. -- RM 13:16, 7 December 2006 (UTC)[reply]
- Thanks for a really comprehensive response. A removal of the throttle on read requests would help enormously on larger categories if that would be allowed. I'll cover each of your suggestions in turn:
- A status log preventing same category being re-run within 24 hours is a good idea. I shall implement this.
- I had already considered cutting the bot out after say 100 articles but hadn't implemented it since the div layout allows even large lists to be added to a page without disruption. Since you suggested this feature also, however, maybe it would be useful. Perhaps, if I can improve on your suggestion, it would be best to have this as a user-configurable option at runtime (checkboxes) for quitting at 50,100,1000 articles for example? What do you think? The only problem I see implementation-wise is that the bot currently dumb-adds every article to a huge list (not checking if it already exists) and only purges duplicates after the entire job has been run. Its more efficient this way but it means setting a limit would be more difficult. I might have to change the code to check for duplicates at every addition.
- With regard to a box asking for wikipedia username for those running the bot, I had considered something similar already in that I was going to start logging the IP of each user who ran it. Logging the username rather than IP would obviously be more user-friendly when reading the results ("Pockbot on behalf of RM" or similar) but as you say would have to rely on user providing their real username and not something else. I can't see anything malicious in people providing other user's names but what if they entered profanity or similar?
- You mentioned that you would like to see two status pages, one showing all of the categories in which PockBot had inserted tables in order to allow old ones to be removed. I don't think this is necessary for this purpose (it might be interesting for it to store a list, although this would essentially be the same as its contributions page) since a) the summary tables it writes to category talk pages are time-stamped and b)if a category caretaker has run it for a reason they would be likely to run it themselves at a later date for the same category in any case on a monthly basis or whatever. I see your point about old data being stale but the same is true for any data on talk pages, much of it is irrelevant a month after it has written due to article drift. The correct way to handle this is to archive old tlak contents etc. I'm ont convinced about this one yet, feel free to convince me further!
- You mentioned that it might be an idea to have the bot keep a "job list" of articles to run monthly or similar (if I understand you correctly). I'm not sure how this would work, since the bot only runs when prompted to do so by a user, it has no facility to be run via a cron job or similar on a periodic basis. Perhaps you could explain further what you meant on this one, don't think I'm quite with you.
- You mentioned that the tables were cluttered and could use transclusion. I didn't even know what transclusion meant and had to look it up(!). I see it means "capability for documents to include sections of other documents by reference", but I'm still not sure what you are proposing exactly, even after looking at the example.
- Thank you immensely for all your input on the bot, it is really helpful, and apologies for appearing a bit thick on a few of the items above!. Cheers - PocklingtonDan 14:32, 7 December 2006 (UTC)[reply]
- First of all, no apologies are necessary. The whole point of this approvals process is to work out the fine details. My ideas are just that: ideas. So let me address your points one by one.
- With regards to an article maximum, I was just thinking that a vandal (or well-meaning user) might put a large category into your bot so that it takes ages to finish. If the bot during its run kept an internal counter of "number of articles processed so far", and just stopped generating the list after that maximum was reached, that would be a decent safeguard. A reasonably large number would be fine, as this would only be a safeguard. No need to make it configurable by users. Some categories have many hundres or even thousands of articles listed, so we don't want it running forever. Still though, this seems a minor point on the whole thing. If it can't be done easily, don't worry about it.
- The IP Address of the request would be good. I can understand the problem with using the username. Perhaps use the IP address on the edit summaries but use the username when posting the final results. Easier to correct an edit containing vandalism than changing edit summaries with profanity!
- Contributions pages will be fine for bot activity. As for the bot performing updates on some interval based on a list stored on some page, that seems like a useful future feature, not critical for initial usage. I was thinking about some sort of Wikipedia page that people register for work, such as the Wikipedia Signpost Spamlist.
- As for eliminating the old lists, many of the categories that you've worked with didn't even have a talk page before the list was added. Dozens or even hundreds of stale lists could become counterproductive and might generate complaints from others. I like this idea, but I don't want it to run into problems months from now because people find these old lists laying around. A page with one single table doesn't need archiving, it just needs cleaning and/or updating. I can't see the advantage of archiving a table anyway, as the data is stale. We have edit history if anyone is *that* interested. Again this doesn't have to be done for bot approval, but it would be useful if the bot cleaned up after itself.
- Regarding transclusion, this is more complicated perhaps and I am available to help out in its development. More specifically this is template transclusion. If you look at the source of my example, you can see that instead of using the HTML:
- <tr><td>[[Anthracite]]</td><td style="background: #ff6666; text-align: center;">Stub</td></tr>
- I am using
- {{PockBotData|Anthracite|Stub|#ff6666}}
- Eliminating the HTML simplifies the details and makes the list readable if you edit the page. I can help you more with this if you'd like, but the parameters of the template contain each line's unique data that are then used to generate the HTML table in the final viewed page.
- Oh, and could the list be sorted?
- That should be it for now. -- RM 16:03, 7 December 2006 (UTC)[reply]
- Thanks again for your help here, you've been fantastic. OK:
- I have just implemented the username field as per your suggestion. I have not put it in the edit summary field to allow easy removal of profanity if bot is used for (very roundabout) vandalism
- With regards the article maximum, I see you what you mean. I will hardcode a 500-article limit into the bot. Shouldn't be too tricky, if the 500 number doesn't need to be too exact, ie as long as we don't mind it capping out at 420 uniques (and 80 duplicates) sometimes and 500 articles proper sometimes. Can't see a problem with this, will do this this afternoon.
- I will add the IP address too. Username is now embedded in page as part of bot's signature. I will put user IP adress in edit summary. Good idea.
- I think if it is OK I will leave performaing updates on some intervl for a future build sometime after approval. I think it is a good idea (in fact has already been suggested to me elsewhere that it could be fed a "list" in addition to individual usage per-category), but its a major new functionality to implement at this point.
- In relation to eliminating old content once it is stale, I do see your point. What I would like to propose is to not include this into version 1 either. It seems like a lot of your ideas would work well together in a version 2 or a separate "User:PockBot2". It would keep 2 databases a)a list of run categories, along with timestamps and b)a list of requested categories to run for. This latter could be a standard wikipedia page that people could add requests to and that it could harvest from. This would work in several of your ideas. Having it run in this way, it could be run periodically, wholly automatically. As I say, this is such a major revision that I would consider it a version 2 or (since I think current scope of quick 1-off article runs has merit too) possibly a separate "User:PockBot2" or similar".
- Regarding transclusion, I think I see what you mean, you want the bot to write "{{PockBotData|Anthracite|Stub|#ff6666}} to the page for each article instead of the HTML code. Is it as simple to do this as it looks? This should be fairly easy to do I think if its all setup and I need to do is print that code instead of HTML. I'll have a go at it and see if I can figure it out!
- You asked about list sorts - it already has this feature - there are little arrows next to the headers of each colukn that allow dynamic sorting by whatever header you choose. Do I need to make the arrows more obvious or change it to text "(sort)" or something do you think?
- Cheers - PocklingtonDan 16:50, 7 December 2006 (UTC)[reply]
- Ah, no, actually there's a problem with these last two - transclusion and list sorting are mutually exclusive, the latter relies on HTML table tags for sorting. I think I prefer to keep the dynamic sorting if possible but what do you think? 16:53, 7 December 2006 (UTC)
- IP address logging added now too. I want to a) hardcode the 500 article limit and b) overcome a browser timeout issue for larger categories and then I will asking for final aproval - already done 30+ edits including some for fairly large categories. - PocklingtonDan 20:04, 7 December 2006 (UTC)[reply]
- I have now added a hardcoded limit to quit when reaching 100 subcats (or sub-sub cats etc). If it reaches this it will stop running and post up the results it got up to that poitn alon with a notice than it quite at 50 subcats. I can change this value if it is thought to be too great/small. It ran for a fairly large cat (Category:Vietnam War fine after this change, taking about 5 minutes, so it looks like this is high enough a limit not to interfere with genuine usage, but still capping it if someone tries running it on the root category or something. There's nothing much I can do to ensure the user browser connection stays live for 5 minutes while running large cats but other than that, the bot has now done 40+ cats, and incorporated most of the suggested functionality. Can I get approval for this now? Cheers - PocklingtonDan 10:01, 8 December 2006 (UTC)[reply]
- Using templates is that easy. Just do the following type of thing and you won't need any HTML at all. Just edit the template itself to change how the tables are globally displayed:
- {{PockBotHeader}}
- {{PockBotData|Anthracite|Stub|#ff6666}}
- ...
- {{PockBotData|PockBotData]]|Coal|''not yet classified''|white}}
- {{PockBotFooter|~~~~}}
- I added links in the above example so you can go to the template page itself and look at the code that generates the final version. Changes to the template are global. If you change the header, it changes the header for every existing and future list using the "PockBotHeader" template. Just run a few examples and you'll get the hang of it really quickly. I also verified that the sorting function you are currently using will work fine with transclusion. UPDATE: I changed the signature in the example above. -- RM 13:11, 8 December 2006 (UTC)[reply]
- That's brilliant, thanks! I'll add these to the bot now instead of the HTML. Apologies in advance for any horrible output from the next few edits if I get it wrong the first time! - PocklingtonDan 13:19, 8 December 2006 (UTC)[reply]
- Rightyho, transclusion now in operations (see output for Category talk:Roman_military_units for example) and seems to be working fine. Added a progress bar while running too so that the browser shouldn't time out even on the largest categories. As far as I can see the bot is now good and ready for general work. I will at some point in the next couple of months do a tear-down and rebuild to make User:PockBot_II or similar that does all the work with category lists etc but I'm happy that PockBot now does everything originally intended, plus all features suggested by you and others that are compatible with this version 1. I'll run a few more cats just to makr sure but PockBot hasn't done anything untoward yet. If you and a few others could run it a time or two to test it that might help too since most runs have still been by me. Cheers - PocklingtonDan 14:05, 8 December 2006 (UTC)[reply]
- Using templates is that easy. Just do the following type of thing and you won't need any HTML at all. Just edit the template itself to change how the tables are globally displayed:
- Ah, no, actually there's a problem with these last two - transclusion and list sorting are mutually exclusive, the latter relies on HTML table tags for sorting. I think I prefer to keep the dynamic sorting if possible but what do you think? 16:53, 7 December 2006 (UTC)
- Thanks again for your help here, you've been fantastic. OK:
- First of all, no apologies are necessary. The whole point of this approvals process is to work out the fine details. My ideas are just that: ideas. So let me address your points one by one.
- Thanks for a really comprehensive response. A removal of the throttle on read requests would help enormously on larger categories if that would be allowed. I'll cover each of your suggestions in turn:
- I've looked over some of the stuff above in a cursory fashion and so far things look good. I was thinking that a read throttle in the final version of the bot would perhaps be unnecessary and only slow things down. There should obviously be a write throttle, but that shouldn't be an issue. In addition, I'd like to see the following features/improvements:
- I still have another 15 or so cats to run, but I have had feedback that the bot is a good idea and had extra feature requests, and I notice from the contributions log that the bot has been run by other users too. Will continue running categories through the bot. - PocklingtonDan 08:37, 7 December 2006 (UTC)[reply]
- (deindent)You know, you don't have to create a new bot account for new functions. You can use the same one. I have one picky note: Can you put a newline/linefeed after each transcluded data entry so that each template/data point is on its own line? It is easier to read when editing the page, since it is likely that someone using the page as a TODO list will want to update the list when they reclassify the articles. I've partially updated Category talk:Roman_military_units. -- RM 14:11, 8 December 2006 (UTC)[reply]
- With regards to creating new bots for new functions, I have had a closer look at the bot request page - I see that to add new functions I could use the same bot account but get aproval for new functions? Ie have two bots running under the same bot accoutn performing two different fucntions? I had a different model in my mind where each bt account perfomred one function, I see now that this is not the wikipedia bot model. No problem, I will make sure to use the same bot account when I make version 2 in a month or two witht he features you requested.
- Newline feed not a problem, it would make editing a lot easier wouldn't it. I'll get this added now. - PocklingtonDan 14:48, 8 December 2006 (UTC)[reply]
Approved. I've looked over all of the edits this bot has done so far and things look great. I'm going to remove the read throttle restriction, however just to be clear, the bot is not authorized to run multiple bot instances or multiple reads in parallel. The bot should not perform a write faster than 1 per minute, due to the nature of what this bot does. This shouldn't ever become a problem anyway. If speed restrictions ever become an issue, just come back and seek a quick ammendment to the approval. I'm going to grant a bot flag as well, so that this bot can scale without issue. There is the potential for this bot to be used frequently and it seems to be harmless enough. -- RM 20:39, 8 December 2006 (UTC)[reply]
- Many thanks for the approval. I will probably keep the read throttle in place at some lower rate in any case just to prevent it really hammering the server with read requests. I might tie this into the bot's knowledge of wikipedia's busier times - if wikipedia is busier, make read rate lower. Is there anything I need to do with regard to the bot flag? And are there procedures for advertising this bot's existence to those who might benefit from using it? Many thanks - PocklingtonDan 21:07, 8 December 2006 (UTC)[reply]
- I'd post a little message to the talk pages of various Wikiprojects notifying people of its existence. You can try a few of the biggest ones and see what happens. You could also consider posting a notice to the appropriate village pump. -- RM 17:54, 18 December 2006 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.