In order to purge old IP data from the AbuseLog, I'd like to run the purgeOldLogIPData.php. I can do this myself but, does the extension now know which data should be purged? (cf. parent task).
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
mediawiki: log next run of purge_abusefilter.pp | operations/puppet | production | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jalexander | T160357 Allow those with CheckUser right to access AbuseLog private information on WMF projects | |||
Resolved | Reedy | T179131 AbuseFilter should actively prune old IP data | |||
Resolved | MarcoAurelio | T186870 Purge old IP data from AbuseFilter on the Beta Cluster |
Event Timeline
I'll test on https://es.wikipedia.beta.wmflabs.org/wiki/Especial:RegistroAbusos which is closed.
Script can't be run until T186928: Correctly reference the "Abuse Filter" extension in maintenance scripts is resolved.
Mentioned in SAL (#wikimedia-releng) [2018-02-10T18:58:24Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=eswiki (1695 rows purged - T186870)
I choose beta eswiki since it is closed and I guess I can more safely do that there. Test plan consisted in:
First: query how many rows do we have with private data:
wikiadmin@deployment-db04[eswiki]> select count(afl_ip) from abuse_filter_log; +---------------+ | count(afl_ip) | +---------------+ | 1695 | +---------------+ 1 row in set (0.00 sec)
Second: see the oldest and newest abusefilter log entry with wikiadmin@deployment-db04[eswiki]> select afl_id, afl_timestamp, afl_ip from abuse_filter_log order by afl_timestamp desc; Oldest is from 20140718015832 and newest is from 20170425131421. That means all data is older than 90 days so all afl_ip data should go.
Third: run the script:
maurelio@deployment-tin:~$ mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=eswiki Purging old IP Address data from abuse_filter_log... 200 400 600 800 1000 1200 1400 1600 1695 1695 rows. Done.
Fourth: check if the data is really gone: wikiadmin@deployment-db04[eswiki]> select afl_id, afl_timestamp, afl_ip from abuse_filter_log order by afl_timestamp desc; shows no data on afl_ip field.
So I guess the script works as expected.
Is that method fine?
Also, I'm not sure if it is possible but it is somewhat kind of affordable to run this manually on all Beta Cluster wikis, but doing so on all WMF Production wikis will be a pain. Apparently foreachwikiindblist 'all-labs.dblist' <script here> expects a maintenance script from the mediawiki core maintenance folder, not from an extension...
@MarcoAurelio we should make sure it can be run automatically. What was the execution time on beta eswiki?
That should be fine, I guess? In the production version of enwiki, there are about 8000 abuse logs every day (see graph below). That would translate to a minute or two, I suppose.
I'm not a Wikimedia sysadmin, so I might be wrong. My idea is that for production wikis I assume that the first run should be scheduled with @greg and be run in batches of wikis or so (start with small.dblist and so on, or another method). In the first run we'll have tens of thousands of entries to clear Wikimedia-wide (note that AF is there logging since 2012?), and that will take some time, or maybe disrupt DBs. Further runs should be scheduled with a cron on the puppet IMHO, but they will not be that heavy IMHO even if we run it daily. To be sure, before running the script in Production I'd do a DBA review of the script and co-schedule a deployment window so we don't trap our fingers.
Tens of thousands? The number of abuse log entries just in enwiki is around 18 million. I think we are dealing with tens of millions, if not hundreds of millions or rows with IP data, across the board.
T186973 is of utmost importance.
Mentioned in SAL (#wikimedia-releng) [2018-02-12T18:43:58Z] <Hauskatze> Starting to purge old afl_ip data from abuse_filter_log on Beta Cluster - T186870
Mentioned in SAL (#wikimedia-releng) [2018-02-12T18:45:24Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=aawiki (0 rows purged - T186870)
Mentioned in SAL (#wikimedia-releng) [2018-02-12T18:46:15Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=arwiki (37 rows purged - T186870)
Mentioned in SAL (#wikimedia-releng) [2018-02-12T18:55:06Z] <Hauskatze> Running maurelio@deployment-tin:~$ foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldIPLogData.php for T186870
Done:
However we should either disabling afl_ip logging on Beta Cluster, or restrict who can access that info or both. In any case, a puppet cron should be set there to do this regularly.
That said, on WMF production it has 2 long years of cleaning to do after we fixed the $this->requireExtension thing. If we were to run this manually, I'd say to run it on terbium since it'll be a long-running script. However the cron job should be fixed so it should run today at 01:15 UTC. I'll ask around and see if, at least for this round, they could keep the logs for doublechecking.
Change 410072 had a related patch set uploaded (by MarcoAurelio; owner: MarcoAurelio):
[operations/puppet@production] mediawiki: log next run of purge_abusefilter.pp
Change 410072 merged by Dzahn:
[operations/puppet@production] mediawiki: log next run of purge_abusefilter.pp
Well, the data is purged. I'll create another task so this script can run croned on the beta cluster puppet.