Page MenuHomePhabricator

Automate WMF wiki creation
Open, Needs TriagePublic

Description

Wiki creation is quite an involved process, documented on wikitech. I think, at least for certain common cases, the task could be almost completely automated.

For uncomplicated creation of new language editions under existing projects, with default configuration, the following tasks need to be done, none of which require complex human decision-making:

  • Reconfigure many services by pushing configuration changes to Gerrit, and deploy those commits
    • mediawiki-config: wikiversions, *.dblist
    • WikimediaMessages
    • DNS
    • RESTBase
    • Parsoid
    • Analytics refinery
    • cxserver
    • Labs dnsrecursor
  • Run addWiki.php. This script aims to automate all tasks which can be executed with the privileges of a MW maintenance script.
  • Run Wikidata's populateSitesTable.php. It should probably be incorporated into addWiki.php.
  • Run labsdb maintain-views
  • Update wikistats labs

So at a minimum, you need to write and deploy commits to 8 different projects, run three scripts, and manually insert some rows into a DB in a labs instance.

Despite there being no human decision making in this process, the documentation requires that you involve people from approximately four different teams (services, ops, wikidata, analytics).

In my opinion, something is going wrong here in terms of development policy. The problem is getting progressively worse. In July 2004, I fully automated wiki creation and provided a web interface allowing people to create wikis. Now, it is unthinkable.

Obviously services are the main culprits. Is it possible for in-house services to follow pybal's example, by polling a central HTTP configuration service for their wiki lists? As with pybal, the service could just be a collection of static files on a webserver etcd. Even MediaWiki could profitably use such a central service for its dblists, with APC caching.

So let's suppose we could get the procedure down to:

  1. Commit/review/deploy the DNS update
  2. Commit/review/deploy a configuration change to the new central config service.
  3. Run addWiki.php

Labs instances needing to know about the change would either poll the config service, or be notified by addWiki.php. WikimediaMessages could be updated in advance via translatewiki.net.

(Thanks to Milos Rancic for raising this issue with me.)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 339144 had a related patch set uploaded (by Reedy):
Run populateSitesTable.php on other wikidata client wikis

https://gerrit.wikimedia.org/r/339144

Much of this should be doable with our regular config management system. With puppet the deployment part however is not that easy to control and automate. Kubernetes might improve the situation in that regard.

Probably better to use etcd than a separate web host, since that seems to be the current standard solution. See above related task and also T149617: Integrating MediaWiki (and other services) with dynamic configuration

@tstarling I agree, dblists is one of the things that could be stored in etcd and read from there. On the other hand, it's such a simple and relatively stable list that we could also decide to maintain this as a simple configuration file that we distribute across the cluster in a standard format, and we expect every application to read from disk.

Say we create on every node /etc/wmf/dblists.yaml (just a random name/format) which contains all the info that we need for each and every application, and then all apps read it and are able to autoconfigure themselves based on those values.

I think that a "rolling restart of applications to pick up the new config" is an acceptable step here (ops need to be involved anyways).

Aren't dblists already in a standard format (newline delimited plain text) that we distribute across the cluster via scap?

Aren't dblists already in a standard format (newline delimited plain text) that we distribute across the cluster via scap?

Yes, but scap only sends them to MW-related hosts. If we moved them to something like etcd or /etc/wmf/dblists.yaml like suggested above, every application would have this data. This could be useful for services that don't need to know/speakin MediaWiki or have its code but want to know a list of all wikis it needs to care about.

Aren't dblists already in a standard format (newline delimited plain text) that we distribute across the cluster via scap?

Yes, but scap only sends them to MW-related hosts. If we moved them to something like etcd or /etc/wmf/dblists.yaml like suggested above, every application would have this data. This could be useful for services that don't need to know/speakin MediaWiki or have its code but want to know a list of all wikis it needs to care about.

Also: unless you have scap/multiversion on your system as well, the format for doing dblist math (all - something, etc) isn't available to you and you have to replicate the logic. A standard distribution/format for this avoids that issue.

The point is that updating dblists via gerrit and running scap is one of the avoidable steps in the task description. I imagine etcd would have structured data about each wiki, and the canonical map from domain name to wiki ID. To figure out exactly what structured data should be in there, we need to survey all the services in my list above, but for mediawiki-config it is dblist membership (e.g. $wikiTags in CommonSettings.php line 165) and wikiversions.json.

I think that a "rolling restart of applications to pick up the new config" is an acceptable step here (ops need to be involved anyways).

I don't think there should be any intelligence involved in the technical process of creating a wiki. I'm not sure what you mean by "a rolling restart of all services" -- if you mean stopping each service and starting it again, then I suspect that would require a human to consider the consequences.

Looking at Parsoid for a case study, I see that it re-reads sitematrix.json on worker startup, and service-runner responds to SIGHUP by doing a rolling restart of local workers. So all we need is a way to replace sitematrix.json and send SIGHUP. Other service-runner users could be reconfigured similarly. If we can have a button labelled "send SIGHUP all services", and a brainless server monkey is allowed to press it at any time, then I guess that would be a solution. But ideally, the brainless server monkey would be replaced by a line in a script.

The point is that updating dblists via gerrit and running scap is one of the avoidable steps in the task description. I imagine etcd would have structured data about each wiki, and the canonical map from domain name to wiki ID. To figure out exactly what structured data should be in there, we need to survey all the services in my list above, but for mediawiki-config it is dblist membership (e.g. $wikiTags in CommonSettings.php line 165) and wikiversions.json.

So my idea was to transform wikilist in this structured data, and store it to disc on every machine that might need it via puppet. As I said before, it should be easy enough to distribute a changed list that way.

My point about not storing this info in etcd is that we try to use etcd to manage dynamic state, not static configurations that will change a few times a year at most.

But either that or a file distributed via puppet to every relevant machine is ok anyways.

I think that a "rolling restart of applications to pick up the new config" is an acceptable step here (ops need to be involved anyways).

I don't think there should be any intelligence involved in the technical process of creating a wiki. I'm not sure what you mean by "a rolling restart of all services" -- if you mean stopping each service and starting it again, then I suspect that would require a human to consider the consequences.

Well human supervision is useful, but I'd expect the process to be as simple as doing a scap deploy. Ops are building a distributed execution framework (https://github.com/wikimedia/cumin) that seems like a perfect candidate for this role.

Looking at Parsoid for a case study, I see that it re-reads sitematrix.json on worker startup, and service-runner responds to SIGHUP by doing a rolling restart of local workers. So all we need is a way to replace sitematrix.json and send SIGHUP. Other service-runner users could be reconfigured similarly. If we can have a button labelled "send SIGHUP all services", and a brainless server monkey is allowed to press it at any time, then I guess that would be a solution. But ideally, the brainless server monkey would be replaced by a line in a script.

The idea would be "do a controlled rolling restart (or send a SIGHUP, depending on the software) of these services", and yes it should be a line in a script.

I created around ten wikis so far, the process of creating wikis is extremely fragile, complicated and stressful. Before automation, this has to be fixed first. Let me give you some examples:

  • The first time I created a wiki, someone removed --wiki=aawiki from documentation in wikitech, I ran addWiki.php and it didn't work because it needed an existing "dummy" wiki to work on. Being my first time, I went with my home wiki (fawiki) thinking it doesn't matter (It shouldn't, right?). At the middle of creation, it exploded because it created the database on s7 (fawiki) but tried to read it from s3 (dblists, config). Someone told me it has to be aawiki so I ran it again and it worked fine-ish until replication to labs broke because now we have two hywwikis, one on s3 and one on s7.
  • The second time, I was like "Okay, I go with a wiki that's on s3, mediawikiwiki" (I needed mediawikiwiki because it was on group0, I didn't want to backport a change on addWiki.php). The creation exploded because mediawikiwiki had something special with OAuth or CentralAuth, took me hours to find out what's wrong.
  • For the next four or fives times I created, I ran into T212881: addWiki.php broken creating ES tables and had to manually change pointer of text table to some random thing so it doesn't fatal on main page. You can't do anything there, you can't edit the main page, you can't delete it (I made myself admin and had ask stewards to demote me). There's nothing you can do.
  • The only time that I created a wiki after the fix got deployed, it broke at the very end on updating interwiki cache because the created wiki was an RTL wiki and we didn't add the wiki to rtl.dblist (1- It wasn't documented that you need to do it otherwise things explode 2- I didn't even know the wiki I'm creating is an RTL wiki). Thankfully fixing this wasn't hard.
  • Every time creating the wiki breaks, it halts at middle of the long list of actions, you need to fix the bit, and then copy paste everything to eval.php or shell.php. It's extremely dangerous to run arbitrary code in production unless you know exactly what you're doing which brings me to the next point:
  • In order to handle this issues in real-time, in production, you need to have a very good knowledge of the infrastructure and everyone has strengths and weaknesses. If something breaks with ElasticSearch which addWiki.php also handles, I have no idea how to fix and proceed.

Hope these notes help.

@Amire80 I suppose it would be useful to summarise the outcome of the T238255 session on this task :)

An update from the subtask which might of interest here:

So with some recent changes the bot creates sub tickets for data storage, and parent tickets for RESTbase, pywikibot and wikidata (it doesn't automatically close them though). Also the bot creates patches automatically for anlaytics refinery, DNS, wikimedia messages and CX server. You can see the artwork of the bot in T264859: Create Inari Sámi Wikipedia. The only thing that's not done yet is automating initial config patch which I want to avoid as it's extremely complicated, we can pick it up once wiki configs are moved to yaml files (T223602: Define variant Wikimedia production config in compiled, static files). I create a follow up for that one.

Change 339144 abandoned by Reedy:

[mediawiki/extensions/WikimediaMaintenance@master] Run populateSitesTable.php on other wikidata client wikis

Reason:

https://gerrit.wikimedia.org/r/339144