Popis: Prilis aktivity na jedno jadro
Ukazuje se, ze cil 600ti VPS na jeden stroj byl prilis optimisticky - HW pod nim je dostatecne schopny takovou zatez utahnout, ale linuxove jadro neni pripravene na to, co od nej chceme; system potom trpi na hadani se o zamky (lock contention) mezi jednotlivymi procesy, kdyz se pusti vic jak 350 VPS nad jednim jadrem.
Budeme tedy muset v nejblizsich tydnech node23/node24 rozdelit na mensi stroje.
Omlouvame se za to mnozstvi problemu, co tahle situace zpusobila, meli jsme minimalne 3 pokusy s rebooty, neco s tim udelat, ale nic nepomahalo (nejslibneji vypadala redukce zavodu o cgroup_mutex, ale nestacilo to zdaleka). Rozkladame zatez zpatky na zbyle servery, situace uz by se ted mela uklidnovat.
Nahlásil: Pavel Šnajdr
ENGLISH:
Summary: Too much activity for single kernel
As it turns out, target of 600 VPS per node was too optimistic; bare hardware is more, than capable of handling the load, but the Linux kernel is not ready for what we need to get out of it; the system suffers heavy lock contention when running more than cca 350 VPS per single kernel.
Thus we will have to split node23/node24 into for smaller servers over the course of next few weeks.
We apologize for all the grief this has caused, there were at least three reboots with various mitigations being tried, none of which were successful (most promising one was to reduce cgroup_mutex contention, still not enough). We're now spreading the load back onto the remaining servers, situation should be calming down already.
Reported by: Pavel Šnajdr
-----BEGIN BASE64 ENCODED PARSEABLE JSON----- eyJpZCI6MjUxOCwiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6 eyJzdW1tYXJ5IjoiVG9vIG11Y2ggYWN0aXZpdHkgZm9yIHNpbmdsZSBrZXJu ZWwiLCJkZXNjcmlwdGlvbiI6IkFzIGl0IHR1cm5zIG91dCwgdGFyZ2V0IG9m IDYwMCBWUFMgcGVyIG5vZGUgd2FzIHRvbyBvcHRpbWlzdGljOyBiYXJlIGhh cmR3YXJlIGlzIG1vcmUsIHRoYW4gY2FwYWJsZSBvZiBoYW5kbGluZyB0aGUg bG9hZCwgYnV0IHRoZSBMaW51eCBrZXJuZWwgaXMgbm90IHJlYWR5IGZvciB3 aGF0IHdlIG5lZWQgdG8gZ2V0IG91dCBvZiBpdDtcclxudGhlIHN5c3RlbSBz dWZmZXJzIGhlYXZ5IGxvY2sgY29udGVudGlvbiB3aGVuIHJ1bm5pbmcgbW9y ZSB0aGFuIGNjYSAzNTAgVlBTIHBlciBzaW5nbGUga2VybmVsLlxyXG5cclxu VGh1cyB3ZSB3aWxsIGhhdmUgdG8gc3BsaXQgbm9kZTIzL25vZGUyNCBpbnRv IGZvciBzbWFsbGVyIHNlcnZlcnMgb3ZlciB0aGUgY291cnNlIG9mIG5leHQg ZmV3IHdlZWtzLlxyXG5cclxuV2UgYXBvbG9naXplIGZvciBhbGwgdGhlIGdy aWVmIHRoaXMgaGFzIGNhdXNlZCwgdGhlcmUgd2VyZSBhdCBsZWFzdCB0aHJl ZSByZWJvb3RzIHdpdGggdmFyaW91cyBtaXRpZ2F0aW9ucyBiZWluZyB0cmll ZCwgbm9uZSBvZiB3aGljaCB3ZXJlIHN1Y2Nlc3NmdWwgKG1vc3QgcHJvbWlz aW5nIG9uZSB3YXMgdG8gcmVkdWNlIGNncm91cF9tdXRleCBjb250ZW50aW9u LCBzdGlsbCBub3QgZW5vdWdoKS4gV2UncmUgbm93IHNwcmVhZGluZyB0aGUg bG9hZCBiYWNrIG9udG8gdGhlIHJlbWFpbmluZyBzZXJ2ZXJzLCBzaXR1YXRp b24gc2hvdWxkIGJlIGNhbG1pbmcgZG93biBhbHJlYWR5LiJ9LCJjcyI6eyJz dW1tYXJ5IjoiUHJpbGlzIGFrdGl2aXR5IG5hIGplZG5vIGphZHJvIiwiZGVz Y3JpcHRpb24iOiJVa2F6dWplIHNlLCB6ZSBjaWwgNjAwdGkgVlBTIG5hIGpl ZGVuIHN0cm9qIGJ5bCBwcmlsaXMgb3B0aW1pc3RpY2t5IC0gSFcgcG9kIG5p bSBqZSBkb3N0YXRlY25lIHNjaG9wbnkgdGFrb3ZvdSB6YXRleiB1dGFobm91 dCwgYWxlIGxpbnV4b3ZlIGphZHJvIG5lbmkgcHJpcHJhdmVuZSBuYSB0bywg Y28gb2QgbmVqIGNoY2VtZTtcclxuc3lzdGVtIHBvdG9tIHRycGkgbmEgaGFk YW5pIHNlIG8gemFta3kgKGxvY2sgY29udGVudGlvbikgbWV6aSBqZWRub3Rs aXZ5bWkgcHJvY2VzeSwga2R5eiBzZSBwdXN0aSB2aWMgamFrIDM1MCBWUFMg bmFkIGplZG5pbSBqYWRyZW0uXHJcblxyXG5CdWRlbWUgdGVkeSBtdXNldCB2 IG5lamJsaXpzaWNoIHR5ZG5lY2ggbm9kZTIzL25vZGUyNCByb3pkZWxpdCBu YSBtZW5zaSBzdHJvamUuXHJcblxyXG5PbWxvdXZhbWUgc2UgemEgdG8gbW5v enN0dmkgcHJvYmxlbXUsIGNvIHRhaGxlIHNpdHVhY2UgenB1c29iaWxhLCBt ZWxpIGpzbWUgbWluaW1hbG5lIDMgcG9rdXN5IHMgcmVib290eSwgbmVjbyBz IHRpbSB1ZGVsYXQsIGFsZSBuaWMgbmVwb21haGFsbyAobmVqc2xpYm5lamkg dnlwYWRhbGEgcmVkdWtjZSB6YXZvZHUgbyBjZ3JvdXBfbXV0ZXgsIGFsZSBu ZXN0YWNpbG8gdG8gemRhbGVrYSkuIFJvemtsYWRhbWUgemF0ZXogenBhdGt5 IG5hIHpieWxlIHNlcnZlcnksIHNpdHVhY2UgdXogYnkgc2UgdGVkIG1lbGEg dWtsaWRub3ZhdC4ifX19 -----END BASE64 ENCODED PARSEABLE JSON-----