Popis: Prilis aktivity na jedno jadro
Ukazuje se, ze cil 600ti VPS na jeden stroj byl prilis optimisticky - HW pod nim je
dostatecne schopny takovou zatez utahnout, ale linuxove jadro neni pripravene na to, co od
nej chceme;
system potom trpi na hadani se o zamky (lock contention) mezi jednotlivymi procesy, kdyz
se pusti vic jak 350 VPS nad jednim jadrem.
Budeme tedy muset v nejblizsich tydnech node23/node24 rozdelit na mensi stroje.
Omlouvame se za to mnozstvi problemu, co tahle situace zpusobila, meli jsme minimalne 3
pokusy s rebooty, neco s tim udelat, ale nic nepomahalo (nejslibneji vypadala redukce
zavodu o cgroup_mutex, ale nestacilo to zdaleka). Rozkladame zatez zpatky na zbyle
servery, situace uz by se ted mela uklidnovat.
Nahlásil: Pavel Šnajdr
ENGLISH:
Summary: Too much activity for single kernel
As it turns out, target of 600 VPS per node was too optimistic; bare hardware is more,
than capable of handling the load, but the Linux kernel is not ready for what we need to
get out of it;
the system suffers heavy lock contention when running more than cca 350 VPS per single
kernel.
Thus we will have to split node23/node24 into for smaller servers over the course of next
few weeks.
We apologize for all the grief this has caused, there were at least three reboots with
various mitigations being tried, none of which were successful (most promising one was to
reduce cgroup_mutex contention, still not enough). We're now spreading the load back
onto the remaining servers, situation should be calming down already.
Reported by: Pavel Šnajdr
-----BEGIN BASE64 ENCODED PARSEABLE JSON-----
eyJpZCI6MjUxOCwiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6
eyJzdW1tYXJ5IjoiVG9vIG11Y2ggYWN0aXZpdHkgZm9yIHNpbmdsZSBrZXJu
ZWwiLCJkZXNjcmlwdGlvbiI6IkFzIGl0IHR1cm5zIG91dCwgdGFyZ2V0IG9m
IDYwMCBWUFMgcGVyIG5vZGUgd2FzIHRvbyBvcHRpbWlzdGljOyBiYXJlIGhh
cmR3YXJlIGlzIG1vcmUsIHRoYW4gY2FwYWJsZSBvZiBoYW5kbGluZyB0aGUg
bG9hZCwgYnV0IHRoZSBMaW51eCBrZXJuZWwgaXMgbm90IHJlYWR5IGZvciB3
aGF0IHdlIG5lZWQgdG8gZ2V0IG91dCBvZiBpdDtcclxudGhlIHN5c3RlbSBz
dWZmZXJzIGhlYXZ5IGxvY2sgY29udGVudGlvbiB3aGVuIHJ1bm5pbmcgbW9y
ZSB0aGFuIGNjYSAzNTAgVlBTIHBlciBzaW5nbGUga2VybmVsLlxyXG5cclxu
VGh1cyB3ZSB3aWxsIGhhdmUgdG8gc3BsaXQgbm9kZTIzL25vZGUyNCBpbnRv
IGZvciBzbWFsbGVyIHNlcnZlcnMgb3ZlciB0aGUgY291cnNlIG9mIG5leHQg
ZmV3IHdlZWtzLlxyXG5cclxuV2UgYXBvbG9naXplIGZvciBhbGwgdGhlIGdy
aWVmIHRoaXMgaGFzIGNhdXNlZCwgdGhlcmUgd2VyZSBhdCBsZWFzdCB0aHJl
ZSByZWJvb3RzIHdpdGggdmFyaW91cyBtaXRpZ2F0aW9ucyBiZWluZyB0cmll
ZCwgbm9uZSBvZiB3aGljaCB3ZXJlIHN1Y2Nlc3NmdWwgKG1vc3QgcHJvbWlz
aW5nIG9uZSB3YXMgdG8gcmVkdWNlIGNncm91cF9tdXRleCBjb250ZW50aW9u
LCBzdGlsbCBub3QgZW5vdWdoKS4gV2UncmUgbm93IHNwcmVhZGluZyB0aGUg
bG9hZCBiYWNrIG9udG8gdGhlIHJlbWFpbmluZyBzZXJ2ZXJzLCBzaXR1YXRp
b24gc2hvdWxkIGJlIGNhbG1pbmcgZG93biBhbHJlYWR5LiJ9LCJjcyI6eyJz
dW1tYXJ5IjoiUHJpbGlzIGFrdGl2aXR5IG5hIGplZG5vIGphZHJvIiwiZGVz
Y3JpcHRpb24iOiJVa2F6dWplIHNlLCB6ZSBjaWwgNjAwdGkgVlBTIG5hIGpl
ZGVuIHN0cm9qIGJ5bCBwcmlsaXMgb3B0aW1pc3RpY2t5IC0gSFcgcG9kIG5p
bSBqZSBkb3N0YXRlY25lIHNjaG9wbnkgdGFrb3ZvdSB6YXRleiB1dGFobm91
dCwgYWxlIGxpbnV4b3ZlIGphZHJvIG5lbmkgcHJpcHJhdmVuZSBuYSB0bywg
Y28gb2QgbmVqIGNoY2VtZTtcclxuc3lzdGVtIHBvdG9tIHRycGkgbmEgaGFk
YW5pIHNlIG8gemFta3kgKGxvY2sgY29udGVudGlvbikgbWV6aSBqZWRub3Rs
aXZ5bWkgcHJvY2VzeSwga2R5eiBzZSBwdXN0aSB2aWMgamFrIDM1MCBWUFMg
bmFkIGplZG5pbSBqYWRyZW0uXHJcblxyXG5CdWRlbWUgdGVkeSBtdXNldCB2
IG5lamJsaXpzaWNoIHR5ZG5lY2ggbm9kZTIzL25vZGUyNCByb3pkZWxpdCBu
YSBtZW5zaSBzdHJvamUuXHJcblxyXG5PbWxvdXZhbWUgc2UgemEgdG8gbW5v
enN0dmkgcHJvYmxlbXUsIGNvIHRhaGxlIHNpdHVhY2UgenB1c29iaWxhLCBt
ZWxpIGpzbWUgbWluaW1hbG5lIDMgcG9rdXN5IHMgcmVib290eSwgbmVjbyBz
IHRpbSB1ZGVsYXQsIGFsZSBuaWMgbmVwb21haGFsbyAobmVqc2xpYm5lamkg
dnlwYWRhbGEgcmVkdWtjZSB6YXZvZHUgbyBjZ3JvdXBfbXV0ZXgsIGFsZSBu
ZXN0YWNpbG8gdG8gemRhbGVrYSkuIFJvemtsYWRhbWUgemF0ZXogenBhdGt5
IG5hIHpieWxlIHNlcnZlcnksIHNpdHVhY2UgdXogYnkgc2UgdGVkIG1lbGEg
dWtsaWRub3ZhdC4ifX19
-----END BASE64 ENCODED PARSEABLE JSON-----