Datum a čas: 2022-10-14 02:15 CEST
Očekavaná délka: 150 minut
Oznámení se týká serverů: node24.prg
Typ výpadku: vps_reset
Důvod: pretizeni vedouci k padu
Výpadek řeší: Pavel Šnajdr, Jakub Skokan
kvuli spatnemu nastaveni BIOSu
ENGLISH:
Date and time: 2022-10-14 02:15 CEST
Expected duration: 150 minutes
Affected systems: node24.prg
Outage type: vps_reset
Reason: overload leading to crash
Handled by: Pavel Šnajdr, Jakub Skokan
due to bad BIOS settings
-----BEGIN BASE64 ENCODED PARSEABLE JSON-----
eyJpZCI6OTUyLCJwbGFubmVkIjpmYWxzZSwiYmVnaW5zX2F0IjoiMjAyMi0x
MC0xNFQwMjoxNTowMCswMjowMCIsImR1cmF0aW9uIjoxNTAsInR5cGUiOiJ2
cHNfcmVzZXQiLCJlbnRpdGllcyI6W3sibmFtZSI6Ik5vZGUiLCJpZCI6MTI1
LCJsYWJlbCI6Im5vZGUyNC5wcmcifV0sImhhbmRsZXJzIjpbIlBhdmVsIMWg
bmFqZHIiLCJKYWt1YiBTa29rYW4iXSwidHJhbnNsYXRpb25zIjp7ImVuIjp7
InN1bW1hcnkiOiJvdmVybG9hZCBsZWFkaW5nIHRvIGNyYXNoIiwiZGVzY3Jp
cHRpb24iOiJkdWUgdG8gYmFkIEJJT1Mgc2V0dGluZ3MifSwiY3MiOnsic3Vt
bWFyeSI6InByZXRpemVuaSB2ZWRvdWNpIGsgcGFkdSIsImRlc2NyaXB0aW9u
Ijoia3Z1bGkgc3BhdG5lbXUgbmFzdGF2ZW5pIEJJT1N1In19fQ==
-----END BASE64 ENCODED PARSEABLE JSON-----
Show replies by date
Popis: Omlouvam se, jeste jeden restart
Nahlásil: Jakub Skokan
ENGLISH:
Summary: Apologies, one more reboot
Reported by: Jakub Skokan
-----BEGIN BASE64 ENCODED PARSEABLE JSON-----
eyJpZCI6MjUxNywiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6
eyJzdW1tYXJ5IjoiQXBvbG9naWVzLCBvbmUgbW9yZSByZWJvb3QiLCJkZXNj
cmlwdGlvbiI6bnVsbH0sImNzIjp7InN1bW1hcnkiOiJPbWxvdXZhbSBzZSwg
amVzdGUgamVkZW4gcmVzdGFydCIsImRlc2NyaXB0aW9uIjpudWxsfX19
-----END BASE64 ENCODED PARSEABLE JSON-----
Popis: Prilis aktivity na jedno jadro
Ukazuje se, ze cil 600ti VPS na jeden stroj byl prilis optimisticky - HW pod nim je
dostatecne schopny takovou zatez utahnout, ale linuxove jadro neni pripravene na to, co od
nej chceme;
system potom trpi na hadani se o zamky (lock contention) mezi jednotlivymi procesy, kdyz
se pusti vic jak 350 VPS nad jednim jadrem.
Budeme tedy muset v nejblizsich tydnech node23/node24 rozdelit na mensi stroje.
Omlouvame se za to mnozstvi problemu, co tahle situace zpusobila, meli jsme minimalne 3
pokusy s rebooty, neco s tim udelat, ale nic nepomahalo (nejslibneji vypadala redukce
zavodu o cgroup_mutex, ale nestacilo to zdaleka). Rozkladame zatez zpatky na zbyle
servery, situace uz by se ted mela uklidnovat.
Nahlásil: Pavel Šnajdr
ENGLISH:
Summary: Too much activity for single kernel
As it turns out, target of 600 VPS per node was too optimistic; bare hardware is more,
than capable of handling the load, but the Linux kernel is not ready for what we need to
get out of it;
the system suffers heavy lock contention when running more than cca 350 VPS per single
kernel.
Thus we will have to split node23/node24 into for smaller servers over the course of next
few weeks.
We apologize for all the grief this has caused, there were at least three reboots with
various mitigations being tried, none of which were successful (most promising one was to
reduce cgroup_mutex contention, still not enough). We're now spreading the load back
onto the remaining servers, situation should be calming down already.
Reported by: Pavel Šnajdr
-----BEGIN BASE64 ENCODED PARSEABLE JSON-----
eyJpZCI6MjUxOCwiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6
eyJzdW1tYXJ5IjoiVG9vIG11Y2ggYWN0aXZpdHkgZm9yIHNpbmdsZSBrZXJu
ZWwiLCJkZXNjcmlwdGlvbiI6IkFzIGl0IHR1cm5zIG91dCwgdGFyZ2V0IG9m
IDYwMCBWUFMgcGVyIG5vZGUgd2FzIHRvbyBvcHRpbWlzdGljOyBiYXJlIGhh
cmR3YXJlIGlzIG1vcmUsIHRoYW4gY2FwYWJsZSBvZiBoYW5kbGluZyB0aGUg
bG9hZCwgYnV0IHRoZSBMaW51eCBrZXJuZWwgaXMgbm90IHJlYWR5IGZvciB3
aGF0IHdlIG5lZWQgdG8gZ2V0IG91dCBvZiBpdDtcclxudGhlIHN5c3RlbSBz
dWZmZXJzIGhlYXZ5IGxvY2sgY29udGVudGlvbiB3aGVuIHJ1bm5pbmcgbW9y
ZSB0aGFuIGNjYSAzNTAgVlBTIHBlciBzaW5nbGUga2VybmVsLlxyXG5cclxu
VGh1cyB3ZSB3aWxsIGhhdmUgdG8gc3BsaXQgbm9kZTIzL25vZGUyNCBpbnRv
IGZvciBzbWFsbGVyIHNlcnZlcnMgb3ZlciB0aGUgY291cnNlIG9mIG5leHQg
ZmV3IHdlZWtzLlxyXG5cclxuV2UgYXBvbG9naXplIGZvciBhbGwgdGhlIGdy
aWVmIHRoaXMgaGFzIGNhdXNlZCwgdGhlcmUgd2VyZSBhdCBsZWFzdCB0aHJl
ZSByZWJvb3RzIHdpdGggdmFyaW91cyBtaXRpZ2F0aW9ucyBiZWluZyB0cmll
ZCwgbm9uZSBvZiB3aGljaCB3ZXJlIHN1Y2Nlc3NmdWwgKG1vc3QgcHJvbWlz
aW5nIG9uZSB3YXMgdG8gcmVkdWNlIGNncm91cF9tdXRleCBjb250ZW50aW9u
LCBzdGlsbCBub3QgZW5vdWdoKS4gV2UncmUgbm93IHNwcmVhZGluZyB0aGUg
bG9hZCBiYWNrIG9udG8gdGhlIHJlbWFpbmluZyBzZXJ2ZXJzLCBzaXR1YXRp
b24gc2hvdWxkIGJlIGNhbG1pbmcgZG93biBhbHJlYWR5LiJ9LCJjcyI6eyJz
dW1tYXJ5IjoiUHJpbGlzIGFrdGl2aXR5IG5hIGplZG5vIGphZHJvIiwiZGVz
Y3JpcHRpb24iOiJVa2F6dWplIHNlLCB6ZSBjaWwgNjAwdGkgVlBTIG5hIGpl
ZGVuIHN0cm9qIGJ5bCBwcmlsaXMgb3B0aW1pc3RpY2t5IC0gSFcgcG9kIG5p
bSBqZSBkb3N0YXRlY25lIHNjaG9wbnkgdGFrb3ZvdSB6YXRleiB1dGFobm91
dCwgYWxlIGxpbnV4b3ZlIGphZHJvIG5lbmkgcHJpcHJhdmVuZSBuYSB0bywg
Y28gb2QgbmVqIGNoY2VtZTtcclxuc3lzdGVtIHBvdG9tIHRycGkgbmEgaGFk
YW5pIHNlIG8gemFta3kgKGxvY2sgY29udGVudGlvbikgbWV6aSBqZWRub3Rs
aXZ5bWkgcHJvY2VzeSwga2R5eiBzZSBwdXN0aSB2aWMgamFrIDM1MCBWUFMg
bmFkIGplZG5pbSBqYWRyZW0uXHJcblxyXG5CdWRlbWUgdGVkeSBtdXNldCB2
IG5lamJsaXpzaWNoIHR5ZG5lY2ggbm9kZTIzL25vZGUyNCByb3pkZWxpdCBu
YSBtZW5zaSBzdHJvamUuXHJcblxyXG5PbWxvdXZhbWUgc2UgemEgdG8gbW5v
enN0dmkgcHJvYmxlbXUsIGNvIHRhaGxlIHNpdHVhY2UgenB1c29iaWxhLCBt
ZWxpIGpzbWUgbWluaW1hbG5lIDMgcG9rdXN5IHMgcmVib290eSwgbmVjbyBz
IHRpbSB1ZGVsYXQsIGFsZSBuaWMgbmVwb21haGFsbyAobmVqc2xpYm5lamkg
dnlwYWRhbGEgcmVkdWtjZSB6YXZvZHUgbyBjZ3JvdXBfbXV0ZXgsIGFsZSBu
ZXN0YWNpbG8gdG8gemRhbGVrYSkuIFJvemtsYWRhbWUgemF0ZXogenBhdGt5
IG5hIHpieWxlIHNlcnZlcnksIHNpdHVhY2UgdXogYnkgc2UgdGVkIG1lbGEg
dWtsaWRub3ZhdC4ifX19
-----END BASE64 ENCODED PARSEABLE JSON-----