Datum a čas: 2022-10-14 02:15 CEST Očekavaná délka: 150 minut Oznámení se týká serverů: node24.prg Typ výpadku: vps_reset Důvod: pretizeni vedouci k padu Výpadek řeší: Pavel Šnajdr, Jakub Skokan
kvuli spatnemu nastaveni BIOSu
ENGLISH: Date and time: 2022-10-14 02:15 CEST Expected duration: 150 minutes Affected systems: node24.prg Outage type: vps_reset Reason: overload leading to crash Handled by: Pavel Šnajdr, Jakub Skokan
due to bad BIOS settings
-----BEGIN BASE64 ENCODED PARSEABLE JSON----- eyJpZCI6OTUyLCJwbGFubmVkIjpmYWxzZSwiYmVnaW5zX2F0IjoiMjAyMi0x MC0xNFQwMjoxNTowMCswMjowMCIsImR1cmF0aW9uIjoxNTAsInR5cGUiOiJ2 cHNfcmVzZXQiLCJlbnRpdGllcyI6W3sibmFtZSI6Ik5vZGUiLCJpZCI6MTI1 LCJsYWJlbCI6Im5vZGUyNC5wcmcifV0sImhhbmRsZXJzIjpbIlBhdmVsIMWg bmFqZHIiLCJKYWt1YiBTa29rYW4iXSwidHJhbnNsYXRpb25zIjp7ImVuIjp7 InN1bW1hcnkiOiJvdmVybG9hZCBsZWFkaW5nIHRvIGNyYXNoIiwiZGVzY3Jp cHRpb24iOiJkdWUgdG8gYmFkIEJJT1Mgc2V0dGluZ3MifSwiY3MiOnsic3Vt bWFyeSI6InByZXRpemVuaSB2ZWRvdWNpIGsgcGFkdSIsImRlc2NyaXB0aW9u Ijoia3Z1bGkgc3BhdG5lbXUgbmFzdGF2ZW5pIEJJT1N1In19fQ== -----END BASE64 ENCODED PARSEABLE JSON-----
Popis: Omlouvam se, jeste jeden restart
Nahlásil: Jakub Skokan
ENGLISH:
Summary: Apologies, one more reboot
Reported by: Jakub Skokan
-----BEGIN BASE64 ENCODED PARSEABLE JSON----- eyJpZCI6MjUxNywiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6 eyJzdW1tYXJ5IjoiQXBvbG9naWVzLCBvbmUgbW9yZSByZWJvb3QiLCJkZXNj cmlwdGlvbiI6bnVsbH0sImNzIjp7InN1bW1hcnkiOiJPbWxvdXZhbSBzZSwg amVzdGUgamVkZW4gcmVzdGFydCIsImRlc2NyaXB0aW9uIjpudWxsfX19 -----END BASE64 ENCODED PARSEABLE JSON-----
Popis: Prilis aktivity na jedno jadro
Ukazuje se, ze cil 600ti VPS na jeden stroj byl prilis optimisticky - HW pod nim je dostatecne schopny takovou zatez utahnout, ale linuxove jadro neni pripravene na to, co od nej chceme; system potom trpi na hadani se o zamky (lock contention) mezi jednotlivymi procesy, kdyz se pusti vic jak 350 VPS nad jednim jadrem.
Budeme tedy muset v nejblizsich tydnech node23/node24 rozdelit na mensi stroje.
Omlouvame se za to mnozstvi problemu, co tahle situace zpusobila, meli jsme minimalne 3 pokusy s rebooty, neco s tim udelat, ale nic nepomahalo (nejslibneji vypadala redukce zavodu o cgroup_mutex, ale nestacilo to zdaleka). Rozkladame zatez zpatky na zbyle servery, situace uz by se ted mela uklidnovat.
Nahlásil: Pavel Šnajdr
ENGLISH:
Summary: Too much activity for single kernel
As it turns out, target of 600 VPS per node was too optimistic; bare hardware is more, than capable of handling the load, but the Linux kernel is not ready for what we need to get out of it; the system suffers heavy lock contention when running more than cca 350 VPS per single kernel.
Thus we will have to split node23/node24 into for smaller servers over the course of next few weeks.
We apologize for all the grief this has caused, there were at least three reboots with various mitigations being tried, none of which were successful (most promising one was to reduce cgroup_mutex contention, still not enough). We're now spreading the load back onto the remaining servers, situation should be calming down already.
Reported by: Pavel Šnajdr
-----BEGIN BASE64 ENCODED PARSEABLE JSON----- eyJpZCI6MjUxOCwiY2hhbmdlcyI6e30sInRyYW5zbGF0aW9ucyI6eyJlbiI6 eyJzdW1tYXJ5IjoiVG9vIG11Y2ggYWN0aXZpdHkgZm9yIHNpbmdsZSBrZXJu ZWwiLCJkZXNjcmlwdGlvbiI6IkFzIGl0IHR1cm5zIG91dCwgdGFyZ2V0IG9m IDYwMCBWUFMgcGVyIG5vZGUgd2FzIHRvbyBvcHRpbWlzdGljOyBiYXJlIGhh cmR3YXJlIGlzIG1vcmUsIHRoYW4gY2FwYWJsZSBvZiBoYW5kbGluZyB0aGUg bG9hZCwgYnV0IHRoZSBMaW51eCBrZXJuZWwgaXMgbm90IHJlYWR5IGZvciB3 aGF0IHdlIG5lZWQgdG8gZ2V0IG91dCBvZiBpdDtcclxudGhlIHN5c3RlbSBz dWZmZXJzIGhlYXZ5IGxvY2sgY29udGVudGlvbiB3aGVuIHJ1bm5pbmcgbW9y ZSB0aGFuIGNjYSAzNTAgVlBTIHBlciBzaW5nbGUga2VybmVsLlxyXG5cclxu VGh1cyB3ZSB3aWxsIGhhdmUgdG8gc3BsaXQgbm9kZTIzL25vZGUyNCBpbnRv IGZvciBzbWFsbGVyIHNlcnZlcnMgb3ZlciB0aGUgY291cnNlIG9mIG5leHQg ZmV3IHdlZWtzLlxyXG5cclxuV2UgYXBvbG9naXplIGZvciBhbGwgdGhlIGdy aWVmIHRoaXMgaGFzIGNhdXNlZCwgdGhlcmUgd2VyZSBhdCBsZWFzdCB0aHJl ZSByZWJvb3RzIHdpdGggdmFyaW91cyBtaXRpZ2F0aW9ucyBiZWluZyB0cmll ZCwgbm9uZSBvZiB3aGljaCB3ZXJlIHN1Y2Nlc3NmdWwgKG1vc3QgcHJvbWlz aW5nIG9uZSB3YXMgdG8gcmVkdWNlIGNncm91cF9tdXRleCBjb250ZW50aW9u LCBzdGlsbCBub3QgZW5vdWdoKS4gV2UncmUgbm93IHNwcmVhZGluZyB0aGUg bG9hZCBiYWNrIG9udG8gdGhlIHJlbWFpbmluZyBzZXJ2ZXJzLCBzaXR1YXRp b24gc2hvdWxkIGJlIGNhbG1pbmcgZG93biBhbHJlYWR5LiJ9LCJjcyI6eyJz dW1tYXJ5IjoiUHJpbGlzIGFrdGl2aXR5IG5hIGplZG5vIGphZHJvIiwiZGVz Y3JpcHRpb24iOiJVa2F6dWplIHNlLCB6ZSBjaWwgNjAwdGkgVlBTIG5hIGpl ZGVuIHN0cm9qIGJ5bCBwcmlsaXMgb3B0aW1pc3RpY2t5IC0gSFcgcG9kIG5p bSBqZSBkb3N0YXRlY25lIHNjaG9wbnkgdGFrb3ZvdSB6YXRleiB1dGFobm91 dCwgYWxlIGxpbnV4b3ZlIGphZHJvIG5lbmkgcHJpcHJhdmVuZSBuYSB0bywg Y28gb2QgbmVqIGNoY2VtZTtcclxuc3lzdGVtIHBvdG9tIHRycGkgbmEgaGFk YW5pIHNlIG8gemFta3kgKGxvY2sgY29udGVudGlvbikgbWV6aSBqZWRub3Rs aXZ5bWkgcHJvY2VzeSwga2R5eiBzZSBwdXN0aSB2aWMgamFrIDM1MCBWUFMg bmFkIGplZG5pbSBqYWRyZW0uXHJcblxyXG5CdWRlbWUgdGVkeSBtdXNldCB2 IG5lamJsaXpzaWNoIHR5ZG5lY2ggbm9kZTIzL25vZGUyNCByb3pkZWxpdCBu YSBtZW5zaSBzdHJvamUuXHJcblxyXG5PbWxvdXZhbWUgc2UgemEgdG8gbW5v enN0dmkgcHJvYmxlbXUsIGNvIHRhaGxlIHNpdHVhY2UgenB1c29iaWxhLCBt ZWxpIGpzbWUgbWluaW1hbG5lIDMgcG9rdXN5IHMgcmVib290eSwgbmVjbyBz IHRpbSB1ZGVsYXQsIGFsZSBuaWMgbmVwb21haGFsbyAobmVqc2xpYm5lamkg dnlwYWRhbGEgcmVkdWtjZSB6YXZvZHUgbyBjZ3JvdXBfbXV0ZXgsIGFsZSBu ZXN0YWNpbG8gdG8gemRhbGVrYSkuIFJvemtsYWRhbWUgemF0ZXogenBhdGt5 IG5hIHpieWxlIHNlcnZlcnksIHNpdHVhY2UgdXogYnkgc2UgdGVkIG1lbGEg dWtsaWRub3ZhdC4ifX19 -----END BASE64 ENCODED PARSEABLE JSON-----