Hi everyone:

This started out as an issue I brought to the IRC channel yesterday, but things have escalated since then. Apologies for the longish email, but I'm going to recap for context.

All nodes Centos7, installing from  http://download.ceph.com/rpm-nautilus/el7/$basearch for Nautilus packages.

- yum autoupdate was left on on our cluster. This normally isn't a problem, since the ceph services don't pick up a new version until you restart them.
However: for some reason, when 14.2.17 was installed on our MGR  host [which also hosts one of our MONs], the MGR process was restarted automatically - and immediately wigged out due to the regressions in 14.2.17.
We noticed this when we saw our monitoring vanish... by which time 1.4.2.18 had also dropped, and the MGR had autorestarted again. [So: first aside - can someone please fix the RPM packaging for MGRs so this doesn't happen? It doesn't do this to any other service, and it was the start of our all problems.]

At this point, the cluster itself was operating fine - many services were accessing it successfully both reading and writing data.

Wanting to get our monitoring back up and running, we went on the IRC channel and asked for advice.
Since just bouncing the MRG service didn't help, the next suggestion was that - since only the MGR was on 14.2.18, with everything else on 14.2.16, we should just bite the bullet and start transitioning the cluster across to 14.2.18 entirely.

:
updating the MONs went well - they picked up 14.2.18 happily.
At this point, we believe the cluster was still relatively happy.

Next, we tried updating a host's worth of OSDs. (Yes, in retrospect, we probably should have tried 1 OSD to start with, but we've never had OSDs go into a fail state after a version update before.)

All of the OSDs on that host immediately started flapping - we have a process on the OSD nodes to monitor them and keep them up if they die, as a work around against the extremely unstable behaviour of 14.2.10 OSDs for us [of which a random selection would just fall over a few times a week - different ones each time], which probably didn't help.

We see the same problem as another user in IRC saw in the logs:
 osd.0 247222 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
 osd.0 247222 start_waiting_for_healthy
-1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
 -1 osd.0 247222 *** Got signal Interrupt ***
 -1 osd.0 247222 *** Immediate shutdown (osd_fast_shutdown=true) ***

for them, it was associated with network links needing restarted, and investigated this but it didn't help for us.

(In the background, we assume the MONs all started rebalancing data as a host's worth of OSDs had effectively vanished.)

Eventually - after an hour or two - we got the OSDs to stay up by setting
ceph osd set noout
ceph osd set nodown

and increasing osd_max_markdown_count in ceph.conf to 20 [a big enough number].

---
However, this then caused things to get far worse... leading to the problem we're now in.

With the host's OSDs back up, the mons appropriately decided to rebalance the PGs by moving stuff back to that host.
800 of our 2048 PGs were marked for rebalancing... and initially this seemed to be going okay.

However: all of the MONs rapidly collected Slow Ops, and both they and the MRG started spamming a lot of slow operations logs to both the ceph.log and the store.db on their hosts. (The file space consumed by those was growing at >GB/s by the point at which we took more drastic action, despite setting more aggressive logrotate settings to deal with the ceph.log side of things.)

After ~6 hours, the rebalancing seemed to get stuck at about 544 PGs left to rebalance. As the MGR still wasn't showing full info, it was hard to tell precisely what was going on.
Given that the MON hosts were in danger of actually running out of disk space at this point, with gigantic store.db s (especially on the quorum head), we took the decision to halt the cluster:

ceph osd set nobackfill
ceph osd set norecover
ceph osd set norebalance
ceph osd set pause

(in addition to the noout and nodown already set).
We then stopped the MON and MGR processes on the hosts as they continued to try to grow the store.db even after this.

Adding to the difficulty - and perhaps the reason why things escalated - during the high load process on the MONs, more of the OSDs on different hosts also seem to have died [I think they ran out of memory and restarted], and of course also came up on 14.2.18.
At present perhaps 40% of the OSDs on the cluster have restarted and migrated themselves from 14.2.16 to 14.2.18...

...but the MON logs suggest that some of them are having problems talking to the others. (It's not clear that this is simply 14.2.18 OSDs having problems talking to 14.2.16 OSDs, since rebalancing had been progressing when we got the first set of 14.2.18 OSDs into the cluster again, initially.)


So, this is the state we're currently in:

cluster paused, MONs off, MGR off, OSDs in a mix of 14.2.18 and 14.2.16 .
Giant store.dbs on the MON hosts.


We're currently doing work to add a bigger additional SSD to the MON hosts so we can give them more space for the store.db , but we also need a way forward out of the current cluster state as well.

I'm *tempted* to just stop all of the OSDs at this point, and then slowly bring the cluster up again with everything having picked up 14.2.18, but I'm not sure how safe this is.
Any advice would be appreciated - especially as things seemed to snowball rapidly from yesterday just from taking what seemed to be reasonable actions - as we're understandably reluctant to prod things further in case of more permanent issues.