Discussion:
[dm-devel] Disconcerting observation with multipathd's dmevent polling code
Martin Wilck
2018-11-21 10:50:38 UTC
Permalink
Hi Ben,

I recently encountered a strange and disturbing phenomenon with the
latest upstream code.

The system in question had two iSCSI connections on separate networks.
On one of them, I tested iSCSI disconnect and re-connects, and broken
storage configurations (different LUNs diffenent sizes and with the
same WWID). The other one was idle.

During one test, I observed that multipathd's internal representation
of the map representing the *idle* connection had vanished. The map was
still present on the device-mapper level, but multipathd seemed to have
"forgotten" it completely. The respective paths were tracked as
orphans.

Here is a log excerpt. multipathd is working on path additions for the
map I was testing with (149455...), when a DM event is received. Right
after ev_add_path() releases the lock, the idle map
(36001405...) is removed (not flushed, as the messages suggest).

Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path (uevent)
# uev_add_path->pathinfo()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev property SCSI_IDENT_LUN_T10 whitelisted
# ev_add_path()->adopt_paths()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership set to 149455400000000006d706174683100000000000000000000
...
# ev_add_path->setup_map()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: failback = "manual" (setting: multipath internal)
# ev_add_path->domap()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: load table [0 2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ...
# ev_add_path->setup_multipath()->update-multipath_strings
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: disassemble map [0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ...
# (!!!!) devmap event for LIO map
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 36001405d08f748dd4c5481f9ec22b888: devmap event #20
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg [8:96]: path added to devmap 149455400000000006d706174683100000000000000000000
# ev_add_path() returns, vecs->lock is released
# uev_add_path() goes on to next path, lock currently not held
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path (uevent)
# (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888)
# this looks like remove_map() or remove_map_and_stop_waiter()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan path, map flushed
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs prioritizer refcount 2
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan path, map flushed
To me this looks like the map was removed by the DM event handler.
More detail can be found on
https://bugzilla.opensuse.org/show_bug.cgi?id=1116272.
I've also tried to reproduce the problem, with no success so far
(actually I haven't observed any dmevents in those reproduction
attempts).

After encountering this problem, I've scrutinized your polling dmevent
handler code over and over last week, but I couldn't spot any possible
problems :-)

The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).

What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?

Regards,
Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Hannes Reinecke
2018-11-21 11:20:09 UTC
Permalink
Post by Martin Wilck
Hi Ben,
I recently encountered a strange and disturbing phenomenon with the
latest upstream code.
The system in question had two iSCSI connections on separate networks.
On one of them, I tested iSCSI disconnect and re-connects, and broken
storage configurations (different LUNs diffenent sizes and with the
same WWID). The other one was idle.
During one test, I observed that multipathd's internal representation
of the map representing the *idle* connection had vanished. The map was
still present on the device-mapper level, but multipathd seemed to have
"forgotten" it completely. The respective paths were tracked as
orphans.
Here is a log excerpt. multipathd is working on path additions for the
map I was testing with (149455...), when a DM event is received. Right
after ev_add_path() releases the lock, the idle map
(36001405...) is removed (not flushed, as the messages suggest).
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path (uevent)
# uev_add_path->pathinfo()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev property SCSI_IDENT_LUN_T10 whitelisted
# ev_add_path()->adopt_paths()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership set to 149455400000000006d706174683100000000000000000000
...
# ev_add_path->setup_map()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: failback = "manual" (setting: multipath internal)
# ev_add_path->domap()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: load table [0 2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ...
# ev_add_path->setup_multipath()->update-multipath_strings
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 149455400000000006d706174683100000000000000000000: disassemble map [0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ...
# (!!!!) devmap event for LIO map
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: 36001405d08f748dd4c5481f9ec22b888: devmap event #20
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg [8:96]: path added to devmap 149455400000000006d706174683100000000000000000000
# ev_add_path() returns, vecs->lock is released
# uev_add_path() goes on to next path, lock currently not held
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path (uevent)
# (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888)
# this looks like remove_map() or remove_map_and_stop_waiter()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan path, map flushed
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs prioritizer refcount 2
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan path, map flushed
To me this looks like the map was removed by the DM event handler.
More detail can be found on
https://bugzilla.opensuse.org/show_bug.cgi?id=1116272.
I've also tried to reproduce the problem, with no success so far
(actually I haven't observed any dmevents in those reproduction
attempts).
After encountering this problem, I've scrutinized your polling dmevent
handler code over and over last week, but I couldn't spot any possible
problems :-)
The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).
What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?
There's an easy solution to it: drop the dmevent handling code in
multipath-tools completely.
It's original design was to track _external_ map reloads, as originally
the maps had been setup by 'multipath', and 'multipahtd' was just
tracking state changes. As such multipathd needed to track those
changes, and it had been using dmevents for this.
However, with the advent of udev and updates to the multipathd daemon
external map reloads rarely happens, and even if they happen we still
would be notified by uevents.
So we might as well drop the dmevent code and rely on udev completely.
Which is what we do nowadays anyway.

Cheers,

Hannes
Martin Wilck
2018-11-21 13:33:04 UTC
Permalink
Post by Hannes Reinecke
Post by Martin Wilck
The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).
What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?
There's an easy solution to it: drop the dmevent handling code in
multipath-tools completely.
It's original design was to track _external_ map reloads, as
originally
the maps had been setup by 'multipath', and 'multipahtd' was just
tracking state changes. As such multipathd needed to track those
changes, and it had been using dmevents for this.
However, with the advent of udev and updates to the multipathd
daemon
external map reloads rarely happens, and even if they happen we still
would be notified by uevents.
So we might as well drop the dmevent code and rely on udev
completely.
Which is what we do nowadays anyway.
I recall Ben arguing that the dmevent API was faster, and more
reliable, than udev. It doesn't depend on udevd being healthy and
having enough workers available, which is a plus. The reliability seems
to be debatable in view of the API design, that was the point of my
mail.

Yet we depend so heavily on udev nowadays that multipathd is pretty
much lost without it anyway.

Let's see what Ben says.

Thanks,
Martin
Martin Wilck
2018-11-27 09:19:20 UTC
Permalink
Hi Ben,
Post by Hannes Reinecke
Post by Martin Wilck
Hi Ben,
I recently encountered a strange and disturbing phenomenon with the
latest upstream code.
The system in question had two iSCSI connections on separate
networks.
On one of them, I tested iSCSI disconnect and re-connects, and broken
storage configurations (different LUNs diffenent sizes and with the
same WWID). The other one was idle.
During one test, I observed that multipathd's internal
representation
of the map representing the *idle* connection had vanished. The map was
still present on the device-mapper level, but multipathd seemed to have
"forgotten" it completely. The respective paths were tracked as
orphans.
Here is a log excerpt. multipathd is working on path additions for the
map I was testing with (149455...), when a DM event is received. Right
after ev_add_path() releases the lock, the idle map
(36001405...) is removed (not flushed, as the messages suggest).
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path (uevent)
# uev_add_path->pathinfo()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev
property SCSI_IDENT_LUN_T10 whitelisted
# ev_add_path()->adopt_paths()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership
set to 149455400000000006d706174683100000000000000000000
...
# ev_add_path->setup_map()
149455400000000006d706174683100000000000000000000: failback =
"manual" (setting: multipath internal)
# ev_add_path->domap()
149455400000000006d706174683100000000000000000000: load table [0
2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ...
# ev_add_path->setup_multipath()->update-multipath_strings
149455400000000006d706174683100000000000000000000: disassemble map
[0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ...
# (!!!!) devmap event for LIO map
36001405d08f748dd4c5481f9ec22b888: devmap event #20
path added to devmap
149455400000000006d706174683100000000000000000000
# ev_add_path() returns, vecs->lock is released
# uev_add_path() goes on to next path, lock currently not held
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path (uevent)
# (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888)
# this looks like remove_map() or remove_map_and_stop_waiter()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan path, map flushed
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs
prioritizer refcount 2
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan path, map flushed
To me this looks like the map was removed by the DM event handler.
More detail can be found on
https://bugzilla.opensuse.org/show_bug.cgi?id=1116272.
I've also tried to reproduce the problem, with no success so far
(actually I haven't observed any dmevents in those reproduction
attempts).
After encountering this problem, I've scrutinized your polling dmevent
handler code over and over last week, but I couldn't spot any possible
problems :-)
The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).
What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?
There's an easy solution to it: drop the dmevent handling code in
multipath-tools completely.
It's original design was to track _external_ map reloads, as
originally
the maps had been setup by 'multipath', and 'multipahtd' was just
tracking state changes. As such multipathd needed to track those
changes, and it had been using dmevents for this.
However, with the advent of udev and updates to the multipathd
daemon
external map reloads rarely happens, and even if they happen we still
would be notified by uevents.
So we might as well drop the dmevent code and rely on udev
completely.
Which is what we do nowadays anyway.
I like this idea, in particular for map removal.
Ben, what do you think?

Martin
Benjamin Marzinski
2018-11-27 20:13:33 UTC
Permalink
Post by Martin Wilck
Hi Ben,
Post by Hannes Reinecke
Post by Martin Wilck
Hi Ben,
I recently encountered a strange and disturbing phenomenon with the
latest upstream code.
The system in question had two iSCSI connections on separate networks.
On one of them, I tested iSCSI disconnect and re-connects, and broken
storage configurations (different LUNs diffenent sizes and with the
same WWID). The other one was idle.
During one test, I observed that multipathd's internal
representation
of the map representing the *idle* connection had vanished. The map was
still present on the device-mapper level, but multipathd seemed to have
"forgotten" it completely. The respective paths were tracked as
orphans.
Here is a log excerpt. multipathd is working on path additions for the
map I was testing with (149455...), when a DM event is received. Right
after ev_add_path() releases the lock, the idle map
(36001405...) is removed (not flushed, as the messages suggest).
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: add path (uevent)
# uev_add_path->pathinfo()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdg: udev
property SCSI_IDENT_LUN_T10 whitelisted
# ev_add_path()->adopt_paths()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sde: ownership
set to 149455400000000006d706174683100000000000000000000
...
# ev_add_path->setup_map()
149455400000000006d706174683100000000000000000000: failback =
"manual" (setting: multipath internal)
# ev_add_path->domap()
149455400000000006d706174683100000000000000000000: load table [0
2097152 multipath 0 0 3 1 service-time 0 1 1 8:64 ...
# ev_add_path->setup_multipath()->update-multipath_strings
149455400000000006d706174683100000000000000000000: disassemble map
[0 0 3 1 service-time 0 1 2 8:64 1 1 service-time ...
# (!!!!) devmap event for LIO map
36001405d08f748dd4c5481f9ec22b888: devmap event #20
path added to devmap
149455400000000006d706174683100000000000000000000
# ev_add_path() returns, vecs->lock is released
# uev_add_path() goes on to next path, lock currently not held
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdh: add path (uevent)
# (!!!!) The LIO map is flushed (36001405d08f748dd4c5481f9ec22b888)
# this looks like remove_map() or remove_map_and_stop_waiter()
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdd: orphan
path, map flushed
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sysfs
prioritizer refcount 2
Nov 15 22:05:46 sles15-gm-iscsi-03 multipathd[568]: sdc: orphan
path, map flushed
To me this looks like the map was removed by the DM event handler.
More detail can be found on
https://bugzilla.opensuse.org/show_bug.cgi?id=1116272.
I've also tried to reproduce the problem, with no success so far
(actually I haven't observed any dmevents in those reproduction
attempts).
After encountering this problem, I've scrutinized your polling dmevent
handler code over and over last week, but I couldn't spot any possible
problems :-)
The concern that I have is with the design of the polling dmevent API,
in particular the fact that the default action is EVENT_REMOVE. If, for
whatever reason, one map is missing from the return value of the
DM_DEVICE_LIST ioctl, multipath removes the map immediately, and
there's nothing short of a "reconfigure" or "add map" CLI command that
would reinstate the map. IOW, we remove the map not on a kernel event
saying "this map has been removed", but on receiving a list where this
element happens to be missing. We handle this differently for paths,
where we wait for a "remove" uevent before we really delete the path
from our data structures. (Note that the messages file I saved from the
event above shows no sign of such an uevent ever beeing sent - as I
said, the dm map was still present after the above occured).
What do you think about this? Could you maybe inspect those logs I
took, to make sure I didn't get on a totally wrong track here?
There's an easy solution to it: drop the dmevent handling code in
multipath-tools completely.
It's original design was to track _external_ map reloads, as
originally
the maps had been setup by 'multipath', and 'multipahtd' was just
tracking state changes. As such multipathd needed to track those
changes, and it had been using dmevents for this.
However, with the advent of udev and updates to the multipathd daemon
external map reloads rarely happens, and even if they happen we still
would be notified by uevents.
So we might as well drop the dmevent code and rely on udev
completely.
Which is what we do nowadays anyway.
I like this idea, in particular for map removal.
Ben, what do you think?
That was my original goal when I ended up with the dmevents code. I
kept running into new corner cases. The issue is that udev, by design,
does not guarantee that you will get all the uevents, and in fact, we do
see this happen occasionally, so it's not a theoretical problem.

Because of that, any system that relies on ueventis has to handle things
like

1. Someone removes a multipath device
2. mutipathd misses the remove uevent
3. Someone creates a differnt multipath device with the same alias
4. multipathd misses the add uevent.
5. multipathd notices that your multipath device has all the wrong
paths, and switches them. Well, fortunately, disable_changed_wwids
will catch this and simply disable the device

Someone decides to switch the names of two (or more) multipath devices:
1. Someone disables user_friendly_names, renaming the devices to
their wwid names.
2. mutipathd misses the uevents.
3. The person switches the aliases in mutipath.conf and reloads
4. Now multipathd needs to figure out how to swap two (or more) device
aliases

I kept trying to figure out methods around issues like these, and kept
thinking up new ones. One observation I hit on was that it was really
important to not miss removing a device, because that was the first step
in having your multipath devices pointing to the wrong place. I'm not
saying it is impossible, I'm just saying that I gave up on it. I would
happily review code that dealt with all of these issues.

However, before we start down that road (because I went down it once
already, and turned back), lets look at fixing what we have. I agree
that what happened here is that the dmevents code didn't see the device,
and thus removed it. There seem to be two possible ways for this to
happen

1. device-mapper told us that it successfully completed our command, but
gave us incomplete information. This would be a really big issue.
Unlike udev, device mapper is supposed to guarantee that if it
returns success, it has the information that you asked for. It's not
just multipath that depends on this. Many tools use libdevmapper.
If there is a bug like this in it, that needs to get fixed, not
worked around on a tool-by-tool basis. But, I don't think that this
is what happened.

2. device-mapper told us that it failed to complete our command, and we
didn't differentiate that from a successful completion with a certain
result. I think that this is more likely the cause, largely because
I can see where we do this. In dm_get_events() once we get the list
of names from device-mapper we call dm_is_mpath() on each name. I'm
not sure why I did this. Probably, I was worried about searching
through a large list of non-multipath names with a lock held, and
thought that culling devices that obviously weren't correct would
speed this up. At any rate, if the device-mapper command fails, we
assume that the device isn't a multipath device, even though
device-mapper never told us that. This is clearly a bug, and it
clearly would cause exactly the result you saw. That check should be
removed, or at least we should differentiate between failure of the
command and success of the command, where the device does not have a
multipath table.

thoughts?

-Ben
Post by Martin Wilck
Martin
Martin Wilck
2018-11-28 09:31:02 UTC
Permalink
Post by Benjamin Marzinski
Post by Martin Wilck
I like this idea, in particular for map removal.
Ben, what do you think?
That was my original goal when I ended up with the dmevents code. I
kept running into new corner cases. The issue is that udev, by design,
does not guarantee that you will get all the uevents, and in fact, we do
see this happen occasionally, so it's not a theoretical problem.
Because of that, any system that relies on ueventis has to handle things
like
1. Someone removes a multipath device
2. mutipathd misses the remove uevent
3. Someone creates a differnt multipath device with the same alias
4. multipathd misses the add uevent.
5. multipathd notices that your multipath device has all the wrong
paths, and switches them. Well, fortunately, disable_changed_wwids
will catch this and simply disable the device
Maybe we should put more work in the "delegate dangerous commands to
multipathd" approach.
Post by Benjamin Marzinski
Someone decides to switch the names of two (or more) multipath
1. Someone disables user_friendly_names, renaming the devices to
their wwid names.
2. mutipathd misses the uevents.
3. The person switches the aliases in mutipath.conf and reloads
4. Now multipathd needs to figure out how to swap two (or more) device
aliases
IMO this works well with the current code (unless names are really
*swapped*, e.g. mpatha<->mpathb). But maybe I'm overlooking something.
Post by Benjamin Marzinski
I kept trying to figure out methods around issues like these, and kept
thinking up new ones. One observation I hit on was that it was really
important to not miss removing a device, because that was the first step
in having your multipath devices pointing to the wrong place. I'm not
saying it is impossible, I'm just saying that I gave up on it. I would
happily review code that dealt with all of these issues.
If "remove" uevents is most important, what if we created an additional
monitor for "kernel" uevents in multipathd? Those should be just as
reliable as dmevents, and for removal, we don't need the udev rule
processing. Having ACTION==remove and the KERNEL name should be
sufficient. This would obviously not work for "change" or "add" events.

Also, I'd like to understand in what (test? production?) scenarios
you've observed lost uevents. Lack of reliablity of udev is a problem
not only for multipathd, and udev should be fixed. I can mainly think
of:

1) udev rule processing gets stuck on slow IO or IO timeouts, causing
the worker to be killed,
2) resource contention between udev workers, in particular on large
systems during "coldplug".

Do you have other failure scenarios in mind?
Post by Benjamin Marzinski
However, before we start down that road (because I went down it once
already, and turned back), lets look at fixing what we have. I agree
that what happened here is that the dmevents code didn't see the device,
and thus removed it. There seem to be two possible ways for this to
happen
1. device-mapper told us that it successfully completed our command, but
gave us incomplete information. This would be a really big issue.
Unlike udev, device mapper is supposed to guarantee that if it
returns success, it has the information that you asked for. It's not
just multipath that depends on this. Many tools use libdevmapper.
If there is a bug like this in it, that needs to get fixed, not
worked around on a tool-by-tool basis. But, I don't think that this
is what happened.
2. device-mapper told us that it failed to complete our command, and we
didn't differentiate that from a successful completion with a certain
result. I think that this is more likely the cause, largely because
I can see where we do this. In dm_get_events() once we get the list
of names from device-mapper we call dm_is_mpath() on each
name. I'm
not sure why I did this. Probably, I was worried about searching
through a large list of non-multipath names with a lock held, and
thought that culling devices that obviously weren't correct would
speed this up. At any rate, if the device-mapper command fails, we
assume that the device isn't a multipath device, even though
device-mapper never told us that. This is clearly a bug, and it
clearly would cause exactly the result you saw. That check should be
removed, or at least we should differentiate between failure of the
command and success of the command, where the device does not have a
multipath table.
Good thought. I'd stared at your code in length, but that didn't occur
to me. I feel relieved that you found an explanation short of
DM_DEVICE_LIST being unreliable.

Martin
Benjamin Marzinski
2018-11-28 16:57:14 UTC
Permalink
Post by Martin Wilck
Post by Benjamin Marzinski
Post by Martin Wilck
I like this idea, in particular for map removal.
Ben, what do you think?
That was my original goal when I ended up with the dmevents code. I
kept running into new corner cases. The issue is that udev, by design,
does not guarantee that you will get all the uevents, and in fact, we do
see this happen occasionally, so it's not a theoretical problem.
Because of that, any system that relies on ueventis has to handle things
like
1. Someone removes a multipath device
2. mutipathd misses the remove uevent
3. Someone creates a differnt multipath device with the same alias
4. multipathd misses the add uevent.
5. multipathd notices that your multipath device has all the wrong
paths, and switches them. Well, fortunately, disable_changed_wwids
will catch this and simply disable the device
Maybe we should put more work in the "delegate dangerous commands to
multipathd" approach.
Yeah. If multipath was more of a front-end to multipathd, these
divergent view issues would be much less likely.
Post by Martin Wilck
Post by Benjamin Marzinski
1. Someone disables user_friendly_names, renaming the devices to
their wwid names.
2. mutipathd misses the uevents.
3. The person switches the aliases in mutipath.conf and reloads
4. Now multipathd needs to figure out how to swap two (or more) device
aliases
IMO this works well with the current code (unless names are really
*swapped*, e.g. mpatha<->mpathb). But maybe I'm overlooking something.
It does work correctly in the current code, but that's because we never
miss remove events (renames are basically a remove followed by an add).
If we relied on uevents and missed them, when we checked the device in
checkerloop we would find that it suddenly is all wrong.
Post by Martin Wilck
Post by Benjamin Marzinski
I kept trying to figure out methods around issues like these, and kept
thinking up new ones. One observation I hit on was that it was really
important to not miss removing a device, because that was the first step
in having your multipath devices pointing to the wrong place. I'm not
saying it is impossible, I'm just saying that I gave up on it. I would
happily review code that dealt with all of these issues.
If "remove" uevents is most important, what if we created an additional
monitor for "kernel" uevents in multipathd? Those should be just as
reliable as dmevents, and for removal, we don't need the udev rule
processing. Having ACTION==remove and the KERNEL name should be
sufficient. This would obviously not work for "change" or "add" events.
That would certainly help, but again, I would contend that it's not as
reliable as devmapper. From the netlink man page

"However, reliable transmissions from kernel to user are impossible in
any case. The kernel can't send a netlink message if the socket buffer
is full: the message will be dropped and the kernel and the user-space
process will no longer have the same view of kernel state. It is up to
the application to detect when this happens (via the ENOBUFS error
returned by recvmsg(2)) and resynchronize."

In this case, we would need to rescan everything on ENOBUFS, but at
least we would get a notification.

dmevents simply has one value per device that tracks the current event
number. Space for this value is allocated when the device is created,
and it doesn't take any more space when there is 5000 outstanding events
than when there is 1.
Post by Martin Wilck
Also, I'd like to understand in what (test? production?) scenarios
you've observed lost uevents. Lack of reliablity of udev is a problem
not only for multipathd, and udev should be fixed. I can mainly think
1) udev rule processing gets stuck on slow IO or IO timeouts, causing
the worker to be killed,
2) resource contention between udev workers, in particular on large
systems during "coldplug".
Do you have other failure scenarios in mind?
I don't know if this ever being confirmed in the field, but udev and
won't store an unlimited number of events. IIRC, after udev finished
processing and event, processes looking for that event may have a
limited time to receive that event before it is dropped, if memory is
tight. To deal with this, uevent_listen needed to work quickly to make
sure it didn't lose events. Or at least that was the case in the early
days of multipath's udev support. It might be better now.

But we've certainly seen workers killed from timeouts.
Post by Martin Wilck
Post by Benjamin Marzinski
However, before we start down that road (because I went down it once
already, and turned back), lets look at fixing what we have. I agree
that what happened here is that the dmevents code didn't see the device,
and thus removed it. There seem to be two possible ways for this to
happen
1. device-mapper told us that it successfully completed our command, but
gave us incomplete information. This would be a really big issue.
Unlike udev, device mapper is supposed to guarantee that if it
returns success, it has the information that you asked for. It's not
just multipath that depends on this. Many tools use libdevmapper.
If there is a bug like this in it, that needs to get fixed, not
worked around on a tool-by-tool basis. But, I don't think that this
is what happened.
2. device-mapper told us that it failed to complete our command, and we
didn't differentiate that from a successful completion with a certain
result. I think that this is more likely the cause, largely because
I can see where we do this. In dm_get_events() once we get the list
of names from device-mapper we call dm_is_mpath() on each
name. I'm
not sure why I did this. Probably, I was worried about searching
through a large list of non-multipath names with a lock held, and
thought that culling devices that obviously weren't correct would
speed this up. At any rate, if the device-mapper command fails, we
assume that the device isn't a multipath device, even though
device-mapper never told us that. This is clearly a bug, and it
clearly would cause exactly the result you saw. That check should be
removed, or at least we should differentiate between failure of the
command and success of the command, where the device does not have a
multipath table.
Good thought. I'd stared at your code in length, but that didn't occur
to me. I feel relieved that you found an explanation short of
DM_DEVICE_LIST being unreliable.
Martin
Martin Wilck
2018-11-28 19:35:05 UTC
Permalink
Post by Benjamin Marzinski
Post by Martin Wilck
If "remove" uevents is most important, what if we created an
additional
monitor for "kernel" uevents in multipathd? Those should be just as
reliable as dmevents, and for removal, we don't need the udev rule
processing. Having ACTION==remove and the KERNEL name should be
sufficient. This would obviously not work for "change" or "add" events.
That would certainly help, but again, I would contend that it's not as
reliable as devmapper. From the netlink man page
"However, reliable transmissions from kernel to user are impossible in
any case. The kernel can't send a netlink message if the socket buffer
is full: the message will be dropped and the kernel and the user-
space
process will no longer have the same view of kernel state. It is up to
the application to detect when this happens (via the ENOBUFS error
returned by recvmsg(2)) and resynchronize."
In this case, we would need to rescan everything on ENOBUFS, but at
least we would get a notification.
By running the uevent listener with RT priority and using a suitably
large receive buffer, we should be able to reduce the likelihood of
this happening such that it doesn't really matter anymore. In real OOM
situations, multipathd is pretty much hosed anyway. multipathd hasn't
been written with resilience against memory pressure in mind.

But for the time being, I'm fine with your dm_is_mpath patch.

Thanks
Martin

Loading...