History log of /fs/dlm/recoverd.c
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
60f98d1839376d30e13f3e452dce2433fad3060e 02-Nov-2011 David Teigland <teigland@redhat.com> dlm: add recovery callbacks

These new callbacks notify the dlm user about lock recovery.
GFS2, and possibly others, need to be aware of when the dlm
will be doing lock recovery for a failed lockspace member.

In the past, this coordination has been done between dlm and
file system daemons in userspace, which then direct their
kernel counterparts. These callbacks allow the same
coordination directly, and more simply.

Signed-off-by: David Teigland <teigland@redhat.com>
/fs/dlm/recoverd.c
f95a34c66554235b70a681fcd9feebc195f7ec0e 14-Oct-2011 David Teigland <teigland@redhat.com> dlm: move recovery barrier calls

Put all the calls to recovery barriers in the same function
to clarify where they each happen. Should not change any behavior.
Also modify some recovery debug lines to make them consistent.

Signed-off-by: David Teigland <teigland@redhat.com>
/fs/dlm/recoverd.c
23e8e1aaacb10d9f05e44a93e10ea4ee5b3838a5 05-Apr-2011 David Teigland <teigland@redhat.com> dlm: use workqueue for callbacks

Instead of creating our own kthread (dlm_astd) to deliver
callbacks for all lockspaces, use a per-lockspace workqueue
to deliver the callbacks. This eliminates complications and
slowdowns from many lockspaces sharing the same thread.

Signed-off-by: David Teigland <teigland@redhat.com>
/fs/dlm/recoverd.c
d44e0fc704143624b3e88fbf8fbcfda7a83fd299 18-Mar-2008 David Teigland <teigland@redhat.com> dlm: recover nodes that are removed and re-added

If a node is removed from a lockspace, and then added back before the
dlm is notified of the removal, the dlm will not detect the removal
and won't clear the old state from the node. This is fixed by using a
list of added nodes so the membership recovery can detect when a newly
added node is already in the member list.

Signed-off-by: David Teigland <teigland@redhat.com>
/fs/dlm/recoverd.c
85f0379aa0f9366bb6918e2e898a915231176fbd 16-Jan-2008 David Teigland <teigland@redhat.com> dlm: keep cached master rsbs during recovery

To prevent the master of an rsb from changing rapidly, an unused rsb is kept
on the "toss list" for a period of time to be reused. The toss list was
being cleared completely for each recovery, which is unnecessary. Much of
the benefit of the toss list can be maintained if nodes keep rsb's in their
toss list that they are the master of. These rsb's need to be included
when the resource directory is rebuilt during recovery.

Signed-off-by: David Teigland <teigland@redhat.com>
/fs/dlm/recoverd.c
c36258b5925e6cf6bf72904635100593573bfcff 27-Sep-2007 David Teigland <teigland@redhat.com> [DLM] block dlm_recv in recovery transition

Introduce a per-lockspace rwsem that's held in read mode by dlm_recv
threads while working in the dlm. This allows dlm_recv activity to be
suspended when the lockspace transitions to, from and between recovery
cycles.

The specific bug prompting this change is one where an in-progress
recovery cycle is aborted by a new recovery cycle. While dlm_recv was
processing a recovery message, the recovery cycle was aborted and
dlm_recoverd began cleaning up. dlm_recv decremented recover_locks_count
on an rsb after dlm_recoverd had reset it to zero. This is fixed by
suspending dlm_recv (taking write lock on the rwsem) before aborting the
current recovery.

The transitions to/from normal and recovery modes are simplified by using
this new ability to block dlm_recv. The switch from normal to recovery
mode means dlm_recv goes from processing locking messages, to saving them
for later, and vice versa. Races are avoided by blocking dlm_recv when
setting the flag that switches between modes.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
3ae1acf93a21512512f8a78430fcde5992dd208e 18-May-2007 David Teigland <teigland@redhat.com> [DLM] add lock timeouts and warnings [2/6]

New features: lock timeouts and time warnings. If the DLM_LKF_TIMEOUT
flag is set, then the request/conversion will be canceled after waiting
the specified number of centiseconds (specified per lock). This feature
is only available for locks requested through libdlm (can be enabled for
kernel dlm users if there's a use for it.)

If the new DLM_LSFL_TIMEWARN flag is set when creating the lockspace, then
a warning message will be sent to userspace (using genetlink) after a
request/conversion has been waiting for a given number of centiseconds
(configurable per node). The time warnings will be used in the future
to do deadlock detection in userspace.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
8ec6886748443bec53ce9b9bf50cec92bc417a1b 09-Jan-2007 David Teigland <teigland@redhat.com> [DLM] change some log_error to log_debug

Some common, non-error messages should use log_debug instead of log_error
so they can be turned off.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
57adf7eede38d315e0e328c52484d6a596e9a238 29-Nov-2006 Ryusuke Konishi <ryusuke@osrg.net> [DLM] fix format warnings in rcom.c and recoverd.c

This fixes the following gcc warnings generated on
the architectures where uint64_t != unsigned long long (e.g. ppc64).

fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'uint64_t'
fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'uint64_t'
fs/dlm/recoverd.c:48: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:202: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:210: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'

Signed-off-by: Ryusuke Konishi <ryusuke@osrg.net>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
2896ee37ccc1f9acb244c9b02becb74a43661009 27-Nov-2006 David Teigland <teigland@redhat.com> [DLM] fix add_requestqueue checking nodes list

Requests that arrive after recovery has started are saved in the
requestqueue and processed after recovery is done. Some of these requests
are purged during recovery if they are from nodes that have been removed.
We move the purging of the requests (dlm_purge_requestqueue) to later in
the recovery sequence which allows the routine saving requests
(dlm_add_requestqueue) to avoid filtering out requests by nodeid since the
same will be done by the purge. The current code has add_requestqueue
filtering by nodeid but doesn't hold any locks when accessing the list of
current nodes. This also means that we need to call the purge routine
when the lockspace is being shut down since the add routine will not be
rejecting requests itself any more.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
4b77f2c93d052adca8cc8690b9b5e7f8798f4ddd 01-Nov-2006 David Teigland <teigland@redhat.com> [DLM] do full recover_locks barrier

Red Hat BZ 211914

The previous patch "[DLM] fix aborted recovery during
node removal" was incomplete as discovered with further testing. It set
the bit for the RS_LOCKS barrier but did not then wait for the barrier.
This is often ok, but sometimes it will cause yet another recovery hang.
If it's a new node that also has the lowest nodeid that skips the barrier
wait, then it misses the important step of collecting and reporting the
barrier status from the other nodes (which is the job of the low nodeid in
the barrier wait routine).

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
2cdc98aaf072d573df10c503d3b3b0b74e2a6d06 31-Oct-2006 David Teigland <teigland@redhat.com> [DLM] fix stopping unstarted recovery

Red Hat BZ 211914

When many nodes are joining a lockspace simultaneously, the dlm gets a
quick sequence of stop/start events, a pair for adding each node.
dlm_controld in user space sends dlm_recoverd in the kernel each stop and
start event. dlm_controld will sometimes send the stop before
dlm_recoverd has had a chance to take up the previously queued start. The
stop aborts the processing of the previous start by setting the
RECOVERY_STOP flag. dlm_recoverd is erroneously clearing this flag and
ignoring the stop/abort if it happens to take up the start after the stop
meant to abort it. The fix is to check the sequence number that's
incremented for each stop/start before clearing the flag.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
91c0dc93a1a6bbdd79707ed311e48b4397df177f 31-Oct-2006 David Teigland <teigland@redhat.com> [DLM] fix aborted recovery during node removal

Red Hat BZ 211914

With the new cluster infrastructure, dlm recovery for a node removal can
be aborted and restarted for a node addition. When this happens, the
restarted recovery isn't aware that it's doing recovery for the earlier
removal as well as the addition. So, it then skips the recovery steps
only required when nodes are removed. This can result in locks not being
purged for failed/removed nodes. The fix is to check for removed nodes
for which recovery has not been completed at the start of a new recovery
sequence.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
5f88f1ea16a2fb5f125505053d1bfb7901a88c64 24-Aug-2006 David Teigland <teigland@redhat.com> [DLM] add new lockspace to list ealier

When a new lockspace was being created, the recoverd thread was being
started for it before the lockspace was added to the global list of
lockspaces. The new thread was looking up the lockspace in the global
list and sometimes not finding it due to the race with the original thread
adding it to the list. We need to add the lockspace to the global list
before starting the thread instead of after, and if the new thread can't
find the lockspace for some reason, it should return an error.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
f6db1b8e724b071d144055b48da3827ce26dba1c 09-Aug-2006 David Teigland <teigland@redhat.com> [DLM] abort recovery more quickly

When we abort one recovery to do another, break out of the ping_members()
routine more quickly, and wake up the dlm_recoverd thread more quickly
instead of waiting for it to time out.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
901359256b2666f52a3a7d3f31927677e91b3a2a 20-Jan-2006 David Teigland <teigland@redhat.com> [DLM] Update DLM to the latest patch level

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c
e7fd41792fc0ee52a05fcaac87511f118328d147 18-Jan-2006 David Teigland <teigland@redhat.com> [DLM] The core of the DLM for GFS2/CLVM

This is the core of the distributed lock manager which is required
to use GFS2 as a cluster filesystem. It is also used by CLVM and
can be used as a standalone lock manager independantly of either
of these two projects.

It implements VAX-style locking modes.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
/fs/dlm/recoverd.c