History log of /include/linux/crush/crush.h
Revision Date Author Comments
d83ed858f144e3cfbef883d4bc499113cdddabeb 19-Mar-2014 Ilya Dryomov <ilya.dryomov@inktank.com> crush: add SET_CHOOSELEAF_VARY_R step

This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e2b149cc4ba00766aceb87950c6de72ea7fc8b2e 19-Mar-2014 Ilya Dryomov <ilya.dryomov@inktank.com> crush: add chooseleaf_vary_r tunable

The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call. That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f046bf92080cbdc4a94c6e86698c5a3f10716445 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: add set_choose_local_[fallback_]tries steps

This allows all of the tunables to be overridden by a specific rule.

Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
0497db49e5973b50df26251ed0e3f4ac7578e66e.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
917edad5d1d62070436b74ecbf5ea019b651ff69 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: CHOOSE_LEAF -> CHOOSELEAF throughout

This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
cc10df4a3a5c34cb1d5b635ac70dd1fc406153ce 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: add SET_CHOOSE_TRIES rule step

Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
f18650ace38ef200dd1578257c75e9407297953c 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: apply chooseleaf_tries to firstn mode too

Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:

If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.

In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.

Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
be3226acc5544bcc91e756eb3ee6ca7b74f6f0a8 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: new SET_CHOOSE_LEAF_TRIES command

Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
9a3b490a20e06368c81d7a81506e99388e733379 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: use breadth-first search for indep mode

Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
c6d98a603a02594f6ecf16d0a0af989ae9fa7abd 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: return CRUSH_ITEM_UNDEF for failed placements with indep

For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.

Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e8ef19c4ad161768e1d8309d5ae18481c098eb81 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: eliminate CRUSH_MAX_SET result size limitation

This is only present to size the temporary scratch arrays that we put on
the stack. Let the caller allocate them as they wish and remove the
limitation.

Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
bfb16d7d69f0272451ad85a6e50aab3c4262fbc0 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> crush: factor out (trivial) crush_destroy_rule()

Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1604f488ac2dcce33c8218e75a000e8c5fb57e61 30-Nov-2012 Jim Schutt <jaschut@sandia.gov> libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
step chooseleaf firstn 0 type <node_type>

This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf. This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf. Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary. This
requires that OSD's data for that PG to need moving as well. This
seems unavoidable but should be relatively rare.

This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Sage Weil <sage@inktank.com>
546f04ef716dd49521774653d8b032a7d64c05d9 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: support crush tunables

The server side recently added support for tuning some magic
crush variables. Decode these variables if they are present, or use the
default values if they are not present.

Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
f671d4cd9b36691ac4ef42cde44c1b7a84e13631 08-May-2012 Sage Weil <sage@inktank.com> crush: fix tree node weight lookup

Fix the node weight lookup for tree buckets by using a correct accessor.

Reflects ceph.git commit d287ade5bcbdca82a3aef145b92924cf1e856733.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
fc7c3ae5ab9246ad96aab4d0d57f67e9255cfb56 08-May-2012 Sage Weil <sage@inktank.com> crush: remove parent maps

These were used for the ill-fated forcefeed feature. Remove them.

Reflects ceph.git commit ebdf80edfecfbd5a842b71fbe5732857994380c1.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
8b12d47b80c7a34dffdd98244d99316db490ec58 08-May-2012 Sage Weil <sage@inktank.com> crush: clean up types, const-ness

Move various types from int -> __u32 (or similar), and add const as
appropriate.

This reflects changes that have been present in the userland implementation
for some time.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
3d14c5d2b6e15c21d8e5467dc62d33127c23a644 07-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>