Discussion:
[dm-devel] [PATCH 0/3] per-cpu in_flight counters for bio-based drivers
Mikulas Patocka
2018-11-28 00:42:11 UTC
Permalink
These are the patches for per-cpu in_flight counters.

Mikulas
Mike Snitzer
2018-11-30 14:43:43 UTC
Permalink
On Tue, Nov 27 2018 at 7:42pm -0500,
Post by Mikulas Patocka
These are the patches for per-cpu in_flight counters.
Do you have updated before vs after performance results for these
changes?

I'd imagine they are comparable to your previous run (though that run
included some other DM changes that I already staged).

Thanks,
Mike
Mike Snitzer
2018-11-30 15:50:54 UTC
Permalink
On Fri, Nov 30 2018 at 9:43am -0500,
Post by Mike Snitzer
On Tue, Nov 27 2018 at 7:42pm -0500,
Post by Mikulas Patocka
These are the patches for per-cpu in_flight counters.
Do you have updated before vs after performance results for these
changes?
I'd imagine they are comparable to your previous run (though that run
included some other DM changes that I already staged).
Would like to see before vs after with:

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.21/block
vs
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=block-dm-4.21-inflight

(block-dm-4.21-inflight is based on latest for-4.21/block -- and it
contains DM changes in front of the block changes from this thread; I'd
prefer Jens take the changes like this rather than leave DM in a bit of
a mess for me/us to have to cleanup later)
Mike Snitzer
2018-11-30 19:57:31 UTC
Permalink
On Fri, Nov 30 2018 at 10:50am -0500,
Post by Mike Snitzer
On Fri, Nov 30 2018 at 9:43am -0500,
Post by Mike Snitzer
On Tue, Nov 27 2018 at 7:42pm -0500,
Post by Mikulas Patocka
These are the patches for per-cpu in_flight counters.
Do you have updated before vs after performance results for these
changes?
I'd imagine they are comparable to your previous run (though that run
included some other DM changes that I already staged).
http://git.kernel.dk/cgit/linux-block/log/?h=for-4.21/block
vs
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=block-dm-4.21-inflight
(block-dm-4.21-inflight is based on latest for-4.21/block -- and it
contains DM changes in front of the block changes from this thread; I'd
prefer Jens take the changes like this rather than leave DM in a bit of
a mess for me/us to have to cleanup later)
I ran the same fio test you did (in the previous thread where you did
the switch to percpu local to DM, rather than in block) _except_ I used
a ramdisk-based pmem device rather than a pure ramdisk:

fio --ioengine=psync --iodepth=1 --rw=read --bs=512 --direct=1 --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/pmem0

2 6-core processors (w/ HT, so 24 logical cpus):

/dev/pmem0 14.6M
/dev/pmem0 with percpu counters 14.8M
/dev/mapper/linear 4736k
/dev/mapper/linear with percpu counters 4595k

It is only after I apply this commit that I can realize a big
performance win:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-dm-4.21-inflight&id=335b41c7513110b1519d8a93d412c138bf671263

/dev/mapper/linear with percpu + pending removed 11.2M
Mike Snitzer
2018-11-30 22:22:20 UTC
Permalink
Hi,

This v2 the product of more thorough review and testing (on my part)
of Mikulas' original patchset.

Not seeing a major performance win in general but no loss either. DM
devices do see a huge boost in IOPS thanks to being able to eliminate
the inefficient md->pending IO accounting that it was doing, see:
https://www.redhat.com/archives/dm-devel/2018-November/msg00415.html

Happy to iterate on this patchset further as needed, all
review/suggestions are very much appreciated.

Thanks,
Mike

Mike Snitzer (1):
dm rq: leverage blk_mq_queue_busy() to check for outstanding IO

Mikulas Patocka (5):
dm: dont rewrite dm_disk(md)->part0.in_flight
block: delete part_round_stats and switch to less precise counting
block: switch to per-cpu in-flight counters
block: return just one value from part_in_flight
dm: remove the pending IO accounting

block/bio.c | 28 ++++++++++++++++----
block/blk-core.c | 67 +++++------------------------------------------
block/blk-merge.c | 3 +--
block/blk-mq.c | 12 ++++-----
block/blk-mq.h | 3 +--
block/genhd.c | 59 +++++++++++++++++++++++------------------
block/partition-generic.c | 10 +++----
drivers/md/dm-core.h | 2 --
drivers/md/dm-rq.c | 9 +++----
drivers/md/dm.c | 36 +++++++++++--------------
include/linux/genhd.h | 13 +++++----
11 files changed, 98 insertions(+), 144 deletions(-)
--
2.15.0
Mike Snitzer
2018-11-30 22:22:21 UTC
Permalink
From: Mikulas Patocka <***@redhat.com>

generic_start_io_acct and generic_end_io_acct already update the variable
in_flight using atomic operations, so we don't have to overwrite them
again.

Signed-off-by: Mikulas Patocka <***@redhat.com>
Signed-off-by: Mike Snitzer <***@redhat.com>
---
drivers/md/dm.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a733e4c920af..a8ae7931bce7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -663,8 +663,7 @@ static void start_io_acct(struct dm_io *io)
generic_start_io_acct(md->queue, bio_op(bio), bio_sectors(bio),
&dm_disk(md)->part0);

- atomic_set(&dm_disk(md)->part0.in_flight[rw],
- atomic_inc_return(&md->pending[rw]));
+ atomic_inc(&md->pending[rw]);

if (unlikely(dm_stats_used(&md->stats)))
dm_stats_account_io(&md->stats, bio_data_dir(bio),
@@ -693,7 +692,6 @@ static void end_io_acct(struct dm_io *io)
* a flush.
*/
pending = atomic_dec_return(&md->pending[rw]);
- atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
pending += atomic_read(&md->pending[rw^0x1]);

/* nudge anyone waiting on suspend queue */
--
2.15.0
Mike Snitzer
2018-11-30 22:22:23 UTC
Permalink
From: Mikulas Patocka <***@redhat.com>

We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.

Signed-off-by: Mikulas Patocka <***@redhat.com>
Signed-off-by: Mike Snitzer <***@redhat.com>
---
block/bio.c | 24 +++++++++++++++---
block/blk-core.c | 63 +++--------------------------------------------
block/blk-merge.c | 1 -
block/genhd.c | 4 ---
block/partition-generic.c | 4 ---
include/linux/genhd.h | 3 +--
6 files changed, 26 insertions(+), 73 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 03895cc0d74a..d5ef043a97aa 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1663,13 +1663,29 @@ void bio_check_pages_dirty(struct bio *bio)
}
EXPORT_SYMBOL_GPL(bio_check_pages_dirty);

+void update_io_ticks(int cpu, struct hd_struct *part, unsigned long now)
+{
+ unsigned long stamp;
+again:
+ stamp = READ_ONCE(part->stamp);
+ if (unlikely(stamp != now)) {
+ if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
+ __part_stat_add(cpu, part, io_ticks, 1);
+ }
+ }
+ if (part->partno) {
+ part = &part_to_disk(part)->part0;
+ goto again;
+ }
+}
+
void generic_start_io_acct(struct request_queue *q, int op,
unsigned long sectors, struct hd_struct *part)
{
const int sgrp = op_stat_group(op);
int cpu = part_stat_lock();

- part_round_stats(q, cpu, part);
+ update_io_ticks(cpu, part, jiffies);
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, sectors[sgrp], sectors);
part_inc_in_flight(q, part, op_is_write(op));
@@ -1681,12 +1697,14 @@ EXPORT_SYMBOL(generic_start_io_acct);
void generic_end_io_acct(struct request_queue *q, int req_op,
struct hd_struct *part, unsigned long start_time)
{
- unsigned long duration = jiffies - start_time;
+ unsigned long now = jiffies;
+ unsigned long duration = now - start_time;
const int sgrp = op_stat_group(req_op);
int cpu = part_stat_lock();

+ update_io_ticks(cpu, part, now);
part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
- part_round_stats(q, cpu, part);
+ part_stat_add(cpu, part, time_in_queue, duration);
part_dec_in_flight(q, part, op_is_write(req_op));

part_stat_unlock();
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f6f5e6c2fe4..6bd4669f05fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -583,63 +583,6 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op,
}
EXPORT_SYMBOL(blk_get_request);

-static void part_round_stats_single(struct request_queue *q, int cpu,
- struct hd_struct *part, unsigned long now,
- unsigned int inflight)
-{
- if (inflight) {
- __part_stat_add(cpu, part, time_in_queue,
- inflight * (now - part->stamp));
- __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
- }
- part->stamp = now;
-}
-
-/**
- * part_round_stats() - Round off the performance stats on a struct disk_stats.
- * @q: target block queue
- * @cpu: cpu number for stats access
- * @part: target partition
- *
- * The average IO queue length and utilisation statistics are maintained
- * by observing the current state of the queue length and the amount of
- * time it has been in this state for.
- *
- * Normally, that accounting is done on IO completion, but that can result
- * in more than a second's worth of IO being accounted for within any one
- * second, leading to >100% utilisation. To deal with that, we call this
- * function to do a round-off before returning the results when reading
- * /proc/diskstats. This accounts immediately for all queue usage up to
- * the current jiffies and restarts the counters again.
- */
-void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
-{
- struct hd_struct *part2 = NULL;
- unsigned long now = jiffies;
- unsigned int inflight[2];
- int stats = 0;
-
- if (part->stamp != now)
- stats |= 1;
-
- if (part->partno) {
- part2 = &part_to_disk(part)->part0;
- if (part2->stamp != now)
- stats |= 2;
- }
-
- if (!stats)
- return;
-
- part_in_flight(q, part, inflight);
-
- if (stats & 2)
- part_round_stats_single(q, cpu, part2, now, inflight[1]);
- if (stats & 1)
- part_round_stats_single(q, cpu, part, now, inflight[0]);
-}
-EXPORT_SYMBOL_GPL(part_round_stats);
-
void blk_put_request(struct request *req)
{
blk_mq_free_request(req);
@@ -1408,9 +1351,10 @@ void blk_account_io_done(struct request *req, u64 now)
cpu = part_stat_lock();
part = req->part;

+ update_io_ticks(cpu, part, jiffies);
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, nsecs[sgrp], now - req->start_time_ns);
- part_round_stats(req->q, cpu, part);
+ part_stat_add(cpu, part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
part_dec_in_flight(req->q, part, rq_data_dir(req));

hd_struct_put(part);
@@ -1446,11 +1390,12 @@ void blk_account_io_start(struct request *rq, bool new_io)
part = &rq->rq_disk->part0;
hd_struct_get(part);
}
- part_round_stats(rq->q, cpu, part);
part_inc_in_flight(rq->q, part, rw);
rq->part = part;
}

+ update_io_ticks(cpu, part, jiffies);
+
part_stat_unlock();
}

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6be04ef8da5b..c278b6d18a24 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -690,7 +690,6 @@ static void blk_account_io_merge(struct request *req)
cpu = part_stat_lock();
part = req->part;

- part_round_stats(req->q, cpu, part);
part_dec_in_flight(req->q, part, rq_data_dir(req));

hd_struct_put(part);
diff --git a/block/genhd.c b/block/genhd.c
index 0145bcb0cc76..cdf174d7d329 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1326,7 +1326,6 @@ static int diskstats_show(struct seq_file *seqf, void *v)
struct hd_struct *hd;
char buf[BDEVNAME_SIZE];
unsigned int inflight[2];
- int cpu;

/*
if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
@@ -1338,9 +1337,6 @@ static int diskstats_show(struct seq_file *seqf, void *v)

disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
while ((hd = disk_part_iter_next(&piter))) {
- cpu = part_stat_lock();
- part_round_stats(gp->queue, cpu, hd);
- part_stat_unlock();
part_in_flight(gp->queue, hd, inflight);
seq_printf(seqf, "%4d %7d %s "
"%lu %lu %lu %u "
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 5f8db5c5140f..42d6138ac876 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -121,11 +121,7 @@ ssize_t part_stat_show(struct device *dev,
struct hd_struct *p = dev_to_part(dev);
struct request_queue *q = part_to_disk(p)->queue;
unsigned int inflight[2];
- int cpu;

- cpu = part_stat_lock();
- part_round_stats(q, cpu, p);
- part_stat_unlock();
part_in_flight(q, p, inflight);
return sprintf(buf,
"%8lu %8lu %8llu %8u "
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 0c5ee17b4d88..f2a0a52c874f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -398,8 +398,7 @@ static inline void free_part_info(struct hd_struct *part)
kfree(part->info);
}

-/* block/blk-core.c */
-extern void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part);
+void update_io_ticks(int cpu, struct hd_struct *part, unsigned long now);

/* block/genhd.c */
extern void device_add_disk(struct device *parent, struct gendisk *disk,
--
2.15.0
Mike Snitzer
2018-11-30 22:22:24 UTC
Permalink
From: Mikulas Patocka <***@redhat.com>

Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Signed-off-by: Mikulas Patocka <***@redhat.com>
Signed-off-by: Mike Snitzer <***@redhat.com>
---
block/bio.c | 4 ++--
block/blk-core.c | 4 ++--
block/blk-merge.c | 2 +-
block/genhd.c | 47 +++++++++++++++++++++++++++++++++++------------
include/linux/genhd.h | 7 ++++---
5 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d5ef043a97aa..b25b4fef9900 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1688,7 +1688,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
update_io_ticks(cpu, part, jiffies);
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, sectors[sgrp], sectors);
- part_inc_in_flight(q, part, op_is_write(op));
+ part_inc_in_flight(q, cpu, part, op_is_write(op));

part_stat_unlock();
}
@@ -1705,7 +1705,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
update_io_ticks(cpu, part, now);
part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_add(cpu, part, time_in_queue, duration);
- part_dec_in_flight(q, part, op_is_write(req_op));
+ part_dec_in_flight(q, cpu, part, op_is_write(req_op));

part_stat_unlock();
}
diff --git a/block/blk-core.c b/block/blk-core.c
index 6bd4669f05fd..87f06672d9a7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1355,7 +1355,7 @@ void blk_account_io_done(struct request *req, u64 now)
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, nsecs[sgrp], now - req->start_time_ns);
part_stat_add(cpu, part, time_in_queue, nsecs_to_jiffies64(now - req->start_time_ns));
- part_dec_in_flight(req->q, part, rq_data_dir(req));
+ part_dec_in_flight(req->q, cpu, part, rq_data_dir(req));

hd_struct_put(part);
part_stat_unlock();
@@ -1390,7 +1390,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
part = &rq->rq_disk->part0;
hd_struct_get(part);
}
- part_inc_in_flight(rq->q, part, rw);
+ part_inc_in_flight(rq->q, cpu, part, rw);
rq->part = part;
}

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c278b6d18a24..c02386cdf0ca 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -690,7 +690,7 @@ static void blk_account_io_merge(struct request *req)
cpu = part_stat_lock();
part = req->part;

- part_dec_in_flight(req->q, part, rq_data_dir(req));
+ part_dec_in_flight(req->q, cpu, part, rq_data_dir(req));

hd_struct_put(part);
part_stat_unlock();
diff --git a/block/genhd.c b/block/genhd.c
index cdf174d7d329..d4c9dd65def6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
static void disk_del_events(struct gendisk *disk);
static void disk_release_events(struct gendisk *disk);

-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
{
if (queue_is_mq(q))
return;

- atomic_inc(&part->in_flight[rw]);
+ local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
if (part->partno)
- atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+ local_inc(&per_cpu_ptr(part_to_disk(part)->part0.dkstats, cpu)->in_flight[rw]);
}

-void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
{
if (queue_is_mq(q))
return;

- atomic_dec(&part->in_flight[rw]);
+ local_dec(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
if (part->partno)
- atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+ local_dec(&per_cpu_ptr(part_to_disk(part)->part0.dkstats, cpu)->in_flight[rw]);
}

void part_in_flight(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2])
{
+ int cpu;
+
if (queue_is_mq(q)) {
blk_mq_in_flight(q, part, inflight);
return;
}

- inflight[0] = atomic_read(&part->in_flight[0]) +
- atomic_read(&part->in_flight[1]);
+ inflight[0] = 0;
+ for_each_possible_cpu(cpu) {
+ inflight[0] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
+ local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+ }
+ if ((int)inflight[0] < 0)
+ inflight[0] = 0;
+
if (part->partno) {
part = &part_to_disk(part)->part0;
- inflight[1] = atomic_read(&part->in_flight[0]) +
- atomic_read(&part->in_flight[1]);
+ inflight[1] = 0;
+ for_each_possible_cpu(cpu) {
+ inflight[1] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
+ local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+ }
+ if ((int)inflight[1] < 0)
+ inflight[1] = 0;
}
}

void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2])
{
+ int cpu;
+
if (queue_is_mq(q)) {
blk_mq_in_flight_rw(q, part, inflight);
return;
}

- inflight[0] = atomic_read(&part->in_flight[0]);
- inflight[1] = atomic_read(&part->in_flight[1]);
+ inflight[0] = 0;
+ inflight[1] = 0;
+ for_each_possible_cpu(cpu) {
+ inflight[0] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]);
+ inflight[1] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
+ }
+ if ((int)inflight[0] < 0)
+ inflight[0] = 0;
+ if ((int)inflight[1] < 0)
+ inflight[1] = 0;
}

struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index f2a0a52c874f..a03aa6502a83 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -17,6 +17,7 @@
#include <linux/percpu-refcount.h>
#include <linux/uuid.h>
#include <linux/blk_types.h>
+#include <asm/local.h>

#ifdef CONFIG_BLOCK

@@ -89,6 +90,7 @@ struct disk_stats {
unsigned long merges[NR_STAT_GROUPS];
unsigned long io_ticks;
unsigned long time_in_queue;
+ local_t in_flight[2];
};

#define PARTITION_META_INFO_VOLNAMELTH 64
@@ -122,7 +124,6 @@ struct hd_struct {
int make_it_fail;
#endif
unsigned long stamp;
- atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
@@ -380,9 +381,9 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);
void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);
-void part_dec_in_flight(struct request_queue *q, struct hd_struct *part,
+void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part,
int rw);
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part,
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part,
int rw);

static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
--
2.15.0
Jens Axboe
2018-12-05 17:30:38 UTC
Permalink
Post by Mike Snitzer
diff --git a/block/genhd.c b/block/genhd.c
index cdf174d7d329..d4c9dd65def6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
static void disk_del_events(struct gendisk *disk);
static void disk_release_events(struct gendisk *disk);
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
{
if (queue_is_mq(q))
return;
- atomic_inc(&part->in_flight[rw]);
+ local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
I mentioned this in a previous email, but why isn't this just using
this_cpu_inc? There's also no need to pass in the cpu, if we're not
running with preempt disabled already we have a problem.
--
Jens Axboe
Mike Snitzer
2018-12-05 17:49:42 UTC
Permalink
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
diff --git a/block/genhd.c b/block/genhd.c
index cdf174d7d329..d4c9dd65def6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
static void disk_del_events(struct gendisk *disk);
static void disk_release_events(struct gendisk *disk);
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
{
if (queue_is_mq(q))
return;
- atomic_inc(&part->in_flight[rw]);
+ local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
I mentioned this in a previous email, but why isn't this just using
this_cpu_inc?
I responded to your earlier question on this point but, Mikulas just
extended the existing percpu struct disk_stats and he is using local_t
for reasons detailed in this patch's header:

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Jens Axboe
2018-12-05 17:54:39 UTC
Permalink
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
diff --git a/block/genhd.c b/block/genhd.c
index cdf174d7d329..d4c9dd65def6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,53 +45,76 @@ static void disk_add_events(struct gendisk *disk);
static void disk_del_events(struct gendisk *disk);
static void disk_release_events(struct gendisk *disk);
-void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+void part_inc_in_flight(struct request_queue *q, int cpu, struct hd_struct *part, int rw)
{
if (queue_is_mq(q))
return;
- atomic_inc(&part->in_flight[rw]);
+ local_inc(&per_cpu_ptr(part->dkstats, cpu)->in_flight[rw]);
I mentioned this in a previous email, but why isn't this just using
this_cpu_inc?
I responded to your earlier question on this point but, Mikulas just
extended the existing percpu struct disk_stats and he is using local_t
We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.
The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.
Gotcha, make sense.
Post by Mike Snitzer
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
--
Jens Axboe
Mike Snitzer
2018-12-05 18:03:47 UTC
Permalink
On Wed, Dec 05 2018 at 12:54pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
I understand you'd like to avoid excess interface baggage. But seems to
me we'd be better off being consistent, when extending the percpu
portion of block core stats, and then do an incremental to clean it all
up.

But I'm open to doing it however you'd like if you feel strongly about
how this should be done.

Mike
Jens Axboe
2018-12-05 18:04:39 UTC
Permalink
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:54pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
I understand you'd like to avoid excess interface baggage. But seems to
me we'd be better off being consistent, when extending the percpu
portion of block core stats, and then do an incremental to clean it all
up.
The incremental should be done first in that case, it'd be silly to
introduce something only to do a cleanup right after.
--
Jens Axboe
Mike Snitzer
2018-12-05 18:18:33 UTC
Permalink
On Wed, Dec 05 2018 at 1:04pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:54pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
I understand you'd like to avoid excess interface baggage. But seems to
me we'd be better off being consistent, when extending the percpu
portion of block core stats, and then do an incremental to clean it all
up.
The incremental should be done first in that case, it'd be silly to
introduce something only to do a cleanup right after.
OK, all existing code for these percpu stats should follow the pattern:

int cpu = part_stat_lock();

<do percpu diskstats stuff>

part_stat_unlock();

part_stat_lock() calls get_cpu() which does preempt_disable(). So to
your point: yes we have preempt disabled. And yes we _could_ just use
smp_processor_id() in callers rather than pass 'cpu' to them.

Is that what you want to see?

Mike
Jens Axboe
2018-12-05 18:35:52 UTC
Permalink
Post by Mike Snitzer
On Wed, Dec 05 2018 at 1:04pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:54pm -0500,
Post by Jens Axboe
Post by Mike Snitzer
On Wed, Dec 05 2018 at 12:30pm -0500,
Post by Jens Axboe
There's also no need to pass in the cpu, if we're not running with
preempt disabled already we have a problem.
Why should this be any different than the part_stat_* interfaces?
__part_stat_add(), part_stat_read(), etc also use
per_cpu_ptr((part)->dkstats, (cpu) accessors.
Maybe audit which ones actually need it? To answer the specific question,
it's silly to pass in the cpu, if we're pinned already. That's true
both programatically, but also for someone reading the code.
I understand you'd like to avoid excess interface baggage. But seems to
me we'd be better off being consistent, when extending the percpu
portion of block core stats, and then do an incremental to clean it all
up.
The incremental should be done first in that case, it'd be silly to
introduce something only to do a cleanup right after.
int cpu = part_stat_lock();
<do percpu diskstats stuff>
part_stat_unlock();
part_stat_lock() calls get_cpu() which does preempt_disable(). So to
your point: yes we have preempt disabled. And yes we _could_ just use
smp_processor_id() in callers rather than pass 'cpu' to them.
Is that what you want to see?
Something like that, yes.
--
Jens Axboe
Mike Snitzer
2018-11-30 22:22:22 UTC
Permalink
Now that request-based dm-multipath only supports blk-mq, make use of
the newly introduced blk_mq_queue_busy() to check for outstanding IO --
rather than (ab)using the block core's in_flight counters.

Signed-off-by: Mike Snitzer <***@redhat.com>
---
drivers/md/dm-rq.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1f1fe9a618ea..d2397d8fcbd1 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -130,11 +130,11 @@ static void rq_end_stats(struct mapped_device *md, struct request *orig)
*/
static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
{
- atomic_dec(&md->pending[rw]);
-
/* nudge anyone waiting on suspend queue */
- if (!md_in_flight(md))
- wake_up(&md->wait);
+ if (unlikely(waitqueue_active(&md->wait))) {
+ if (!blk_mq_queue_busy(md->queue))
+ wake_up(&md->wait);
+ }

/*
* dm_put() must be at the end of this function. See the comment above
@@ -436,7 +436,6 @@ ssize_t dm_attr_rq_based_seq_io_merge_deadline_store(struct mapped_device *md,
static void dm_start_request(struct mapped_device *md, struct request *orig)
{
blk_mq_start_request(orig);
- atomic_inc(&md->pending[rq_data_dir(orig)]);

if (unlikely(dm_stats_used(&md->stats))) {
struct dm_rq_target_io *tio = tio_from_request(orig);
--
2.15.0
Mike Snitzer
2018-11-30 22:22:25 UTC
Permalink
From: Mikulas Patocka <***@redhat.com>

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka <***@redhat.com>
Signed-off-by: Mike Snitzer <***@redhat.com>
---
block/blk-mq.c | 12 +++++-------
block/blk-mq.h | 3 +--
block/genhd.c | 32 +++++++++++---------------------
block/partition-generic.c | 6 +++---
include/linux/genhd.h | 3 +--
5 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7dcef565dc0f..88ed969cb9df 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -101,25 +101,23 @@ static bool blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
struct mq_inflight *mi = priv;

/*
- * index[0] counts the specific partition that was asked for. index[1]
- * counts the ones that are active on the whole device, so increment
- * that if mi->part is indeed a partition, and not a whole device.
+ * index[0] counts the specific partition that was asked for.
*/
if (rq->part == mi->part)
mi->inflight[0]++;
- if (mi->part->partno)
- mi->inflight[1]++;

return true;
}

-void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
- unsigned int inflight[2])
+unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part)
{
+ unsigned inflight[2];
struct mq_inflight mi = { .part = part, .inflight = inflight, };

inflight[0] = inflight[1] = 0;
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+
+ return inflight[0];
}

static bool blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7291e5379358..4022943cb191 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -184,8 +184,7 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
}

-void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
- unsigned int inflight[2]);
+unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part);
void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);

diff --git a/block/genhd.c b/block/genhd.c
index d4c9dd65def6..3397288a2926 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -65,34 +65,24 @@ void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part
local_dec(&per_cpu_ptr(part_to_disk(part)->part0.dkstats, cpu)->in_flight[rw]);
}

-void part_in_flight(struct request_queue *q, struct hd_struct *part,
- unsigned int inflight[2])
+unsigned int part_in_flight(struct request_queue *q, struct hd_struct *part)
{
int cpu;
+ int inflight;

if (queue_is_mq(q)) {
- blk_mq_in_flight(q, part, inflight);
- return;
+ return blk_mq_in_flight(q, part);
}

- inflight[0] = 0;
+ inflight = 0;
for_each_possible_cpu(cpu) {
- inflight[0] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
+ inflight += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
}
- if ((int)inflight[0] < 0)
- inflight[0] = 0;
+ if (inflight < 0)
+ inflight = 0;

- if (part->partno) {
- part = &part_to_disk(part)->part0;
- inflight[1] = 0;
- for_each_possible_cpu(cpu) {
- inflight[1] += local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) +
- local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]);
- }
- if ((int)inflight[1] < 0)
- inflight[1] = 0;
- }
+ return (unsigned int)inflight;
}

void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
@@ -1348,7 +1338,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
struct disk_part_iter piter;
struct hd_struct *hd;
char buf[BDEVNAME_SIZE];
- unsigned int inflight[2];
+ unsigned int inflight;

/*
if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
@@ -1360,7 +1350,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)

disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
while ((hd = disk_part_iter_next(&piter))) {
- part_in_flight(gp->queue, hd, inflight);
+ inflight = part_in_flight(gp->queue, hd);
seq_printf(seqf, "%4d %7d %s "
"%lu %lu %lu %u "
"%lu %lu %lu %u "
@@ -1376,7 +1366,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
part_stat_read(hd, merges[STAT_WRITE]),
part_stat_read(hd, sectors[STAT_WRITE]),
(unsigned int)part_stat_read_msecs(hd, STAT_WRITE),
- inflight[0],
+ inflight,
jiffies_to_msecs(part_stat_read(hd, io_ticks)),
jiffies_to_msecs(part_stat_read(hd, time_in_queue)),
part_stat_read(hd, ios[STAT_DISCARD]),
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 42d6138ac876..8e596a8dff32 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -120,9 +120,9 @@ ssize_t part_stat_show(struct device *dev,
{
struct hd_struct *p = dev_to_part(dev);
struct request_queue *q = part_to_disk(p)->queue;
- unsigned int inflight[2];
+ unsigned int inflight;

- part_in_flight(q, p, inflight);
+ inflight = part_in_flight(q, p);
return sprintf(buf,
"%8lu %8lu %8llu %8u "
"%8lu %8lu %8llu %8u "
@@ -137,7 +137,7 @@ ssize_t part_stat_show(struct device *dev,
part_stat_read(p, merges[STAT_WRITE]),
(unsigned long long)part_stat_read(p, sectors[STAT_WRITE]),
(unsigned int)part_stat_read_msecs(p, STAT_WRITE),
- inflight[0],
+ inflight,
jiffies_to_msecs(part_stat_read(p, io_ticks)),
jiffies_to_msecs(part_stat_read(p, time_in_queue)),
part_stat_read(p, ios[STAT_DISCARD]),
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index a03aa6502a83..13b7ce01727a 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -377,8 +377,7 @@ static inline void free_part_stats(struct hd_struct *part)
#define part_stat_sub(cpu, gendiskp, field, subnd) \
part_stat_add(cpu, gendiskp, field, -subnd)

-void part_in_flight(struct request_queue *q, struct hd_struct *part,
- unsigned int inflight[2]);
+unsigned int part_in_flight(struct request_queue *q, struct hd_struct *part);
void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);
void part_dec_in_flight(struct request_queue *q, int cpu, struct hd_struct *part,
--
2.15.0
Mike Snitzer
2018-11-30 22:22:26 UTC
Permalink
From: Mikulas Patocka <***@redhat.com>

Remove the "pending" atomic counters, that duplicate block-core's
in_flight counters, and update md_in_flight() to look at percpu
in_flight counters.

Signed-off-by: Mikulas Patocka <***@redhat.com>
Signed-off-by: Mike Snitzer <***@redhat.com>
---
drivers/md/dm-core.h | 2 --
drivers/md/dm.c | 34 +++++++++++++++-------------------
2 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 224d44503a06..6fe883fac471 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -65,7 +65,6 @@ struct mapped_device {
*/
struct work_struct work;
wait_queue_head_t wait;
- atomic_t pending[2];
spinlock_t deferred_lock;
struct bio_list deferred;

@@ -119,7 +118,6 @@ struct mapped_device {
struct srcu_struct io_barrier;
};

-int md_in_flight(struct mapped_device *md);
void disable_write_same(struct mapped_device *md);
void disable_write_zeroes(struct mapped_device *md);

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a8ae7931bce7..ff6e5a5902f2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -646,25 +646,30 @@ static void free_tio(struct dm_target_io *tio)
bio_put(&tio->clone);
}

-int md_in_flight(struct mapped_device *md)
+static bool md_in_flight(struct mapped_device *md)
{
- return atomic_read(&md->pending[READ]) +
- atomic_read(&md->pending[WRITE]);
+ int cpu;
+ struct hd_struct *part = &dm_disk(md)->part0;
+
+ for_each_possible_cpu(cpu) {
+ if (local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[0]) ||
+ local_read(&per_cpu_ptr(part->dkstats, cpu)->in_flight[1]))
+ return true;
+ }
+
+ return false;
}

static void start_io_acct(struct dm_io *io)
{
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
- int rw = bio_data_dir(bio);

io->start_time = jiffies;

generic_start_io_acct(md->queue, bio_op(bio), bio_sectors(bio),
&dm_disk(md)->part0);

- atomic_inc(&md->pending[rw]);
-
if (unlikely(dm_stats_used(&md->stats)))
dm_stats_account_io(&md->stats, bio_data_dir(bio),
bio->bi_iter.bi_sector, bio_sectors(bio),
@@ -676,8 +681,6 @@ static void end_io_acct(struct dm_io *io)
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
unsigned long duration = jiffies - io->start_time;
- int pending;
- int rw = bio_data_dir(bio);

generic_end_io_acct(md->queue, bio_op(bio), &dm_disk(md)->part0,
io->start_time);
@@ -687,16 +690,11 @@ static void end_io_acct(struct dm_io *io)
bio->bi_iter.bi_sector, bio_sectors(bio),
true, duration, &io->stats_aux);

- /*
- * After this is decremented the bio must not be touched if it is
- * a flush.
- */
- pending = atomic_dec_return(&md->pending[rw]);
- pending += atomic_read(&md->pending[rw^0x1]);
-
/* nudge anyone waiting on suspend queue */
- if (!pending)
- wake_up(&md->wait);
+ if (unlikely(waitqueue_active(&md->wait))) {
+ if (!md_in_flight(md))
+ wake_up(&md->wait);
+ }
}

/*
@@ -1904,8 +1902,6 @@ static struct mapped_device *alloc_dev(int minor)
if (!md->disk)
goto bad;

- atomic_set(&md->pending[0], 0);
- atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
init_waitqueue_head(&md->eventq);
--
2.15.0
Loading...