文件系统学习8——文件系统MQ队列机制详解

上一篇已经讲述了MQ多队列的机制，利用cpu的多核，配上多队列机制，并发的处理IO请求，提高效率。
本篇详细讲述下从bio下发到IO调度器中，MQ队列机制是如何一步步完成的。

1、MQ处理结构流图

文件系统学习8——文件系统MQ队列机制详解
从整个流程图可以看到，主要是分为三个部分：初始化硬件设备的target参数、初始化请求队列request_queue以及bio请求的处理过程。前面两个过程主要是完成底层存储设备向文件系统的注册，同时完成软硬队列映射关系等初始化，后一个部分是bio在MQ机制最后生成对应子请求并挂载在硬件队列上的过程。

2、scsi设备初始化

对于走scsi协议的底层存储设备，均完成此初始化过程，内核为每个scsi设备提供给一个target的设备参数，该参数主要包括：硬件队列深度，超时时间，硬件中断函数等。
同时，target的设备参数后序作为参数传入requset_queue结构体中，用来初始化硬件队列相关参数。
具体过程如下：
文件系统学习8——文件系统MQ队列机制详解

3、request_queue队列的初始化

之前已经提到过，linux内核支持单队列的机制，也支持多队列的机制，并且在内核中IO调用通过的调用函数都是make_request_fn，那么内核如何知道选用的是多队列机制还是单队列机制呢？？
通过make_request_fn函数的注册。

初始化步骤如下：

当底层存储设备是单通道时，此时会向内核make_request_fn注册实体函数，函数名为blk_sq_make_request；当设备是多通道时，此时会向内核make_request_fn注册实体函数，函数名为blk_mq_make_request。
当注册完成后，会根据各自队列的数量，按照内核提供的映射关系，形成cpu号(软件队列号)到硬件队列号的map数组。
按照cpu个数和硬件队列数分别分配软硬件队列环境并进行初始化
将互为映射的软硬件队列环境的参数相互关联。

具体函数流程图如下：
文件系统学习8——文件系统MQ队列机制详解
关于队列初始化的源码阅读如下：
1.首先在driver\scsi文件夹下的scsi_scan.c文件中，scsi_alloc_sdev函数会判断当前scsi支持的块设备是否支持multiqueue：

if (shost_use_blk_mq(shost))    // multiple queue is enabled   gaocm
     sdev->request_queue = scsi_mq_alloc_queue(sdev);    //MQ队列的定义   
 else
     sdev->request_queue = scsi_alloc_queue(sdev);

2.若上述设备支持mq则进入scsi_mq_alloc_queue函数，该函数中主要进行设备队列的初始化，blk_mq_init_queue，在blk_mq_init_queue 函数中根据set信息进行与该设备队列相关的信息参数初始化，过程如下：

  /* mark the queue as mq asap */
   q->mq_ops = set->ops;   //标记为MQ队列   

   q->queue_ctx = alloc_percpu(struct blk_mq_ctx); //获得per_cpu的地址 建立software queue环境 
   if (!q->queue_ctx)
       goto err_exit;

   q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
                       GFP_KERNEL, set->numa_node);    //获得hardware queue上下文环境 
   if (!q->queue_hw_ctx)
       goto err_percpu;

   q->mq_map = blk_mq_make_queue_map(set);   // 建立software queue和hardware queue之间的映射关系 
   if (!q->mq_map)
       goto err_map;

   blk_mq_realloc_hw_ctxs(set, q);
   if (!q->nr_hw_queues)
       goto err_hctxs;

   INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
   blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); //定义scsi设备队列超时设定    

   q->nr_queues = nr_cpu_ids;

   q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;

   if (!(set->flags & BLK_MQ_F_SG_MERGE))
       q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;

   q->sg_reserved_size = INT_MAX;

   INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
   INIT_LIST_HEAD(&q->requeue_list);
   spin_lock_init(&q->requeue_lock);

   if (q->nr_hw_queues > 1)
       blk_queue_make_request(q, blk_mq_make_request); //注册q->make_request_fn函数        
   else
       blk_queue_make_request(q, blk_sq_make_request);

   /*
    * Do this after blk_queue_make_request() overrides it...
    */
   q->nr_requests = set->queue_depth;

   if (set->ops->complete)
       blk_queue_softirq_done(q, set->ops->complete);

   blk_mq_init_cpu_queues(q, set->nr_hw_queues);

   get_online_cpus();
   mutex_lock(&all_q_mutex);

   list_add_tail(&q->all_q_node, &all_q_list);
   blk_mq_add_queue_tag_set(set, q);
   blk_mq_map_swqueue(q, cpu_online_mask);

   mutex_unlock(&all_q_mutex);
   put_online_cpus();

   return q;

err_hctxs:
   kfree(q->mq_map);
err_map:
   kfree(q->queue_hw_ctx);
err_percpu:
   free_percpu(q->queue_ctx);
err_exit:
   q->mq_ops = NULL;
   return ERR_PTR(-ENOMEM);
}

3.完成上述的初始化之后，在blk_queue_make_request中将q->make_request_fn注册为blk_mq_make_request。

4、bio请求处理

bio的请求处理是device mapper中提交一个个bio请求到IO schedule中之后完成的。整个处理的流图如下图所示：
文件系统学习8——文件系统MQ队列机制详解
首先，bio请求被提交(submit_bio)，此时进入generic_make_request，表明其将在块设备层中被进行相关处理工作。
generic_make_request函数如下：

blk_qc_t generic_make_request(struct bio *bio)
{
    struct bio_list bio_list_on_stack;
    blk_qc_t ret = BLK_QC_T_NONE;

    if (!generic_make_request_checks(bio))  //判断当前bio是否有效       gaocm
        goto out;
    if (current->bio_list) {
        bio_list_add(current->bio_list, bio);
        goto out;
    }
    //上述过程要求当前的make_request_fn每次只能被触发一次，因此，通过current->bio_list判断当前是否有bio在其中，若有则将当前这个加入到尾部等待被处理，若没有则可直接处理该bio   

    BUG_ON(bio->bi_next);
    bio_list_init(&bio_list_on_stack);      //初始化该双向链表  
    current->bio_list = &bio_list_on_stack; //当前为NULL   
    do {
        struct request_queue *q = bdev_get_queue(bio->bi_bdev); //获得bio对应的设备队列  

        if (likely(blk_queue_enter(q, false) == 0)) {   //判断当前的设备队列是否有效能够响应该请求  
            ret = q->make_request_fn(q, bio);   //将bio进行进一步处理，放入块设备层的处理队列中  
            blk_queue_exit(q);

            bio = bio_list_pop(current->bio_list);  
        } else {
            struct bio *bio_next = bio_list_pop(current->bio_list);

            bio_io_error(bio);
            bio = bio_next;
        }
    } while (bio);
    current->bio_list = NULL; /* deactivate */  //clear this bio list and make_request function is avalible again

out:
    return ret;
}

没错，从代码上分析和我们之前讲的一样，bio会通过队列机制进行一个管理，当处理一个bio时，首先或得当前bio的设备队列（即软硬件队列），然后判断当前的设备队列是否有效能够响应该请求，如果能够则调用make_request_fn，该函数即是硬件初始化时候注册的函数。

然后，对于MQ机制，make_request_fn的重载即为blk_mq_make_request函数，该函数如下：

    const int is_sync = rw_is_sync(bio->bi_rw); //判断是否为同步       
    const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);//判断是否为屏障IO    
    struct blk_map_ctx data;
    struct request *rq;
    unsigned int request_count = 0;
    struct blk_plug *plug;
    struct request *same_queue_rq = NULL;
    blk_qc_t cookie;

    blk_queue_bounce(q, &bio);      //做DMA时的相关地址限制，可能该bio只能访问低端内存，因此需要将高端内存中的bio数据拷贝到低端内存中  

    if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {    //bio完整性判断
        bio_io_error(bio);
        return BLK_QC_T_NONE;
    }

    blk_queue_split(q, &bio, q->bio_split); //判断当前的bio是否超过了预设最大处理大小，若是则进行拆分，拆分后会进行generic_make_request函数调用  

    if (!is_flush_fua && !blk_queue_nomerges(q) &&
        blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq)) //若非屏障IO并且设备队列支持合并且plug队列中可进行合并则进行合并工作  
        return BLK_QC_T_NONE;

    rq = blk_mq_map_request(q, bio, &data); //在mq中注册一个request       
    if (unlikely(!rq))
        return BLK_QC_T_NONE;

    cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

    if (unlikely(is_flush_fua)) {
        blk_mq_bio_to_request(rq, bio); //将bio转换为request    
        blk_insert_flush(rq);   //若是屏障IO则将其加入到flush队列中，该队列直接发送至driver   
        goto run_queue;
    }

    plug = current->plug;
    /*
     * If the driver supports defer issued based on 'last', then
     * queue it up like normal since we can potentially save some
     * CPU this way.
     */
    if (((plug && !blk_queue_nomerges(q)) || is_sync) &&    //有plug队列，且设备队列支持合并或者改请求是同步请求。。
        !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {   //延迟发送  
        struct request *old_rq = NULL;

        blk_mq_bio_to_request(rq, bio); //转化为request    

        /*
         * We do limited pluging. If the bio can be merged, do that.
         * Otherwise the existing request in the plug list will be
         * issued. So the plug list will have one request at most
         */
        if (plug) {
            /*
             * The plug list might get flushed before this. If that
             * happens, same_queue_rq is invalid and plug list is
             * empty
             */
            if (same_queue_rq && !list_empty(&plug->mq_list)) {
                old_rq = same_queue_rq;
                list_del_init(&old_rq->queuelist);  //判断之前是否有能合并或者一样的请求，若有则删除之前的请求  
            }
            list_add_tail(&rq->queuelist, &plug->mq_list);  //将该请求加入到plug队列中
        } else /* is_sync */
            old_rq = rq;
        blk_mq_put_ctx(data.ctx);   
        if (!old_rq)    //无为处理请求    
            goto done;
        if (!blk_mq_direct_issue_request(old_rq, &cookie))  //直接加入到底层scsi层队列中，并发往driver?    
            goto done;
        blk_mq_insert_request(old_rq, false, true, true);   //加入到software queue中    
        goto done;
    }

    if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { //底层driver支持延迟发送或者为async请求  
        //能合并则进行合并，否则加入到software queue中 
        /*
         * For a SYNC request, send it to the hardware immediately. For
         * an ASYNC request, just ensure that we run it later on. The
         * latter allows for merging opportunities and more efficient
         * dispatching.
         */
run_queue:
        blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);   //执行hardware queue  
    }
    blk_mq_put_ctx(data.ctx);
done:
    return cookie;
}

该函数完成bio转换为请求的下发，主要分为以下四种情况：
1、对于flush等屏障IO，不通过软件队列，直接下发对应的flush队列，因为屏障IO具有非延时的特性，需要写入到driver中，不能被队列所阻塞。
2、对于非屏障IO，首先判断队列里是否能合并，如果不能合并，则产生一个request，下发到plug队列中，plug队列命中，则直接返回。
3、plug队列未命中，则将该请求加入到plug队列中，注意，实际在MQ机制中plug队列没有泄洪的作用，实际其队列深度不超过1.对于这类加入到plug队列的请求，最终还是下发到软件队列中。
4、下发到软件队列的请求同样判断是否能合并，合并则直接返回，不能合并则受软件队列的调度器调度，最后加入到硬件队列中。

这四种情况分别对应的源码如下图所示：

blk_mq_direct_issue_request 将请求直接加入到底层driver中，判断当前scsi设备能否处理该请求

static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
{
    int ret;
    struct request_queue *q = rq->q;
    struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
            rq->mq_ctx->cpu);
    struct blk_mq_queue_data bd = {
        .rq = rq,
        .list = NULL,
        .last = 1
    };
    blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);

    /*
     * For OK queue, we are done. For error, kill it. Any other
     * error (busy), just add it to our list as we previously
     * would have done
     */
    ret = q->mq_ops->queue_rq(hctx, &bd);   //直接放入scsi队列中，返回是否能够被处理 
    if (ret == BLK_MQ_RQ_QUEUE_OK) {
        *cookie = new_cookie;
        return 0;
    }

    __blk_mq_requeue_request(rq);   //标记该request的nr_phys_segments减1 

    if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
        *cookie = BLK_QC_T_NONE;
        rq->errors = -EIO;
        blk_mq_end_request(rq, rq->errors);
        return 0;
    }

    return -1;
}

blk_mq_insert_request 将请求加入到software queue队列中：

void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
        bool async)
{
    struct request_queue *q = rq->q;
    struct blk_mq_hw_ctx *hctx;
    struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;

    current_ctx = blk_mq_get_ctx(q);    //获得software queue环境    
    if (!cpu_online(ctx->cpu))
        rq->mq_ctx = ctx = current_ctx;

    hctx = q->mq_ops->map_queue(q, ctx->cpu);   //找到对应的hardware queue上下文环境  

    spin_lock(&ctx->lock);
    __blk_mq_insert_request(hctx, rq, at_head); //通过rq找到ctx，加入到software queue中  
    spin_unlock(&ctx->lock);

    if (run_queue)
        blk_mq_run_hw_queue(hctx, async);   //运行hardware queue，用异步方式执行  

    blk_mq_put_ctx(current_ctx);
}

blk_mq_merge_queue_io 判断能否与当前software queue中的请求进行合并

static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
                     struct blk_mq_ctx *ctx,
                     struct request *rq, struct bio *bio)
{
    if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) {  //不允许merge  
        blk_mq_bio_to_request(rq, bio);
        spin_lock(&ctx->lock);
        __blk_mq_insert_request(hctx, rq, false);   //加入到software queue中    
        spin_unlock(&ctx->lock);
        return false;
    } else {
        struct request_queue *q = hctx->queue;

        spin_lock(&ctx->lock);
        if (!blk_mq_attempt_merge(q, ctx, bio)) {   //进行合并尝试    
            blk_mq_bio_to_request(rq, bio); //无法合并则转向加入software queue中  
            goto insert_rq;
        }

        spin_unlock(&ctx->lock);
        __blk_mq_free_request(hctx, ctx, rq);   //将刚刚在software queue和hardware queue中注册的request去除，因为请求已经加入到software queue中       
        return true;
    }
}

硬件队列请求执行blk_mq_run_hw_queue函数，下发请求到scsi驱动层中

void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
{
    if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
        !blk_mq_hw_queue_mapped(hctx)))
        return;

    if (!async) {   // false and run  gaocm
    //若async为flash则说明该处理是同步的，需要马上处理，若是异步则将该操作交由kblocked进行处理 
        int cpu = get_cpu();
        if (cpumask_test_cpu(cpu, hctx->cpumask)) {  // cpu is set in cpumask    
            __blk_mq_run_hw_queue(hctx);    //运行hardware queue  
            put_cpu();
            return;
        }

        put_cpu();
    }

    kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
            &hctx->run_work, 0);
}

以上就是MQ队列的执行过程。

详细的执行流程见博客Linux Block Layer块设备层基于MultiQueue的部分源码分析

5、MQ队列调度器

完成上述的MQ队列的执行过程，但是没有涉及到调度器这一块，经过请教，发现MQ也存在调度器模块，调度器主要完成同一个task中的调度，也就是一个或者多个软件队列上的请求的调度，调度的目的和传统CFQ类似，即考虑底层磁盘特性，更高效的利用存储介质的IO读写。

注意：软队列的数目和cpu核数对应，在应用执行过程中，上层cpu的调度中就为该应用绑定了对应cpu执行的个数，一个应用可以运行在多个核上，那么这个应用下发的IO将会执行在多个软件队列中，即在多个软件队列中完成调度。个人猜测，未经验证。

6、疑问与解答

在内核支持了MQ机制之后，倘若底层硬件某个通道任务繁忙，此时IO被阻塞的非常严重，上层应用如何感知到呢？此时软硬件队列的映射能够改变吗？
——软硬件队列的映射在硬件初始化阶段时就已经确定了，而且是不能改的。当某个硬件通道IO密集时，此时会获取该硬件队列对应的软件队列，从而获取IO较为繁忙的CPU，CPU在得知各个核的状态时，在应用任务下发时，cpu会重新调度，避免将过重的IO绑定到任务繁忙的cpu核上。

参考博客：

https://blog.csdn.net/g382112762/article/details/79606485
https://blog.csdn.net/notbaron/article/details/81147591
https://blog.csdn.net/yedushu/article/details/82050933
https://hyunyoung2.github.io/2016/09/14/Multi_Queue/