亿级数据处理

2021-04-06

MongoDB中存在一张数据量过亿的表

目前文档数量 232,656,863

文档容量：46.35Gb

运行环境：

操作系统：Ubuntu 20.04.1 LTS
CPU: i5-4590 @ 3.30 Ghz
内存: 32G
硬盘：M.2 SSD

已在2个字段上创建索引，命中索引的话响应3ms，没命中索引80s以上。

批量导入优化

有百万条数据需导入，为加快导入速度，我们执行批量导入，一次1万条试试：

导入： 610000 [run 5309.4ms]
导入： 620000 [run 5290.9ms]
导入： 630000 [run 4837.2ms]
导入： 640000 [run 4927.6ms]
导入： 650000 [run 5749.3ms]
导入： 660000 [run 4754.5ms]
导入： 670000 [run 5378.4ms]
导入： 680000 [run 4548.4ms]
导入： 690000 [run 4558.4ms]

可以看到1万条一次需要大概5秒。

如果把索引去掉：

导入： 1200000 [run 52.5ms]
导入： 1210000 [run 53.6ms]
导入： 1220000 [run 52.5ms]
导入： 1230000 [run 53.7ms]
导入： 1240000 [run 53ms]
导入： 1250000 [run 52.8ms]
导入： 1260000 [run 60.8ms]
导入： 1270000 [run 52.7ms]
导入： 1280000 [run 55.3ms]
导入： 1290000 [run 55.3ms]

快了近100倍。

像一些冷数据，导入量比较大时，把索引去除加快导入速度。等导入完，重建索引，效率反而提升。

大量数据查询优化

当数据量返回过大时，除了必要的索引优化外，对数据集返回数的限制尤为必要。

请使用skip和limit做数据限制。

MongoDB性能分析和优化

explain

在查询语句后面跟上explain()能获取查询相关诊断信息

比如：

> db.getCollection("m_pass_base").find({_id:{$regex:/^malu/}}).explain(true)

{
    "queryPlanner": {
        "plannerVersion": NumberInt("1"),
        "namespace": "d1.m_pass_base",      // 查询的集合
        "indexFilterSet": false,            // 索引过滤
        "parsedQuery": {                    // 查询条件
            "_id": {
                "$regex": "^malu"
            }
        },
        "winningPlan": {                    // 最佳执行计划
            "stage": "FETCH",
            "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                    "_id": NumberInt("1")
                },
                "indexName": "_id_",
                "isMultiKey": false,
                "multiKeyPaths": {
                    "_id": [ ]
                },
                "isUnique": true,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": NumberInt("2"),
                "direction": "forward",
                "indexBounds": {           // 当前查询具体使用的索引
                    "_id": [
                        "[\"malu\", \"malv\")",
                        "[/^malu/, /^malu/]"
                    ]
                }
            }
        },
        "rejectedPlans": [ ]               // 拒绝执行计划
    },
    "executionStats": {                    // executionStats会返回最佳执行计划的一些统计信息
        "executionSuccess": true,          // 是否执行成功
        "nReturned": NumberInt("466"),     // 返回结果数
        "executionTimeMillis": NumberInt("0"),
        "totalKeysExamined": NumberInt("467"),   // 索引扫描数
        "totalDocsExamined": NumberInt("466"),   // 文档扫描数
        "executionStages": {
            "stage": "FETCH",                    // 扫描方式
            "nReturned": NumberInt("466"),
            "executionTimeMillisEstimate": NumberInt("0"),
            "works": NumberInt("468"),
            "advanced": NumberInt("466"),
            "needTime": NumberInt("1"),
            "needYield": NumberInt("0"),
            "saveState": NumberInt("3"),
            "restoreState": NumberInt("3"),
            "isEOF": NumberInt("1"),
            "invalidates": NumberInt("0"),
            "docsExamined": NumberInt("466"),
            "alreadyHasObj": NumberInt("0"),
            "inputStage": {
                "stage": "IXSCAN",
                "nReturned": NumberInt("466"),
                "executionTimeMillisEstimate": NumberInt("0"),
                "works": NumberInt("468"),
                "advanced": NumberInt("466"),
                "needTime": NumberInt("1"),
                "needYield": NumberInt("0"),
                "saveState": NumberInt("3"),
                "restoreState": NumberInt("3"),
                "isEOF": NumberInt("1"),
                "invalidates": NumberInt("0"),
                "keyPattern": {
                    "_id": NumberInt("1")
                },
                "indexName": "_id_",
                "isMultiKey": false,
                "multiKeyPaths": {
                    "_id": [ ]
                },
                "isUnique": true,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": NumberInt("2"),
                "direction": "forward",
                "indexBounds": {           // 当前查询具体使用的索引
                    "_id": [
                        "[\"malu\", \"malv\")",
                        "[/^malu/, /^malu/]"
                    ]
                },
                "keysExamined": NumberInt("467"),
                "seeks": NumberInt("2"),
                "dupsTested": NumberInt("0"),
                "dupsDropped": NumberInt("0"),
                "seenInvalidated": NumberInt("0")
            }
        },
        "allPlansExecution": [ ]     // 所有执行计划
    },
    "serverInfo": {
        "host": "M1",
        "port": NumberInt("27017"),
        "version": "3.6.8",
        "gitVersion": "8e540c0b6db93ce994cc548f000900bdc740f80a"
    },
    "ok": 1
}

扫描方式stage有如下几种：

COLLSCAN：全表扫描
IXSCAN：索引扫描
FETCH：根据索引去检索指定document
SHARD_MERGE：将各个分片返回数据进行merge
SORT：表明在内存中进行了排序
LIMIT：使用limit限制返回数
SKIP：使用skip进行跳过
IDHACK：针对_id进行查询
SHARDING_FILTER：通过mongos对分片数据进行查询
COUNT：利用db.coll.explain().count()之类进行count运算
COUNTSCAN：count不使用Index进行count时的stage返回
COUNT_SCAN：count使用了Index进行count时的stage返回
SUBPLA：未使用到索引的$or查询的stage返回
TEXT：使用全文索引进行查询时候的stage返回
PROJECTION：限定返回字段时候stage的返回

所以对于查询优化，我们希望看到stage的组合是(查询的时候尽可能用上索引)：

Fetch+IDHACK
Fetch+ixscan
Limit+（Fetch+ixscan）
PROJECTION+ixscan
SHARDING_FITER+ixscan
COUNT_SCAN

而不希望看到包含如下的stage：

COLLSCAN(全表扫描)
SORT(使用sort但是无index)
不合理的SKIP
SUBPLA(未用到index的$or)
COUNTSCAN(不使用index进行count)

hint

hint 可以强制 MongoDB 使用一个指定的索引，一般我们在联合索引上做优化。

hint({“$natural”:true}) 可以强制查询走全表扫描，这种情况适合在返回数据集很大的时候，不走索引反而效率更高。

MongoDB慢查询

官方文档：https://docs.mongodb.com/manual/reference/database-profiler/

开启慢查询Profiling

Profiling级别说明

0：关闭，不收集任何数据。
1：收集慢查询数据，默认是100毫秒。
2：收集所有数据

方式一：配置文件开启Profiling

修改启动mongo.conf，插入以下代码

#开启慢查询，200毫秒的记录
profile = 1
slowms = 200

方式二：通过命令开启

注意该方式只保留在内存中，重启mongo将失效

#查看状态：级别和时间
drug:PRIMARY> db.getProfilingStatus()   
{ "was" : 1, "slowms" : 100 }

#查看级别
drug:PRIMARY> db.getProfilingLevel()    
1

#设置级别
drug:PRIMARY> db.setProfilingLevel(2)
{ "was" : 1, "slowms" : 100, "ok" : 1 }

#设置级别和时间
drug:PRIMARY> db.setProfilingLevel(1,200)
{ "was" : 2, "slowms" : 100, "ok" : 1 }

修改“慢查询日志”的大小

#关闭Profiling
drug:PRIMARY> db.setProfilingLevel(0)
{ "was" : 0, "slowms" : 200, "ok" : 1 }

#删除system.profile集合
drug:PRIMARY> db.system.profile.drop()
true

#创建一个新的system.profile集合
drug:PRIMARY> db.createCollection( "system.profile", { capped: true, size:4000000 } )
{ "ok" : 1 }

#重新开启Profiling
drug:PRIMARY> db.setProfilingLevel(1)
{ "was" : 0, "slowms" : 200, "ok" : 1 }

慢日志示例：

{
    "op": "command",                      // 操作类型，有insert、query、update、remove、getmore、command   
    "ns": "d1.m_pass_base",               // 操作的集合
    "command": {                          // 查询语句
        "aggregate": "m_pass_base",
        "pipeline": [
            {
                "$match": {
                    "_id": /^1/
                }
            },
            {
                "$group": {
                    "_id": NumberInt("1"),
                    "n": {
                        "$sum": NumberInt("1")
                    }
                }
            }
        ],
        "allowDiskUse": false,
        "cursor": { },
        "$db": "d1",
        "lsid": {
            "id": UUID("e4f7b72e-b69a-42a9-91e5-ec85c7c11f2a")
        }
    },
    "keysExamined": NumberInt("77810366"),
    "docsExamined": NumberInt("0"),
    "cursorExhausted": true,
    "numYield": NumberInt("607894"),
    "locks": {
        "Global": {
            "acquireCount": {
                "r": NumberLong("1215792")
            }
        },
        "Database": {
            "acquireCount": {
                "r": NumberLong("607896")
            }
        },
        "Collection": {
            "acquireCount": {
                "r": NumberLong("607896")
            }
        }
    },
    "nreturned": NumberInt("1"),
    "responseLength": NumberInt("111"),
    "protocol": "op_msg",
    "millis": NumberInt("42190"),                  // 消耗的时间（毫秒）
    "planSummary": "IXSCAN { _id: 1 }",
    "ts": ISODate("2021-04-06T13:43:56.792Z"),     // 语句执行的时间
    "client": "192.168.50.1",
    "allUsers": [ ],
    "user": ""
}

日常使用的查询

#返回最近的10条记录
db.system.profile.find().limit(10).sort({ ts : -1 }).pretty()

#返回所有的操作，除command类型的
db.system.profile.find( { op: { $ne : 'command' } } ).pretty()

#返回特定集合
db.system.profile.find( { ns : 'mydb.test' } ).pretty()

#返回大于5毫秒慢的操作
db.system.profile.find( { millis : { $gt : 5 } } ).pretty()

#从一个特定的时间范围内返回信息
db.system.profile.find(
                       {
                        ts : {
                              $gt : new ISODate("2021-04-06T03:00:00Z") ,
                              $lt : new ISODate("2021-04-06T03:40:00Z")
                             }
                       }
                      ).pretty()

#特定时间，限制用户，按照消耗时间排序
db.system.profile.find(
                       {
                         ts : {
                              $gt : new ISODate("2021-04-06T03:00:00Z") ,
                              $lt : new ISODate("2021-04-06T03:40:00Z")
                              }
                       },
                       { user : 0 }
                      ).sort( { millis : -1 } )