分布式文件存储SeaweedFS试用对比总结

2019年07月29日中间件技术

SeaweedFS,文件存储

基础概念

1、SeaweedFS将磁盘进行了分组

分为DataCenters（数据中心、机房）、Racks（机架），Servers 和 Hard Drive，从而保证可用性。

2、Replication - 复制多副本

这是启动Master节点时设置的参数：

./weed master -defaultReplication=001

代表在相同机架的不同服务器上复制一个副本（共2两份）。

为什么是001，官方的定义如下：

000no replication, just one copy
001replicate once on the same rack
010replicate once on a different rack in the same data center
100replicate once on a different data center
200replicate twice on two other different data center
110replicate once on a different rack, and once on a different data center

即，xyz三位数分别为：

xnumber of replica in other data centers
ynumber of replica in other racks in the same data center
znumber of replica in other servers in the same rack

通常，在测试环境，服务器都是一个机架上的，所以 xy 都为0。

组成部分：

基础部分：Master server + Volume server
扩展部分：Filer server + Cronjob server (Replication-job) + S3 server

各部分的作用稍后再说。

值得注意的一点：外部与 Master Server、Volume Server 和 Filer 进行通信的方式是 HTTP API。API的用法官网有详细说明。

概念对应关系

Node 系统抽象的节点，抽象为DataCenter、Rack、DataNode
DataCenter 数据中心，对应现实中的不同机房
Rack 机架，对应现实中的机柜
Datanode 存储节点，用于管理、存储逻辑卷
Volume 逻辑卷，存储的逻辑结构，逻辑卷下存储Needle
Needle 逻辑卷中的Object，对应存储的文件（每个文件有一个唯一needleID）
Collection 文件集，可以分布在多个逻辑卷上

注意，以上说的Datanode，其实就是Volume server（卷服务器），而Volume server下是有很多个逻辑卷的。

官方文档关于 volume和 collection 的描述非常少，以下很多是我看了很多文档后摸索出来的。

增大并发写和读 - Increase concurrent writes

By default, SeaweedFS grows the volumes automatically. For example, for no-replication volumes, there will be concurrently 7 writable volumes allocated.

默认情况下，seaweedf会自动增加卷。例如，对于没有replication的卷，将同时分配7个可写卷。

If you want to distribute writes to more volumes, you can do so by instructing SeaweedFS master via this URL.

curl http://localhost:9333/vol/grow?count=12&replication=001

This will assign 12 volumes with 001 replication. Since 001 replication means 2 copies for the same data, this will actually consumes 24 physical volumes（实际将消耗24个物理卷）。译者注：虽然volumes是逻辑结构，但是也是存放文件的地方，所以此处说物理卷，上面说逻辑卷，其实都是一个意思。

另外，Seaweedf 在卷（Volume）上实现空间回收（删除掉的文件，卷会自动缩小），官方原文如下：

If your system has many deletions, the deleted file's disk space will not be synchronously re-claimed. There is a background job to check volume disk usage. If empty space is more than the threshold, default to 0.3, the vacuum job will make the volume readonly, create a new volume with only existing files, and switch on the new volume. If you are impatient or doing some testing, vacuum the unused spaces this way.

也就是说，删除文件不会立即释放磁盘空间，只有卷的占用率低于阈值（0.3）才会触发释放空间的操作，而释放空间的手段是新建一个只包含未删除文件的新卷。

Pre-Allocate Volumes（预创建卷）

One volume serves one write a time. If you need to increase concurrency, you can pre-allocate lots of volumes. Here are examples. You can combine all the different options also.

但是要注意，创建卷有很多属性，例如指定副本数、集合、数据中心等，参见官方文档。

各组成部分介绍

S3 Server（适配Amazon S3 API）

To be compatible with Amazon S3 API, a separate "weed s3" command is provided. weed s3 will start a stateless gateway server to bridge the Amazon S3 API to SeaweedFS Filer.

For convenience, weed server -s3 will start a master, a volume server, a filer, and the S3 gateway.

Each bucket is stored in one collection, and mapped to folder /buckets/<bucket_name> by default.

每个bucket会映射到 /buckets/<bucket_name> 文件夹（应该是一个collection吧？collection就是folder？）。

A bucket can be deleted efficiently by deleting the whole collection.

Currently, the following APIs are supported.

// Object operations
* PutObject
* GetObject
* HeadObject
* DeleteObject
* ListObjectsV2
* ListObjectsV1

// Bucket operations
* PutBucket
* DeleteBucket
* HeadBucket
* ListBuckets

// Multipart upload operations
* NewMultipartUpload
* CompleteMultipartUpload
* AbortMultipartUpload
* ListMultipartUploads

Filer Server - 文件管理器

文件管理器（Filer）可以用来浏览文件和目录，以及add/delete files, and even browse the sub directories and files，还有检索、重命名等。

Filer has a persistent client connecting to Master, to get the location updates of all volumes. 因此一个master server节点，只能部署一个Filer，

官方说明参见：https://github.com/chrislusf/seaweedfs/wiki/Directories-and-Files#architecture

weed mount 功能需要配合 Filer 才能使用，这样可以在服务器上用命令行操作文件。支持的操作如下：

file read / write
create new file
mkdir
list
remove
rename
chmod
chown
soft link
display free disk space

另外，Filer的HTTP API也可以用来，其功能如下：

上传文件：

# Basic Usage:
> curl -F file=@report.js "http://localhost:8888/javascript/"
{"name":"report.js","size":866,"fid":"7,0254f1f3fd","url":"http://localhost:8081/7,0254f1f3fd"}
> curl  "http://localhost:8888/javascript/report.js"   # get the file content
...
# upload the file with a different name
> curl -F file=@report.js "http://localhost:8888/javascript/new_name.js"
{"name":"report.js","size":866,"fid":"3,034389657e","url":"http://localhost:8081/3,034389657e"}

注意，上传文件时，如果带了目录路径，则会自动递归创建目录。

查看某个目录的文件（list）：

# list all files under /javascript/
curl  -H "Accept: application/json" "http://localhost:8888/javascript/?pretty=y"
{
  "Directory": "/javascript/",
  "Files": [
    {
      "name": "new_name.js",
      "fid": "3,034389657e"
    },
    {
      "name": "report.js",
      "fid": "7,0254f1f3fd"
    }
  ],
  "Subdirectories": null
}

删除文件和目录：

curl -X DELETE http://localhost:8888/path/to/file
curl -X DELETE

http://localhost:8888/path/to/dir?recursive=true

Cronjob server (Replication-job)

简单的讲，在运行大型群集时，通常会添加更多卷服务器，或者某些卷服务器关闭，或者某些卷服务器被替换。这些拓扑更改可能导致卷副本丢失或卷服务器上的卷数不平衡。所以这时就需要Cronjob server。

When running large clusters, it is common that some volume servers are down. If a volume is replicated and one replica is missing, the volume will be marked as readonly.

One way to fix is to find one healthy copy and replicated to other servers, to meet the replication requirement. This volume id will be marked as writable.

In weed shell, the command volume.fix.replication will do exactly that, automating the replication fixing process. You can start a crontab job to periodically run volume.fix.replication to ensure the system health.

在Cronjob server上运行着Replication-job，它会自动执行replication fixing操作。

官方文档：https://github.com/chrislusf/seaweedfs/wiki/Volume-Management

VolumeServer 卷服务器

这个就是所谓的“Data Node”数据节点，用于挂载磁盘存储文件。Volume Server与Master Server通信，受Master控制。可以动态的增加和减少VolumeServer，这一点比另一个云存储MinIO要强得多。

卷服务器的API主要功能为：

上传文件：

curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"size": 43234}

注意，上传文件前，需要从master server取得 预分配的fileId。

删除文件：

curl -X DELETE http://127.0.0.1:8080/3,01637037d6

访问/下载文件：

curl http://127.0.0.1:8080/3,01637037d6

需要注意的是：通过VolumeServer或者MasterServer直接上传到自定义collection的文件，通过Filer默认的collection是访问不了的。通过Filer上传的文件所在的collection为 ""（空）。也许filer切换到指定collection才能访问里面的文件，但是我没研究过filer，官方也没有说明，可能需要摸索一下。

Master Server

Master是不存储数据的，只做集群协调，类似于Zookeeper的作用吧。

Master Server API功能如下：

分配一个fileId，用于接下来的存储文件

# Basic Usage:
curl http://localhost:9333/dir/assign
{"count":1,"fid":"3,01637037d6","url":"127.0.0.1:8080",
 "publicUrl":"localhost:8080"}
# To assign with a specific replication type:
curl "http://localhost:9333/dir/assign?replication=001"
# To specify how many file ids to reserve
curl "http://localhost:9333/dir/assign?count=5"
# To assign a specific data center
curl " 
另外，还可以指定文件集（collection）

分配fileId+上传文件一次搞定：

curl -F file=@/home/chris/myphoto.jpg http://localhost:9333/submit
{"fid":"3,01fbe0dc6f1f38","fileName":"myphoto.jpg","fileUrl":"localhost:8080/3,01fbe0dc6f1f38","size":68231}

删除collection（文件集）

# delete a collection
curl "http://localhost:9333/col/delete?collection=benchmark&pretty=y"

各种语言的客户端

官方没有提供客户端，只提供了API，各种语言可以自行封装客户端。

试用简单总结

我使用了Docker Compose部署。在虚拟机的宿主机外访问Master，得到的Master返回的master leader url和volume server url都是docker容器本地IP，从在虚拟机的宿主机外无法访问。

这个问题，作者好像不太care，不打算解决。当然这是个比较常见的问题，我通过改造client，做docker 内部ip和外部IP的映射也能解决问题。

另外，官方没提供集群的docker-compose配置，下面是我用的一个集群配置：

https://github.com/shiguanghuxian/seaweedfs-docker

文件上传下载，S3 API这些我都测试过了，OK，唯一要吐槽一下的是，官方文档和教程不够全面，但考虑到作者也挺辛苦的，没多少资助，只能慢慢等社区完善吧，现在看社区还是比较活跃的。

FastDFS和SeaweedFS都有的一个问题：

FastDFS由tracker server 和 nodeserver组成，客户端的配置文件中只需要配置tracker server ip，然后tracker server会告诉客户端去访问哪个nodeserver，相当于tracker server是个中介服务。

同样，SeaweedFS也有这个问题，先访问master，master再返回volume server的ip。

这样会存在一个很普遍的问题：访问tracker/master通常是用的外部ip，而tacker和node server（或master和volume server）之间的通信是用的本地ip。tracker/master当中介，返回的是本地ip，外部自然访问不了。

解决方案1（需要中间件支持）

中介服务器能够拿到节点服务器的外网ip，然后返回。

解决方案2（客户端自己配置）

客户端配置一个内网、外网ip地址和端口的映射关系，客户端获得内网URL后，替换成外网URL去访问。

我找了一个SeaweedFS的Java客户端，按照这种思路改了一下，测试通过。

解决方案3

暴力方案，直接打通内、外网IP段，可以互相访问。该方案只能在特定环境下能用（比如Kubernetes+Calico网络，在边界路由器上配置BGP）。如果能用这个方案，自然是再好不过。