Configuration

Common Configuration

Environment configurations

  • MEGFILE_BLOCK_SIZE: block size in some open func, like http_open, s3_open, default is 8MB

  • MEGFILE_MAX_BLOCK_SIZE: max block size in some open func, like http_open, s3_open, default is block size * 16

  • MEGFILE_MAX_BUFFER_SIZE: max buffer size in some open func, like http_open, s3_open, default is block size * 16

  • MEGFILE_MAX_WORKERS: max threads will be used, default is 32

  • MEGFILE_BLOCK_CAPACITY: default cache capacity of block and concurrency, default is 16

  • MEGFILE_S3_CLIENT_CACHE_MODE: s3 client cache mode, thread_local or process_local, default is thread_local, it’s a experimental feature.

S3 Configuration

Before using megfile to access files on s3, you need to set up authentication credentials for your s3 account. In addition to boto3, megfile also supports some additional configuration items, and the following describes some common configurations. You can use environments and configuration file for configuration, and priority is that environment variables take precedence over configuration file.

Use environments

You can use environments to setup authentication credentials for your s3 account:

  • AWS_ACCESS_KEY_ID: access key

  • AWS_SECRET_ACCESS_KEY: secret key

  • OSS_ENDPOINT: endpoint url of s3

  • AWS_S3_ADDRESSING_STYLE: addressing style

Use command

You can update config file with megfile command easyly: megfile config s3 [OPTIONS] AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY

$ megfile config s3 accesskey secretkey

# for aliyun
$ megfile config s3 accesskey secretkey \
--addressing-style virtual \
--endpoint-url http://oss-cn-hangzhou.aliyuncs.com \

You can get the configuration from ~/.aws/credentials, like:

[default]
aws_secret_access_key = accesskey
aws_access_key_id = secretkey

s3 =
    addressing_style = virtual
    endpoint_url = http://oss-cn-hangzhou.aliyuncs.com

Config for different s3 server or authentications

You can operate s3 files with different endpoint urls, access keys and secret keys. For example, you have two s3 server with different endpoint_url, access_key and secret key. With configuration, you can use path with profile name like s3+profile_name://bucket/key to operate different s3 server:

from megfile import smart_sync

smart_sync('s3+profile1://bucket/key', 's3+profile2://bucket/key')

Using environment

You need use PROFILE_NAME__ prefix, like:

  • PROFILE1__AWS_ACCESS_KEY_ID

  • PROFILE1__AWS_SECRET_ACCESS_KEY

  • PROFILE1__OSS_ENDPOINT

  • PROFILE1__AWS_S3_ADDRESSING_STYLE

Using command:

megfile config s3 AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY --profile-name profile1

Then the config file’s content will be:

[profile1]
aws_secret_access_key = accesskey
aws_access_key_id = secretkey

Hdfs Configuration

Please use command pip install 'megfile[hdfs]' to install hdfs requirements. You can use environments and configuration file for configuration, and priority is that environment variables take precedence over configuration file.

Use environments

You can use environments to setup authentication credentials and other configuration items:

  • HDFS_USER: hdfs user

  • HDFS_URL: The url can be configured to support High Availability namenodes of WebHDFS, simply add more URLs by delimiting with a semicolon (;)

  • HDFS_ROOT: hdfs root directory when using relative path

  • HDFS_TIMEOUT: request hdfs server timeout

  • HDFS_TOKEN: hdfs token if hdfs server require

  • HDFS_CONFIG_PATH: hdfs config file, default is ~/.hdfscli.cfg

Use command

You can update config file with megfile command easyly: megfile config hdfs [OPTIONS] URL

$ megfile config hdfs http://127.0.0.1:50070 --user admin --root '/' --token xxx

You can get the configuration from ~/.hdfscli.cfg, like:

[global]
default.alias = default

[default.alias]
url = http://127.0.0.1:50070
user = admin
root = /
token = xxx

Most information about configuration file: https://hdfscli.readthedocs.io/en/latest/quickstart.html#configuration

Config for different hdfs server

You can operate hdfs files in different hdfs server. For example, you have two hdfs server with different url. With configuration, you can use path with profile name like hdfs+profile_name://bucket/key to operate different hdfs server:

from megfile import smart_sync

smart_sync('hdfs+profile1://path/to/file', 'hdfs+profile2://path/to/file')

Using environment

You need use PROFILE_NAME__ prefix, like:

  • PROFILE1__HDFS_USER

  • PROFILE1__HDFS_URL

  • PROFILE1__HDFS_ROOT

  • PROFILE1__HDFS_TIMEOUT

  • PROFILE1__HDFS_TOKEN

Using command:

megfile config hdfs http://127.0.0.1:8000 --user admin \
--root /b --token aaa --profile-name profile1

megfile config hdfs http://127.0.0.1:8001 --user admin \
--root /a --token bbb --profile-name profile2

Then the configuration file’s content will be:

[global]
default.alias = default

[default.alias]
url = http://127.0.0.1:8000
user = admin
root = /a
token = aaa

[test.alias]
url = http://127.0.0.1:8001
user = admin
root = /b
token = bbb

Sftp Configuration

Sftp is a little different from other protocols, because you can set some configurations in path(sftp://[username[:password]@]hostname[:port]/file_path). But we suggest you not to use password in path. You can use environments setting configuration, and priority is that path settings take precedence over environments.

Use environments

You can use environments to setup authentication credentials:

  • SFTP_USERNAME

  • SFTP_PASSWORD

  • SFTP_PRIVATE_KEY_PATH: ssh private key path

  • SFTP_PRIVATE_KEY_TYPE: algorithm of ssh key

  • SFTP_PRIVATE_KEY_PASSWORD: if don’t have passwd, not set this environment

  • SFTP_MAX_UNAUTH_CONN: this enviroment is about sftp server’s MaxStartups configuration, for connect to sftp server concurrently.