megfile.smart module

class megfile.smart.SmartCacher(path: str, cache_path: str | None = None, mode: str = 'r')[source]

Bases: FileCacher

cache_path = None
megfile.smart.register_copy_func(src_protocol: str, dst_protocol: str, copy_func: Callable | None = None) None[source]

Used to register copy func between protocols, and do not allow duplicate registration

Parameters:
  • src_protocol – protocol name of source file, e.g. ‘s3’

  • dst_protocol – protocol name of destination file, e.g. ‘s3’

  • copy_func – copy func, its type is: Callable[[str, str, Optional[Callable[[int], None]], Optional[bool], Optional[bool]], None]

megfile.smart.smart_abspath(path: str | BasePath | PathLike)[source]

Return the absolute path of given path

Parameters:

path – Given path

Returns:

Absolute path of given path

megfile.smart.smart_access(path: str | BasePath | PathLike, mode: Access) bool[source]

Test if path has access permission described by mode

Parameters:
  • path – Path to be tested

  • mode – Access mode(Access.READ, Access.WRITE, Access.BUCKETREAD, Access.BUCKETWRITE)

Returns:

bool, if the path has read/write access.

megfile.smart.smart_cache(path, cacher=<class 'megfile.smart.SmartCacher'>, **options)[source]

Return a path to Posixpath Interface

param path: Path to cache param s3_cacher: Cacher for s3 path param options: Optional arguments for s3_cacher

megfile.smart.smart_combine_open(path_glob: str, mode: str = 'rb', open_func=<function smart_open>) CombineReader[source]

Open a unified reader that supports multi file reading.

Parameters:
  • path_glob – A path may contain shell wildcard characters

  • mode – Mode to open file, supports ‘rb’

Returns:

A `CombineReader`

megfile.smart.smart_concat(src_paths: List[str | BasePath | PathLike], dst_path: str | BasePath | PathLike) None[source]

Concatenate src_paths to dst_path

Parameters:
  • src_paths – List of source paths

  • dst_path – Destination path

megfile.smart.smart_copy(src_path: str | BasePath | PathLike, dst_path: str | BasePath | PathLike, callback: Callable[[int], None] | None = None, followlinks: bool = False, overwrite: bool = True) None[source]

Copy file from source path to destination path

Here are a few examples:

>>> from tqdm import tqdm
>>> from megfile import smart_copy, smart_stat
>>> class Bar:
...     def __init__(self, total=10):
...         self._bar = tqdm(total=10)
...
...     def __call__(self, bytes_num):
...         self._bar.update(bytes_num)
...
>>> src_path = 'test.png'
>>> dst_path = 'test1.png'
>>> smart_copy(
...     src_path,
...     dst_path,
...     callback=Bar(total=smart_stat(src_path).size), followlinks=False
... )
856960it [00:00, 260592384.24it/s]
Parameters:
  • src_path – Given source path

  • dst_path – Given destination path

  • callback – Called periodically during copy, and the input parameter is the data size (in bytes) of copy since the last call

  • followlinks – False if regard symlink as file, else True

  • overwrite – whether or not overwrite file when exists, default is True

megfile.smart.smart_exists(path: str | BasePath | PathLike, followlinks: bool = False) bool[source]

Test if path or s3_url exists

Parameters:

path – Path to be tested

Returns:

True if path exists, else False

megfile.smart.smart_getmd5(path: str | BasePath | PathLike, recalculate: bool = False, followlinks: bool = False)[source]

Get md5 value of file

param path: File path param recalculate: calculate md5 in real-time or not return s3 etag when path is s3 param followlinks: If is True, calculate md5 for real file

megfile.smart.smart_getmtime(path: str | BasePath | PathLike) float[source]

Get last-modified time of the file on the given s3_url or file path (in Unix timestamp format).

If the path is an existent directory, return the latest modified time of all file in it. The mtime of empty directory is 1970-01-01 00:00:00

Parameters:

path – Given path

Returns:

Last-modified time

Raises:

FileNotFoundError

megfile.smart.smart_getsize(path: str | BasePath | PathLike) int[source]

Get file size on the given s3_url or file path (in bytes).

If the path in a directory, return the sum of all file size in it, including file in subdirectories (if exist).

The result excludes the size of directory itself. In other words, return 0 Byte on an empty directory path.

Parameters:

path – Given path

Returns:

File size

Raises:

FileNotFoundError

megfile.smart.smart_glob(pathname: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) List[str][source]

Given pathname may contain shell wildcard characters, return path list in ascending alphabetical order, in which path matches glob pattern

Parameters:
  • pathname – A path pattern may contain shell wildcard characters

  • recursive – If False, this function will not glob recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

megfile.smart.smart_glob_stat(pathname: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) Iterator[FileEntry][source]

Given pathname may contain shell wildcard characters, return a list contains tuples of path and file stat in ascending alphabetical order, in which path matches glob pattern

Parameters:
  • pathname – A path pattern may contain shell wildcard characters

  • recursive – If False, this function will not glob recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

megfile.smart.smart_iglob(pathname: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) Iterator[str][source]

Given pathname may contain shell wildcard characters, return path iterator in ascending alphabetical order, in which path matches glob pattern

Parameters:
  • pathname – A path pattern may contain shell wildcard characters

  • recursive – If False, this function will not glob recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

megfile.smart.smart_isabs(path: str | BasePath | PathLike) bool[source]

Test whether a path is absolute

Parameters:

path – Given path

Returns:

True if a path is absolute, else False

megfile.smart.smart_isdir(path: str | BasePath | PathLike, followlinks: bool = False) bool[source]

Test if a file path or an s3 url is directory

Parameters:

path – Path to be tested

Returns:

True if path is directory, else False

megfile.smart.smart_isfile(path: str | BasePath | PathLike, followlinks: bool = False) bool[source]

Test if a file path or an s3 url is file

Parameters:

path – Path to be tested

Returns:

True if path is file, else False

megfile.smart.smart_ismount(path: str | BasePath | PathLike) bool[source]

Test whether a path is a mount point

Parameters:

path – Given path

Returns:

True if a path is a mount point, else False

megfile.smart.smart_listdir(path: str | BasePath | PathLike | None = None) List[str][source]

Get all contents of given s3_url or file path. The result is in ascending alphabetical order.

Parameters:

path – Given path

Returns:

All contents of given s3_url or file path in ascending alphabetical order.

Raises:

FileNotFoundError, NotADirectoryError

megfile.smart.smart_load_content(path: str | BasePath | PathLike, start: int | None = None, stop: int | None = None) bytes[source]

Get specified file from [start, stop) in bytes

Parameters:
  • path – Specified path

  • start – start index

  • stop – stop index

Returns:

bytes content in range [start, stop)

megfile.smart.smart_load_from(path: str | BasePath | PathLike) BinaryIO[source]

Read all content in binary on specified path and write into memory

User should close the BinaryIO manually

Parameters:

path – Specified path

Returns:

BinaryIO

megfile.smart.smart_load_text(path: str | BasePath | PathLike) str[source]

Read content from path

param path: Path to be read

megfile.smart.smart_makedirs(path: str | BasePath | PathLike, exist_ok: bool = False) None[source]

Create a directory if is on fs. If on s3, it actually check if target exists, and check if bucket has WRITE access

Parameters:
  • path – Given path

  • missing_ok – if False and target directory not exists, raise FileNotFoundError

Raises:

PermissionError, FileExistsError

megfile.smart.smart_move(src_path: str | BasePath | PathLike, dst_path: str | BasePath | PathLike, overwrite: bool = True) None[source]

Move file/directory on s3 or fs. s3:// or s3://bucket is not allowed to move

Parameters:
  • src_path – Given source path

  • dst_path – Given destination path

  • overwrite – whether or not overwrite file when exists

megfile.smart.smart_open(path: str | ~megfile.pathlike.BasePath | ~os.PathLike, mode: str = 'r', encoding: str | None = None, errors: str | None = None, *, s3_open_func: ~typing.Callable[[str, str], ~typing.BinaryIO] = <function s3_buffered_open>, **options) IO[source]

Open a file on the path

Note

On fs, the difference between this function and io.open is that this function create directories automatically, instead of raising FileNotFoundError

Here are a few examples:

>>> import cv2
>>> import numpy as np
>>> raw = smart_open(
...     'https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy'
...     '/it/u=2275743969,3715493841&fm=26&gp=0.jpg'
... ).read()
>>> img = cv2.imdecode(np.frombuffer(raw, np.uint8),
...                    cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
Parameters:
  • path – Given path

  • mode – Mode to open file, supports r’[rwa][tb]?+?’

  • encoding – encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode.

  • errors – errors is an optional string that specifies how encoding and decoding errors are to be handled—this cannot be used in binary mode.

  • buffering – buffering is an optional integer used to set the buffering policy. Only be used when support.

  • followlinks – follow symbolic link, default False. Only be used when support

  • s3_open_func – Function used to open s3_url. Require the function includes 2 necessary parameters, file path and mode. only be used in s3 path.

  • max_workers – Max download / upload thread number, None by default, will use global thread pool with 8 threads. Only be used in s3, http, hdfs.

  • max_buffer_size – Max cached buffer size in memory, 128MB by default. Set to 0 will disable cache. Only be used in s3, http, hdfs.

  • block_forward – How many blocks of data cached from offset position, only for read mode. Only be used in s3, http, hdfs.

  • block_size – Size of single block. Each block will be uploaded by single thread. Only be used in s3, http, hdfs.

Returns:

File-Like object

Raises:

FileNotFoundError, IsADirectoryError, ValueError

megfile.smart.smart_path_join(path: str | BasePath | PathLike, *other_paths: str | BasePath | PathLike) str[source]

Concat 2 or more path to a complete path

Parameters:
  • path – Given path

  • other_paths – Paths to be concatenated

Returns:

Concatenated complete path

Note

For URI, the difference between this function and os.path.join is that this function ignores left side slash (which indicates absolute path) in other_paths and will directly concat.

e.g. os.path.join(‘s3://path’, ‘to’, ‘/file’) => ‘/file’, and smart_path_join(‘s3://path’, ‘to’, ‘/file’) => ‘/path/to/file’

But for fs path, this function behaves exactly like os.path.join

e.g. smart_path_join(‘/path’, ‘to’, ‘/file’) => ‘/file’

Return a string representing the path to which the symbolic link points. :param path: Path to be read :returns: Return a string representing the path to which the symbolic link points.

megfile.smart.smart_realpath(path: str | BasePath | PathLike)[source]

Return the real path of given path

Parameters:

path – Given path

Returns:

Real path of given path

megfile.smart.smart_relpath(path: str | BasePath | PathLike, start=None)[source]

Return the relative path of given path

Parameters:
  • path – Given path

  • start – Given start directory

Returns:

Relative path from start

megfile.smart.smart_remove(path: str | BasePath | PathLike, missing_ok: bool = False) None[source]

Remove the file or directory on s3 or fs, s3:// and s3://bucket are not permitted to remove

Parameters:
  • path – Given path

  • missing_ok – if False and target file/directory not exists, raise FileNotFoundError

Raises:

PermissionError, FileNotFoundError

megfile.smart.smart_rename(src_path: str | BasePath | PathLike, dst_path: str | BasePath | PathLike, overwrite: bool = True) None[source]

Move file on s3 or fs. s3:// or s3://bucket is not allowed to move

Parameters:
  • src_path – Given source path

  • dst_path – Given destination path

  • overwrite – whether or not overwrite file when exists

megfile.smart.smart_save_as(file_object: BinaryIO, path: str | BasePath | PathLike) None[source]

Write the opened binary stream to specified path, but the stream won’t be closed

Parameters:
  • file_object – Stream to be read

  • path – Specified target path

megfile.smart.smart_save_content(path: str | BasePath | PathLike, content: bytes) None[source]

Save bytes content to specified path

param path: Path to save content

megfile.smart.smart_save_text(path: str | BasePath | PathLike, text: str) None[source]

Save text to specified path

param path: Path to save text

megfile.smart.smart_scan(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) Iterator[str][source]

Iteratively traverse only files in given directory, in alphabetical order. Every iteration on generator yields a path string.

If path is a file path, yields the file only If path is a non-existent path, return an empty generator If path is a bucket path, return all file paths in the bucket

Parameters:
  • path – Given path

  • missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

megfile.smart.smart_scan_stat(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) Iterator[FileEntry][source]

Iteratively traverse only files in given directory, in alphabetical order. Every iteration on generator yields a tuple of path string and file stat

Parameters:
  • path – Given path

  • missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

megfile.smart.smart_scandir(path: str | BasePath | PathLike | None = None) Iterator[FileEntry][source]

Get all content of given s3_url or file path.

Parameters:

path – Given path

Returns:

An iterator contains all contents have prefix path

Raises:

FileNotFoundError, NotADirectoryError

megfile.smart.smart_stat(path: str | BasePath | PathLike, follow_symlinks=True) StatResult[source]

Get StatResult of s3_url or file path

Parameters:

path – Given path

Returns:

StatResult

Raises:

FileNotFoundError

Create a symbolic link pointing to src_path named path.

Parameters:
  • src_path – Source path

  • dst_path – Destination path

megfile.smart.smart_sync(src_path: str | ~megfile.pathlike.BasePath | ~os.PathLike, dst_path: str | ~megfile.pathlike.BasePath | ~os.PathLike, callback: ~typing.Callable[[str, int], None] | None = None, followlinks: bool = False, callback_after_copy_file: ~typing.Callable[[str, str], None] | None = None, src_file_stats: ~typing.Iterable[~megfile.pathlike.FileEntry] | None = None, map_func: ~typing.Callable[[~typing.Callable, ~typing.Iterable], ~typing.Any] = <class 'map'>, force: bool = False, overwrite: bool = True) None[source]

Sync file or directory

Note

When the parameter is file, this function bahaves like smart_copy.

If file and directory of same name and same level, sync consider it’s file first

Here are a few examples:

>>> from tqdm import tqdm
>>> from threading import Lock
>>> from megfile import smart_sync, smart_stat, smart_glob
>>> class Bar:
...     def __init__(self, total_file):
...         self._total_file = total_file
...         self._bar = None
...         self._now = None
...         self._file_index = 0
...         self._lock = Lock()
...     def __call__(self, path, num_bytes):
...         with self._lock:
...             if path != self._now:
...                 self._file_index += 1
...                 print("copy file {}/{}:".format(self._file_index,
...                                                 self._total_file))
...                 if self._bar:
...                     self._bar.close()
...                 self._bar = tqdm(total=smart_stat(path).size)
...                 self._now = path
...            self._bar.update(num_bytes)
>>> total_file = len(list(smart_glob('src_path')))
>>> smart_sync('src_path', 'dst_path', callback=Bar(total_file=total_file))
Parameters:
  • src_path – Given source path

  • dst_path – Given destination path

  • callback – Called periodically during copy, and the input parameter is the data size (in bytes) of copy since the last call

  • followlinks – False if regard symlink as file, else True

  • callback_after_copy_file – Called after copy success, and the input parameter is src file path and dst file path

  • src_file_stats – If this parameter is not None, only this parameter’s files will be synced,and src_path is the root_path of these files used to calculate the path of the target file. This parameter is in order to reduce file traversal times.

  • map_func – A Callable func like map. You can use ThreadPoolExecutor.map, Pool.map and so on if you need concurrent capability. default is standard library map.

  • force – Sync file forcible, do not ignore same files, priority is higher than ‘overwrite’, default is False

  • overwrite – whether or not overwrite file when exists, default is True

megfile.smart.smart_sync_with_progress(src_path, dst_path, callback: ~typing.Callable[[str, int], None] | None = None, followlinks: bool = False, map_func: ~typing.Callable[[~typing.Callable, ~typing.Iterable], ~typing.Iterator] = <class 'map'>, force: bool = False, overwrite: bool = True)[source]

Sync file or directory with progress bar

Parameters:
  • src_path – Given source path

  • dst_path – Given destination path

  • callback – Called periodically during copy, and the input parameter is the data size (in bytes) of copy since the last call

  • followlinks – False if regard symlink as file, else True

  • callback_after_copy_file – Called after copy success, and the input parameter is src file path and dst file path

  • src_file_stats – If this parameter is not None, only this parameter’s files will be synced, and src_path is the root_path of these files used to calculate the path of the target file. This parameter is in order to reduce file traversal times.

  • map_func – A Callable func like map. You can use ThreadPoolExecutor.map, Pool.map and so on if you need concurrent capability. default is standard library map.

  • force – Sync file forcible, do not ignore same files, priority is higher than ‘overwrite’, default is False

  • overwrite – whether or not overwrite file when exists, default is True

megfile.smart.smart_touch(path: str | BasePath | PathLike)[source]

Create a new file on path

param path: Path to create file

Remove the file on s3 or fs

Parameters:
  • path – Given path

  • missing_ok – if False and target file not exists, raise FileNotFoundError

Raises:

PermissionError, FileNotFoundError, IsADirectoryError

megfile.smart.smart_walk(path: str | BasePath | PathLike, followlinks: bool = False) Iterator[Tuple[str, List[str], List[str]]][source]

Generate the file names in a directory tree by walking the tree top-down. For each directory in the tree rooted at directory path (including path itself), it yields a 3-tuple (root, dirs, files).

  • root: a string of current path

  • dirs: name list of subdirectories (excluding ‘.’ and ‘..’ if they exist) in ‘root’ The list is sorted by ascending alphabetical order

  • files: name list of non-directory files (link is regarded as file) in ‘root’. The list is sorted by ascending alphabetical order

If path not exists, return an empty generator If path is a file, return an empty generator If try to apply walk() on unsupported path, raise UnsupportedError

Parameters:

path – Given path

Raises:

UnsupportedError

Returns:

A 3-tuple generator