megfile.hdfs_path module

class megfile.hdfs_path.HdfsPath(path: str | BasePath | PathLike, *other_paths: str | BasePath | PathLike)[source]

Bases: URIPath

absolute() HdfsPath[source]

Make the path absolute, without normalization or resolving symlinks. Returns a new path object

exists(followlinks: bool = False) bool[source]

Test if path exists

If the bucket of path are not permitted to read, return False

Returns:

True if path exists, else False

getmtime(follow_symlinks: bool = False) float[source]

Get last-modified time of the file on the given path path (in Unix timestamp format). If the path is an existent directory, return the latest modified time of all file in it. The mtime of empty directory is 1970-01-01 00:00:00

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

Returns:

Last-modified time

Raises:

FileNotFoundError

getsize(follow_symlinks: bool = False) int[source]

Get file size on the given path path (in bytes). If the path in a directory, return the sum of all file size in it, including file in subdirectories (if exist).

The result excludes the size of directory itself. In other words, return 0 Byte on an empty directory path.

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

Returns:

File size

Raises:

FileNotFoundError

glob(pattern, recursive: bool = True, missing_ok: bool = True) List[HdfsPath][source]

Return hdfs path list, in which path matches glob pattern Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:
  • pattern – Glob the given relative pattern in the directory represented by this path

  • recursive – If False, ** will not search directory recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

A list contains paths match hdfs_pathname

glob_stat(pattern, recursive: bool = True, missing_ok: bool = True) Iterator[FileEntry][source]

Return a generator contains tuples of path and file stat, in which path matches glob pattern

Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:
  • pattern – Glob the given relative pattern in the directory represented by this path

  • recursive – If False, ** will not search directory recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

A generator contains tuples of path and file stat, in which paths match hdfs_pathname

iglob(pattern, recursive: bool = True, missing_ok: bool = True) Iterator[HdfsPath][source]

Return hdfs path iterator, in which path matches glob pattern Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:
  • pattern – Glob the given relative pattern in the directory represented by this path

  • recursive – If False, ** will not search directory recursively

  • missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

An iterator contains paths match hdfs_pathname

is_dir(followlinks: bool = False) bool[source]

Test if an hdfs url is directory Specific procedures are as follows: If there exists a suffix, of which os.path.join(path, suffix) is a file If the url is empty bucket or hdfs://

Parameters:

followlinks – whether followlinks is True or False, result is the same. Because hdfs symlink not support dir.

Returns:

True if path is hdfs directory, else False

is_file(followlinks: bool = False) bool[source]

Test if an path is file

Returns:

True if path is hdfs file, else False

iterdir() Iterator[HdfsPath][source]

Get all contents of given path.

Returns:

All contents have prefix of path.

Raises:

FileNotFoundError, NotADirectoryError

listdir() List[str][source]

Get all contents of given path.

Returns:

All contents have prefix of path.

Raises:

FileNotFoundError, NotADirectoryError

load() BinaryIO[source]

Read all content in binary on specified path and write into memory

User should close the BinaryIO manually

Returns:

BinaryIO

md5(recalculate: bool = False, followlinks: bool = False) str[source]

Get checksum of the file or dir.

Parameters:
  • recalculate – Ignore this parameter, just for compatibility

  • followlinks – Ignore this parameter, just for compatibility

Returns:

checksum

mkdir(mode=511, parents: bool = False, exist_ok: bool = False)[source]

Create an hdfs directory. Purely creating directory is invalid because it’s unavailable on OSS. This function is to test the target bucket have WRITE access.

Parameters:
  • mode – Octal permission to set on the newly created directory. These permissions will only be set on directories that do not already exist.

  • parents – parents is ignored, only be compatible with pathlib.Path

  • exist_ok – If False and target directory exists, raise FileExistsError

Raises:

BucketNotFoundError, FileExistsError

move(dst_path: str | BasePath | PathLike, overwrite: bool = True) None[source]

Move file/directory path from src_path to dst_path

Parameters:

dst_path – Given destination path

open(mode: str = 'r', *, buffering: int | None = None, encoding: str | None = None, errors: str | None = None, max_workers: int | None = None, max_buffer_size: int = 134217728, block_forward: int | None = None, block_size: int = 8388608, **kwargs) IO[source]

Open a file on the specified path.

Parameters:
  • mode – Mode to open the file. Supports ‘r’, ‘rb’, ‘w’, ‘wb’, ‘a’, ‘ab’.

  • buffering – Optional integer used to set the buffering policy.

  • encoding – Name of the encoding used to decode or encode the file. Should only be used in text mode.

  • errors – Optional string specifying how encoding and decoding errors are to be handled. Cannot be used in binary mode.

  • max_workers – Max download thread number, None by default, will use global thread pool with 8 threads.

  • max_buffer_size – Max cached buffer size in memory, 128MB by default. Set to 0 will disable cache.

  • block_forward – Number of blocks of data for reader cached from the offset position.

  • block_size – Size of a single block for reader, default is 8MB.

Returns:

A file-like object.

Raises:

ValueError – If an unacceptable mode is provided.

property parts: Tuple[str, ...]

A tuple giving access to the path’s various components

property path_with_protocol: str

Return path with protocol, like hdfs://path

property path_without_protocol: str

Return path without protocol, example: if path is hdfs://path, return path

protocol = 'hdfs'
remove(missing_ok: bool = False) None[source]

Remove the file or directory on hdfs, hdfs:// and hdfs://bucket are not permitted to remove

Parameters:

missing_ok – if False and target file/directory not exists, raise FileNotFoundError

Raises:

FileNotFoundError, UnsupportedError

rename(dst_path: str | BasePath | PathLike, overwrite: bool = True) HdfsPath[source]

Move hdfs file path from src_path to dst_path

Parameters:
  • dst_path – Given destination path

  • overwrite – whether or not overwrite file when exists

save(file_object: BinaryIO)[source]

Write the opened binary stream to specified path, but the stream won’t be closed

Parameters:

file_object – Stream to be read

scan(missing_ok: bool = True, followlinks: bool = False) Iterator[str][source]

Iteratively traverse only files in given hdfs directory. Every iteration on generator yields a path string.

If path is a file path, yields the file only If path is a non-existent path, return an empty generator If path is a bucket path, return all file paths in the bucket If path is an empty bucket, return an empty generator If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile

Parameters:

missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

scan_stat(missing_ok: bool = True, followlinks: bool = False) Iterator[FileEntry][source]

Iteratively traverse only files in given directory. Every iteration on generator yields a tuple of path string and file stat

Parameters:

missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

scandir() ContextIterator[source]

Get all contents of given path, the order of result is in arbitrary order.

Returns:

All contents have prefix of path

Raises:

FileNotFoundError, NotADirectoryError

stat(follow_symlinks=True) StatResult[source]

Get StatResult of path file, including file size and mtime, referring to hdfs_getsize and hdfs_getmtime

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

If attempt to get StatResult of complete hdfs, such as hdfs_dir_url == ‘hdfs://’, raise BucketNotFoundError

Returns:

StatResult

Raises:

FileNotFoundError

Remove the file on hdfs

Parameters:

missing_ok – if False and target file not exists, raise FileNotFoundError

Raises:

FileNotFoundError, IsADirectoryError

walk(followlinks: bool = False) Iterator[Tuple[str, List[str], List[str]]][source]

Iteratively traverse the given hdfs directory, in top-bottom order. In other words, firstly traverse parent directory, if subdirectories exist, traverse the subdirectories.

Every iteration on generator yields a 3-tuple: (root, dirs, files)

  • root: Current hdfs path;

  • dirs: Name list of subdirectories in current directory.

  • files: Name list of files in current directory.

If path is a file path, return an empty generator

If path is a non-existent path, return an empty generator

If path is a bucket path, bucket will be the top directory, and will be returned at first iteration of generator

If path is an empty bucket, only yield one 3-tuple (notes: hdfs doesn’t have empty directory)

If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile

Parameters:

followlinks – whether followlinks is True or False, result is the same. Because hdfs not support symlink.

Returns:

A 3-tuple generator