megfile.hdfs module

megfile.hdfs.hdfs_exists(path: str | BasePath | PathLike, followlinks: bool = False) → bool[source]

Test if path exists

If the bucket of path are not permitted to read, return False

Parameters:: path – Given path
Returns:: True if path exists, else False

megfile.hdfs.hdfs_getmd5(path: str | BasePath | PathLike, recalculate: bool = False, followlinks: bool = False) → str[source]

Get checksum of the file or dir.

Parameters:

path – Given path
recalculate – Ignore this parameter, just for compatibility
followlinks – Ignore this parameter, just for compatibility

Returns:

checksum

megfile.hdfs.hdfs_getmtime(path: str | BasePath | PathLike, follow_symlinks: bool = False) → float[source]

Get last-modified time of the file on the given path path (in Unix timestamp format). If the path is an existent directory, return the latest modified time of all file in it. The mtime of empty directory is 1970-01-01 00:00:00

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

Parameters:: path – Given path
Returns:: Last-modified time
Raises:: FileNotFoundError

megfile.hdfs.hdfs_getsize(path: str | BasePath | PathLike, follow_symlinks: bool = False) → int[source]

Get file size on the given path path (in bytes). If the path in a directory, return the sum of all file size in it, including file in subdirectories (if exist).

The result excludes the size of directory itself. In other words, return 0 Byte on an empty directory path.

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

Parameters:: path – Given path
Returns:: File size
Raises:: FileNotFoundError

megfile.hdfs.hdfs_glob(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) → List[str][source]

Return hdfs path list in ascending alphabetical order, in which path matches glob pattern

Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:

recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

A list contains paths match path

megfile.hdfs.hdfs_glob_stat(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) → Iterator[FileEntry][source]

Return a generator contains tuples of path and file stat, in ascending alphabetical order, in which path matches glob pattern

Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:

recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

A generator contains tuples of path and file stat, in which paths match path

megfile.hdfs.hdfs_iglob(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) → Iterator[str][source]

Return hdfs path iterator in ascending alphabetical order, in which path matches glob pattern

Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError

Parameters:

recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError

Raises:

UnsupportedError, when bucket part contains wildcard characters

Returns:

An iterator contains paths match path

megfile.hdfs.hdfs_isdir(path: str | BasePath | PathLike, followlinks: bool = False) → bool[source]

Test if an hdfs url is directory Specific procedures are as follows: If there exists a suffix, of which os.path.join(path, suffix) is a file If the url is empty bucket or hdfs://

Parameters:

path – Given path
followlinks – whether followlinks is True or False, result is the same. Because hdfs symlink not support dir.

Returns:

True if path is hdfs directory, else False

megfile.hdfs.hdfs_isfile(path: str | BasePath | PathLike, followlinks: bool = False) → bool[source]

Test if an path is file

Parameters:: path – Given path
Returns:: True if path is hdfs file, else False

megfile.hdfs.hdfs_listdir(path: str | BasePath | PathLike) → List[str][source]

Get all contents of given path.

Parameters:: path – Given path
Returns:: All contents have prefix of path.
Raises:: FileNotFoundError, NotADirectoryError

megfile.hdfs.hdfs_load_from(path: str | BasePath | PathLike) → BinaryIO[source]

Read all content in binary on specified path and write into memory

User should close the BinaryIO manually

Parameters:: path – Given path
Returns:: BinaryIO

megfile.hdfs.hdfs_makedirs(path: str | BasePath | PathLike, exist_ok: bool = False)[source]

Create an hdfs directory. Purely creating directory is invalid because it’s unavailable on OSS. This function is to test the target bucket have WRITE access.

Parameters:

path – Given path
exist_ok – If False and target directory exists, raise S3FileExistsError

Raises:

FileExistsError

megfile.hdfs.hdfs_move(src_path: str | BasePath | PathLike, dst_path: str | BasePath | PathLike, overwrite: bool = True) → None[source]

Move file/directory path from src_path to dst_path

Parameters:

src_path – Given path
dst_path – Given destination path

Open a file on the specified path.

Parameters:

path – Given path
mode – Mode to open the file. Supports ‘r’, ‘rb’, ‘w’, ‘wb’, ‘a’, ‘ab’.
buffering – Optional integer used to set the buffering policy.
encoding – Name of the encoding used to decode or encode the file. Should only be used in text mode.
errors – Optional string specifying how encoding and decoding errors are to be handled. Cannot be used in binary mode.
max_workers – Max download thread number, None by default, will use global thread pool with 8 threads.
max_buffer_size – Max cached buffer size in memory, 128MB by default. Set to 0 will disable cache.
block_forward – Number of blocks of data for reader cached from the offset position.
block_size – Size of a single block for reader, default is 8MB.

Returns:

A file-like object.

Raises:

ValueError – If an unacceptable mode is provided.

megfile.hdfs.hdfs_remove(path: str | BasePath | PathLike, missing_ok: bool = False) → None[source]

Remove the file or directory on hdfs, hdfs:// and hdfs://bucket are not permitted to remove

Parameters:

path – Given path
missing_ok – if False and target file/directory not exists, raise FileNotFoundError

Raises:

FileNotFoundError, UnsupportedError

megfile.hdfs.hdfs_save_as(file_object: BinaryIO, path: str | BasePath | PathLike)[source]

Write the opened binary stream to specified path, but the stream won’t be closed

Parameters:

path – Given path
file_object – Stream to be read

megfile.hdfs.hdfs_scan(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) → Iterator[str][source]

Iteratively traverse only files in given hdfs directory. Every iteration on generator yields a path string.

If path is a file path, yields the file only If path is a non-existent path, return an empty generator If path is a bucket path, return all file paths in the bucket If path is an empty bucket, return an empty generator If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile

Parameters:

path – Given path
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

megfile.hdfs.hdfs_scan_stat(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) → Iterator[FileEntry][source]

Iteratively traverse only files in given directory. Every iteration on generator yields a tuple of path string and file stat

Parameters:

path – Given path
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError

Raises:

UnsupportedError

Returns:

A file path generator

megfile.hdfs.hdfs_scandir(path: str | BasePath | PathLike) → Iterator[FileEntry][source]

Get all contents of given path, the order of result is in arbitrary order.

Parameters:: path – Given path
Returns:: All contents have prefix of path
Raises:: FileNotFoundError, NotADirectoryError

megfile.hdfs.hdfs_stat(path: str | BasePath | PathLike, follow_symlinks=True) → StatResult[source]

Get StatResult of path file, including file size and mtime, referring to hdfs_getsize and hdfs_getmtime

If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError

If attempt to get StatResult of complete hdfs, such as hdfs_dir_url == ‘hdfs://’, raise BucketNotFoundError

Parameters:: path – Given path
Returns:: StatResult
Raises:: FileNotFoundError

megfile.hdfs.hdfs_unlink(path: str | BasePath | PathLike, missing_ok: bool = False) → None[source]

Remove the file on hdfs

Parameters:

path – Given path
missing_ok – if False and target file not exists, raise FileNotFoundError

Raises:

FileNotFoundError, IsADirectoryError

megfile.hdfs.hdfs_walk(path: str | BasePath | PathLike, followlinks: bool = False) → Iterator[Tuple[str, List[str], List[str]]][source]

Iteratively traverse the given hdfs directory, in top-bottom order. In other words, firstly traverse parent directory, if subdirectories exist, traverse the subdirectories.

Every iteration on generator yields a 3-tuple: (root, dirs, files)

root: Current hdfs path;
dirs: Name list of subdirectories in current directory.
files: Name list of files in current directory.

If path is a file path, return an empty generator

If path is a non-existent path, return an empty generator

If path is a bucket path, bucket will be the top directory, and will be returned at first iteration of generator

If path is an empty bucket, only yield one 3-tuple (notes: hdfs doesn’t have empty directory)

If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile

Parameters:

path – Given path
followlinks – whether followlinks is True or False, result is the same. Because hdfs not support symlink.

Returns:

A 3-tuple generator

megfile.hdfs.is_hdfs(path: str | BasePath | PathLike) → bool[source]

Test if a path is sftp path

Parameters:: path – Path to be tested
Returns:: True of a path is sftp path, else False