megfile.hdfs module
- megfile.hdfs.hdfs_exists(path: str | BasePath | PathLike, followlinks: bool = False) bool [source]
Test if path exists
If the bucket of path are not permitted to read, return False
- Parameters:
path – Given path
- Returns:
True if path exists, else False
- megfile.hdfs.hdfs_getmd5(path: str | BasePath | PathLike, recalculate: bool = False, followlinks: bool = False) str [source]
Get checksum of the file or dir.
- Parameters:
path – Given path
recalculate – Ignore this parameter, just for compatibility
followlinks – Ignore this parameter, just for compatibility
- Returns:
checksum
- megfile.hdfs.hdfs_getmtime(path: str | BasePath | PathLike, follow_symlinks: bool = False) float [source]
Get last-modified time of the file on the given path path (in Unix timestamp format). If the path is an existent directory, return the latest modified time of all file in it. The mtime of empty directory is 1970-01-01 00:00:00
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
- Parameters:
path – Given path
- Returns:
Last-modified time
- Raises:
FileNotFoundError
- megfile.hdfs.hdfs_getsize(path: str | BasePath | PathLike, follow_symlinks: bool = False) int [source]
Get file size on the given path path (in bytes). If the path in a directory, return the sum of all file size in it, including file in subdirectories (if exist).
The result excludes the size of directory itself. In other words, return 0 Byte on an empty directory path.
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
- Parameters:
path – Given path
- Returns:
File size
- Raises:
FileNotFoundError
- megfile.hdfs.hdfs_glob(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) List[str] [source]
Return hdfs path list in ascending alphabetical order, in which path matches glob pattern
Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
A list contains paths match path
- megfile.hdfs.hdfs_glob_stat(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) Iterator[FileEntry] [source]
Return a generator contains tuples of path and file stat, in ascending alphabetical order, in which path matches glob pattern
Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
A generator contains tuples of path and file stat, in which paths match path
- megfile.hdfs.hdfs_iglob(path: str | BasePath | PathLike, recursive: bool = True, missing_ok: bool = True) Iterator[str] [source]
Return hdfs path iterator in ascending alphabetical order, in which path matches glob pattern
Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
An iterator contains paths match path
- megfile.hdfs.hdfs_isdir(path: str | BasePath | PathLike, followlinks: bool = False) bool [source]
Test if an hdfs url is directory Specific procedures are as follows: If there exists a suffix, of which
os.path.join(path, suffix)
is a file If the url is empty bucket or hdfs://- Parameters:
path – Given path
followlinks – whether followlinks is True or False, result is the same. Because hdfs symlink not support dir.
- Returns:
True if path is hdfs directory, else False
- megfile.hdfs.hdfs_isfile(path: str | BasePath | PathLike, followlinks: bool = False) bool [source]
Test if an path is file
- Parameters:
path – Given path
- Returns:
True if path is hdfs file, else False
- megfile.hdfs.hdfs_listdir(path: str | BasePath | PathLike, followlinks: bool = False) List[str] [source]
Get all contents of given path.
- Parameters:
path – Given path
- Returns:
All contents have prefix of path.
- Raises:
FileNotFoundError, NotADirectoryError
- megfile.hdfs.hdfs_load_from(path: str | BasePath | PathLike, followlinks: bool = False) BinaryIO [source]
Read all content in binary on specified path and write into memory
User should close the BinaryIO manually
- Parameters:
path – Given path
- Returns:
BinaryIO
- megfile.hdfs.hdfs_makedirs(path: str | BasePath | PathLike, exist_ok: bool = False)[source]
Create an hdfs directory. Purely creating directory is invalid because it’s unavailable on OSS. This function is to test the target bucket have WRITE access.
- Parameters:
path – Given path
exist_ok – If False and target directory exists, raise S3FileExistsError
- Raises:
FileExistsError
- megfile.hdfs.hdfs_move(src_path: str | BasePath | PathLike, dst_path: str | BasePath | PathLike, overwrite: bool = True) None [source]
Move file/directory path from src_path to dst_path
- Parameters:
src_path – Given path
dst_path – Given destination path
- megfile.hdfs.hdfs_open(path: str | BasePath | PathLike, mode: str = 'r', *, buffering: int | None = None, encoding: str | None = None, errors: str | None = None, max_workers: int | None = None, max_buffer_size: int = 134217728, block_forward: int | None = None, block_size: int = 8388608, **kwargs) IO [source]
Open a file on the specified path.
- Parameters:
path – Given path
mode – Mode to open the file. Supports ‘r’, ‘rb’, ‘w’, ‘wb’, ‘a’, ‘ab’.
buffering – Optional integer used to set the buffering policy.
encoding – Name of the encoding used to decode or encode the file. Should only be used in text mode.
errors – Optional string specifying how encoding and decoding errors are to be handled. Cannot be used in binary mode.
max_workers – Max download thread number, None by default, will use global thread pool with 8 threads.
max_buffer_size – Max cached buffer size in memory, 128MB by default. Set to 0 will disable cache.
block_forward – Number of blocks of data for reader cached from the offset position.
block_size – Size of a single block for reader, default is 8MB.
- Returns:
A file-like object.
- Raises:
ValueError – If an unacceptable mode is provided.
- megfile.hdfs.hdfs_remove(path: str | BasePath | PathLike, missing_ok: bool = False) None [source]
Remove the file or directory on hdfs, hdfs:// and hdfs://bucket are not permitted to remove
- Parameters:
path – Given path
missing_ok – if False and target file/directory not exists, raise FileNotFoundError
- Raises:
FileNotFoundError, UnsupportedError
- megfile.hdfs.hdfs_save_as(file_object: BinaryIO, path: str | BasePath | PathLike)[source]
Write the opened binary stream to specified path, but the stream won’t be closed
- Parameters:
path – Given path
file_object – Stream to be read
- megfile.hdfs.hdfs_scan(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) Iterator[str] [source]
Iteratively traverse only files in given hdfs directory. Every iteration on generator yields a path string.
If path is a file path, yields the file only If path is a non-existent path, return an empty generator If path is a bucket path, return all file paths in the bucket If path is an empty bucket, return an empty generator If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile
- Parameters:
path – Given path
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError
- Raises:
UnsupportedError
- Returns:
A file path generator
- megfile.hdfs.hdfs_scan_stat(path: str | BasePath | PathLike, missing_ok: bool = True, followlinks: bool = False) Iterator[FileEntry] [source]
Iteratively traverse only files in given directory. Every iteration on generator yields a tuple of path string and file stat
- Parameters:
path – Given path
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError
- Raises:
UnsupportedError
- Returns:
A file path generator
- megfile.hdfs.hdfs_scandir(path: str | BasePath | PathLike, followlinks: bool = False) Iterator[FileEntry] [source]
Get all contents of given path, the order of result is not guaranteed.
- Parameters:
path – Given path
- Returns:
All contents have prefix of path
- Raises:
FileNotFoundError, NotADirectoryError
- megfile.hdfs.hdfs_stat(path: str | BasePath | PathLike, follow_symlinks=True) StatResult [source]
Get StatResult of path file, including file size and mtime, referring to hdfs_getsize and hdfs_getmtime
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
If attempt to get StatResult of complete hdfs, such as hdfs_dir_url == ‘hdfs://’, raise BucketNotFoundError
- Parameters:
path – Given path
- Returns:
StatResult
- Raises:
FileNotFoundError
- megfile.hdfs.hdfs_unlink(path: str | BasePath | PathLike, missing_ok: bool = False) None [source]
Remove the file on hdfs
- Parameters:
path – Given path
missing_ok – if False and target file not exists, raise FileNotFoundError
- Raises:
FileNotFoundError, IsADirectoryError
- megfile.hdfs.hdfs_walk(path: str | BasePath | PathLike, followlinks: bool = False) Iterator[Tuple[str, List[str], List[str]]] [source]
Iteratively traverse the given hdfs directory, in top-bottom order. In other words, firstly traverse parent directory, if subdirectories exist, traverse the subdirectories.
Every iteration on generator yields a 3-tuple: (root, dirs, files)
root: Current hdfs path;
dirs: Name list of subdirectories in current directory.
files: Name list of files in current directory.
If path is a file path, return an empty generator
If path is a non-existent path, return an empty generator
If path is a bucket path, bucket will be the top directory, and will be returned at first iteration of generator
If path is an empty bucket, only yield one 3-tuple (notes: hdfs doesn’t have empty directory)
If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile
- Parameters:
path – Given path
followlinks – whether followlinks is True or False, result is the same. Because hdfs not support symlink.
- Returns:
A 3-tuple generator