megfile.hdfs_path module
- class megfile.hdfs_path.HdfsPath(path: str | BasePath | PathLike, *other_paths: str | BasePath | PathLike)[source]
Bases:
URIPath
- absolute() HdfsPath [source]
Make the path absolute, without normalization or resolving symlinks. Returns a new path object
- exists(followlinks: bool = False) bool [source]
Test if path exists
If the bucket of path are not permitted to read, return False
- Returns:
True if path exists, else False
- getmtime(follow_symlinks: bool = False) float [source]
Get last-modified time of the file on the given path path (in Unix timestamp format). If the path is an existent directory, return the latest modified time of all file in it. The mtime of empty directory is 1970-01-01 00:00:00
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
- Returns:
Last-modified time
- Raises:
FileNotFoundError
- getsize(follow_symlinks: bool = False) int [source]
Get file size on the given path path (in bytes). If the path in a directory, return the sum of all file size in it, including file in subdirectories (if exist).
The result excludes the size of directory itself. In other words, return 0 Byte on an empty directory path.
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
- Returns:
File size
- Raises:
FileNotFoundError
- glob(pattern, recursive: bool = True, missing_ok: bool = True) List[HdfsPath] [source]
Return hdfs path list, in which path matches glob pattern Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
pattern – Glob the given relative pattern in the directory represented by this path
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
A list contains paths match hdfs_pathname
- glob_stat(pattern, recursive: bool = True, missing_ok: bool = True) Iterator[FileEntry] [source]
Return a generator contains tuples of path and file stat, in which path matches glob pattern
Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
pattern – Glob the given relative pattern in the directory represented by this path
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
A generator contains tuples of path and file stat, in which paths match hdfs_pathname
- iglob(pattern, recursive: bool = True, missing_ok: bool = True) Iterator[HdfsPath] [source]
Return hdfs path iterator, in which path matches glob pattern Notes: Only glob in bucket. If trying to match bucket with wildcard characters, raise UnsupportedError
- Parameters:
pattern – Glob the given relative pattern in the directory represented by this path
recursive – If False, ** will not search directory recursively
missing_ok – If False and target path doesn’t match any file, raise FileNotFoundError
- Raises:
UnsupportedError, when bucket part contains wildcard characters
- Returns:
An iterator contains paths match hdfs_pathname
- is_dir(followlinks: bool = False) bool [source]
Test if an hdfs url is directory Specific procedures are as follows: If there exists a suffix, of which
os.path.join(path, suffix)
is a file If the url is empty bucket or hdfs://- Parameters:
followlinks – whether followlinks is True or False, result is the same. Because hdfs symlink not support dir.
- Returns:
True if path is hdfs directory, else False
- is_file(followlinks: bool = False) bool [source]
Test if an path is file
- Returns:
True if path is hdfs file, else False
- iterdir(followlinks: bool = False) Iterator[HdfsPath] [source]
Get all contents of given path.
- Returns:
All contents have prefix of path.
- Raises:
FileNotFoundError, NotADirectoryError
- listdir(followlinks: bool = False) List[str] [source]
Get all contents of given path.
- Returns:
All contents have prefix of path.
- Raises:
FileNotFoundError, NotADirectoryError
- load(followlinks: bool = False) BinaryIO [source]
Read all content in binary on specified path and write into memory
User should close the BinaryIO manually
- Returns:
BinaryIO
- md5(recalculate: bool = False, followlinks: bool = False) str [source]
Get checksum of the file or dir.
- Parameters:
recalculate – Ignore this parameter, just for compatibility
followlinks – Ignore this parameter, just for compatibility
- Returns:
checksum
- mkdir(mode=511, parents: bool = False, exist_ok: bool = False)[source]
Create an hdfs directory. Purely creating directory is invalid because it’s unavailable on OSS. This function is to test the target bucket have WRITE access.
- Parameters:
mode – Octal permission to set on the newly created directory. These permissions will only be set on directories that do not already exist.
parents – parents is ignored, only be compatible with pathlib.Path
exist_ok – If False and target directory exists, raise FileExistsError
- Raises:
BucketNotFoundError, FileExistsError
- move(dst_path: str | BasePath | PathLike, overwrite: bool = True) None [source]
Move file/directory path from src_path to dst_path
- Parameters:
dst_path – Given destination path
- open(mode: str = 'r', *, buffering: int | None = None, encoding: str | None = None, errors: str | None = None, max_workers: int | None = None, max_buffer_size: int = 134217728, block_forward: int | None = None, block_size: int = 8388608, **kwargs) IO [source]
Open a file on the specified path.
- Parameters:
mode – Mode to open the file. Supports ‘r’, ‘rb’, ‘w’, ‘wb’, ‘a’, ‘ab’.
buffering – Optional integer used to set the buffering policy.
encoding – Name of the encoding used to decode or encode the file. Should only be used in text mode.
errors – Optional string specifying how encoding and decoding errors are to be handled. Cannot be used in binary mode.
max_workers – Max download thread number, None by default, will use global thread pool with 8 threads.
max_buffer_size – Max cached buffer size in memory, 128MB by default. Set to 0 will disable cache.
block_forward – Number of blocks of data for reader cached from the offset position.
block_size – Size of a single block for reader, default is 8MB.
- Returns:
A file-like object.
- Raises:
ValueError – If an unacceptable mode is provided.
- property parts: Tuple[str, ...]
A tuple giving access to the path’s various components
- property path_with_protocol: str
Return path with protocol, like hdfs://path
- property path_without_protocol: str
Return path without protocol, example: if path is hdfs://path, return path
- protocol = 'hdfs'
- remove(missing_ok: bool = False) None [source]
Remove the file or directory on hdfs, hdfs:// and hdfs://bucket are not permitted to remove
- Parameters:
missing_ok – if False and target file/directory not exists, raise FileNotFoundError
- Raises:
FileNotFoundError, UnsupportedError
- rename(dst_path: str | BasePath | PathLike, overwrite: bool = True) HdfsPath [source]
Move hdfs file path from src_path to dst_path
- Parameters:
dst_path – Given destination path
overwrite – whether or not overwrite file when exists
- save(file_object: BinaryIO)[source]
Write the opened binary stream to specified path, but the stream won’t be closed
- Parameters:
file_object – Stream to be read
- scan(missing_ok: bool = True, followlinks: bool = False) Iterator[str] [source]
Iteratively traverse only files in given hdfs directory. Every iteration on generator yields a path string.
If path is a file path, yields the file only If path is a non-existent path, return an empty generator If path is a bucket path, return all file paths in the bucket If path is an empty bucket, return an empty generator If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile
- Parameters:
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError
- Raises:
UnsupportedError
- Returns:
A file path generator
- scan_stat(missing_ok: bool = True, followlinks: bool = False) Iterator[FileEntry] [source]
Iteratively traverse only files in given directory. Every iteration on generator yields a tuple of path string and file stat
- Parameters:
missing_ok – If False and there’s no file in the directory, raise FileNotFoundError
- Raises:
UnsupportedError
- Returns:
A file path generator
- scandir(followlinks: bool = False) Iterator[FileEntry] [source]
Get all contents of given path, the order of result is not guaranteed.
- Returns:
All contents have prefix of path
- Raises:
FileNotFoundError, NotADirectoryError
- stat(follow_symlinks=True) StatResult [source]
Get StatResult of path file, including file size and mtime, referring to hdfs_getsize and hdfs_getmtime
If path is not an existent path, which means hdfs_exist(path) returns False, then raise FileNotFoundError
If attempt to get StatResult of complete hdfs, such as hdfs_dir_url == ‘hdfs://’, raise BucketNotFoundError
- Returns:
StatResult
- Raises:
FileNotFoundError
- unlink(missing_ok: bool = False) None [source]
Remove the file on hdfs
- Parameters:
missing_ok – if False and target file not exists, raise FileNotFoundError
- Raises:
FileNotFoundError, IsADirectoryError
- walk(followlinks: bool = False) Iterator[Tuple[str, List[str], List[str]]] [source]
Iteratively traverse the given hdfs directory, in top-bottom order. In other words, firstly traverse parent directory, if subdirectories exist, traverse the subdirectories.
Every iteration on generator yields a 3-tuple: (root, dirs, files)
root: Current hdfs path;
dirs: Name list of subdirectories in current directory.
files: Name list of files in current directory.
If path is a file path, return an empty generator
If path is a non-existent path, return an empty generator
If path is a bucket path, bucket will be the top directory, and will be returned at first iteration of generator
If path is an empty bucket, only yield one 3-tuple (notes: hdfs doesn’t have empty directory)
If path doesn’t contain any bucket, which is path == ‘hdfs://’, raise UnsupportedError. walk() on complete hdfs is not supported in megfile
- Parameters:
followlinks – whether followlinks is True or False, result is the same. Because hdfs not support symlink.
- Returns:
A 3-tuple generator