This section contains library documentation for Hadoopy.
Run Hadoop given the parameters
| Parameters: |
|
|---|---|
| Return type: | Dictionary with some of the following entries (depending on options) |
| Returns: | freeze_cmds: Freeze command(s) ran |
| Returns: | frozen_tar_path: HDFS path to frozen file |
| Returns: | hadoop_cmds: Hadoopy command(s) ran |
| Returns: | process: subprocess.Popen object |
| Returns: | output: Iterator of (key, value) pairs |
| Raises : | subprocess.CalledProcessError: Hadoop error. |
| Raises : | OSError: Hadoop streaming not found. |
| Raises : | TypeError: Input types are not correct. |
| Raises : | ValueError: Script not found or check_script failed |
Freezes a script and then launches it.
This function will freeze your python program, and place it on HDFS in ‘temp_path’. It will not remove it afterwards as they are typically small, you can easily reuse/debug them, and to avoid any risks involved with removing the file.
| Parameters: |
|
|---|---|
| Return type: | Dictionary with some of the following entries (depending on options) |
| Returns: | freeze_cmds: Freeze command(s) ran |
| Returns: | frozen_tar_path: HDFS path to frozen file |
| Returns: | hadoop_cmds: Hadoopy command(s) ran |
| Returns: | process: subprocess.Popen object |
| Returns: | output: Iterator of (key, value) pairs |
| Raises : | subprocess.CalledProcessError: Hadoop error. |
| Raises : | OSError: Hadoop streaming not found. |
| Raises : | TypeError: Input types are not correct. |
| Raises : | ValueError: Script not found |
A simple local emulation of hadoop
This doesn’t run hadoop and it doesn’t support many advanced features, it is intended for simple debugging. The input/output uses HDFS if an HDFS path is given. This allows for small tasks to be run locally (primarily while debugging). A temporary working directory is used and removed.
Support
| Parameters: |
|
|---|---|
| Return type: | Dictionary with some of the following entries (depending on options) |
| Returns: | freeze_cmds: Freeze command(s) ran |
| Returns: | frozen_tar_path: HDFS path to frozen file |
| Returns: | hadoop_cmds: Hadoopy command(s) ran |
| Returns: | process: subprocess.Popen object |
| Returns: | output: Iterator of (key, value) pairs |
| Raises : | subprocess.CalledProcessError: Hadoop error. |
| Raises : | OSError: Hadoop streaming not found. |
| Raises : | TypeError: Input types are not correct. |
| Raises : | ValueError: Script not found |
Hadoopy entrance function
This is to be called in all Hadoopy job’s. Handles arguments passed in, calls the provided functions with input, and stores the output.
TypedBytes are used if the following is True os.environ[‘stream_map_input’] == ‘typedbytes’
It is highly recommended that TypedBytes be used for all non-trivial tasks. Keep in mind that the semantics of what you can safely emit from your functions is limited when using Text (i.e., no t or n). You can use the base64 module to ensure that your output is clean.
If the HADOOPY_CHDIR environmental variable is set, this will immediately change the working directory to the one specified. This is useful if your data is provided in an archive but your program assumes it is in that directory.
As hadoop streaming relies on stdin/stdout/stderr for communication, anything that outputs on them in an unexpected way (especially stdout) will break the pipe on the Java side and can potentially cause data errors. To fix this problem, hadoopy allows file descriptors (integers) to be provided to each task. These will be used instead of stdin/stdout by hadoopy. This is designed to combine with the ‘pipe’ command.
To use the pipe functionality, instead of using your_script.py map use your_script.py pipe map which will call the script as a subprocess and use the read_fd/write_fd command line arguments for communication. This isolates your script and eliminates the largest source of errors when using hadoop streaming.
The pipe functionality has the following semantics stdin: Always an empty file stdout: Redirected to stderr (which is visible in the hadoop log) stderr: Kept as stderr read_fd: File descriptor that points to the true stdin write_fd: File descriptor that points to the true stdout
| Parameters: |
|
|---|
Output a status message that is displayed in the Hadoop web interface
The status message will replace any other, if you want to append you must do this yourself.
| Parameters: |
|
|---|
Output a counter update that is displayed in the Hadoop web interface
Counters are useful for quickly identifying the number of times an error occurred, current progress, or coarse statistics.
| Parameters: |
|
|---|
Read typedbytes sequence files on HDFS (with optional compression).
By default, ignores files who’s names start with an underscore ‘_’ as they are log files. This allows you to cat a directory that may be a variety of outputs from hadoop (e.g., _SUCCESS, _logs). This works on directories and files. The KV pairs may be interleaved between files (they are read in parallel).
| Parameters: |
|
|---|---|
| Returns: | An iterator of key, value pairs. |
| Raises : | IOError: An error occurred reading the directory (e.g., not available). |
Write typedbytes sequence file to HDFS given an iterator of KeyValue pairs
| Parameters: |
|
|---|---|
| Raises : | IOError: An error occurred while saving the data. |
Return the absolute path to a file and canonicalize it
Path is returned without a trailing slash and without redundant slashes. Caches the user’s home directory.
| Parameters: | path – A string for the path. This should not have any wildcards. |
|---|---|
| Returns: | Absolute path to the file |
| Raises IOError: | If unsuccessful |
List files on HDFS.
| Parameters: | path – A string (potentially with wildcards). |
|---|---|
| Return type: | A list of strings representing HDFS paths. |
| Raises : | IOError: An error occurred listing the directory (e.g., not available). |
Get a file from hdfs
| Parameters: |
|
|---|---|
| Raises : | IOError: If unsuccessful |
Put a file on hdfs
| Parameters: |
|
|---|---|
| Raises : | IOError: If unsuccessful |
Remove a file if it exists (recursive)
| Parameters: | path – A string (potentially with wildcards). |
|---|---|
| Raises IOError: | If unsuccessful |
Check if a path has zero length (also true if it’s a directory)
| Parameters: | path – A string for the path. This should not have any wildcards. |
|---|---|
| Returns: | True if the path has zero length, False otherwise. |
Check if a path is a directory
| Parameters: | path – A string for the path. This should not have any wildcards. |
|---|---|
| Returns: | True if the path is a directory, False otherwise. |
Check if a file exists.
| Parameters: | path – A string for the path. This should not have any wildcards. |
|---|---|
| Returns: | True if the path exists, False otherwise. |