Outputs
Each backend processor generates outputs in a specific format, usually written to the disk as a text file. Thus, original output formats exhibit significant variations across backends. Bitbox, therefore, includes wrapper functions that convert these outputs into a standard Python dictionary format.
File Caching System
Running backend processors to produce output files might take some time, usually a few minutes per video file, depending on specific hardware requirements. To enhance analysis efficiency and provide a versioning system, Bitbox includes an integrated file caching mechanism.
Each time a processor is run, Bitbox checks if the output files and their metadata, which detail the last execution, already exist in the specified output directory. This metadata is stored as .json files. If the files and metadata are found, Bitbox verifies if the time elapsed since their last creation is within the retention period (default is 6 months). If it is, Bitbox uses the existing files, avoiding the need to recreate them. This process significantly saves time.
Adjust the retention time according to your requirements.
# you can set the retention period using natural language: 1 year, 3 minutes, etc.
processor.cache.change_retention_period('1 year')Each file saved to disk will have an accompanying .json file, named identically, that tracks the details of the most recent execution.
"backend": "3DI",
"morphable_model": "BFMmm-19830",
"camera": 30,
"landmark": "global4",
"fast": false,
"local_bases": "0.0.1.F591-cd-K32d",
"input_hash": "4e31c4610ad3641ed651394855516d7989f9c5b3127520add6d87efc5618c162",
"cmd": "CUDA_VISIBLE_DEVICES=1 docker run --rm --gpus device=1 -v /home/test/bitbox/tutorials/data:/app/input -v /home/test/bitbox/tutorials/output:/app/output -w /app/3DI bitbox:cuda12 ./video_detect_landmarks /app/input/elaine.mp4 /app/output/elaine_rects.3DI /app/output/elaine_landmarks.3DI /app/3DI/configs/BFMmm-19830.cfg1.global4.txt > /dev/null",
"input": "/home/test/bitbox/tutorials/data/elaine.mp4",
"output": "/home/test/bitbox/tutorials/output",
"time": "2025-07-18 12:33:50"Outputs Types
Bitbox returns Python dictionaries by default after each processing step, allowing users to easily manipulate the output. If you prefer that the steps return nothing, and only generate backend output files, set the return_output parameter to None. To receive paths of the generated files, set it to 'file'.
Output Formats
Below is a list of common components of face and body analysis pipelines and their associated outputs. The wrapper functions generate these raw behavioral signals, which serve as inputs for analysis functions to produce behavioral measurements. Details on outputs of analysis functions are given in Biomechanics, Affective Expressions, and Social Dynamics sections.
Face Rectangles
The dictionary containing the coordinates for the face rectangles is structured as follows.
Head Pose
The dictionary containing head pose is structured as follows.
The first three values (Tx, Ty, Tz) are the x, y, z coordinates of the translation vector and the last three values (Rx, Ry, Rz) are pitch, yaw, roll angles in radians of the rotation vector.
The frame count for pose estimation (287) is one less than the frames reported for face rectangles (288). This difference arises because pose estimation occurs after 3D fitting, which involves comparing subsequent frames.
When using 3DI-lite as the face processor, pose variable only includes rotation angles and not translation coordinates.
2D Face Landmarks
The dictionary containing the coordinates for the facial landmarks is structured as follows.
Below is an illustration showcasing the 51 landmarks from the iBUG schema included in Bitbox.

3D Face Landmarks
3DI and 3DI-Lite backends also identify 3D coordinates for the same 51 landmarks in a standardized/canonicalized template, adjusted for pose and individual identity. These coordinates effectively represent expression-related motion only. The dictionary containing these coordinates is structured as follows.
The frame count for 3D landmarks (287) is one less than the frames reported for 2D landmarks (288). This difference arises because 3D landmark estimation occurs after 3D fitting, which involves comparing subsequent frames.
Facial Expressions
The dictionary containing facial expressions is structured as follows.
Depending on the backend processor used, the columns of the data frame have different meanings. With 3DI and 3DI-Lite, they may represent global, non-interpretable facial deformations along 79 PCA directions if generated by processor.fit(). Alternatively, they can denote localized, interpretable facial motions, similar to Action Units, if generated by processor.localized_expressions(). The format field will inform you of their specific representation. With OpenFace (coming soon), they will correspond to Action Units.
Body Joints
Coming Soon