• AIPressRoom
  • Posts
  • Depth-Conscious Object Insertion in Movies Utilizing Python | by Berkan Zorlubas | Aug, 2023

Depth-Conscious Object Insertion in Movies Utilizing Python | by Berkan Zorlubas | Aug, 2023

We now arrive on the core element of the undertaking: video processing. In my repository, two key scripts are supplied — video_processing_utils.py and depth_aware_object_insertion.py. As implied by their names, video_processing_utils.py homes all of the important features for object insertion, whereas depth_aware_object_insertion.py serves as the first script that executes these features to every video body inside a loop.

A snipped model of the principle part of depth_aware_object_insertion.py is given under. In a for loop that runs as many because the depend of frames within the enter video, we load batched data of the depth computation pipeline from which we get the unique RGB body and its depth estimation. Then we compute the inverse of the digital camera pose matrix. Afterwards, we feed the mesh, depth, and intrinsics of the digital camera right into a perform named render_mesh_with_depth().

for i in tqdm(vary(batch_count)):

batch = np.load(os.path.be part of(BATCH_DIRECTORY, file_names[i]))

# ... (snipped for brevity)

# transformation of the mesh with the inverse digital camera extrinsics
frame_transformation = np.vstack(np.cut up(extrinsics_data[i],4))
inverse_frame_transformation = np.empty((4, 4))
inverse_frame_transformation[:3, :] = np.concatenate((np.linalg.inv(frame_transformation[:3,:3]),
np.expand_dims(-1 * frame_transformation[:3,3],0).T), axi
inverse_frame_transformation[3, :] = [0.00, 0.00, 0.00, 1.00]
mesh.remodel(inverse_frame_transformation)

# ... (snipped for brevity)

picture = np.transpose(batch['img_1'], (2, 3, 1, 0))[:,:,:,0]
depth = np.transpose(batch['depth'], (2, 3, 1, 0))[:,:,0,0]

# ... (snipped for brevity)

# rendering the colour and depth buffer of the remodeled mesh within the picture area
mesh_color_buffer, mesh_depth_buffer = render_mesh_with_depth(np.array(mesh.vertices),
np.array(mesh.vertex_colors),
np.array(mesh.triangles),
depth, intrinsics)

# depth-aware overlaying of the mesh and the unique picture
combined_frame, combined_depth = combine_frames(picture, mesh_color_buffer, depth, mesh_depth_buffer)

# ... (snipped for brevity)

The render_mesh_with_depth perform takes a 3D mesh, represented by its vertices, vertex colours, and triangles, and renders it onto a 2D depth body. The perform begins by initializing depth and shade buffers to carry the rendered output. It then initiatives the 3D mesh vertices onto the 2D body utilizing digital camera intrinsic parameters. The perform makes use of scanline rendering to loop via every triangle within the mesh, rasterizing it into pixels on the 2D body. Throughout this course of, the perform computes barycentric coordinates for every pixel to interpolate depth and shade values. These interpolated values are then used to replace the depth and shade buffers, however provided that the pixel’s interpolated depth is nearer to the digital camera than the present worth within the depth buffer. Lastly, the perform returns the colour and depth buffers because the rendered output, with the colour buffer transformed to a uint8 format appropriate for picture show.

def render_mesh_with_depth(mesh_vertices, vertex_colors, triangles, depth_frame, intrinsic):
vertex_colors = np.asarray(vertex_colors)

# Initialize depth and shade buffers
buffer_width, buffer_height = depth_frame.form[1], depth_frame.form[0]
mesh_depth_buffer = np.ones((buffer_height, buffer_width)) * np.inf

# Undertaking 3D vertices to 2D picture coordinates
vertices_homogeneous = np.hstack((mesh_vertices, np.ones((mesh_vertices.form[0], 1))))
camera_coords = vertices_homogeneous.T[:-1,:]
projected_vertices = intrinsic @ camera_coords
projected_vertices /= projected_vertices[2, :]
projected_vertices = projected_vertices[:2, :].T.astype(int)
depths = camera_coords[2, :]

mesh_color_buffer = np.zeros((buffer_height, buffer_width, 3), dtype=np.float32)

# Loop via every triangle to render it
for triangle in triangles:
# Get 2D factors and depths for the triangle vertices
points_2d = np.array([projected_vertices[v] for v in triangle])
triangle_depths = [depths[v] for v in triangle]
colours = np.array([vertex_colors[v] for v in triangle])

# Type the vertices by their y-coordinates for scanline rendering
order = np.argsort(points_2d[:, 1])
points_2d = points_2d[order]
triangle_depths = np.array(triangle_depths)[order]
colours = colours[order]

y_mid = points_2d[1, 1]

for y in vary(points_2d[0, 1], points_2d[2, 1] + 1):
if y < 0 or y >= buffer_height:
proceed

# Decide begin and finish x-coordinates for the present scanline
if y < y_mid:
x_start = interpolate_values(y, points_2d[0, 1], points_2d[1, 1], points_2d[0, 0], points_2d[1, 0])
x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])
else:
x_start = interpolate_values(y, points_2d[1, 1], points_2d[2, 1], points_2d[1, 0], points_2d[2, 0])
x_end = interpolate_values(y, points_2d[0, 1], points_2d[2, 1], points_2d[0, 0], points_2d[2, 0])

x_start, x_end = int(x_start), int(x_end)

# Loop via every pixel within the scanline
for x in vary(x_start, x_end + 1):
if x < 0 or x >= buffer_width:
proceed

# Compute barycentric coordinates for the pixel
s, t, u = compute_barycentric_coords(points_2d, x, y)

# Test if the pixel lies contained in the triangle
if s >= 0 and t >= 0 and u >= 0:
# Interpolate depth and shade for the pixel
depth_interp = s * triangle_depths[0] + t * triangle_depths[1] + u * triangle_depths[2]
color_interp = s * colours[0] + t * colours[1] + u * colours[2]

# Replace the pixel whether it is nearer to the digital camera
if depth_interp < mesh_depth_buffer[y, x]:
mesh_depth_buffer[y, x] = depth_interp
mesh_color_buffer[y, x] = color_interp

# Convert float colours to uint8
mesh_color_buffer = (mesh_color_buffer * 255).astype(np.uint8)

return mesh_color_buffer, mesh_depth_buffer

Shade and depth buffers of the remodeled mesh are then fed into combine_frames() perform together with the unique RGB picture and its estimated depthmap. This perform is designed to merge two units of picture and depth frames. It makes use of depth data to determine which pixels within the unique body needs to be changed by the corresponding pixels within the rendered mesh body. Particularly, for every pixel, the perform checks if the depth worth of the rendered mesh is lower than the depth worth of the unique scene. Whether it is, that pixel is taken into account to be “nearer” to the digital camera within the rendered mesh body, and the pixel values in each the colour and depth frames are changed accordingly. The perform returns the mixed shade and depth frames, successfully overlaying the rendered mesh onto the unique scene primarily based on depth data.

# Mix the unique and mesh-rendered frames primarily based on depth data
def combine_frames(original_frame, rendered_mesh_img, original_depth_frame, mesh_depth_buffer):
# Create a masks the place the mesh is nearer than the unique depth
mesh_mask = mesh_depth_buffer < original_depth_frame

# Initialize mixed frames
combined_frame = original_frame.copy()
combined_depth = original_depth_frame.copy()

# Replace the mixed frames with mesh data the place the masks is True
combined_frame[mesh_mask] = rendered_mesh_img[mesh_mask]
combined_depth[mesh_mask] = mesh_depth_buffer[mesh_mask]

return combined_frame, combined_depth

Right here is how the mesh_color_buffer, mesh_depth_buffer and the combined_frame seem like the primary object, an elephant. Because the elephant object just isn’t occluded by every other components throughout the body, it stays absolutely seen. In numerous placements, occlusion would happen.

Accordingly, I positioned the second mesh, the automobile, on the curbside of the street. I additionally adjusted its preliminary orientation such that it appears prefer it has been parked there. The next visuals are the corresponding mesh_color_buffer, mesh_depth_buffer and the combined_frame for this mesh.

The purpose cloud visualization with each objects inserted is given under. Extra white gaps are launched because of the new occlusion areas which got here up with new objects.

After calculating the overlayed photographs for every one of many video frames, we at the moment are able to render our video.