Recording setup. Data was recorded with a mobile robot equipped with two 2D laser scanners (360° range) and a forward-facing RGB camera. Robot poses and the geometric map were obtained with GMapping (2D SLAM). RGB and laser observations are temporally synchronized, establishing a point-to-pixel correspondence between geometric and visual data. The full exploration run was subsampled by motion thresholds (30 cm / 15°) to 74 frames in a single controlled environment.
Structure. The data is organized per frame (74 frames). Each frame provides four files: undistorted_image.png, laser.json, sam1_fine_fused_instances.json (automatically generated, fused SAM masks), and sam1_gt_instances.json (manually annotated ground-truth masks).
Formats.
Images: PNG, 768 × 480, RGB, lens-undistorted. Pixel origin top-left; u = column, v = row.
laser.json: JSON object with points (list of {u, v, x, y, z, intensity}) and a Unix timestamp. u, v are image coordinates; x, y, z are positions in meters relative to the LiDAR frame (z = 0 for planar scans).
Instance files: JSON list of instances. Each instance is defined by a segmantation mask in pixels (list of [u, v] mask coordinates).
Software / reuse. All annotations are plain JSON and images are standard PNG; no proprietary software is required. Files can be parsed with any JSON library (e.g. Python json) and inspected with standard image tools or NumPy/OpenCV/Matplotlib. Pixel coordinates index directly into the corresponding undistorted image, and laser (u, v) values map laser returns into the same image plane. The dataset supports tasks such as instance segmentation, multi-view object association, geometric/semantic mapping, and benchmarking vision-language models for intralogistics perception. A detailed README.md (including the two evaluated VLM prompts) is included in the dataset.