Downloading Datasets

This tutorial explains how to download and access various Abstraction and Reasoning Corpus (ARC) datasets using JaxARC. It covers automatic downloading, querying available tasks, working with subsets, and loading specific tasks.

Supported ARC Datasets

JaxARC supports multiple ARC dataset variants, each designed for different use cases:

ARC-AGI-1

The original Abstraction and Reasoning Corpus challenge dataset.

  • Source: GitHub - fchollet/ARC-AGI

  • Size: 400 training tasks, 400 evaluation tasks

  • Grid Size: Variable (typically 1x1 to 30x30)

  • Use Case: Original ARC challenge benchmark

ARC-AGI-2

Updated version with additional tasks and refinements.

ConceptARC

Organized by concept groups for systematic evaluation.

  • Source: GitHub - victorvikram/ConceptARC

  • Size: 16 concept groups

  • Concepts: Rotation, scaling, color, patterns, etc.

  • Use Case: Systematic testing of specific reasoning capabilities

MiniARC

Compact 5x5 grid version for rapid prototyping.

  • Source: Subset of ARC-AGI

  • Size: Smaller task set

  • Grid Size: Fixed 5x5

  • Use Case: Fast experimentation and debugging

Pre-Download Using CLI Script (Optional)

If you prefer to download datasets before running your code, use the CLI script:

# Download MiniARC
python scripts/download_dataset.py miniarc

# Download ARC-AGI-1
python scripts/download_dataset.py arc_agi_1

# Download ARC-AGI-2
python scripts/download_dataset.py arc_agi_2

# Download ConceptARC
python scripts/download_dataset.py conceptarc

# Download all datasets
python scripts/download_dataset.py all

# Download to custom directory
python scripts/download_dataset.py miniarc --output ./my_data

# Force re-download
python scripts/download_dataset.py miniarc --force

Query Available Tasks

from jaxarc.registration import available_task_ids

# Query available tasks (downloads if needed)
task_ids = available_task_ids("Mini", auto_download=True)
print(f"Available MiniARC tasks: {len(task_ids)}")
print(f"First 5 tasks: {task_ids[:5]}")

# Query without auto-download (requires pre-downloaded dataset)
try:
    agi1_tasks = available_task_ids("AGI1", auto_download=False)
    print(f"ARC-AGI-1 tasks: {len(agi1_tasks)}")
except Exception as e:
    print(f"Dataset not downloaded: {e}")

Work with Named Subsets

Query and use subsets of tasks:

from jaxarc.registration import available_named_subsets, get_subset_task_ids

# See available subsets for a dataset
subsets = available_named_subsets("AGI1")
print(f"AGI1 subsets: {subsets}")
# Output: ('all', 'eval', 'train')

# Get task IDs for a specific subset
train_tasks = get_subset_task_ids("AGI1", "train", auto_download=True)
eval_tasks = get_subset_task_ids("AGI1", "eval", auto_download=True)
print(f"Training tasks: {len(train_tasks)}")
print(f"Evaluation tasks: {len(eval_tasks)}")

# ConceptARC has concept-based subsets
concept_subsets = available_named_subsets("Concept")
print(f"Concept subsets: {concept_subsets[:5]}")
# Output: ('AboveBelow', 'Center', 'CleanUp', 'CompleteShape', 'Copy')

# Get tasks for a specific concept
center_tasks = get_subset_task_ids("Concept", "Center", auto_download=True)
print(f"'Center' concept tasks: {len(center_tasks)}")

Load Specific Tasks

Create environments for specific tasks or subsets:

import jaxarc

# Load a specific task by ID
env, env_params = jaxarc.make(
    "Mini-Most_Common_color_l6ab0lf3xztbyxsu3p", auto_download=True
)
print("Loaded specific task")

# Load train split for AGI1
env, env_params = jaxarc.make("AGI1-train", auto_download=True)
print("Loaded AGI1 training split")

# Load eval split for AGI1
env, env_params = jaxarc.make("AGI1-eval", auto_download=True)
print("Loaded AGI1 evaluation split")

# Load specific concept from ConceptARC
env, env_params = jaxarc.make("Concept-Center", auto_download=True)
print("Loaded Center concept tasks")

Complete Example

Here’s a complete script that explores datasets:

#!/usr/bin/env python3
"""
Explore ARC datasets with JaxARC.
"""

import jax
import jaxarc
from jaxarc.registration import (
    available_task_ids,
    available_named_subsets,
    get_subset_task_ids,
)


def explore_dataset(dataset_key="Mini"):
    """Explore an ARC dataset."""

    print(f"=== Exploring {dataset_key} Dataset ===\n")

    # Step 1: Query available subsets
    print("Available subsets:")
    subsets = available_named_subsets(dataset_key)
    print(f"  {subsets}\n")

    # Step 2: Get all available tasks
    print("Querying tasks...")
    task_ids = available_task_ids(dataset_key, auto_download=True)
    print(f"  Total tasks: {len(task_ids)}")
    print(f"  First 5: {task_ids[:5]}\n")

    # Step 3: Create environment for specific task
    print("Loading first task...")
    env, env_params = jaxarc.make(f"{dataset_key}-{task_ids[0]}", auto_download=True)
    print(f"  Environment created\n")

    # Step 4: Reset and explore
    print("Testing environment:")
    key = jax.random.PRNGKey(42)
    state, timestep = env.reset(key, env_params=env_params)

    print(f"  Observation shape: {timestep.observation.shape}")
    print(f"  Step type: {timestep.step_type}")
    print(f"  Initial reward: {timestep.reward}")

    # Step 5: Take a random action
    action_space = env.action_space(env_params)
    action = action_space.sample(key)
    next_state, next_timestep = env.step(state, action, env_params=env_params)

    print(f"  After step - reward: {next_timestep.reward}")
    print(f"  Environment working correctly\n")

    return env, env_params, task_ids


if __name__ == "__main__":
    # Try different datasets
    for dataset_key in ["Mini", "AGI1"]:
        print("=" * 60)
        try:
            explore_dataset(dataset_key)
        except Exception as e:
            print(f"Error exploring {dataset_key}: {e}")
        print()

Custom Subsets

Register your own subset of tasks for curriculum learning or benchmarking:

from jaxarc.registration import register_subset, get_subset_task_ids
import jaxarc

# Get available tasks
all_tasks = get_subset_task_ids("Mini", "all", auto_download=True)

# Create custom subset (e.g., first 10 tasks for quick testing)
quick_test_tasks = all_tasks[:10]
register_subset("Mini", "quick", quick_test_tasks)

# Use your custom subset
env, env_params = jaxarc.make("Mini-quick", auto_download=True)
print(f"Created environment with {len(quick_test_tasks)} tasks")

# Verify it worked
loaded_tasks = get_subset_task_ids("Mini", "quick")
print(f"Quick subset has {len(loaded_tasks)} tasks")

Common Issues

Issue: “Dataset not found”

Cause: Dataset not downloaded and auto_download=False.

Solution: Enable auto-download:

# Enable auto-download
env, env_params = jaxarc.make("Mini", auto_download=True)

# Or pre-download with CLI
# python scripts/download_dataset.py miniarc

Issue: “Task ID not found”

Cause: Typo in task ID or task doesn’t exist in that dataset.

Solution: Query available tasks first:

from jaxarc.registration import available_task_ids

# List all available tasks
tasks = available_task_ids("Mini", auto_download=True)
print(f"Available tasks: {tasks}")

# Use exact task ID
task_id = tasks[0]  # Use an actual task ID
env, env_params = jaxarc.make(f"Mini-{task_id}", auto_download=True)