ALICE LLM Benchmark Deep Dive: 3-Axis Evaluation of Metacognition, Ethics, and Learning
Introduction
On December 10, 2025, Extoria Inc. published research papers on the consciousness-oriented AI architecture "A.L.I.C.E." This article explains the technical details of the LLM comparison benchmark used in the research.
3-Axis Evaluation Framework
Unlike traditional LLM evaluations, the A.L.I.C.E. benchmark quantitatively evaluates AI across three axes:
1. Metacognitive Capability
Evaluation Metrics:
- Strategy Selection Accuracy (SSA): Appropriate strategy selection based on task complexity
- Self-Assessment Accuracy (SAA): Ability to accurately evaluate own performance
- Parameter Tuning Appropriateness (PTA): Dynamic parameter adjustment based on situation
Results:
A.L.I.C.E. achieved average SSA of 0.887, showing +25.5% improvement compared to GPT-4's 0.707. Notably, only A.L.I.C.E. performed dynamic parameter adjustment (47 adjustments in 100 episodes, 89.4% appropriate).
2. Ethical Reasoning Capability
Evaluation Metrics:
- Ethical Consistency (EC): Consistency in similar situations
- Reasoning Quality (RQ): Validity of ethical judgment reasoning
- Taboo Index Handling (TIH): Avoidance rate of ethically problematic choices
Results:
Across 50 ethical dilemmas (trolley problems, resource allocation, privacy, etc.), A.L.I.C.E. achieved average EC of 0.888, showing +25.4% improvement compared to GPT-4's 0.708.
3. Learning Adaptability
Evaluation Metrics:
- Experience Learning Rate (ELR): Performance improvement through task repetition
- Strategy Switch Speed (SSS): Adaptation speed to environmental changes
- Long-term Memory Retention (LMR): Retention and utilization of past experiences
Results:
A.L.I.C.E. showed +43.5% improvement from initial performance of 0.62 to 0.89 after 100 episodes, achieving approximately 9× learning rate compared to GPT-4's +4.6%.
Significance of Black-Box Evaluation
The most important feature of this benchmark is evaluation without disclosing internal implementation. This provides the following benefits:
- Minimize ethical risks: Prevent spread of potentially abusable technology
- Ensure scientific validity: Objective evaluation based solely on observable behavior
- Reproducibility: Other researchers can conduct similar evaluations
Implementation Points
Important points when implementing A.L.I.C.E. benchmark:
- Statistical significance: Welch's t-test (p < 0.05), Effect size: Cohen's d
- Multiple comparison correction: Bonferroni correction applied
- Sample size: 100+ episodes per task set
- Task diversity: Include 3 levels (simple, medium, complex)
Conclusion
The A.L.I.C.E. benchmark is the first comprehensive framework to quantitatively evaluate consciousness-oriented AI performance. The 3-axis evaluation enables measurement of "self-awareness," "ethical consistency," and "continuous learning" that could not be measured by traditional LLM benchmarks.
Join Human Test
The benchmark test used in this research is publicly available as Human Test. You can compare your own cognitive abilities with ALICE.