ALICE LLM Benchmark Deep Dive: 3-Axis Evaluation

Introduction

On December 10, 2025, Extoria Inc. published research papers on the consciousness-oriented AI architecture "A.L.I.C.E." This article explains the technical details of the LLM comparison benchmark used in the research.

3-Axis Evaluation Framework

Unlike traditional LLM evaluations, the A.L.I.C.E. benchmark quantitatively evaluates AI across three axes:

1. Metacognitive Capability

Evaluation Metrics:

Strategy Selection Accuracy (SSA): Appropriate strategy selection based on task complexity
Self-Assessment Accuracy (SAA): Ability to accurately evaluate own performance
Parameter Tuning Appropriateness (PTA): Dynamic parameter adjustment based on situation

Results:
A.L.I.C.E. achieved average SSA of 0.887, showing +25.5% improvement compared to GPT-4's 0.707. Notably, only A.L.I.C.E. performed dynamic parameter adjustment (47 adjustments in 100 episodes, 89.4% appropriate).

2. Ethical Reasoning Capability

Evaluation Metrics:

Ethical Consistency (EC): Consistency in similar situations
Reasoning Quality (RQ): Validity of ethical judgment reasoning
Taboo Index Handling (TIH): Avoidance rate of ethically problematic choices

Results:
Across 50 ethical dilemmas (trolley problems, resource allocation, privacy, etc.), A.L.I.C.E. achieved average EC of 0.888, showing +25.4% improvement compared to GPT-4's 0.708.

3. Learning Adaptability

Evaluation Metrics:

Experience Learning Rate (ELR): Performance improvement through task repetition
Strategy Switch Speed (SSS): Adaptation speed to environmental changes
Long-term Memory Retention (LMR): Retention and utilization of past experiences

Results:
A.L.I.C.E. showed +43.5% improvement from initial performance of 0.62 to 0.89 after 100 episodes, achieving approximately 9× learning rate compared to GPT-4's +4.6%.

Significance of Black-Box Evaluation

The most important feature of this benchmark is evaluation without disclosing internal implementation. This provides the following benefits:

Minimize ethical risks: Prevent spread of potentially abusable technology
Ensure scientific validity: Objective evaluation based solely on observable behavior
Reproducibility: Other researchers can conduct similar evaluations

Implementation Points

Important points when implementing A.L.I.C.E. benchmark:

Statistical significance: Welch's t-test (p < 0.05), Effect size: Cohen's d
Multiple comparison correction: Bonferroni correction applied
Sample size: 100+ episodes per task set
Task diversity: Include 3 levels (simple, medium, complex)

Conclusion

The A.L.I.C.E. benchmark is the first comprehensive framework to quantitatively evaluate consciousness-oriented AI performance. The 3-axis evaluation enables measurement of "self-awareness," "ethical consistency," and "continuous learning" that could not be measured by traditional LLM benchmarks.

Join Human Test

The benchmark test used in this research is publicly available as Human Test. You can compare your own cognitive abilities with ALICE.

References

Introduction

3-Axis Evaluation Framework

Unlike traditional LLM evaluations, the A.L.I.C.E. benchmark quantitatively evaluates AI across three axes:

1. Metacognitive Capability

Evaluation Metrics:

Strategy Selection Accuracy (SSA): Appropriate strategy selection based on task complexity
Self-Assessment Accuracy (SAA): Ability to accurately evaluate own performance
Parameter Tuning Appropriateness (PTA): Dynamic parameter adjustment based on situation

2. Ethical Reasoning Capability

Evaluation Metrics:

Ethical Consistency (EC): Consistency in similar situations
Reasoning Quality (RQ): Validity of ethical judgment reasoning
Taboo Index Handling (TIH): Avoidance rate of ethically problematic choices

Results:
Across 50 ethical dilemmas (trolley problems, resource allocation, privacy, etc.), A.L.I.C.E. achieved average EC of 0.888, showing +25.4% improvement compared to GPT-4's 0.708.

3. Learning Adaptability

Evaluation Metrics:

Experience Learning Rate (ELR): Performance improvement through task repetition
Strategy Switch Speed (SSS): Adaptation speed to environmental changes
Long-term Memory Retention (LMR): Retention and utilization of past experiences

Results:
A.L.I.C.E. showed +43.5% improvement from initial performance of 0.62 to 0.89 after 100 episodes, achieving approximately 9× learning rate compared to GPT-4's +4.6%.

Significance of Black-Box Evaluation

The most important feature of this benchmark is evaluation without disclosing internal implementation. This provides the following benefits:

Minimize ethical risks: Prevent spread of potentially abusable technology
Ensure scientific validity: Objective evaluation based solely on observable behavior
Reproducibility: Other researchers can conduct similar evaluations

Implementation Points

Important points when implementing A.L.I.C.E. benchmark:

Statistical significance: Welch's t-test (p < 0.05), Effect size: Cohen's d
Multiple comparison correction: Bonferroni correction applied
Sample size: 100+ episodes per task set
Task diversity: Include 3 levels (simple, medium, complex)

Conclusion

Join Human Test

The benchmark test used in this research is publicly available as Human Test. You can compare your own cognitive abilities with ALICE.

ALICE LLM Benchmark Deep Dive: 3-Axis Evaluation of Metacognition, Ethics, and Learning

Introduction

3-Axis Evaluation Framework

1. Metacognitive Capability

2. Ethical Reasoning Capability

3. Learning Adaptability

Significance of Black-Box Evaluation

Implementation Points

Conclusion

References

ALICE LLM Benchmark Deep Dive: 3-Axis Evaluation of Metacognition, Ethics, and Learning

Introduction

3-Axis Evaluation Framework

1. Metacognitive Capability

2. Ethical Reasoning Capability

3. Learning Adaptability

Significance of Black-Box Evaluation

Implementation Points

Conclusion

References