LDC – Arabic Corpus

Category : Speech Corpus

microphone speech speech

LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts

LDC2002S02 West Point Arabic Speech

ELRA-Useful Links

Category : Speech Corpus

Speech resources specific items - Technical details for speech resources

http://www.elra.info/LRs-Collection.html -  ELRA can assist on developing speech database

http://www.elda.org/blark/ - BLARK (Basic Language Resources Kit)

http://www.elra.info/Overview.html , http://www.elra.info/Validation-Standards.html- speech validation

3 types of catalogue maintained by elra - http://catalog.elra.info/ , http://catalog.elra.info/retd/ , http://universal.elra.info/

http://www.elra.info/Newsletter.html - subscribe newsletter

http://catalog.elra.info/retd/ - r&d catalogue, some of them are free

http://www.hlt-evaluation.org/spip.php?article147 - speech recognition evaluation

ELRA-Universal Catalog

Category : Speech Corpus

http://universal.elra.info/

ERLA-Arabic Corpus

1

Category : Speech Corpus

2011

ELRA-S0315 A-SpeechDB 
A-SpeechDB© is an Arabic speech database contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence.

2010

ELRA-S0308 Egyptian Arabic Speecon database
The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

2009

ELRA-W0049 “Le Monde Diplomatique” Arabic tagged corpus
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.

2008

ELRA-S0289 OrienTel Jordan MCA (Modern Colloquial Arabic) database 
This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0290 OrienTel Jordan MSA (Modern Standard Arabic) database 
This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

2007

ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic) 
This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic)
This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon 
The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus)
The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States . The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.

OLAC: Open Language Archives Community

Category : Speech Corpus

http://www.language-archives.org/

http://www.language-archives.org/archive/catalogue.elra.info

Arabic Speech Corpus – details

Category : Speech Corpus

SpeechDat like database UOB/ENS More than 100 speakers French/Arabic, For speech recognition, Lebanese/Syrian/French 1,1,1

http://en.wikipedia.org/wiki/University_of_Burundi

http://www.ub.edu.bi/

Arabic digits UOB For speech recognition, Lebanese accent 1,1,1

http://www.balamand.edu.lb/english/index.asp

http://www.mghamdi.com/KACST&UOB.pdf

Speech database in 4 languages LibanCell 10,000 announcement with 10 words/announcements 3

http://www.libancell.com.lb/

Labelled database for TTS Millenium 3

ss

Arabic broadcast news speech corpus (BNSC) ELRA/LDC More than 20 hours of transcribed Arabic news in Modern Standard Arabic. Domain: news 1,2,1

ss

Arabic acoustic corpus mono-speaker Benabbou, Morocco 3

ss

Arabic Phonetic database King Abdulaziz City for Science and Technology Lang: En-Ar 3

http://www.kacst.edu.sa/en/Pages/default.aspx

Holy Qur’an multi-speaker RDI 60 hours 1,4,1

http://www.rdi-eg.com/

Arabic concatenative TTS male recording Sakhr MSA 3 hours 3

http://www.sakhr.com/Default.aspx

The Corpus of Spoken Palestinian Arabic (CoSPA University of Haifa, Israel Between 1996 and 1998, 200 hours of recorded speech have been collected. The aim is to collect data that would cover the whole linguistic area of Palestinian Arabic. -

http://www.haifa.ac.il/index_eng.html

CALLHOME Egyptian Arabic Speech LDC 120 Egyptian Colloquial Arabic telephone conversations Calls lasting up to 30 minutes 1,2,1

http://www.ldc.upenn.edu/

OrienTel United Arab Emirates MSA ELRA 500 speakers (254 males, 246 females) Recorded over the local fixed and mobile telephone network. 1,4

http://www.elra.info/

ss

 

 

Speech Corpus

Category : Speech Corpus

http://www.natcorp.ox.ac.uk/ - British English

 

http://www.ldc.upenn.edu/

http://www.elra.info/

http://www.medar.info/BLARK/speech_resources.php

http://www.bibalex.org/Home/Default_EN.aspx