Technology

Microsoft releases Speech Corpus for three Indian languages to aid researchers

September 7, 2018

• The largest publicly available Indian language speech data for use in research and building models

Bangalore, September 07, 2018 : Microsoft India today announced the availability of Microsoft Indian language Speech Corpus, offering speech training and test data for Telugu, Tamil and Gujarati. This is the largest publicly available Indian language speech dataset which includes audio and corresponding transcripts. It is aimed at helping researchers and academia build Indian language speech recognition for all applications where speech is used. This Indian language Speech Corpus content is provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences.

Today, there is a scarcity of adequate digital data for text, speech and linguistic resources – which are imperative in building large machine learning models for many vernacular languages across the world. Moreover, the differences in enunciation, accent, diction, and slang across various regions in India are very subtle. As a result of these complexities, development of accurate digital tools in Indian languages has been slow. Microsoft is working to address this lack of data and catalyze the development of machine learning based models that can help in building systems for low resource languages, thus enabling the eco system of researchers, academia and tech companies working on India language models and to accelerate the needs of Indian users. The launch of Microsoft Indian Language Speech Corpus is a part of this effort.

“We believe India’s increasing digital literacy needs to be supported by a multi-lingual digital world. Microsoft Indian Language Speech Corpus is an extension of our on-going efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice based computing for India by supporting researchers and academia,” said Sundar Srinivasan, General Manager, Artificial Intelligence & Research, Microsoft India.

Microsoft’s Indian Language Speech Corpus was tested at Interspeech 2018, the world’s largest and most comprehensive conference on the science and technology of spoken language processing. In a Low Resource Speech Recognition Challenge, participants used data from Microsoft Indian language speech corpus to build Automatic Speech Recognition (ASR) systems. They were able to create high quality speech recognition models using this data, thus validating the efficacy of the Corpus.

Microsoft has been working with Indian languages for over two decades since the launch of Project Bhasha in 1998, allowing users to input localized text easily and quickly using the Indian Language Input tool. With the help of AI and Deep Neural Networks, Microsoft is working on improving real-time language translation for Hindi, Bengali, Tamil and now expanding it to real-time language translation for Telugu. Microsoft also recently announced support for email addresses in multiple Indian languagesacross most of its email apps and services. Also, as part of the latest Windows update, Microsoft added Tamil 99 virtual keyboard to Windows 10. Through its global Local Language Program (LLP), Microsoft provides people access to technology in their native language. This includes Language Interface Packs for Indian languages like Hindi, Kannada, Bengali, Malayalam, amongst others.

Microsoft releases Speech Corpus for three Indian languages to aid researchers

TECH NEWS

IDC: Generative AI Spending to Reach $26 Billion by 2027

Positive momentum for Google Workspace continues, finds GlobalData

Leading Companies Launch Consortium to Address AI’s Impact on the Technology...

Augmented reality will help sports companies engage with fans and intensify...

India Showed Remarkable Emergence in Web3 Adoption, Stats Show Sector Expansion:...

IDC Estimates that GenAI Will Increase Marketing Productivity More Than 40%...

TOP STORIES

Nine solutions for Cities to Cut Carbon Emissions in Construction

Large European and US organizations are planning to invest $3.4 trillion...

Global telcos lead the way in digital inclusion, finds GlobalData

Gartner Announces the Top Government Technology Trends for 2024

Average CEO compensation stands at INR13.8 crore, up 40 percent compared...

Technology adoption by private insurers changing competitive landscape of Indian motor...

Cyber Security

Cisco Study Reveals Very Few Organizations Prepared to Defend Against Today’s...

Bots Now Make Up Nearly Half of All Internet Traffic Globally

McAfee’s 2024 Tax Scam Study Reveals a National Average of $8,199...

Cyber Criminals Target Victims Using Social Engineering Techniques

Cybercriminals Abuse Remote Desktop Protocol in 90% of Attacks Handled by...

Data-stealing malware infections increased sevenfold since 2020, Kaspersky experts say